35
LETTER Communicated by Yair Weiss On the Uniqueness of Loopy Belief Propagation Fixed Points Tom Heskes [email protected] SNN, University of Nijmegen, 6525 EZ, Nijmegen, The Netherlands We derive sufficient conditions for the uniqueness of loopy belief propa- gation fixed points. These conditions depend on both the structure of the graph and the strength of the potentials and naturally extend those for convexity of the Bethe free energy. We compare them with (a strength- ened version of) conditions derived elsewhere for pairwise potentials. We discuss possible implications for convergent algorithms, as well as for other approximate free energies. 1 Introduction Loopy belief propagation is Pearl’s belief propagation (Pearl, 1988) ap- plied to networks containing cycles. It can be used to compute approximate marginals in Bayesian networks and Markov random fields. Whereas belief propagation is exact only in special cases, for example, for tree-structured (singly connected) networks with just gaussian or just discrete nodes, loopy belief propagation empirically often leads to good performance (Murphy, Weiss, & Jordan, 1999; McEliece, MacKay, & Cheng, 1998). That is, the ap- proximate marginals computed with loopy belief propagation are in many cases close to the exact marginals. In gaussian graphical models, the means are guaranteed to coincide with the exact means (Weiss & Freeman, 2001). The notion that fixed points of loopy belief propagation correspond to ex- trema of the so-called Bethe free energy (Yedidia, Freeman, & Weiss, 2001) is an important step in the theoretical understanding of this success and paved the road for interesting generalizations. However, when applied to graphs with cycles, loopy belief propaga- tion does not always converge. So-called double-loop algorithms have been proposed that do guarantee convergence (Yuille, 2002; Teh & Welling, 2002; Heskes, Albers, & Kappen, 2003), but are an order of magnitude slower than standard loopy belief propagation. It is generally believed that there is a close connection between (non)convergence of loopy belief propaga- tion and (non)uniqueness of loopy belief propagation fixed points. More specifically, the working hypothesis is that uniqueness of a loopy belief propagation fixed point guarantees convergence of loopy belief propaga- tion to this fixed point. The goal of this study, then, is to derive sufficient Neural Computation 16, 2379–2413 (2004) c 2004 Massachusetts Institute of Technology

On the Uniqueness of Loopy Belief Propagation Fixed PointsdualLearn/heskes04.pdf · Communicated by Yair Weiss On the Uniqueness of Loopy Belief Propagation Fixed Points Tom Heskes

Embed Size (px)

Citation preview

LETTER Communicated by Yair Weiss

On the Uniqueness of Loopy Belief Propagation Fixed Points

Tom HeskestomsnnkunnlSNN University of Nijmegen 6525 EZ Nijmegen The Netherlands

We derive sufficient conditions for the uniqueness of loopy belief propa-gation fixed points These conditions depend on both the structure of thegraph and the strength of the potentials and naturally extend those forconvexity of the Bethe free energy We compare them with (a strength-ened version of) conditions derived elsewhere for pairwise potentialsWe discuss possible implications for convergent algorithms as well asfor other approximate free energies

1 Introduction

Loopy belief propagation is Pearlrsquos belief propagation (Pearl 1988) ap-plied to networks containing cycles It can be used to compute approximatemarginals in Bayesian networks and Markov random fields Whereas beliefpropagation is exact only in special cases for example for tree-structured(singly connected) networks with just gaussian or just discrete nodes loopybelief propagation empirically often leads to good performance (MurphyWeiss amp Jordan 1999 McEliece MacKay amp Cheng 1998) That is the ap-proximate marginals computed with loopy belief propagation are in manycases close to the exact marginals In gaussian graphical models the meansare guaranteed to coincide with the exact means (Weiss amp Freeman 2001)The notion that fixed points of loopy belief propagation correspond to ex-trema of the so-called Bethe free energy (Yedidia Freeman amp Weiss 2001)is an important step in the theoretical understanding of this success andpaved the road for interesting generalizations

However when applied to graphs with cycles loopy belief propaga-tion does not always converge So-called double-loop algorithms have beenproposed that do guarantee convergence (Yuille 2002 Teh amp Welling 2002Heskes Albers amp Kappen 2003) but are an order of magnitude slowerthan standard loopy belief propagation It is generally believed that thereis a close connection between (non)convergence of loopy belief propaga-tion and (non)uniqueness of loopy belief propagation fixed points Morespecifically the working hypothesis is that uniqueness of a loopy beliefpropagation fixed point guarantees convergence of loopy belief propaga-tion to this fixed point The goal of this study then is to derive sufficient

Neural Computation 16 2379ndash2413 (2004) ccopy 2004 Massachusetts Institute of Technology

2380 T Heskes

conditions for uniqueness Such conditions are not only relevant from a the-oretical point of view but can also be used to derive faster algorithms andsuggest different free energies as will be discussed in section 9

2 Outline

Before getting into the mathematical details we first sketch the line of rea-soning that will be followed in this article It is inspired by the connectionbetween fixed points of loopy belief propagation and extrema of the Bethefree energy by studying the Bethe free energy we can learn about propertiesof loopy belief propagation

The Bethe free energy is an approximation to the exact variational Gibbs-Helmholtz free energy Both are concepts from (statistical) physics Abstract-ing from the physical interpretation the Gibbs-Helmholtz free energy isldquojustrdquo a functional with a unique minimum the argument of which corre-sponds to the exact probability distribution However the Gibbs-Helmholtzfree energy is as intractable as the exact probability distribution The ideais then to approximate the Gibbs-Helmholtz free energy in the hope thatthe minimum of such a tractable approximate free energy relates to theminimum of the exact free energy Examples of such approximations arethe mean-field free energy the Bethe free energy and the Kikuchi free en-ergy The connections between the Gibbs-Helmholtz free energy Bethe freeenergy and loopy belief propagation are reviewed in section 3

The Bethe free energy is a function of so-called pseudomarginals or be-liefs For the minimum of the Bethe free energy to make sense these pseudo-marginals have to be properly normalized as well as consistent Our startingpoint the upper-left corner in Figure 1 is a constrained minimization prob-lem In general it is in fact a nonconvex constrained minimization problemsince the Bethe free energy is a nonconvex function of the pseudomarginals(the constraints are linear in these pseudomarginals)

However using the constraints on the pseudomarginals it may be pos-sible to rewrite the Bethe free energy in a form that is convex in the pseudo-marginals When this is possible we call the Bethe free energy ldquoconvex overthe set of constraintsrdquo (Pakzad amp Anantharam 2002) Now if the Bethe freeenergy is convex over the set of constraints we have in combination withthe linearity of the constraints a convex constrained minimization prob-lem Convex constrained minimization problems have a unique solution(see eg (Luenberger 1984) which explains link d in Figure 1

Sufficient conditions for convexity over the set of constraints link b inFigure 1 can be found in Pakzad and Anantharam (2002) and Heskes et al(2003) They are (re)derived and discussed in section 4 These conditionsdepend on only the structure of the graph not on the (strength of the)potentials that make up the probability distribution defined over this graphA corollary of these conditions derived in section 43 is that the Bethefree energy for a graph with a single loop is ldquojustrdquo convex over the set of

Uniqueness of Loopy Belief Propagation Fixed Points 2381

Figure 1 Layout of correspondences and implications See the text for details

constraints with two or more connected loops the conditions fail (see alsoMcEliece amp Yildirim 2003)

Milder conditions for uniqueness which do depend on the strength of theinteractions follow from the track on the right-hand side of Figure 1 Firstwe note that nonconvex constrained minimization of the Bethe free energyis equivalent to an unconstrained nonconvexconcave minimax problem(Heskes 2002) link a in Figure 1 Convergent double-loop algorithms likeCCCP (Yuille 2002) and faster variants thereof (Heskes et al 2003) in factsolve such a minimax problem the concave problem in the maximizing pa-rameters (basically Lagrange multipliers) is solved by a message-passing al-gorithm very similar to standard loopy belief propagation in the inner loopwhere the outer loop changes the minimizing parameters (a remaining set ofpseudomarginals) in the proper downward direction The transformation

2382 T Heskes

from nonconvex constrained minimization problem to an unconstrainednonconvexconcave minimax problem is in a particular setting relevant tothis article repeated in section 51

Rather than requiring the Bethe free energy to be convex (over the setof constraints) we then in sections 6 and 8 work toward conditions underwhich this minimax problem is convexconcave These indeed depend onthe strength of the potentials defined in section 7 These conditions can beconsidered the main result of this article Link c follows from the observa-tion in section 52 that the minimax problem corresponding to a Bethe freeenergy that is convex over the set of constraints has to be convex or concave

As indicated by link e convexconcave minimax problems have a uniquesolution This then also implies that the Bethe free energy has a uniqueextremum satisfying the constraints which since the Bethe free energy isbounded from below (see section 53) has to be a minimum link f

The concluding statement by link g in the lower-right corner is to thebest of our knowledge no more than a conjecture We discuss it in moredetail in section 9

3 The Bethe Free Energy and Loopy Belief Propagation

31 The Gibbs-Helmholtz Free Energy The exact probability distribu-tion in Bayesian networks and Markov random fields can be written in thefactorized form

Pexact(X) = 1Z

prodα

α(Xα) (31)

Hereα is a potential some function of the potential subset Xα and Z is anunknown normalization constant Potential subsets typically overlap andthey span the whole domain X The convention that we adhere to in thisarticle is that there are no potential subsets Xα and Xαprime such that Xαprime is fullysubsumed by Xα The standard choice of a potential in a Bayesian networkis a child with all its parents We further restrict ourselves to probabilisticmodels defined on discrete random variables each of which runs over afinite number of states The potentials are positive and finite

The typical goal in Bayesian networks and Markov random fields is tocompute the partition function Z or marginals for example

Pexact(Xα) =sumXα

Pexact(X)

One way to do this is with the junction tree algorithm (Lauitzen amp Spiegel-halter 1988) However the junction tree algorithm scales exponentially withthe size of the largest clique and may become intractable for complex mod-els The alternative is then to resort to approximate methods which can be

Uniqueness of Loopy Belief Propagation Fixed Points 2383

roughly divided into two categories sampling approaches and determinis-tic approximations

Most deterministic approximations derive from the so-called Gibbs-Helmholtz free energy

F(P) = minussumα

sumXα

P(Xα)ψα(Xα)+sum

X

P(X) log P(X)

with shorthandψ equiv log Minimizing this variational free energy over theset P of all properly normalized probability distributions we get back theexact probability distribution equation 31 as the argument at the minimumand minus the log of the partition function as the value at the minimum

Pexact = argminPisinP

F(P) and minus log Z = minPisinP

F(P)

Since the Gibbs-Helmholtz free energy is convex in P the equality constraint(proper normalization) is linear and the inequality constraints (nonnega-tivity) are convex this minimum is unique By itself we have not gainedanything the entropy may still be intractable to compute

32 The Bethe Free Energy The Bethe free energy is an approximationof the exact Gibbs-Helmholtz free energy In particular we approximate theentropy through

sumX

P(X) log P(X) asympsumα

sumXα

P(Xα) log P(Xα)

minussumβ

(nβ minus 1)sumxβ

P(xβ) log P(xβ)

with xβ a (super)node and nβ =sum

αsupβ 1 the number of potentials thatcontains node xβ The second term follows from a discounting argumentwithout it we would overcount the entropy contributions on the overlapbetween the potential subsets The (super)nodes xβ are themselves subsetsof the potential subsets that is

xβ cap Xα = empty or xβ cap Xα = xβ forallαβand partition the domain X

xβ cap xβ prime = empty forallββ prime and⋃β

xβ = X

Typically the xβ are taken to be single nodes and in the following we willrefer to them as such For clarity of notation we will indicate these nodes byβ and xβ in lowercase to contrast them with the potentials α and potentialsubsets Xα in uppercase

2384 T Heskes

Note that the Bethe free energy depends on only the marginals P(Xα) andP(xβ) We replace minimization of the exact Gibbs-Helmholtz free energyover probability distributions by minimization of the Bethe free energy

F(QαQβ) = minussumα

sumXα

Qα(Xα)ψα(Xα)+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumβ

(nβ minus 1)sumxβ

Qβ(xβ) log Qβ(xβ) (32)

over sets of ldquopseudomarginalsrdquo1 or beliefs QαQβ For this to make sensethese pseudomarginals have to be properly normalized as well as consistentthat is2sum

Qα(Xα) = 1 and Qα(xβ) =sumXαβ

Qα(Xα) = Qβ(xβ) (33)

Let Q denote all subsets of consistent and properly normalized pseudo-marginals Then our goal is to solve

minQαQβ isinQ

F(QαQβ)

The hope is that the pseudomarginals at this minimum are accurate approx-imations to the exact marginals Pexact(Xα) and Pexact(xβ)

33 Link with Loopy Belief Propagation For completeness and laterreference we describe the link between the Bethe free energy and loopybelief propagation as originally reported on by Yedidia et al (2001) It startswith the Lagrangian

L(QαQβ λαβ λα λβ) = F(QαQβ)

+sumβ

sumαsupβ

sumxβ

λαβ(xβ)[Qβ(xβ)minusQα(xβ)

]

+sumα

λα

[1minus

sumXα

Qα(Xα)

]

+sumβ

λβ

[1minus

sumβ

Qβ(xβ)

] (34)

1 Terminology from Wainwright Jaakkola and Willsky (2002) used to indicate thatthere need not be a joint distribution that would yield such marginals

2 Strictly speaking we also have to take inequality constraints into account namelythose of the form Qα(Xα) ge 0 However with the potentials being positive and finitethe logarithmic terms in the free energy make sure that we never really have to worryabout those they never become ldquoactiverdquo For convenience we will not consider them anyfurther

Uniqueness of Loopy Belief Propagation Fixed Points 2385

At an extremum of the Bethe free energy satisfying the constraints allderivatives of L are zero the ones with respect to the Lagrange multipliersλ give back the constraints the ones with respect to the pseudomarginalsQ give an extremum of the Bethe free energy Setting the derivatives withrespect to Qα and Qβ to zero we can solve for Qα and Qβ in terms of theLagrange multipliers

Qlowastα(Xα) = α(Xα) exp

[λα minus 1+

sumβsubα

λαβ(xβ)

]

Qlowastβ(xβ) = exp

[1

nβ minus 1

1minus λβ +

sumαsupβ

λαβ(xβ)

]

In terms of theldquomessagerdquomicroβrarrα(xβ) equiv exp[λαβ(xβ)] from node β to potentialα the pseudomarginal Qlowastα(Xα) reads

Qlowastα(Xα) prop α(Xα)prodβsubα

microβrarrα(xβ) (35)

where proper normalization yields the Lagrange multiplier λα With defi-nition

microαrarrβ(xβ) equivQlowastβ(xβ)microβrarrα(xβ)

(36)

the fixed-point equation for Qlowastβ(xβ) can after some manipulation be writtenin the form

Qlowastβ(xβ) propprodαsupβ

microαrarrβ(xβ) (37)

where again the Lagrange multiplier λβ follows from normalization Finallythe constraint Qlowastα(xβ) = Qlowastβ(xβ) in combination with equation 36 suggeststhe update

microαrarrβ(xβ) = Qlowastα(xβ)microβrarrα(xβ)

(38)

Equations 35 through 38 constitute the belief propagation equations Theycan be summarized as follows A pseudomarginal is the potential (just 1 forthe nodes in the convention where no potentials are assigned to nodes) timesits incoming messages the outgoing message is the pseudomarginal dividedby the incoming message The scheduling of the messages is somewhatarbitrary Loopy belief propagation can be ldquodampedrdquo by taking smaller

2386 T Heskes

steps This damping is usually done in terms of the Lagrange multipliersthat is in the log domain of the messages

logmicronewαrarrβ(xβ) = logmicroαrarrβ(xβ)

+ ε[log Qlowastα(xβ)minus logmicroβrarrα(xβ) minus logmicroαrarrβ(xβ)] (39)

Summarizing loopy belief propagation is equivalent to fixed-point itera-tion where the fixed points are the zero derivatives of the Lagrangian

4 Convexity of the Bethe Free Energy

41 Rewriting the Bethe Free Energy Minimization of the Bethe freeenergy equation 32 under the constraints of equation 33 is equivalent tosolving a minimax problem on the Lagrangian equation 34 namely

minQαQβ

maxλαβ λαλβ

L(QαQβ λαβ λα λβ)

The ordering of the min and max operations is important here to enforcethe constraints we first have to take the maximum The min and max oper-ations can be interchanged if we have a convex constrained minimizationproblem (Luenberger 1984) That is the function to be minimized must beconvex in its parameters the equality constraints have to be linear and theinequality constraints convex In our case the equality constraints are in-deed linear and the inequality constraints enforcing nonnegativity of thepseudomarginals indeed are convex However the Bethe free energy equa-tion 32 is clearly nonconvex in its parameters QαQβ This is what makesit a difficult optimization problem

Luckily the description in equation 32 is not unique any other formthat can be constructed by substituting the constraints of equation 33 isequally valid Following Pakzad and Anantharam (2002) we call the Bethefree energy ldquoconvex over the set of constraintsrdquo if by making use of theconstraints of equation 33 we can rewrite it in a form that is convex inQαQβ

42 Conditions for Convexity The problem is with the term

Sβ(Qβ) equiv minussumxβ

Qβ(xβ) log Qβ(xβ)

which is concave in Qβ Using the constraint Qβ(xβ) = Qα(xβ) we can turnit into a functional that is convex in Qα and Qβ separately but not necessarilyjointly That is with the substitution Qβ(xβ) = Qα(xβ) for any α sup β theentropy and thus the Bethe free energy is convex in Qα and in Qβ but notnecessarily in QαQβ However if we add to Sβ(Qβ) a convex entropycontribution

minusSα(Qα) equivsumXα

Qα(Xα) log Qα(Xα)

Uniqueness of Loopy Belief Propagation Fixed Points 2387

the combination of minusSα and Sβ is convex in QαQβ as the followinglemma needed in the proof of theorem 1 below shows

Lemma 1

αβ(QαQβ) equivsumXα

Qα(Xα) log Qα(Xα)minussumxβ

Qα(xβ) log Qβ(xβ)

is convex in QαQβ

Proof The matrix with second derivatives of αβ has the components

H(XαXprimeα) equivpart2αβ

partQα(Xα)partQα(Xprimeα)= 1

Qα(Xα)δXαXprimeα

H(Xα xprimeβ) equivpart2αβ

partQα(Xα)partQβ(xprimeβ)= minus 1

Qβ(xβ)δxβ xprimeβ

H(xβ xprimeβ) equivpart2αβ

partQβ(xβ)partQβ(xprimeβ)= minusQα(xβ)

Q2β(xβ)

δxβ xprimeβ

where we note that Xα and xβ should be interpreted as indices Convexityrequires that for any ldquovectorrdquo (Rα(Xα) Rβ(xβ))

0 le (Rα(Xα) Rβ(xβ))(

H(XαXprimeα) H(Xα xprimeβ)H(xβXprimeα) H(xprimeβ xβ)

)(Rα(Xprimeα)Rβ(xprimeβ)

)

=sumXα

R2α(Xα)

Qα(Xα)minus 2

sumXα

Rα(Xα)Rβ(xβ)Qβ(xβ)

+sumxβ

Qα(xβ)R2β(xβ)

Q2β(xβ)

=sumXα

Qα(Xα)

[Rα(Xα)

Qα(Xα)minus Rβ(xβ)

Qβ(xβ)

]2

The idea is that the Bethe free energy is convex over the set of constraints ifwe have sufficient convex resources Qα log Qα to compensate for the concaveminusQβ log Qβ terms This can be formalized in the following theorem

Theorem 1 The Bethe free energy is convex over the set of consistency con-straints if there exists an allocation matrix Aαβ between potentials α and nodes βsatisfying

1 Aαβ ge 0 forallαβsubα (positivity)

2sumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(41)

2388 T Heskes

Proof First we note that we do not have to worry about the energy termsthat are linear in Qα In other words to prove the theorem we can restrictourselves to proving that minus the entropy

minusS(Q) = minus[sum

α

Sα(Qα)minussumβ

(nβ minus 1)Sβ(Qβ)

]

is convex over the set of consistency constraints The resulting operation isnow a matter of resource allocation For each concave contribution (nβ minus1)Sβ we have to find convex contributions minusSα to compensate for it LetAαβ denote the ldquoamount of resourcesrdquo that we take from potential subset αto compensate for node β Now in shorthand notation and with a little bitof rewriting

minusS(Q) = minus[sum

α

Sα minussumβ

(nβ minus 1)Sβ

]

= minussumα

(1minus

sumβsubα

Aαβ +sumβsubα

Aαβ

)Sα

minussumβ

minussumαsupβ

Aαβ +sumαsupβ

Aαβ minus (nβ minus 1)

= minussumα

(1minus

sumβsubα

Aαβ

)Sα minus

sumα

sumβsubα

Aαβ [Sα minus Sβ ]

minussumβ

[sumαsupβ

Aαβ minus (nβ minus 1)

]Sβ

Convexity of the first term is guaranteed if 1minussumβ Aαβ ge 0 (condition 2) ofthe second term if Aαβ ge 0 (condition 1 and lemma 1) and of the third termifsum

α Aαβ minus (nβ minus 1) ge 0 (condition 3)

This theorem is a special case of the one in Heskes et al (2003) for themore general Kikuchi free energy Either one of the inequality signs in con-dition 2 and 3 of equation 41 can be replaced by an equality sign withoutany consequences

43 Some Implications

Corollary 1 The Bethe free energy for singly connected graphs is convex overthe set of constraints

Uniqueness of Loopy Belief Propagation Fixed Points 2389

Proof The proof is by construction Choose one of the leaf nodes as theroot βlowast and define

Aαβ = 1 iff β sub α and β closer to the root βlowast than any other β prime sub αAαβ prime = 0 for all other β prime

Obviously this choice of A satisfies conditions 1 and 2 of equation 41Arguing the other way around for eachβ = βlowast there is just a single potentialα sup β that is closer to the root βlowast than β itself (see the illustration in Figure 2)and thus there are precisely nβ minus 1 contributions Aαβ = 1 The root itselfgets nβlowast contributions Aαβlowast = 1 which is even better Hence condition 3 isalso satisfied

sumαsupβ

Aαβ = nβ minus 1 forallβ =βlowast andsumαsupβlowast

Aαβlowast = nβlowast gt nβlowast minus 1

With the above construction of A we are in a sense ldquoeating up resourcestoward the rootrdquo At the root we have one piece of resources left whichsuggests that we can still enlarge the set of graphs for which convexity canbe shown using theorem 1 This leads to the next corollary

Corollary 2 The Bethe free energy for graphs with a single loop is convex overthe set of constraints

Proof Again the proof is by construction Break the loop at one particularplace that is remove one node βlowast from a potential αlowast such that a singlyconnected structure is left Construct a matrix A as in the proof of corollary 1taking the node βlowast as the root The matrix A constructed in this way alsojust works for the graph with the closed loop since still

sumαsupβ

Aαβ = nβ minus 1 forallβ =βlowast and nowsumαsupβlowast

Aαβlowast = nβlowast minus 1

It can be seen that this construction starts to fail as soon as we have twoloops that are connected with two connected loops we have insufficientpositive resources to compensate for the negative entropy contributions

44 Connection with Other Work The same corollaries can be foundin Pakzad and Anantharam (2002) and McEliece and Yildirim (2003) Fur-thermore the conditions in theorem 1 are very similar to the ones stated inPakzad and Anantharam (2002) which for the Bethe free energy boil downto the following

2390 T Heskes

Figure 2 The construction of an allocation matrix satisfying all convexity con-straints for singly connected (a) and single-loop structures (b) Neglecting thearrows and dashes each graph corresponds to a factor graph (Kscischang Freyamp Loeliger 2001) where dark boxes refer to potentials and circles to nodes Thenumbers within the circles give the corresponding ldquoovercounting numbersrdquo forthe Bethe free energy 1minus nβ with nβ the number of neighboring potentials Thearrows pointing from potentials α to nodes β visualize the allocation matrixA with Aαβ = 1 if there is an arrow and Aαβ = 0 otherwise As can be seenfor each potential there is precisely one outgoing arrow pointing at the nodeclosest to the root chosen to be the node in the upper right corner of the graphIn the singly-connected structure (a) all nonroot nodes have precisely nβ minus 1incoming arrows just sufficient to compensate the overcounting number 1minusnβ The root node itself has one incoming arrow which it does not really need Inthe structure with the single loop (b) we open the loop by breaking the dashedlink and construct the allocation matrix for the corresponding singly connectedstructure This allocation matrix works for the single-loop structure as well be-cause now the incoming arrow at the ldquorootrdquo is just sufficient to compensate forthe negative overcounting number resulting from the extra link closing the loop

Uniqueness of Loopy Belief Propagation Fixed Points 2391

Theorem 2 (Adapted from theorem 1 in Pakzad amp Anantharam 2002) TheBethe free energy is convex for the set of constraints if for any set of nodes B wehave sum

βisinB

(1minus nβ)+sumαisinπ(B)

1 ge 0 (42)

where π(B) equiv α existβ isin Bβ sub α denotes the ldquoparentrdquo set of B that is thosepotential subsets that include at least one node in B

Proposition 1 The conditions in theorems 1 and 2 are equivalent

Proof Let us first suppose that there does exist an allocation matrix Aαβ

satisfying the conditions of equation 41 Then for any set B

sumβisinB

(nβ minus 1) lesumβisinB

sumαsupβ

Aαβ lesumαisinπ(B)

sumβsubα

Aαβ lesumαisinπ(B)

1

where the inequalities follow from conditions 3 1 and 2 in equation 41respectively In other words validity of the conditions in theorem 1 impliesthe validity of those in theorem 2

Next let us suppose that the conditions in Theorem 1 fail Above wehave seen that this can happen if and only if the graph contains at least oneconnected component with two connected loops But then condition 42is violated as well when we take for B the set of all nodes within such acomponent

Since validity implies validity and violation implies violation the con-ditions must be equivalent

Graphical models with a single loop have been studied in detail in Weiss(2000) yielding important theoretical results (eg correctness of maximuma posteriori assignments) These results are derived by ldquounwrappingrdquo thesingle loop into an infinite tree This argument also breaks down as soon asthere is more than a single loop It might be interesting to find out whetherthere is a deeper connection between this unwrapping argument and theconvexity of the Bethe free energy

In summary we have given conditions for the Bethe free energy to have aunique extremum satisfying the constraints From the connection betweenthe extrema of the Bethe free energy and fixed points of loopy belief prop-agation it then follows that loopy belief propagation has a unique fixedpoint when these conditions are satisfied These conditions fail as soon asthe structure of the graph contains two connected loops

The conditions for convexity of the Bethe free energy depend on thestructure of the graph the potentialsα(Xα) do not play any role These po-tentials appear only in the energy term that is linear in the pseudomarginals

2392 T Heskes

and thus does not affect the convexity argument Consequently adding aldquofake linkrdquo with potential α(Xα) = 1 can change the validity of the con-ditions whereas it has no effect on the loopy belief propagation updatesEven if we managed to find more interesting (ie milder) conditions forconvexity over the set of constraints3 the impact of fake links would neverdisappear In the following we will therefore dig a little deeper to arrive atmilder conditions that do take into account (the strength of) the potentials

5 The Dual Formulation

51 From Lagrangian to Dual As we have seen fixed points of loopybelief propagation are in a one-to-one correspondence with zero derivativesof the Lagrangian If we manage to find conditions under which these zeroderivatives have a unique solution then for the same conditions loopybelief propagation has a unique fixed point In the following we will workwith a Lagrangian slightly different from equation 34 First we substitutethe constraint Qα(xβ) = Qβ(xβ) to write the Bethe free energy in the ldquomoreconvexrdquo form

F(QαQβ) = minussumα

sumXα

Qα(Xα)ψα(Xα)+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumβ

sumαsupβ

Aαβ

sumxβ

Qα(xβ) log Qβ(xβ) (51)

where the allocation matrix Aαβ can be any matrix that satisfies

sumαsupβ

Aαβ = nβ minus 1 (52)

And second we express the consistency constraints from equation 33 interms of the potential pseudomarginals Qα alone This then yields

L(QαQβ λαβ λα) = minussumα

sumXα

Qα(Xα)ψα(Xα)

+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumα

sumβsubα

Aαβ

sumxβ

Qα(xβ) log Qβ(xβ)

3 We would like to conjecture that this is not possiblemdashthat the conditions in theorem 1are not only sufficient but also necessary to prove convexity of the Bethe free energy overthe set of consistency constraints Note that this would not imply that we need theseconditions to guarantee the uniqueness of fixed points since for that convexity by itselfis sufficient not necessary

Uniqueness of Loopy Belief Propagation Fixed Points 2393

+sumβ

sumαsupβ

sumxβ

λαβ(xβ)

times[

1nβ minus 1

sumαprimesupβ

AαprimeβQαprime(xβ)minusQα(xβ)

]

+sumα

λα

[1minus

sumXα

Qα(Xα)

]

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (53)

Note that the constraint Qβ(xβ) = Qα(xβ) as well as its normalization is nolonger incorporated with Lagrange multipliers but follows when we takethe minimum with respect to Qβ It is easy to check that the fixed-pointequations of loopy belief propagation still follow by setting the derivativesof the Lagrangian equation 53 to zero

Although the Bethe free energy and thus the Lagrangian equation 53may not be convex in QαQβ they are convex in Qα and Qβ separatelyTherefore we can interchange the minimum over the pseudomarginals Qα

and the maximum over the Lagrange multipliers as long as we leave theminimum over Qβ as the final operation4

minQαQβ

maxλαβ λα

L(QαQβ λαβ λα) = minQβ

maxλαβ λα

minQα

L(QαQβ λαβ λα)

Rewriting

sumβ

sumαsupβ

sumxβ

λαβ(xβ)

[1

nβ minus 1

sumαprimesupβ

AαprimeβQαprime(xβ)minusQα(xβ)

]

= minussumα

sumβsubα

sumxβ

λαβ(xβ)Qα(xβ)

with

λαβ(xβ) equiv λαβ(xβ)minus 1nβ minus 1

sumαprimesupβ

Aαprimeβλαprimeβ(xβ)

we can easily solve for the minimum with respect to Qα

Qlowastα(Xα) = α(Xα) exp

[λα minus 1+

sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

] (54)

4 In principle we could also first take the minimum over Qβ and leave the minimumover Qα but this does not seem to lead to any useful results

2394 T Heskes

Plugging this into the Lagrangian we obtain the ldquodualrdquo

G(Qβ λαβ λα) equiv L(QlowastαQβ λαβ λα)

= minussumα

sumXα

α(Xα)

exptimes[λα minus 1+

sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

]

+sumα

λα +sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (55)

Next we find for the maximum with respect to λα

exp[1minus λlowastα

] =sumXα

α(Xα) exp

[sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

]

equiv Zlowastα (56)

where we have to keep in mind that Zlowastα by itself like Qlowastα is a function of theremaining pseudomarginals Qβ and Lagrange multipliers λαβ Substitutingthis solution into the dual we arrive at

G(Qβ λαβ) equiv G(Qβ λαβ λlowastα)

= minussumα

log Zlowastα +sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (57)

Let us pause here for a moment and reflect on what we have done so farThe Lagrangian equation 53 being convex in Qα has a unique minimumin Qα (given all other parameters fixed) which is also the only extremumIt happens to be relatively straightforward to express the value at this mini-mum in terms of the remaining parameters and then also to find the optimal(maximal) λlowastα Plugging these values into the Lagrangian equation 53 wehave not lost anything That is zero derivatives of the Lagrangian are still inone-to-one correspondence with zero derivatives of the dual equation 57and thus with fixed points of loopy belief propagation

52 Recovering the Convexity Conditions (1) To find a minimum of theBethe free energy satisfying the constraints in equation 33 we first have totake the maximum of the dual equation 57 over the remaining Lagrangemultipliersλαβ and then the minimum over the remaining pseudomarginalsQβ The duality theorem a standard result from constrained optimization(see Luenberger 1984) tells us that the dual G is concave in the Lagrangemultipliers The remaining question is then whether the dual is convex in

Uniqueness of Loopy Belief Propagation Fixed Points 2395

Qβ If it is we have a convex-concave minimax problem which is guaranteedto have a unique solution

Link c in Figure 1 follows from the following proposition

Proposition 2 Convexity of the Bethe free energy equation 51 in QαQβimplies convexity of the dual equation 57 in Qβ

Proof First we note that the minimum of a convex function over someof its parameters is convex in its remaining parameters In obvious one-dimensional notation with ylowast(x) equiv argmin

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge 2 f (x (ylowast(x+ δ)+ ylowast(xminus δ))2)ge 2 f (x ylowast(x))

where the first inequality follows from the convexity of f in x yand the sec-ond inequality from ylowast(x) being the unique minimum of f (x y) Thereforethe dual equation 55 is convex in Qβ when the Lagrangian equation 53and thus the Bethe free energy equation 51 is convex in QαQβ Further-more from the duality theorem the dual equation 55 is concave in theLagrange multipliers λαβ λα Next we note that the maximum of a con-vex or concave function over its maximizing parameters is again convexwith ylowast(x) equiv argmax

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge f (x+ δ ylowast(x))+ f (xminus δ ylowast(x))ge 2 f (x ylowast(x))

where the first inequality follows from ylowast(xplusmnδ) being the unique maximumof f (x plusmn δ y) and the second inequality from the convexity of f (x y) in xHence the dual equation 57 must still be convex in Qβ

For now we did not gain or lose anything in comparison with the con-ditions for theorem 1 However the inequalities in the above proof suggesta little space that will lead to milder conditions for the uniqueness of fixedpoints

53 Boundedness of the Bethe Free Energy For completeness and tosupport link f in Figure 1 we will here prove that the Bethe free energy isbounded from below The following theorem can be considered a specialcase of the one stated in Minka (2001) on the Bethe free energy for expectationpropagation a generalization of (loopy) belief propagation

Theorem 3 If all potentials are bounded from above that isα(Xα) le max forall α and Xα the Bethe free energy is bounded from below on the set of constraints

2396 T Heskes

Proof It is sufficient to prove that the function G(Qβ) equiv maxλαβ G(Qβ λαβ)

is bounded from below for a particular choice of Aαβ satisfying equation 52Considering Aαβ = nβminus1

nβ we then have

G(Qβ) ge minussumα

logsumXα

α(Xα) exp

[sumβsubα

nβ minus 1nβ

log Qβ(xβ)

]

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

logsumXα

α(Xα)Qβ(xβ)

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

+sumβ

(nβ minus 1)

[minus log

sumxβ

Qβ(xβ)+sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

where the first inequality follows by substituting the choice λαβ(xβ) = 0 forall α β and xβ in G(Qβ λαβ) the second from the concavity of the function

ynβminus1

nβ and the third from the upper bound on the potentials

6 Toward Better Conditions

61 The Hessian The next step is to compute the Hessianmdashthe secondderivative of the dual with respect to the pseudomarginals Qβ The firstderivative yields

partGpartQβ(xβ)

= minussumαsupβ

Aαβ

Qlowastα(xβ)Qβ(xβ)

+ (nβ minus 1)

which is immediate from the Lagrangian equation 53 To compute thematrix of second derivatives

Hββ prime(xβ xprimeβ prime) equivpart2G

partQβ(xβ)partQβ prime(xprimeβ prime)

Uniqueness of Loopy Belief Propagation Fixed Points 2397

we make use of

partQlowastα(xβ)partQβ prime(xprimeβ prime)

= Aαβ primeQlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)

Qβ prime(xprimeβ prime)

where both β and β prime should be a subset of α and with convention Qlowastα(xβ xβ)= Qlowastα(xβ) and Qlowastα(xβ xprimeβ) = 0 if xβ = xprimeβ Here the first term follows from thedifferentation of equation 54 and the second term from the normalizationas in equation 56 Distinguishing between β = β prime and β = β prime we then have

Hββ(xβ xprimeβ) =sumαsupβ

Aαβ(1minus Aαβ)Qlowastα(xβ)Q2β(xβ)

δxβ xprimeβ

+sumαsupβ

A2αβ

Qlowastα(xβ)Qlowastα(xprimeβ)Qβ(xβ)Qβ(xprimeβ)

Hββ prime(xβ xprimeβ prime) = minussum

αsupββ primeAαβAαβ prime

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)Qβ(xβ)Qβ prime(xprimeβ prime)

for β prime = β

where δxβ xprimeβ = 1 if and only if xβ = xprimeβ Here it should be noted that bothβ and xβ play the role of indices that is xβ should not be mistaken for avariable or parameter The parameters are still the (tables with) Lagrangemultipliers λαβ and pseudomarginals Qβ

The goal is now to find conditions under which this Hessian is positive(semi) definite for any setting of the parameters Qβ λαβ that is conditionsthat guarantee

K equivsumββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) ge 0

for any choice of the ldquovectorrdquo S with elements Sβ(xβ) Straightforward ma-nipulations yieldsum

ββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) (K)

=sumα

sumβsubα

sumxβ

Aαβ(1minus Aαβ)Qlowastα(xβ)R2β(xβ) (K1)

+sumα

sumββ primesubα

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ)Qlowastα(xprimeβ prime)Rβ(xβ)Rβ prime(x

primeβ prime) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime) (K3)

where Rβ(xβ) equiv Sβ(xβ)Qβ(xβ)

2398 T Heskes

62 Recovering the Convexity Conditions (2) Let us first see how weget back the conditions for convexity of the Bethe free energy equation 51Since

K2 =sumα

[sumβsubα

sumxβ

AαβQlowastα(xβ)Rβ(xβ)

]2

ge 0

and5

K3 =sumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)

times

12

[Rβ(xβ)minus Rβ prime(xprimeβ prime)

]2 minus 12

R2β(xβ)minus

12

R2β prime(x

primeβ prime)

gesumα

sumβsubα

sumxβ

Aαβ

(sumβ primesubα

Aαβ prime minus Aαβ

)Qlowastα(xβ)R

2β(xβ) (61)

we have

K = K1 + K2 + K3 gesumα

sumβsubα

sumxβ

Aαβ

(1minus

sumβ primesubα

Aαβ prime

)Qlowastα(xβ)R

2β(xβ)

That is sufficient conditions for K to be nonnegative are

Aαβ ge 0 forallαβsubα andsumβsubα

Aαβ le 1 forallα

precisely the conditions for theorem 1

63 Fake Interactions While discussing the conditions for convexity ofthe Bethe free energy we noticed that adding a ldquofake interactionrdquo such as aconstant potential can change the validity of the conditions We will see thathere this is not the case and these fake interactions drop out as we wouldexpect them to

Suppose that we have a fake interaction α(Xα) = 1 From the solutionequation 54 it follows that the pseudomarginal Qlowastα(Xα) factorizes6

Qlowastα(xβ xprimeβ prime) = Qlowastα(xβ)Qlowastα(xprimeβ prime) forallββ primesubα

5 This step is in fact equivalent to the Gerschgorin theorem for bounding the eigenval-ues of a matrix

6 The exact marginal Pexact(Xα) need not factorize This is really a consequence of thelocality assumptions behind loopy belief propagation and the Bethe free energy

Uniqueness of Loopy Belief Propagation Fixed Points 2399

Consequently the terms involving α in K3 cancel with those in K2 which ismost easily seen when we combine K2 and K3 in a different way

K2 + K3 =sumα

sumβsubα

sumxβ xprime

βprime

A2αβQlowastα(xβ)Q

lowastα(xprimeβ)Rβ(xβ)Rβ(x

primeβ) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ prime

times [Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)]Rβ(xβ)Rβ prime(x

primeβ prime) (K3)

This leaves us with the weaker requirement (from K1) Aαβ(1minusAαβ) ge 0 forall β sub α The best choice is then to take Aαβ = 1 which turns condition 3of equation 41 intosum

αprimesupβαprime =α

Aαprimeβ + 1 ge nβ minus 1

The net effect is equivalent to ignoring the interaction reducing the numberof neighboring potentials nβ by 1 for all β that are part of the fake interactionα

We have seen how we get milder and thus better conditions when thereis effectively no interaction Motivated by this ldquosuccessrdquo we will work to-ward conditions that take into account the strength of the interactions Ourstarting point will be the above decomposition in K2 and K3 where sinceK2 ge 0 we will concentrate on K3

7 The Strength of a Potential

71 Bounding the Correlations The crucial observation which will al-low us to obtain milder and thus better conditions for the uniqueness of afixed point is the following lemma It bounds the term between brackets inK3 such that we can again combine this bound with the (positive) term K1However before we get to that we take some time to introduce and deriveproperties of the ldquostrengthrdquo of a potential

Lemma 2 Two-node correlations of loopy belief marginals obey the bound

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime) le σαQlowastα(xβ xprimeβ prime) forall ββprime subα

βprime =βforallxβ xprime

βprime (71)

with the ldquostrengthrdquo σα a function of the potential ψα(Xα) equiv logα(Xα) only

σα = 1minus exp(minusωα) with

ωα equiv maxXαXα

[ψα(Xα)+ (nα minus 1)ψα(Xα)minus

sumβsubα

ψα(Xαβ xβ)

] (72)

where nα equivsum

βsubα 1

2400 T Heskes

Proof For convenience and without loss of generality we omit α fromour notation and renumber the nodes that are contained in α from 1 to nWe consider the quotient between the loopy belief on the potential subsetdivided by the product of its single-node marginals

Qlowast(X)nprodβ=1

Qlowast(xβ)=(X)

prodβ

microβ(xβ)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)microβ(xβ)

=(X)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)

(73)

where we substituted the properly normalized version of equation 35 aloopy belief pseudomarginal is proportional to the potential times incomingmessages The goal is now to find the maximum of the above expressionover all possible messages and all values of X Especially the maximum overmessagesmicro seems to be difficult to compute but the following intermediatelemma helps us out

Lemma 3 The maximum of the function

V(micro) = (nminus 1) log

[sumX

(X)nprodβ=1

microβ(xβ)

]

minusnsumβ=1

log

sum

(Xβ xlowastβ)prodβ prime =β

microβ prime(xβ prime)

with respect to the messages micro under constraintssum

xβ microβ(xβ) = 1 for all β andmicroβ(xβ) ge 0 for all β and xβ occurs at an extreme point microβ(xβ) = δxβ xβ for somexβ to be found

Proof Let us consider optimizing the messagemicro1(x1)with fixed messagesmicroβ(xβ) for β gt 1 The first and second derivatives are easily found to obey

partVpartmicro1(x1)

= (nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ)

part2Vpartmicro1(x1)partmicro1(xprime1)

= (nminus 1)Q(x1)Q(xprime1)minussumβ =1

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

Uniqueness of Loopy Belief Propagation Fixed Points 2401

where

Q(X) equiv (X)prodβ microβ(xβ)sum

Xprime (Xprime)prodβ microβ(x

primeβ)

Now suppose that V has a regular extremum (maximum or minimum) notat an extreme point that is micro1(x1) gt 0 for two or more values of x1 At suchan extremum the first derivative should obey

(nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ) = λ

with λ a Lagrange multiplier implementing the constraintsum

x1micro1(x1) = 1

Summing over x1 we obtain λ = 0 (in fact V is indifferent to any multi-plicative scaling of micro) For the matrix with second derivatives at such anextremum we then have

part2Vpartmicro1(x1)partmicro1(xprime1)

=sumβ =1

sumβprime =1βprime =β

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

which is positive semidefinite the extremum cannot be a maximum Con-sequently any maximum must be at the boundary of the domain Sincethis holds for any choice of microβ(xβ) β gt 1 it follows by induction that themaximum with respect to all microβ(xβ) must be at an extreme point as well

The function V(micro) is up to a term independent of micro the logarithm ofequation 73 So the intermediate lemma 3 tells us that we can replace themaximization over messages micro by maximization over values X

maxmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

X

(X)[(X)

]nminus1

prodβ

(Xβ xβ)

Next we take the maximum over X as well and define the ldquostrengthrdquo σ tobe used in equation 71 through

11minus σ equiv max

Xmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

XX

(X)[(X)

]nminus1

prodβ

(Xβ xβ) (74)

2402 T Heskes

The inequality 71 then follows by summing out Xββ prime in

Qlowast(X)minusprodβ

Qlowast(xβ) le σQlowast(X)

The form of equation 72 then follows by rewriting equation 74 as

ω equiv minus log(1minus σ) = maxXX

W(X X) with

W(X X) =[ψ(X)+ (nminus 1)ψ(X)minus

sumβ

ψ(Xβ xβ)

]

where we recall that ψ(X) equiv log(X)

72 Some Properties In the following we will refer to both ω and σ asthe strength of the potential There are several properties worth noting

bull The strength of a potential is indifferent to multiplication with anyterm that factorizes over the nodes that is

if (X) = (X)prodβ

microβ(xβ) then ω() = ω() for any choice of micro

This property relates to the arbitrariness in the definition of equa-tion 31 if two potentials overlap then multiplying one potential witha term that only depends on the overlap and dividing the other by thesame term does not change the distribution Luckily it also does notchange the strength of those potentials

bull To compute the strength we can enumerate all possible combinationsHowever we can neglect all combinations X and X that differ in fewerthan two nodes To see this consider

W(x1 x2 x12 x1 x2 x12) = ψ(x1 x2 x12)+ ψ(x1 x2 x12)minus ψ(x1 x2 x12)minus ψ(x1 x2 x12)

= minusW(x1 x2 x12 x1 x2 x12)

If now also x2 = x2 we get W(x1 x1 x1 x1) = minusW(x1 x1 x1 x1) =0 Furthermore if W(x1 x2 x12 x1 x2 x12) le 0 then it must be thatW(x1 x2 x12 x1 x2 x12) ge 0 and vice versa So ω the maximumover all combinations must be nonnegative and we can indeed neglectall combinations that by definition yield zero

bull Thus for finite potentials 0 le ω ltinfin and 0 le σ lt 1

Uniqueness of Loopy Belief Propagation Fixed Points 2403

bull With pairwise potentials the above symmetries can be used to reducethe number of evaluations to |x1||x2|(|x1|minus1)(|x2|minus1)4 combinationsAnd indeed for binary nodes x12 isin 0 1 we immediately obtain

ω = |ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| (75)

Any pairwise binary potential can be written as a Boltzmann factor

(x1 x2) prop exp[wx1x2 + θ1x1 + θ2x2]

In this notation we find the simple and intuitive expression ω = |w|the strength is the absolute value of the ldquoweightrdquo It is indeed inde-pendent of (the size of) the thresholds In the case of minus1 1 codingthe relationship is ω = 4|w|

bull In some models there is the notion of a ldquotemperaturerdquo T that is(X) propexp[ψ(X)T] where ψ(X) is considered constant In obvious notationwe then have ω(T) = ω(1)T and thus σ(T) = 1 minus exp[minusω(1)T] =1minus [1(1minus σ(1))]1T

bull Loopy belief revision (max-product) can be interpreted as a zero-temperature limit of loopy belief propagation (sum product) Morespecifically we get the belief revision updates if we imagine runningloopy belief propagation on potentials that are scaled with tempera-ture T and then take the limit T to zero Consequently when analyzingconditions for uniqueness of loopy belief revision fixed points we cantake σ(0) = 0 if σ(1) = 0 (fake interaction) yet σ(0) = 1 wheneverσ(1) gt 0

8 Conditions for Uniqueness

81 Main Result

Theorem 4 Loopy belief propagation has a unique fixed point if there exists anallocation matrix Aαβ between potentials α and nodes β with properties

1 Aαβ ge 0 forallαβsubα (positivity)

2 (1minus σα)maxβsubα

Aαβ + σαsumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(81)

with the strength σα a function of the potentialα(Xα) as defined in equation 72

Proof For completeness we first summarize our line of reasoning Fixedpoints of loopy belief propagation are in one-to-one correspondence with

2404 T Heskes

extrema of the dual equation 55 This dual has a unique extremum if itis convexconcave Concavity is guaranteed so we focus on conditionsfor convexity that is for positive (semi)definiteness of the correspondingHessian This then boils down to conditions that ensure K = K1+K2+K3 ge 0for any choice of Rβ(xβ)

Substituting the bound equation 71 into the term K3 we obtain

K3 ge minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeσαQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime)

ge minussumα

σαsumβsubα

sumxβ

Aαβ

sumβprimesubαβprime =β

Aαβ primeQlowastα(xβ)R2β(xβ)

where in the last step we applied the same trick as in equation 61 SinceK2 ge 0 and combining K1 and (the above lower bound on) K3 we get

K = K1 + K2 + K3

gesumα

sumβsubα

sumxβ

Aαβ

[1minus Aαβ minus σα

sumβ prime =β

Aαβ prime

]Qlowastα(xβ)R

2β(xβ)

This implies

(1minus σα)Aαβ + σαsumβ primesubα

Aαβ prime le 1 forallαβsubα

which in combination with Aαβ ge 0 and σα le 1 yields condition 2 inequation 81 The equality constraint equation 52 that we started with canbe relaxed to the inequality condition 3 without any consequences

We get back the stricter conditions of theorem 1 if σα = 1 for all potentialsα Furthermore ldquofake interactionsrdquo play no role with σα = 0 condition 2becomes maxβsubα Aαβ le 1 suggesting the choice Aαβ = 1 for all β sub αwhich then effectively reduces the number of neighboring potentials nβ incondition 3

82 Comparison with Other Work To the best of our knowledge theonly conditions for uniqueness of loopy belief propagation fixed points thatdepend on more than just the structure of the graph are those in Tatikondaand Jordan (2002) for pairwise potentials The analysis in Tatikonda andJordan is based on the concept of the computation tree which represents anunwrapping of the original graph with respect to the loopy belief propaga-tion algorithm The same concept is used in Weiss (2000) to show that beliefrevision yields the correct maximum a posteriori assignments in graphs

Uniqueness of Loopy Belief Propagation Fixed Points 2405

with a single loop and Weiss and Freeman (2001) to prove that loopy beliefpropagation in gaussian graphical models yields exact means Although thecurrent theorems based on the concept of computation trees are derived forpairwise potentials it should be possible to extend them to more generalfactor graphs

The setup in Tatikonda and Jordan (2002) is slightly different it is basedon the factorization

Pexact(X) = 1Z

prodα

α(Xα)prodβ

β(xβ)

to be compared with our equation 31 where there are no self-potentialsβ(xβ) With this in mind the statement is then as follows

Theorem 5 (adapted from Tatikonda amp Jordan 2002 in particular proposi-tion 53) Loopy belief propagation on pairwise potentials has a unique fixed pointif

sumαsupβ

(max

ψα(Xα)minusminXα

ψα(Xα)

)lt 2 forallβ (82)

To make the connection between theorem 5 and theorem 4 we will firststrengthen the former and then weaken the latter We will focus on thecase of binary pairwise potentials Since the definition of self-potentials isarbitrary and the condition 82 is valid for any choice we can easily improvethe condition by optimizing this choice This then leads to the followingcorollary

Corollary 3 This corollary concerns an improvement of theorem 5 for pairwisebinary potentials Loopy belief propagation on pairwise binary potentials has aunique fixed point ifsum

αsupβωα lt 4 forallβ (83)

with ωα defined in equation 72

Proof The condition 82 applies to any arbitrary definition of self-poten-tials β(xβ) In fact it is valid for any choice

ψα(Xα) = ψα(Xα)+sumβsubα

φαβ(xβ)

where ψα(Xα) is any choice of potential subsets that fits in our frameworkof no self-potentials (as argued above there is some arbitrariness here as

2406 T Heskes

well) We can then optimize this choice to obtain milder and thus betterconditions Omitting α and renumbering the nodes from 1 to 2 we have

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= minφ1φ2

maxx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

minus minx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

In the case of binary nodes (two-by-two matrices ψ(x1 x2)) it is easy tocheck that the optimal φ1 and φ2 that yield the smallest gap are such that

ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2)

ge ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2) (84)

for some x1 x2 x1 and x2 with x1 = x1 and x2 = x2 Solving for φ1 and φ2we find

φ1(x1)minus φ1(x1) = 12

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]φ2(x2)minus φ2(x2) = 1

2

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]

Substitution back into equation 84 yields

ψ(x1 x2)+ φ1(x1)+ φ2(x2)minus ψ(x1 x2)minus φ1(x1)minus φ2(x2)

= 12

[ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)minus ψ(x1 x2)

]

which has to be nonnegative Of all four possible combinations two of themare valid and yield the same positive gap and the other two are invalid sincethey yield the same negative gap Enumerating these combinations we find

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= 12|ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| = ω

2

from equation 75 Substitution into the condition 82 then yields equa-tion 83

Next we derive the following weaker corollary of theorem 4

Uniqueness of Loopy Belief Propagation Fixed Points 2407

Corollary 4 This is a weaker version of theorem 4 for pairwise potentials Loopybelief propagation on pairwise potentials has a unique fixed point ifsum

αsupβωα le 1 forallβ (85)

with ωα defined in equation 72

Proof Consider the allocation matrix with components Aαβ = 1minus σα forall β sub α With this choice conditions 1 and 2 of equation 81 are fulfilledsince (condition 1) σα le 1 and (condition 2)

(1minus σα)(1minus σα)+ 2σα(1minus σα) = 1minus 2σ 2α le 1

Substitution into condition 3 yieldssumαsupβ

(1minus σα) gesumαsupβ

1minus 1 and thussumαsupβ

σα le 1 (86)

Since ωα = minus log(1minus σα) ge σα condition 85 is weaker than condition 86

Summarizing the conditions in Tatikonda and Jordan (2002) are for bi-nary pairwise potentials and when strengthened as above at most a constant(factor 4) less strict and thus better than the ones derived here The latterare better when the structure is (close to) a tree The best set of conditionsfollows by taking the union of both Note further that the conditions derivedin Tatikonda and Jordan (2002) are unlike theorem 4 specific to pairwisepotentials

83 Illustration For illustration we consider a 3 times 3 Ising grid withtoroidal boundary conditions as in Figure 3a and uniform ferromagneticpotentials proportional to(

α 1minus α1minus α α

)

The trivial solution which is the only minimum of the Bethe free energy forsmall α is the one with all pseudomarginals equal to (05 05) With simplealgebra for example following the line of reasoning that leads to the beliefoptimization algorithm in Welling and Teh (2003) it can be shown that thistrivial solution becomes unstable at the critical αcritical = 23 asymp 067 Forα gt 23 we find two minima one with ldquospins uprdquo and the other one withldquospins downrdquo

In this symmetric problem the strength of each potential is given by

ω = 2 log[

α

1minus α]

and thus σ = 1minus(

1minus αα

)2

2408 T Heskes

Figure 3 Three Ising grids in factor-graph notation circles denote nodes boxesinteractions (a) Toroidal boundary conditions All elements of the allocationmatrix equal to 34 (not shown) (b) Aperiodic boundary conditions and (c) twoloops left The elements of the allocation matrix along the edges follow directlyfrom optimizing condition 3 in theorem 4 and symmetry considerations WithB = 2minus 2A in b and C = 1minusA in c the optimal settings for the single remainingvariable A then boil down to 34 and 1 minus

radic18 respectively See the text for

further explanation

Uniqueness of Loopy Belief Propagation Fixed Points 2409

The minimal (uniform) compensation in condition 3 of theorem 4 amountsto A = 34 for all combinations of potentials and nodes Substitution intocondition 2 then yields

σ le 13

and thus α le 11+radic23

asymp 055

The critical value that follows from corollary 3 is in this case slightly better

ω lt 1 and thus α le 11+ eminus12 asymp 062

Next we consider the same grid with aperiodic boundary conditions asin Figure 3b Numerically we find a critical αcritical asymp 079 The value thatfollows from corollary 3 is dominated by the center node and hence staysthe same a unique loopy belief propagation fixed point for α lt 062 Theo-rem 4 can be exploited to shift resources a little In principle we can solvethe nonlinear programming problem but for this small problem it can stillbe done by hand with the following argumentation Minimal compensationaccording to condition 3 in theorem 4 combined with symmetry considera-tions yields the allocation matrix elements along on the edges in Figure 3bIt is then easy to check that there are only two different appearances ofcondition 2

(2minus 2A)σ + 34le 1 and

12σ + A le 1

The optimal choice for A is the one in which both conditions turn out to beidentical In this way we obtain A = 34 yielding

σ le 12

and thus α le 11+radic12

asymp 058

still slightly worse than the condition from corollary 3An example in which the condition obtained with theorem 4 is better

than the one from corollary 3 is given in Figure 3c Straightforward analysisfollowing the same recipe as for Figure 3b yields A = 1minusradic18 with

σ leradic

12

and thus α le 1

1+radic

1minusradic12asymp 065

better than theα lt 062 from corollary 3 and to be compared with the criticalαcritical asymp 088

9 Discussion

In this article we derived sufficient conditions for loopy belief propagationto have just a single fixed point These conditions remain much too strong tobe anywhere near the necessary conditions and in that sense should be seenas no more than a first step These conditions have the following positivefeatures

2410 T Heskes

bull Generalize the conditions for convexity of the Bethe free energy

bull Incorporate the (local) strength of potentials

bull Scale naturally as a function of the ldquotemperaturerdquo

bull Are invariant to arbitrary definitions of potentials and self-interactions

Although the analysis that led to these conditions may seem quite involvedit basically consists of a relatively straightforward combination of two ob-servations The first observation is that we can exploit the arbitrariness inthe definition of the Bethe free energy when we incorporate the constraintsThis forms the basis of the resource allocation argument And the secondobservation concerns the bound on the correlation of a loopy belief propa-gation marginal that leads to the introduction of the strength of a potential

Besides its theoretical usefulness there are more practical uses Firstalgorithms for guaranteed convergence explicitly minimize the Bethe freeenergy They can be considered ldquobound optimization algorithmsrdquo similarto expectation maximization and iterative proportional fitting in the innerloop they minimize a bound on the Bethe free energy which is then up-dated in the outer loop In practice it appears that the tighter the boundthe faster the convergence (see eg Heskes et al 2003) Instead of a boundthat is convex (Yuille 2002) or convex over the set of constraints (Teh ampWelling 2002 Heskes et al 2003) we might relax the convexity conditionand choose a tighter bound that still has a unique minimum thereby speed-ing up the convergence Second in Wainwright et al (2002) a convexifiedBethe free energy is proposed The arguments for this class of free energiesare twofold they yield a bound on the partition function (instead of justan approximation as the standard Bethe free energy) and have a uniqueminimum Focusing on the second argument the conditions in this articlecan be used to construct Bethe free energies that may not be convex (overthe set of constraints) but do have a unique minimum and being closer tothe standard Bethe free energy may yield better approximations

We can think of the following opportunities to make the sufficient con-ditions derived here stricter and thus closer to necessary conditions

bull The conditions guarantee convexity of the dual G(Qβ λαβ) with re-spect to Qβ But in fact we need only G(Qβ) equiv maxλαβ G(Qβ λαβ) to beconvex which is a weaker requirement The Hessian of G(Qβ) how-ever appears to be more difficult to compute and to analyze in generalbut may lead to stronger results in specific cases (eg only pairwiseinteractions or substituting a particular choice of Aαβ )

bull It may be possible to strengthen the bound equation 71 on loopybelief correlations especially for interactions that involve more thantwo nodes

An important question is how the uniqueness of loopy belief propaga-tion fixed points relates to the convergence of loopy belief propagation

Uniqueness of Loopy Belief Propagation Fixed Points 2411

Intuitively one might expect that if loopy belief propagation has a uniquefixed point it will also converge to it This also seems to be the argumenta-tion in Tatikonda and Jordan (2002) However to the best of our knowledgethere is no proof of such correspondence Furthermore the following set ofsimulations does seem to suggest otherwise

We consider a Boltzmann machine with four binary nodes weights

w = ω

0 1 minus1 minus11 0 1 minus1minus1 1 0 minus1minus1 minus1 minus1 0

zero thresholds and potentials

ij(xi xj) = exp[wij4] if xi = xj and ij(xi xj) = exp[minuswij4] if xi = xj

Running loopy belief propagation possibly damped as in equation 39 weobserve ldquoconvergentrdquo and ldquononconvergentrdquo behavior For relatively smallweights loopy belief propagation converges to the trivial fixed point withPi(xi) = 05 for all nodes i and xi = 0 1 as in the lower left inset inFigure 4 For relatively large weights it ends up in a limit cycle as shown inthe upper right inset The weight strength that forms the transition betweenthis ldquoconvergentrdquo and ldquononconvergentrdquo behavior strongly depends on thestep size7 This by itself makes it hard to defend a one-to-one correspondencebetween convergence of loopy belief propagation (apparently dependingon step size) and uniqueness of fixed points (obviously independent of stepsize)

For weights larger than roughly 58 loopy belief propagation failed toconverge to the trivial fixed point even for very small step sizes Howeverrunning a convergent double-loop algorithm from many different initialconditions and many weight strengths considerably larger than 58 we al-ways ended up in the trivial fixed point and never in another one We foundsimilar behavior for a three-node Boltzmann machine (same weight matrixas above except for the fourth node) for very large weights loopy beliefpropagation ends up in a limit cycle whereas a convergent double-loopalgorithm converges to the trivial fixed point which here by corollary 2is guaranteed to be unique In future work we hope to elaborate on theseissues

7 Note that the conditions for guaranteed uniqueness imply ω = 43 for corollary 3and ω = log(2) asymp 069 for theorem 4 both far below the weight strengths where ldquonon-convergentrdquo behavior sets in

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

2380 T Heskes

conditions for uniqueness Such conditions are not only relevant from a the-oretical point of view but can also be used to derive faster algorithms andsuggest different free energies as will be discussed in section 9

2 Outline

Before getting into the mathematical details we first sketch the line of rea-soning that will be followed in this article It is inspired by the connectionbetween fixed points of loopy belief propagation and extrema of the Bethefree energy by studying the Bethe free energy we can learn about propertiesof loopy belief propagation

The Bethe free energy is an approximation to the exact variational Gibbs-Helmholtz free energy Both are concepts from (statistical) physics Abstract-ing from the physical interpretation the Gibbs-Helmholtz free energy isldquojustrdquo a functional with a unique minimum the argument of which corre-sponds to the exact probability distribution However the Gibbs-Helmholtzfree energy is as intractable as the exact probability distribution The ideais then to approximate the Gibbs-Helmholtz free energy in the hope thatthe minimum of such a tractable approximate free energy relates to theminimum of the exact free energy Examples of such approximations arethe mean-field free energy the Bethe free energy and the Kikuchi free en-ergy The connections between the Gibbs-Helmholtz free energy Bethe freeenergy and loopy belief propagation are reviewed in section 3

The Bethe free energy is a function of so-called pseudomarginals or be-liefs For the minimum of the Bethe free energy to make sense these pseudo-marginals have to be properly normalized as well as consistent Our startingpoint the upper-left corner in Figure 1 is a constrained minimization prob-lem In general it is in fact a nonconvex constrained minimization problemsince the Bethe free energy is a nonconvex function of the pseudomarginals(the constraints are linear in these pseudomarginals)

However using the constraints on the pseudomarginals it may be pos-sible to rewrite the Bethe free energy in a form that is convex in the pseudo-marginals When this is possible we call the Bethe free energy ldquoconvex overthe set of constraintsrdquo (Pakzad amp Anantharam 2002) Now if the Bethe freeenergy is convex over the set of constraints we have in combination withthe linearity of the constraints a convex constrained minimization prob-lem Convex constrained minimization problems have a unique solution(see eg (Luenberger 1984) which explains link d in Figure 1

Sufficient conditions for convexity over the set of constraints link b inFigure 1 can be found in Pakzad and Anantharam (2002) and Heskes et al(2003) They are (re)derived and discussed in section 4 These conditionsdepend on only the structure of the graph not on the (strength of the)potentials that make up the probability distribution defined over this graphA corollary of these conditions derived in section 43 is that the Bethefree energy for a graph with a single loop is ldquojustrdquo convex over the set of

Uniqueness of Loopy Belief Propagation Fixed Points 2381

Figure 1 Layout of correspondences and implications See the text for details

constraints with two or more connected loops the conditions fail (see alsoMcEliece amp Yildirim 2003)

Milder conditions for uniqueness which do depend on the strength of theinteractions follow from the track on the right-hand side of Figure 1 Firstwe note that nonconvex constrained minimization of the Bethe free energyis equivalent to an unconstrained nonconvexconcave minimax problem(Heskes 2002) link a in Figure 1 Convergent double-loop algorithms likeCCCP (Yuille 2002) and faster variants thereof (Heskes et al 2003) in factsolve such a minimax problem the concave problem in the maximizing pa-rameters (basically Lagrange multipliers) is solved by a message-passing al-gorithm very similar to standard loopy belief propagation in the inner loopwhere the outer loop changes the minimizing parameters (a remaining set ofpseudomarginals) in the proper downward direction The transformation

2382 T Heskes

from nonconvex constrained minimization problem to an unconstrainednonconvexconcave minimax problem is in a particular setting relevant tothis article repeated in section 51

Rather than requiring the Bethe free energy to be convex (over the setof constraints) we then in sections 6 and 8 work toward conditions underwhich this minimax problem is convexconcave These indeed depend onthe strength of the potentials defined in section 7 These conditions can beconsidered the main result of this article Link c follows from the observa-tion in section 52 that the minimax problem corresponding to a Bethe freeenergy that is convex over the set of constraints has to be convex or concave

As indicated by link e convexconcave minimax problems have a uniquesolution This then also implies that the Bethe free energy has a uniqueextremum satisfying the constraints which since the Bethe free energy isbounded from below (see section 53) has to be a minimum link f

The concluding statement by link g in the lower-right corner is to thebest of our knowledge no more than a conjecture We discuss it in moredetail in section 9

3 The Bethe Free Energy and Loopy Belief Propagation

31 The Gibbs-Helmholtz Free Energy The exact probability distribu-tion in Bayesian networks and Markov random fields can be written in thefactorized form

Pexact(X) = 1Z

prodα

α(Xα) (31)

Hereα is a potential some function of the potential subset Xα and Z is anunknown normalization constant Potential subsets typically overlap andthey span the whole domain X The convention that we adhere to in thisarticle is that there are no potential subsets Xα and Xαprime such that Xαprime is fullysubsumed by Xα The standard choice of a potential in a Bayesian networkis a child with all its parents We further restrict ourselves to probabilisticmodels defined on discrete random variables each of which runs over afinite number of states The potentials are positive and finite

The typical goal in Bayesian networks and Markov random fields is tocompute the partition function Z or marginals for example

Pexact(Xα) =sumXα

Pexact(X)

One way to do this is with the junction tree algorithm (Lauitzen amp Spiegel-halter 1988) However the junction tree algorithm scales exponentially withthe size of the largest clique and may become intractable for complex mod-els The alternative is then to resort to approximate methods which can be

Uniqueness of Loopy Belief Propagation Fixed Points 2383

roughly divided into two categories sampling approaches and determinis-tic approximations

Most deterministic approximations derive from the so-called Gibbs-Helmholtz free energy

F(P) = minussumα

sumXα

P(Xα)ψα(Xα)+sum

X

P(X) log P(X)

with shorthandψ equiv log Minimizing this variational free energy over theset P of all properly normalized probability distributions we get back theexact probability distribution equation 31 as the argument at the minimumand minus the log of the partition function as the value at the minimum

Pexact = argminPisinP

F(P) and minus log Z = minPisinP

F(P)

Since the Gibbs-Helmholtz free energy is convex in P the equality constraint(proper normalization) is linear and the inequality constraints (nonnega-tivity) are convex this minimum is unique By itself we have not gainedanything the entropy may still be intractable to compute

32 The Bethe Free Energy The Bethe free energy is an approximationof the exact Gibbs-Helmholtz free energy In particular we approximate theentropy through

sumX

P(X) log P(X) asympsumα

sumXα

P(Xα) log P(Xα)

minussumβ

(nβ minus 1)sumxβ

P(xβ) log P(xβ)

with xβ a (super)node and nβ =sum

αsupβ 1 the number of potentials thatcontains node xβ The second term follows from a discounting argumentwithout it we would overcount the entropy contributions on the overlapbetween the potential subsets The (super)nodes xβ are themselves subsetsof the potential subsets that is

xβ cap Xα = empty or xβ cap Xα = xβ forallαβand partition the domain X

xβ cap xβ prime = empty forallββ prime and⋃β

xβ = X

Typically the xβ are taken to be single nodes and in the following we willrefer to them as such For clarity of notation we will indicate these nodes byβ and xβ in lowercase to contrast them with the potentials α and potentialsubsets Xα in uppercase

2384 T Heskes

Note that the Bethe free energy depends on only the marginals P(Xα) andP(xβ) We replace minimization of the exact Gibbs-Helmholtz free energyover probability distributions by minimization of the Bethe free energy

F(QαQβ) = minussumα

sumXα

Qα(Xα)ψα(Xα)+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumβ

(nβ minus 1)sumxβ

Qβ(xβ) log Qβ(xβ) (32)

over sets of ldquopseudomarginalsrdquo1 or beliefs QαQβ For this to make sensethese pseudomarginals have to be properly normalized as well as consistentthat is2sum

Qα(Xα) = 1 and Qα(xβ) =sumXαβ

Qα(Xα) = Qβ(xβ) (33)

Let Q denote all subsets of consistent and properly normalized pseudo-marginals Then our goal is to solve

minQαQβ isinQ

F(QαQβ)

The hope is that the pseudomarginals at this minimum are accurate approx-imations to the exact marginals Pexact(Xα) and Pexact(xβ)

33 Link with Loopy Belief Propagation For completeness and laterreference we describe the link between the Bethe free energy and loopybelief propagation as originally reported on by Yedidia et al (2001) It startswith the Lagrangian

L(QαQβ λαβ λα λβ) = F(QαQβ)

+sumβ

sumαsupβ

sumxβ

λαβ(xβ)[Qβ(xβ)minusQα(xβ)

]

+sumα

λα

[1minus

sumXα

Qα(Xα)

]

+sumβ

λβ

[1minus

sumβ

Qβ(xβ)

] (34)

1 Terminology from Wainwright Jaakkola and Willsky (2002) used to indicate thatthere need not be a joint distribution that would yield such marginals

2 Strictly speaking we also have to take inequality constraints into account namelythose of the form Qα(Xα) ge 0 However with the potentials being positive and finitethe logarithmic terms in the free energy make sure that we never really have to worryabout those they never become ldquoactiverdquo For convenience we will not consider them anyfurther

Uniqueness of Loopy Belief Propagation Fixed Points 2385

At an extremum of the Bethe free energy satisfying the constraints allderivatives of L are zero the ones with respect to the Lagrange multipliersλ give back the constraints the ones with respect to the pseudomarginalsQ give an extremum of the Bethe free energy Setting the derivatives withrespect to Qα and Qβ to zero we can solve for Qα and Qβ in terms of theLagrange multipliers

Qlowastα(Xα) = α(Xα) exp

[λα minus 1+

sumβsubα

λαβ(xβ)

]

Qlowastβ(xβ) = exp

[1

nβ minus 1

1minus λβ +

sumαsupβ

λαβ(xβ)

]

In terms of theldquomessagerdquomicroβrarrα(xβ) equiv exp[λαβ(xβ)] from node β to potentialα the pseudomarginal Qlowastα(Xα) reads

Qlowastα(Xα) prop α(Xα)prodβsubα

microβrarrα(xβ) (35)

where proper normalization yields the Lagrange multiplier λα With defi-nition

microαrarrβ(xβ) equivQlowastβ(xβ)microβrarrα(xβ)

(36)

the fixed-point equation for Qlowastβ(xβ) can after some manipulation be writtenin the form

Qlowastβ(xβ) propprodαsupβ

microαrarrβ(xβ) (37)

where again the Lagrange multiplier λβ follows from normalization Finallythe constraint Qlowastα(xβ) = Qlowastβ(xβ) in combination with equation 36 suggeststhe update

microαrarrβ(xβ) = Qlowastα(xβ)microβrarrα(xβ)

(38)

Equations 35 through 38 constitute the belief propagation equations Theycan be summarized as follows A pseudomarginal is the potential (just 1 forthe nodes in the convention where no potentials are assigned to nodes) timesits incoming messages the outgoing message is the pseudomarginal dividedby the incoming message The scheduling of the messages is somewhatarbitrary Loopy belief propagation can be ldquodampedrdquo by taking smaller

2386 T Heskes

steps This damping is usually done in terms of the Lagrange multipliersthat is in the log domain of the messages

logmicronewαrarrβ(xβ) = logmicroαrarrβ(xβ)

+ ε[log Qlowastα(xβ)minus logmicroβrarrα(xβ) minus logmicroαrarrβ(xβ)] (39)

Summarizing loopy belief propagation is equivalent to fixed-point itera-tion where the fixed points are the zero derivatives of the Lagrangian

4 Convexity of the Bethe Free Energy

41 Rewriting the Bethe Free Energy Minimization of the Bethe freeenergy equation 32 under the constraints of equation 33 is equivalent tosolving a minimax problem on the Lagrangian equation 34 namely

minQαQβ

maxλαβ λαλβ

L(QαQβ λαβ λα λβ)

The ordering of the min and max operations is important here to enforcethe constraints we first have to take the maximum The min and max oper-ations can be interchanged if we have a convex constrained minimizationproblem (Luenberger 1984) That is the function to be minimized must beconvex in its parameters the equality constraints have to be linear and theinequality constraints convex In our case the equality constraints are in-deed linear and the inequality constraints enforcing nonnegativity of thepseudomarginals indeed are convex However the Bethe free energy equa-tion 32 is clearly nonconvex in its parameters QαQβ This is what makesit a difficult optimization problem

Luckily the description in equation 32 is not unique any other formthat can be constructed by substituting the constraints of equation 33 isequally valid Following Pakzad and Anantharam (2002) we call the Bethefree energy ldquoconvex over the set of constraintsrdquo if by making use of theconstraints of equation 33 we can rewrite it in a form that is convex inQαQβ

42 Conditions for Convexity The problem is with the term

Sβ(Qβ) equiv minussumxβ

Qβ(xβ) log Qβ(xβ)

which is concave in Qβ Using the constraint Qβ(xβ) = Qα(xβ) we can turnit into a functional that is convex in Qα and Qβ separately but not necessarilyjointly That is with the substitution Qβ(xβ) = Qα(xβ) for any α sup β theentropy and thus the Bethe free energy is convex in Qα and in Qβ but notnecessarily in QαQβ However if we add to Sβ(Qβ) a convex entropycontribution

minusSα(Qα) equivsumXα

Qα(Xα) log Qα(Xα)

Uniqueness of Loopy Belief Propagation Fixed Points 2387

the combination of minusSα and Sβ is convex in QαQβ as the followinglemma needed in the proof of theorem 1 below shows

Lemma 1

αβ(QαQβ) equivsumXα

Qα(Xα) log Qα(Xα)minussumxβ

Qα(xβ) log Qβ(xβ)

is convex in QαQβ

Proof The matrix with second derivatives of αβ has the components

H(XαXprimeα) equivpart2αβ

partQα(Xα)partQα(Xprimeα)= 1

Qα(Xα)δXαXprimeα

H(Xα xprimeβ) equivpart2αβ

partQα(Xα)partQβ(xprimeβ)= minus 1

Qβ(xβ)δxβ xprimeβ

H(xβ xprimeβ) equivpart2αβ

partQβ(xβ)partQβ(xprimeβ)= minusQα(xβ)

Q2β(xβ)

δxβ xprimeβ

where we note that Xα and xβ should be interpreted as indices Convexityrequires that for any ldquovectorrdquo (Rα(Xα) Rβ(xβ))

0 le (Rα(Xα) Rβ(xβ))(

H(XαXprimeα) H(Xα xprimeβ)H(xβXprimeα) H(xprimeβ xβ)

)(Rα(Xprimeα)Rβ(xprimeβ)

)

=sumXα

R2α(Xα)

Qα(Xα)minus 2

sumXα

Rα(Xα)Rβ(xβ)Qβ(xβ)

+sumxβ

Qα(xβ)R2β(xβ)

Q2β(xβ)

=sumXα

Qα(Xα)

[Rα(Xα)

Qα(Xα)minus Rβ(xβ)

Qβ(xβ)

]2

The idea is that the Bethe free energy is convex over the set of constraints ifwe have sufficient convex resources Qα log Qα to compensate for the concaveminusQβ log Qβ terms This can be formalized in the following theorem

Theorem 1 The Bethe free energy is convex over the set of consistency con-straints if there exists an allocation matrix Aαβ between potentials α and nodes βsatisfying

1 Aαβ ge 0 forallαβsubα (positivity)

2sumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(41)

2388 T Heskes

Proof First we note that we do not have to worry about the energy termsthat are linear in Qα In other words to prove the theorem we can restrictourselves to proving that minus the entropy

minusS(Q) = minus[sum

α

Sα(Qα)minussumβ

(nβ minus 1)Sβ(Qβ)

]

is convex over the set of consistency constraints The resulting operation isnow a matter of resource allocation For each concave contribution (nβ minus1)Sβ we have to find convex contributions minusSα to compensate for it LetAαβ denote the ldquoamount of resourcesrdquo that we take from potential subset αto compensate for node β Now in shorthand notation and with a little bitof rewriting

minusS(Q) = minus[sum

α

Sα minussumβ

(nβ minus 1)Sβ

]

= minussumα

(1minus

sumβsubα

Aαβ +sumβsubα

Aαβ

)Sα

minussumβ

minussumαsupβ

Aαβ +sumαsupβ

Aαβ minus (nβ minus 1)

= minussumα

(1minus

sumβsubα

Aαβ

)Sα minus

sumα

sumβsubα

Aαβ [Sα minus Sβ ]

minussumβ

[sumαsupβ

Aαβ minus (nβ minus 1)

]Sβ

Convexity of the first term is guaranteed if 1minussumβ Aαβ ge 0 (condition 2) ofthe second term if Aαβ ge 0 (condition 1 and lemma 1) and of the third termifsum

α Aαβ minus (nβ minus 1) ge 0 (condition 3)

This theorem is a special case of the one in Heskes et al (2003) for themore general Kikuchi free energy Either one of the inequality signs in con-dition 2 and 3 of equation 41 can be replaced by an equality sign withoutany consequences

43 Some Implications

Corollary 1 The Bethe free energy for singly connected graphs is convex overthe set of constraints

Uniqueness of Loopy Belief Propagation Fixed Points 2389

Proof The proof is by construction Choose one of the leaf nodes as theroot βlowast and define

Aαβ = 1 iff β sub α and β closer to the root βlowast than any other β prime sub αAαβ prime = 0 for all other β prime

Obviously this choice of A satisfies conditions 1 and 2 of equation 41Arguing the other way around for eachβ = βlowast there is just a single potentialα sup β that is closer to the root βlowast than β itself (see the illustration in Figure 2)and thus there are precisely nβ minus 1 contributions Aαβ = 1 The root itselfgets nβlowast contributions Aαβlowast = 1 which is even better Hence condition 3 isalso satisfied

sumαsupβ

Aαβ = nβ minus 1 forallβ =βlowast andsumαsupβlowast

Aαβlowast = nβlowast gt nβlowast minus 1

With the above construction of A we are in a sense ldquoeating up resourcestoward the rootrdquo At the root we have one piece of resources left whichsuggests that we can still enlarge the set of graphs for which convexity canbe shown using theorem 1 This leads to the next corollary

Corollary 2 The Bethe free energy for graphs with a single loop is convex overthe set of constraints

Proof Again the proof is by construction Break the loop at one particularplace that is remove one node βlowast from a potential αlowast such that a singlyconnected structure is left Construct a matrix A as in the proof of corollary 1taking the node βlowast as the root The matrix A constructed in this way alsojust works for the graph with the closed loop since still

sumαsupβ

Aαβ = nβ minus 1 forallβ =βlowast and nowsumαsupβlowast

Aαβlowast = nβlowast minus 1

It can be seen that this construction starts to fail as soon as we have twoloops that are connected with two connected loops we have insufficientpositive resources to compensate for the negative entropy contributions

44 Connection with Other Work The same corollaries can be foundin Pakzad and Anantharam (2002) and McEliece and Yildirim (2003) Fur-thermore the conditions in theorem 1 are very similar to the ones stated inPakzad and Anantharam (2002) which for the Bethe free energy boil downto the following

2390 T Heskes

Figure 2 The construction of an allocation matrix satisfying all convexity con-straints for singly connected (a) and single-loop structures (b) Neglecting thearrows and dashes each graph corresponds to a factor graph (Kscischang Freyamp Loeliger 2001) where dark boxes refer to potentials and circles to nodes Thenumbers within the circles give the corresponding ldquoovercounting numbersrdquo forthe Bethe free energy 1minus nβ with nβ the number of neighboring potentials Thearrows pointing from potentials α to nodes β visualize the allocation matrixA with Aαβ = 1 if there is an arrow and Aαβ = 0 otherwise As can be seenfor each potential there is precisely one outgoing arrow pointing at the nodeclosest to the root chosen to be the node in the upper right corner of the graphIn the singly-connected structure (a) all nonroot nodes have precisely nβ minus 1incoming arrows just sufficient to compensate the overcounting number 1minusnβ The root node itself has one incoming arrow which it does not really need Inthe structure with the single loop (b) we open the loop by breaking the dashedlink and construct the allocation matrix for the corresponding singly connectedstructure This allocation matrix works for the single-loop structure as well be-cause now the incoming arrow at the ldquorootrdquo is just sufficient to compensate forthe negative overcounting number resulting from the extra link closing the loop

Uniqueness of Loopy Belief Propagation Fixed Points 2391

Theorem 2 (Adapted from theorem 1 in Pakzad amp Anantharam 2002) TheBethe free energy is convex for the set of constraints if for any set of nodes B wehave sum

βisinB

(1minus nβ)+sumαisinπ(B)

1 ge 0 (42)

where π(B) equiv α existβ isin Bβ sub α denotes the ldquoparentrdquo set of B that is thosepotential subsets that include at least one node in B

Proposition 1 The conditions in theorems 1 and 2 are equivalent

Proof Let us first suppose that there does exist an allocation matrix Aαβ

satisfying the conditions of equation 41 Then for any set B

sumβisinB

(nβ minus 1) lesumβisinB

sumαsupβ

Aαβ lesumαisinπ(B)

sumβsubα

Aαβ lesumαisinπ(B)

1

where the inequalities follow from conditions 3 1 and 2 in equation 41respectively In other words validity of the conditions in theorem 1 impliesthe validity of those in theorem 2

Next let us suppose that the conditions in Theorem 1 fail Above wehave seen that this can happen if and only if the graph contains at least oneconnected component with two connected loops But then condition 42is violated as well when we take for B the set of all nodes within such acomponent

Since validity implies validity and violation implies violation the con-ditions must be equivalent

Graphical models with a single loop have been studied in detail in Weiss(2000) yielding important theoretical results (eg correctness of maximuma posteriori assignments) These results are derived by ldquounwrappingrdquo thesingle loop into an infinite tree This argument also breaks down as soon asthere is more than a single loop It might be interesting to find out whetherthere is a deeper connection between this unwrapping argument and theconvexity of the Bethe free energy

In summary we have given conditions for the Bethe free energy to have aunique extremum satisfying the constraints From the connection betweenthe extrema of the Bethe free energy and fixed points of loopy belief prop-agation it then follows that loopy belief propagation has a unique fixedpoint when these conditions are satisfied These conditions fail as soon asthe structure of the graph contains two connected loops

The conditions for convexity of the Bethe free energy depend on thestructure of the graph the potentialsα(Xα) do not play any role These po-tentials appear only in the energy term that is linear in the pseudomarginals

2392 T Heskes

and thus does not affect the convexity argument Consequently adding aldquofake linkrdquo with potential α(Xα) = 1 can change the validity of the con-ditions whereas it has no effect on the loopy belief propagation updatesEven if we managed to find more interesting (ie milder) conditions forconvexity over the set of constraints3 the impact of fake links would neverdisappear In the following we will therefore dig a little deeper to arrive atmilder conditions that do take into account (the strength of) the potentials

5 The Dual Formulation

51 From Lagrangian to Dual As we have seen fixed points of loopybelief propagation are in a one-to-one correspondence with zero derivativesof the Lagrangian If we manage to find conditions under which these zeroderivatives have a unique solution then for the same conditions loopybelief propagation has a unique fixed point In the following we will workwith a Lagrangian slightly different from equation 34 First we substitutethe constraint Qα(xβ) = Qβ(xβ) to write the Bethe free energy in the ldquomoreconvexrdquo form

F(QαQβ) = minussumα

sumXα

Qα(Xα)ψα(Xα)+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumβ

sumαsupβ

Aαβ

sumxβ

Qα(xβ) log Qβ(xβ) (51)

where the allocation matrix Aαβ can be any matrix that satisfies

sumαsupβ

Aαβ = nβ minus 1 (52)

And second we express the consistency constraints from equation 33 interms of the potential pseudomarginals Qα alone This then yields

L(QαQβ λαβ λα) = minussumα

sumXα

Qα(Xα)ψα(Xα)

+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumα

sumβsubα

Aαβ

sumxβ

Qα(xβ) log Qβ(xβ)

3 We would like to conjecture that this is not possiblemdashthat the conditions in theorem 1are not only sufficient but also necessary to prove convexity of the Bethe free energy overthe set of consistency constraints Note that this would not imply that we need theseconditions to guarantee the uniqueness of fixed points since for that convexity by itselfis sufficient not necessary

Uniqueness of Loopy Belief Propagation Fixed Points 2393

+sumβ

sumαsupβ

sumxβ

λαβ(xβ)

times[

1nβ minus 1

sumαprimesupβ

AαprimeβQαprime(xβ)minusQα(xβ)

]

+sumα

λα

[1minus

sumXα

Qα(Xα)

]

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (53)

Note that the constraint Qβ(xβ) = Qα(xβ) as well as its normalization is nolonger incorporated with Lagrange multipliers but follows when we takethe minimum with respect to Qβ It is easy to check that the fixed-pointequations of loopy belief propagation still follow by setting the derivativesof the Lagrangian equation 53 to zero

Although the Bethe free energy and thus the Lagrangian equation 53may not be convex in QαQβ they are convex in Qα and Qβ separatelyTherefore we can interchange the minimum over the pseudomarginals Qα

and the maximum over the Lagrange multipliers as long as we leave theminimum over Qβ as the final operation4

minQαQβ

maxλαβ λα

L(QαQβ λαβ λα) = minQβ

maxλαβ λα

minQα

L(QαQβ λαβ λα)

Rewriting

sumβ

sumαsupβ

sumxβ

λαβ(xβ)

[1

nβ minus 1

sumαprimesupβ

AαprimeβQαprime(xβ)minusQα(xβ)

]

= minussumα

sumβsubα

sumxβ

λαβ(xβ)Qα(xβ)

with

λαβ(xβ) equiv λαβ(xβ)minus 1nβ minus 1

sumαprimesupβ

Aαprimeβλαprimeβ(xβ)

we can easily solve for the minimum with respect to Qα

Qlowastα(Xα) = α(Xα) exp

[λα minus 1+

sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

] (54)

4 In principle we could also first take the minimum over Qβ and leave the minimumover Qα but this does not seem to lead to any useful results

2394 T Heskes

Plugging this into the Lagrangian we obtain the ldquodualrdquo

G(Qβ λαβ λα) equiv L(QlowastαQβ λαβ λα)

= minussumα

sumXα

α(Xα)

exptimes[λα minus 1+

sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

]

+sumα

λα +sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (55)

Next we find for the maximum with respect to λα

exp[1minus λlowastα

] =sumXα

α(Xα) exp

[sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

]

equiv Zlowastα (56)

where we have to keep in mind that Zlowastα by itself like Qlowastα is a function of theremaining pseudomarginals Qβ and Lagrange multipliers λαβ Substitutingthis solution into the dual we arrive at

G(Qβ λαβ) equiv G(Qβ λαβ λlowastα)

= minussumα

log Zlowastα +sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (57)

Let us pause here for a moment and reflect on what we have done so farThe Lagrangian equation 53 being convex in Qα has a unique minimumin Qα (given all other parameters fixed) which is also the only extremumIt happens to be relatively straightforward to express the value at this mini-mum in terms of the remaining parameters and then also to find the optimal(maximal) λlowastα Plugging these values into the Lagrangian equation 53 wehave not lost anything That is zero derivatives of the Lagrangian are still inone-to-one correspondence with zero derivatives of the dual equation 57and thus with fixed points of loopy belief propagation

52 Recovering the Convexity Conditions (1) To find a minimum of theBethe free energy satisfying the constraints in equation 33 we first have totake the maximum of the dual equation 57 over the remaining Lagrangemultipliersλαβ and then the minimum over the remaining pseudomarginalsQβ The duality theorem a standard result from constrained optimization(see Luenberger 1984) tells us that the dual G is concave in the Lagrangemultipliers The remaining question is then whether the dual is convex in

Uniqueness of Loopy Belief Propagation Fixed Points 2395

Qβ If it is we have a convex-concave minimax problem which is guaranteedto have a unique solution

Link c in Figure 1 follows from the following proposition

Proposition 2 Convexity of the Bethe free energy equation 51 in QαQβimplies convexity of the dual equation 57 in Qβ

Proof First we note that the minimum of a convex function over someof its parameters is convex in its remaining parameters In obvious one-dimensional notation with ylowast(x) equiv argmin

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge 2 f (x (ylowast(x+ δ)+ ylowast(xminus δ))2)ge 2 f (x ylowast(x))

where the first inequality follows from the convexity of f in x yand the sec-ond inequality from ylowast(x) being the unique minimum of f (x y) Thereforethe dual equation 55 is convex in Qβ when the Lagrangian equation 53and thus the Bethe free energy equation 51 is convex in QαQβ Further-more from the duality theorem the dual equation 55 is concave in theLagrange multipliers λαβ λα Next we note that the maximum of a con-vex or concave function over its maximizing parameters is again convexwith ylowast(x) equiv argmax

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge f (x+ δ ylowast(x))+ f (xminus δ ylowast(x))ge 2 f (x ylowast(x))

where the first inequality follows from ylowast(xplusmnδ) being the unique maximumof f (x plusmn δ y) and the second inequality from the convexity of f (x y) in xHence the dual equation 57 must still be convex in Qβ

For now we did not gain or lose anything in comparison with the con-ditions for theorem 1 However the inequalities in the above proof suggesta little space that will lead to milder conditions for the uniqueness of fixedpoints

53 Boundedness of the Bethe Free Energy For completeness and tosupport link f in Figure 1 we will here prove that the Bethe free energy isbounded from below The following theorem can be considered a specialcase of the one stated in Minka (2001) on the Bethe free energy for expectationpropagation a generalization of (loopy) belief propagation

Theorem 3 If all potentials are bounded from above that isα(Xα) le max forall α and Xα the Bethe free energy is bounded from below on the set of constraints

2396 T Heskes

Proof It is sufficient to prove that the function G(Qβ) equiv maxλαβ G(Qβ λαβ)

is bounded from below for a particular choice of Aαβ satisfying equation 52Considering Aαβ = nβminus1

nβ we then have

G(Qβ) ge minussumα

logsumXα

α(Xα) exp

[sumβsubα

nβ minus 1nβ

log Qβ(xβ)

]

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

logsumXα

α(Xα)Qβ(xβ)

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

+sumβ

(nβ minus 1)

[minus log

sumxβ

Qβ(xβ)+sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

where the first inequality follows by substituting the choice λαβ(xβ) = 0 forall α β and xβ in G(Qβ λαβ) the second from the concavity of the function

ynβminus1

nβ and the third from the upper bound on the potentials

6 Toward Better Conditions

61 The Hessian The next step is to compute the Hessianmdashthe secondderivative of the dual with respect to the pseudomarginals Qβ The firstderivative yields

partGpartQβ(xβ)

= minussumαsupβ

Aαβ

Qlowastα(xβ)Qβ(xβ)

+ (nβ minus 1)

which is immediate from the Lagrangian equation 53 To compute thematrix of second derivatives

Hββ prime(xβ xprimeβ prime) equivpart2G

partQβ(xβ)partQβ prime(xprimeβ prime)

Uniqueness of Loopy Belief Propagation Fixed Points 2397

we make use of

partQlowastα(xβ)partQβ prime(xprimeβ prime)

= Aαβ primeQlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)

Qβ prime(xprimeβ prime)

where both β and β prime should be a subset of α and with convention Qlowastα(xβ xβ)= Qlowastα(xβ) and Qlowastα(xβ xprimeβ) = 0 if xβ = xprimeβ Here the first term follows from thedifferentation of equation 54 and the second term from the normalizationas in equation 56 Distinguishing between β = β prime and β = β prime we then have

Hββ(xβ xprimeβ) =sumαsupβ

Aαβ(1minus Aαβ)Qlowastα(xβ)Q2β(xβ)

δxβ xprimeβ

+sumαsupβ

A2αβ

Qlowastα(xβ)Qlowastα(xprimeβ)Qβ(xβ)Qβ(xprimeβ)

Hββ prime(xβ xprimeβ prime) = minussum

αsupββ primeAαβAαβ prime

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)Qβ(xβ)Qβ prime(xprimeβ prime)

for β prime = β

where δxβ xprimeβ = 1 if and only if xβ = xprimeβ Here it should be noted that bothβ and xβ play the role of indices that is xβ should not be mistaken for avariable or parameter The parameters are still the (tables with) Lagrangemultipliers λαβ and pseudomarginals Qβ

The goal is now to find conditions under which this Hessian is positive(semi) definite for any setting of the parameters Qβ λαβ that is conditionsthat guarantee

K equivsumββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) ge 0

for any choice of the ldquovectorrdquo S with elements Sβ(xβ) Straightforward ma-nipulations yieldsum

ββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) (K)

=sumα

sumβsubα

sumxβ

Aαβ(1minus Aαβ)Qlowastα(xβ)R2β(xβ) (K1)

+sumα

sumββ primesubα

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ)Qlowastα(xprimeβ prime)Rβ(xβ)Rβ prime(x

primeβ prime) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime) (K3)

where Rβ(xβ) equiv Sβ(xβ)Qβ(xβ)

2398 T Heskes

62 Recovering the Convexity Conditions (2) Let us first see how weget back the conditions for convexity of the Bethe free energy equation 51Since

K2 =sumα

[sumβsubα

sumxβ

AαβQlowastα(xβ)Rβ(xβ)

]2

ge 0

and5

K3 =sumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)

times

12

[Rβ(xβ)minus Rβ prime(xprimeβ prime)

]2 minus 12

R2β(xβ)minus

12

R2β prime(x

primeβ prime)

gesumα

sumβsubα

sumxβ

Aαβ

(sumβ primesubα

Aαβ prime minus Aαβ

)Qlowastα(xβ)R

2β(xβ) (61)

we have

K = K1 + K2 + K3 gesumα

sumβsubα

sumxβ

Aαβ

(1minus

sumβ primesubα

Aαβ prime

)Qlowastα(xβ)R

2β(xβ)

That is sufficient conditions for K to be nonnegative are

Aαβ ge 0 forallαβsubα andsumβsubα

Aαβ le 1 forallα

precisely the conditions for theorem 1

63 Fake Interactions While discussing the conditions for convexity ofthe Bethe free energy we noticed that adding a ldquofake interactionrdquo such as aconstant potential can change the validity of the conditions We will see thathere this is not the case and these fake interactions drop out as we wouldexpect them to

Suppose that we have a fake interaction α(Xα) = 1 From the solutionequation 54 it follows that the pseudomarginal Qlowastα(Xα) factorizes6

Qlowastα(xβ xprimeβ prime) = Qlowastα(xβ)Qlowastα(xprimeβ prime) forallββ primesubα

5 This step is in fact equivalent to the Gerschgorin theorem for bounding the eigenval-ues of a matrix

6 The exact marginal Pexact(Xα) need not factorize This is really a consequence of thelocality assumptions behind loopy belief propagation and the Bethe free energy

Uniqueness of Loopy Belief Propagation Fixed Points 2399

Consequently the terms involving α in K3 cancel with those in K2 which ismost easily seen when we combine K2 and K3 in a different way

K2 + K3 =sumα

sumβsubα

sumxβ xprime

βprime

A2αβQlowastα(xβ)Q

lowastα(xprimeβ)Rβ(xβ)Rβ(x

primeβ) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ prime

times [Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)]Rβ(xβ)Rβ prime(x

primeβ prime) (K3)

This leaves us with the weaker requirement (from K1) Aαβ(1minusAαβ) ge 0 forall β sub α The best choice is then to take Aαβ = 1 which turns condition 3of equation 41 intosum

αprimesupβαprime =α

Aαprimeβ + 1 ge nβ minus 1

The net effect is equivalent to ignoring the interaction reducing the numberof neighboring potentials nβ by 1 for all β that are part of the fake interactionα

We have seen how we get milder and thus better conditions when thereis effectively no interaction Motivated by this ldquosuccessrdquo we will work to-ward conditions that take into account the strength of the interactions Ourstarting point will be the above decomposition in K2 and K3 where sinceK2 ge 0 we will concentrate on K3

7 The Strength of a Potential

71 Bounding the Correlations The crucial observation which will al-low us to obtain milder and thus better conditions for the uniqueness of afixed point is the following lemma It bounds the term between brackets inK3 such that we can again combine this bound with the (positive) term K1However before we get to that we take some time to introduce and deriveproperties of the ldquostrengthrdquo of a potential

Lemma 2 Two-node correlations of loopy belief marginals obey the bound

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime) le σαQlowastα(xβ xprimeβ prime) forall ββprime subα

βprime =βforallxβ xprime

βprime (71)

with the ldquostrengthrdquo σα a function of the potential ψα(Xα) equiv logα(Xα) only

σα = 1minus exp(minusωα) with

ωα equiv maxXαXα

[ψα(Xα)+ (nα minus 1)ψα(Xα)minus

sumβsubα

ψα(Xαβ xβ)

] (72)

where nα equivsum

βsubα 1

2400 T Heskes

Proof For convenience and without loss of generality we omit α fromour notation and renumber the nodes that are contained in α from 1 to nWe consider the quotient between the loopy belief on the potential subsetdivided by the product of its single-node marginals

Qlowast(X)nprodβ=1

Qlowast(xβ)=(X)

prodβ

microβ(xβ)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)microβ(xβ)

=(X)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)

(73)

where we substituted the properly normalized version of equation 35 aloopy belief pseudomarginal is proportional to the potential times incomingmessages The goal is now to find the maximum of the above expressionover all possible messages and all values of X Especially the maximum overmessagesmicro seems to be difficult to compute but the following intermediatelemma helps us out

Lemma 3 The maximum of the function

V(micro) = (nminus 1) log

[sumX

(X)nprodβ=1

microβ(xβ)

]

minusnsumβ=1

log

sum

(Xβ xlowastβ)prodβ prime =β

microβ prime(xβ prime)

with respect to the messages micro under constraintssum

xβ microβ(xβ) = 1 for all β andmicroβ(xβ) ge 0 for all β and xβ occurs at an extreme point microβ(xβ) = δxβ xβ for somexβ to be found

Proof Let us consider optimizing the messagemicro1(x1)with fixed messagesmicroβ(xβ) for β gt 1 The first and second derivatives are easily found to obey

partVpartmicro1(x1)

= (nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ)

part2Vpartmicro1(x1)partmicro1(xprime1)

= (nminus 1)Q(x1)Q(xprime1)minussumβ =1

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

Uniqueness of Loopy Belief Propagation Fixed Points 2401

where

Q(X) equiv (X)prodβ microβ(xβ)sum

Xprime (Xprime)prodβ microβ(x

primeβ)

Now suppose that V has a regular extremum (maximum or minimum) notat an extreme point that is micro1(x1) gt 0 for two or more values of x1 At suchan extremum the first derivative should obey

(nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ) = λ

with λ a Lagrange multiplier implementing the constraintsum

x1micro1(x1) = 1

Summing over x1 we obtain λ = 0 (in fact V is indifferent to any multi-plicative scaling of micro) For the matrix with second derivatives at such anextremum we then have

part2Vpartmicro1(x1)partmicro1(xprime1)

=sumβ =1

sumβprime =1βprime =β

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

which is positive semidefinite the extremum cannot be a maximum Con-sequently any maximum must be at the boundary of the domain Sincethis holds for any choice of microβ(xβ) β gt 1 it follows by induction that themaximum with respect to all microβ(xβ) must be at an extreme point as well

The function V(micro) is up to a term independent of micro the logarithm ofequation 73 So the intermediate lemma 3 tells us that we can replace themaximization over messages micro by maximization over values X

maxmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

X

(X)[(X)

]nminus1

prodβ

(Xβ xβ)

Next we take the maximum over X as well and define the ldquostrengthrdquo σ tobe used in equation 71 through

11minus σ equiv max

Xmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

XX

(X)[(X)

]nminus1

prodβ

(Xβ xβ) (74)

2402 T Heskes

The inequality 71 then follows by summing out Xββ prime in

Qlowast(X)minusprodβ

Qlowast(xβ) le σQlowast(X)

The form of equation 72 then follows by rewriting equation 74 as

ω equiv minus log(1minus σ) = maxXX

W(X X) with

W(X X) =[ψ(X)+ (nminus 1)ψ(X)minus

sumβ

ψ(Xβ xβ)

]

where we recall that ψ(X) equiv log(X)

72 Some Properties In the following we will refer to both ω and σ asthe strength of the potential There are several properties worth noting

bull The strength of a potential is indifferent to multiplication with anyterm that factorizes over the nodes that is

if (X) = (X)prodβ

microβ(xβ) then ω() = ω() for any choice of micro

This property relates to the arbitrariness in the definition of equa-tion 31 if two potentials overlap then multiplying one potential witha term that only depends on the overlap and dividing the other by thesame term does not change the distribution Luckily it also does notchange the strength of those potentials

bull To compute the strength we can enumerate all possible combinationsHowever we can neglect all combinations X and X that differ in fewerthan two nodes To see this consider

W(x1 x2 x12 x1 x2 x12) = ψ(x1 x2 x12)+ ψ(x1 x2 x12)minus ψ(x1 x2 x12)minus ψ(x1 x2 x12)

= minusW(x1 x2 x12 x1 x2 x12)

If now also x2 = x2 we get W(x1 x1 x1 x1) = minusW(x1 x1 x1 x1) =0 Furthermore if W(x1 x2 x12 x1 x2 x12) le 0 then it must be thatW(x1 x2 x12 x1 x2 x12) ge 0 and vice versa So ω the maximumover all combinations must be nonnegative and we can indeed neglectall combinations that by definition yield zero

bull Thus for finite potentials 0 le ω ltinfin and 0 le σ lt 1

Uniqueness of Loopy Belief Propagation Fixed Points 2403

bull With pairwise potentials the above symmetries can be used to reducethe number of evaluations to |x1||x2|(|x1|minus1)(|x2|minus1)4 combinationsAnd indeed for binary nodes x12 isin 0 1 we immediately obtain

ω = |ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| (75)

Any pairwise binary potential can be written as a Boltzmann factor

(x1 x2) prop exp[wx1x2 + θ1x1 + θ2x2]

In this notation we find the simple and intuitive expression ω = |w|the strength is the absolute value of the ldquoweightrdquo It is indeed inde-pendent of (the size of) the thresholds In the case of minus1 1 codingthe relationship is ω = 4|w|

bull In some models there is the notion of a ldquotemperaturerdquo T that is(X) propexp[ψ(X)T] where ψ(X) is considered constant In obvious notationwe then have ω(T) = ω(1)T and thus σ(T) = 1 minus exp[minusω(1)T] =1minus [1(1minus σ(1))]1T

bull Loopy belief revision (max-product) can be interpreted as a zero-temperature limit of loopy belief propagation (sum product) Morespecifically we get the belief revision updates if we imagine runningloopy belief propagation on potentials that are scaled with tempera-ture T and then take the limit T to zero Consequently when analyzingconditions for uniqueness of loopy belief revision fixed points we cantake σ(0) = 0 if σ(1) = 0 (fake interaction) yet σ(0) = 1 wheneverσ(1) gt 0

8 Conditions for Uniqueness

81 Main Result

Theorem 4 Loopy belief propagation has a unique fixed point if there exists anallocation matrix Aαβ between potentials α and nodes β with properties

1 Aαβ ge 0 forallαβsubα (positivity)

2 (1minus σα)maxβsubα

Aαβ + σαsumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(81)

with the strength σα a function of the potentialα(Xα) as defined in equation 72

Proof For completeness we first summarize our line of reasoning Fixedpoints of loopy belief propagation are in one-to-one correspondence with

2404 T Heskes

extrema of the dual equation 55 This dual has a unique extremum if itis convexconcave Concavity is guaranteed so we focus on conditionsfor convexity that is for positive (semi)definiteness of the correspondingHessian This then boils down to conditions that ensure K = K1+K2+K3 ge 0for any choice of Rβ(xβ)

Substituting the bound equation 71 into the term K3 we obtain

K3 ge minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeσαQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime)

ge minussumα

σαsumβsubα

sumxβ

Aαβ

sumβprimesubαβprime =β

Aαβ primeQlowastα(xβ)R2β(xβ)

where in the last step we applied the same trick as in equation 61 SinceK2 ge 0 and combining K1 and (the above lower bound on) K3 we get

K = K1 + K2 + K3

gesumα

sumβsubα

sumxβ

Aαβ

[1minus Aαβ minus σα

sumβ prime =β

Aαβ prime

]Qlowastα(xβ)R

2β(xβ)

This implies

(1minus σα)Aαβ + σαsumβ primesubα

Aαβ prime le 1 forallαβsubα

which in combination with Aαβ ge 0 and σα le 1 yields condition 2 inequation 81 The equality constraint equation 52 that we started with canbe relaxed to the inequality condition 3 without any consequences

We get back the stricter conditions of theorem 1 if σα = 1 for all potentialsα Furthermore ldquofake interactionsrdquo play no role with σα = 0 condition 2becomes maxβsubα Aαβ le 1 suggesting the choice Aαβ = 1 for all β sub αwhich then effectively reduces the number of neighboring potentials nβ incondition 3

82 Comparison with Other Work To the best of our knowledge theonly conditions for uniqueness of loopy belief propagation fixed points thatdepend on more than just the structure of the graph are those in Tatikondaand Jordan (2002) for pairwise potentials The analysis in Tatikonda andJordan is based on the concept of the computation tree which represents anunwrapping of the original graph with respect to the loopy belief propaga-tion algorithm The same concept is used in Weiss (2000) to show that beliefrevision yields the correct maximum a posteriori assignments in graphs

Uniqueness of Loopy Belief Propagation Fixed Points 2405

with a single loop and Weiss and Freeman (2001) to prove that loopy beliefpropagation in gaussian graphical models yields exact means Although thecurrent theorems based on the concept of computation trees are derived forpairwise potentials it should be possible to extend them to more generalfactor graphs

The setup in Tatikonda and Jordan (2002) is slightly different it is basedon the factorization

Pexact(X) = 1Z

prodα

α(Xα)prodβ

β(xβ)

to be compared with our equation 31 where there are no self-potentialsβ(xβ) With this in mind the statement is then as follows

Theorem 5 (adapted from Tatikonda amp Jordan 2002 in particular proposi-tion 53) Loopy belief propagation on pairwise potentials has a unique fixed pointif

sumαsupβ

(max

ψα(Xα)minusminXα

ψα(Xα)

)lt 2 forallβ (82)

To make the connection between theorem 5 and theorem 4 we will firststrengthen the former and then weaken the latter We will focus on thecase of binary pairwise potentials Since the definition of self-potentials isarbitrary and the condition 82 is valid for any choice we can easily improvethe condition by optimizing this choice This then leads to the followingcorollary

Corollary 3 This corollary concerns an improvement of theorem 5 for pairwisebinary potentials Loopy belief propagation on pairwise binary potentials has aunique fixed point ifsum

αsupβωα lt 4 forallβ (83)

with ωα defined in equation 72

Proof The condition 82 applies to any arbitrary definition of self-poten-tials β(xβ) In fact it is valid for any choice

ψα(Xα) = ψα(Xα)+sumβsubα

φαβ(xβ)

where ψα(Xα) is any choice of potential subsets that fits in our frameworkof no self-potentials (as argued above there is some arbitrariness here as

2406 T Heskes

well) We can then optimize this choice to obtain milder and thus betterconditions Omitting α and renumbering the nodes from 1 to 2 we have

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= minφ1φ2

maxx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

minus minx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

In the case of binary nodes (two-by-two matrices ψ(x1 x2)) it is easy tocheck that the optimal φ1 and φ2 that yield the smallest gap are such that

ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2)

ge ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2) (84)

for some x1 x2 x1 and x2 with x1 = x1 and x2 = x2 Solving for φ1 and φ2we find

φ1(x1)minus φ1(x1) = 12

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]φ2(x2)minus φ2(x2) = 1

2

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]

Substitution back into equation 84 yields

ψ(x1 x2)+ φ1(x1)+ φ2(x2)minus ψ(x1 x2)minus φ1(x1)minus φ2(x2)

= 12

[ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)minus ψ(x1 x2)

]

which has to be nonnegative Of all four possible combinations two of themare valid and yield the same positive gap and the other two are invalid sincethey yield the same negative gap Enumerating these combinations we find

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= 12|ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| = ω

2

from equation 75 Substitution into the condition 82 then yields equa-tion 83

Next we derive the following weaker corollary of theorem 4

Uniqueness of Loopy Belief Propagation Fixed Points 2407

Corollary 4 This is a weaker version of theorem 4 for pairwise potentials Loopybelief propagation on pairwise potentials has a unique fixed point ifsum

αsupβωα le 1 forallβ (85)

with ωα defined in equation 72

Proof Consider the allocation matrix with components Aαβ = 1minus σα forall β sub α With this choice conditions 1 and 2 of equation 81 are fulfilledsince (condition 1) σα le 1 and (condition 2)

(1minus σα)(1minus σα)+ 2σα(1minus σα) = 1minus 2σ 2α le 1

Substitution into condition 3 yieldssumαsupβ

(1minus σα) gesumαsupβ

1minus 1 and thussumαsupβ

σα le 1 (86)

Since ωα = minus log(1minus σα) ge σα condition 85 is weaker than condition 86

Summarizing the conditions in Tatikonda and Jordan (2002) are for bi-nary pairwise potentials and when strengthened as above at most a constant(factor 4) less strict and thus better than the ones derived here The latterare better when the structure is (close to) a tree The best set of conditionsfollows by taking the union of both Note further that the conditions derivedin Tatikonda and Jordan (2002) are unlike theorem 4 specific to pairwisepotentials

83 Illustration For illustration we consider a 3 times 3 Ising grid withtoroidal boundary conditions as in Figure 3a and uniform ferromagneticpotentials proportional to(

α 1minus α1minus α α

)

The trivial solution which is the only minimum of the Bethe free energy forsmall α is the one with all pseudomarginals equal to (05 05) With simplealgebra for example following the line of reasoning that leads to the beliefoptimization algorithm in Welling and Teh (2003) it can be shown that thistrivial solution becomes unstable at the critical αcritical = 23 asymp 067 Forα gt 23 we find two minima one with ldquospins uprdquo and the other one withldquospins downrdquo

In this symmetric problem the strength of each potential is given by

ω = 2 log[

α

1minus α]

and thus σ = 1minus(

1minus αα

)2

2408 T Heskes

Figure 3 Three Ising grids in factor-graph notation circles denote nodes boxesinteractions (a) Toroidal boundary conditions All elements of the allocationmatrix equal to 34 (not shown) (b) Aperiodic boundary conditions and (c) twoloops left The elements of the allocation matrix along the edges follow directlyfrom optimizing condition 3 in theorem 4 and symmetry considerations WithB = 2minus 2A in b and C = 1minusA in c the optimal settings for the single remainingvariable A then boil down to 34 and 1 minus

radic18 respectively See the text for

further explanation

Uniqueness of Loopy Belief Propagation Fixed Points 2409

The minimal (uniform) compensation in condition 3 of theorem 4 amountsto A = 34 for all combinations of potentials and nodes Substitution intocondition 2 then yields

σ le 13

and thus α le 11+radic23

asymp 055

The critical value that follows from corollary 3 is in this case slightly better

ω lt 1 and thus α le 11+ eminus12 asymp 062

Next we consider the same grid with aperiodic boundary conditions asin Figure 3b Numerically we find a critical αcritical asymp 079 The value thatfollows from corollary 3 is dominated by the center node and hence staysthe same a unique loopy belief propagation fixed point for α lt 062 Theo-rem 4 can be exploited to shift resources a little In principle we can solvethe nonlinear programming problem but for this small problem it can stillbe done by hand with the following argumentation Minimal compensationaccording to condition 3 in theorem 4 combined with symmetry considera-tions yields the allocation matrix elements along on the edges in Figure 3bIt is then easy to check that there are only two different appearances ofcondition 2

(2minus 2A)σ + 34le 1 and

12σ + A le 1

The optimal choice for A is the one in which both conditions turn out to beidentical In this way we obtain A = 34 yielding

σ le 12

and thus α le 11+radic12

asymp 058

still slightly worse than the condition from corollary 3An example in which the condition obtained with theorem 4 is better

than the one from corollary 3 is given in Figure 3c Straightforward analysisfollowing the same recipe as for Figure 3b yields A = 1minusradic18 with

σ leradic

12

and thus α le 1

1+radic

1minusradic12asymp 065

better than theα lt 062 from corollary 3 and to be compared with the criticalαcritical asymp 088

9 Discussion

In this article we derived sufficient conditions for loopy belief propagationto have just a single fixed point These conditions remain much too strong tobe anywhere near the necessary conditions and in that sense should be seenas no more than a first step These conditions have the following positivefeatures

2410 T Heskes

bull Generalize the conditions for convexity of the Bethe free energy

bull Incorporate the (local) strength of potentials

bull Scale naturally as a function of the ldquotemperaturerdquo

bull Are invariant to arbitrary definitions of potentials and self-interactions

Although the analysis that led to these conditions may seem quite involvedit basically consists of a relatively straightforward combination of two ob-servations The first observation is that we can exploit the arbitrariness inthe definition of the Bethe free energy when we incorporate the constraintsThis forms the basis of the resource allocation argument And the secondobservation concerns the bound on the correlation of a loopy belief propa-gation marginal that leads to the introduction of the strength of a potential

Besides its theoretical usefulness there are more practical uses Firstalgorithms for guaranteed convergence explicitly minimize the Bethe freeenergy They can be considered ldquobound optimization algorithmsrdquo similarto expectation maximization and iterative proportional fitting in the innerloop they minimize a bound on the Bethe free energy which is then up-dated in the outer loop In practice it appears that the tighter the boundthe faster the convergence (see eg Heskes et al 2003) Instead of a boundthat is convex (Yuille 2002) or convex over the set of constraints (Teh ampWelling 2002 Heskes et al 2003) we might relax the convexity conditionand choose a tighter bound that still has a unique minimum thereby speed-ing up the convergence Second in Wainwright et al (2002) a convexifiedBethe free energy is proposed The arguments for this class of free energiesare twofold they yield a bound on the partition function (instead of justan approximation as the standard Bethe free energy) and have a uniqueminimum Focusing on the second argument the conditions in this articlecan be used to construct Bethe free energies that may not be convex (overthe set of constraints) but do have a unique minimum and being closer tothe standard Bethe free energy may yield better approximations

We can think of the following opportunities to make the sufficient con-ditions derived here stricter and thus closer to necessary conditions

bull The conditions guarantee convexity of the dual G(Qβ λαβ) with re-spect to Qβ But in fact we need only G(Qβ) equiv maxλαβ G(Qβ λαβ) to beconvex which is a weaker requirement The Hessian of G(Qβ) how-ever appears to be more difficult to compute and to analyze in generalbut may lead to stronger results in specific cases (eg only pairwiseinteractions or substituting a particular choice of Aαβ )

bull It may be possible to strengthen the bound equation 71 on loopybelief correlations especially for interactions that involve more thantwo nodes

An important question is how the uniqueness of loopy belief propaga-tion fixed points relates to the convergence of loopy belief propagation

Uniqueness of Loopy Belief Propagation Fixed Points 2411

Intuitively one might expect that if loopy belief propagation has a uniquefixed point it will also converge to it This also seems to be the argumenta-tion in Tatikonda and Jordan (2002) However to the best of our knowledgethere is no proof of such correspondence Furthermore the following set ofsimulations does seem to suggest otherwise

We consider a Boltzmann machine with four binary nodes weights

w = ω

0 1 minus1 minus11 0 1 minus1minus1 1 0 minus1minus1 minus1 minus1 0

zero thresholds and potentials

ij(xi xj) = exp[wij4] if xi = xj and ij(xi xj) = exp[minuswij4] if xi = xj

Running loopy belief propagation possibly damped as in equation 39 weobserve ldquoconvergentrdquo and ldquononconvergentrdquo behavior For relatively smallweights loopy belief propagation converges to the trivial fixed point withPi(xi) = 05 for all nodes i and xi = 0 1 as in the lower left inset inFigure 4 For relatively large weights it ends up in a limit cycle as shown inthe upper right inset The weight strength that forms the transition betweenthis ldquoconvergentrdquo and ldquononconvergentrdquo behavior strongly depends on thestep size7 This by itself makes it hard to defend a one-to-one correspondencebetween convergence of loopy belief propagation (apparently dependingon step size) and uniqueness of fixed points (obviously independent of stepsize)

For weights larger than roughly 58 loopy belief propagation failed toconverge to the trivial fixed point even for very small step sizes Howeverrunning a convergent double-loop algorithm from many different initialconditions and many weight strengths considerably larger than 58 we al-ways ended up in the trivial fixed point and never in another one We foundsimilar behavior for a three-node Boltzmann machine (same weight matrixas above except for the fourth node) for very large weights loopy beliefpropagation ends up in a limit cycle whereas a convergent double-loopalgorithm converges to the trivial fixed point which here by corollary 2is guaranteed to be unique In future work we hope to elaborate on theseissues

7 Note that the conditions for guaranteed uniqueness imply ω = 43 for corollary 3and ω = log(2) asymp 069 for theorem 4 both far below the weight strengths where ldquonon-convergentrdquo behavior sets in

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

Uniqueness of Loopy Belief Propagation Fixed Points 2381

Figure 1 Layout of correspondences and implications See the text for details

constraints with two or more connected loops the conditions fail (see alsoMcEliece amp Yildirim 2003)

Milder conditions for uniqueness which do depend on the strength of theinteractions follow from the track on the right-hand side of Figure 1 Firstwe note that nonconvex constrained minimization of the Bethe free energyis equivalent to an unconstrained nonconvexconcave minimax problem(Heskes 2002) link a in Figure 1 Convergent double-loop algorithms likeCCCP (Yuille 2002) and faster variants thereof (Heskes et al 2003) in factsolve such a minimax problem the concave problem in the maximizing pa-rameters (basically Lagrange multipliers) is solved by a message-passing al-gorithm very similar to standard loopy belief propagation in the inner loopwhere the outer loop changes the minimizing parameters (a remaining set ofpseudomarginals) in the proper downward direction The transformation

2382 T Heskes

from nonconvex constrained minimization problem to an unconstrainednonconvexconcave minimax problem is in a particular setting relevant tothis article repeated in section 51

Rather than requiring the Bethe free energy to be convex (over the setof constraints) we then in sections 6 and 8 work toward conditions underwhich this minimax problem is convexconcave These indeed depend onthe strength of the potentials defined in section 7 These conditions can beconsidered the main result of this article Link c follows from the observa-tion in section 52 that the minimax problem corresponding to a Bethe freeenergy that is convex over the set of constraints has to be convex or concave

As indicated by link e convexconcave minimax problems have a uniquesolution This then also implies that the Bethe free energy has a uniqueextremum satisfying the constraints which since the Bethe free energy isbounded from below (see section 53) has to be a minimum link f

The concluding statement by link g in the lower-right corner is to thebest of our knowledge no more than a conjecture We discuss it in moredetail in section 9

3 The Bethe Free Energy and Loopy Belief Propagation

31 The Gibbs-Helmholtz Free Energy The exact probability distribu-tion in Bayesian networks and Markov random fields can be written in thefactorized form

Pexact(X) = 1Z

prodα

α(Xα) (31)

Hereα is a potential some function of the potential subset Xα and Z is anunknown normalization constant Potential subsets typically overlap andthey span the whole domain X The convention that we adhere to in thisarticle is that there are no potential subsets Xα and Xαprime such that Xαprime is fullysubsumed by Xα The standard choice of a potential in a Bayesian networkis a child with all its parents We further restrict ourselves to probabilisticmodels defined on discrete random variables each of which runs over afinite number of states The potentials are positive and finite

The typical goal in Bayesian networks and Markov random fields is tocompute the partition function Z or marginals for example

Pexact(Xα) =sumXα

Pexact(X)

One way to do this is with the junction tree algorithm (Lauitzen amp Spiegel-halter 1988) However the junction tree algorithm scales exponentially withthe size of the largest clique and may become intractable for complex mod-els The alternative is then to resort to approximate methods which can be

Uniqueness of Loopy Belief Propagation Fixed Points 2383

roughly divided into two categories sampling approaches and determinis-tic approximations

Most deterministic approximations derive from the so-called Gibbs-Helmholtz free energy

F(P) = minussumα

sumXα

P(Xα)ψα(Xα)+sum

X

P(X) log P(X)

with shorthandψ equiv log Minimizing this variational free energy over theset P of all properly normalized probability distributions we get back theexact probability distribution equation 31 as the argument at the minimumand minus the log of the partition function as the value at the minimum

Pexact = argminPisinP

F(P) and minus log Z = minPisinP

F(P)

Since the Gibbs-Helmholtz free energy is convex in P the equality constraint(proper normalization) is linear and the inequality constraints (nonnega-tivity) are convex this minimum is unique By itself we have not gainedanything the entropy may still be intractable to compute

32 The Bethe Free Energy The Bethe free energy is an approximationof the exact Gibbs-Helmholtz free energy In particular we approximate theentropy through

sumX

P(X) log P(X) asympsumα

sumXα

P(Xα) log P(Xα)

minussumβ

(nβ minus 1)sumxβ

P(xβ) log P(xβ)

with xβ a (super)node and nβ =sum

αsupβ 1 the number of potentials thatcontains node xβ The second term follows from a discounting argumentwithout it we would overcount the entropy contributions on the overlapbetween the potential subsets The (super)nodes xβ are themselves subsetsof the potential subsets that is

xβ cap Xα = empty or xβ cap Xα = xβ forallαβand partition the domain X

xβ cap xβ prime = empty forallββ prime and⋃β

xβ = X

Typically the xβ are taken to be single nodes and in the following we willrefer to them as such For clarity of notation we will indicate these nodes byβ and xβ in lowercase to contrast them with the potentials α and potentialsubsets Xα in uppercase

2384 T Heskes

Note that the Bethe free energy depends on only the marginals P(Xα) andP(xβ) We replace minimization of the exact Gibbs-Helmholtz free energyover probability distributions by minimization of the Bethe free energy

F(QαQβ) = minussumα

sumXα

Qα(Xα)ψα(Xα)+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumβ

(nβ minus 1)sumxβ

Qβ(xβ) log Qβ(xβ) (32)

over sets of ldquopseudomarginalsrdquo1 or beliefs QαQβ For this to make sensethese pseudomarginals have to be properly normalized as well as consistentthat is2sum

Qα(Xα) = 1 and Qα(xβ) =sumXαβ

Qα(Xα) = Qβ(xβ) (33)

Let Q denote all subsets of consistent and properly normalized pseudo-marginals Then our goal is to solve

minQαQβ isinQ

F(QαQβ)

The hope is that the pseudomarginals at this minimum are accurate approx-imations to the exact marginals Pexact(Xα) and Pexact(xβ)

33 Link with Loopy Belief Propagation For completeness and laterreference we describe the link between the Bethe free energy and loopybelief propagation as originally reported on by Yedidia et al (2001) It startswith the Lagrangian

L(QαQβ λαβ λα λβ) = F(QαQβ)

+sumβ

sumαsupβ

sumxβ

λαβ(xβ)[Qβ(xβ)minusQα(xβ)

]

+sumα

λα

[1minus

sumXα

Qα(Xα)

]

+sumβ

λβ

[1minus

sumβ

Qβ(xβ)

] (34)

1 Terminology from Wainwright Jaakkola and Willsky (2002) used to indicate thatthere need not be a joint distribution that would yield such marginals

2 Strictly speaking we also have to take inequality constraints into account namelythose of the form Qα(Xα) ge 0 However with the potentials being positive and finitethe logarithmic terms in the free energy make sure that we never really have to worryabout those they never become ldquoactiverdquo For convenience we will not consider them anyfurther

Uniqueness of Loopy Belief Propagation Fixed Points 2385

At an extremum of the Bethe free energy satisfying the constraints allderivatives of L are zero the ones with respect to the Lagrange multipliersλ give back the constraints the ones with respect to the pseudomarginalsQ give an extremum of the Bethe free energy Setting the derivatives withrespect to Qα and Qβ to zero we can solve for Qα and Qβ in terms of theLagrange multipliers

Qlowastα(Xα) = α(Xα) exp

[λα minus 1+

sumβsubα

λαβ(xβ)

]

Qlowastβ(xβ) = exp

[1

nβ minus 1

1minus λβ +

sumαsupβ

λαβ(xβ)

]

In terms of theldquomessagerdquomicroβrarrα(xβ) equiv exp[λαβ(xβ)] from node β to potentialα the pseudomarginal Qlowastα(Xα) reads

Qlowastα(Xα) prop α(Xα)prodβsubα

microβrarrα(xβ) (35)

where proper normalization yields the Lagrange multiplier λα With defi-nition

microαrarrβ(xβ) equivQlowastβ(xβ)microβrarrα(xβ)

(36)

the fixed-point equation for Qlowastβ(xβ) can after some manipulation be writtenin the form

Qlowastβ(xβ) propprodαsupβ

microαrarrβ(xβ) (37)

where again the Lagrange multiplier λβ follows from normalization Finallythe constraint Qlowastα(xβ) = Qlowastβ(xβ) in combination with equation 36 suggeststhe update

microαrarrβ(xβ) = Qlowastα(xβ)microβrarrα(xβ)

(38)

Equations 35 through 38 constitute the belief propagation equations Theycan be summarized as follows A pseudomarginal is the potential (just 1 forthe nodes in the convention where no potentials are assigned to nodes) timesits incoming messages the outgoing message is the pseudomarginal dividedby the incoming message The scheduling of the messages is somewhatarbitrary Loopy belief propagation can be ldquodampedrdquo by taking smaller

2386 T Heskes

steps This damping is usually done in terms of the Lagrange multipliersthat is in the log domain of the messages

logmicronewαrarrβ(xβ) = logmicroαrarrβ(xβ)

+ ε[log Qlowastα(xβ)minus logmicroβrarrα(xβ) minus logmicroαrarrβ(xβ)] (39)

Summarizing loopy belief propagation is equivalent to fixed-point itera-tion where the fixed points are the zero derivatives of the Lagrangian

4 Convexity of the Bethe Free Energy

41 Rewriting the Bethe Free Energy Minimization of the Bethe freeenergy equation 32 under the constraints of equation 33 is equivalent tosolving a minimax problem on the Lagrangian equation 34 namely

minQαQβ

maxλαβ λαλβ

L(QαQβ λαβ λα λβ)

The ordering of the min and max operations is important here to enforcethe constraints we first have to take the maximum The min and max oper-ations can be interchanged if we have a convex constrained minimizationproblem (Luenberger 1984) That is the function to be minimized must beconvex in its parameters the equality constraints have to be linear and theinequality constraints convex In our case the equality constraints are in-deed linear and the inequality constraints enforcing nonnegativity of thepseudomarginals indeed are convex However the Bethe free energy equa-tion 32 is clearly nonconvex in its parameters QαQβ This is what makesit a difficult optimization problem

Luckily the description in equation 32 is not unique any other formthat can be constructed by substituting the constraints of equation 33 isequally valid Following Pakzad and Anantharam (2002) we call the Bethefree energy ldquoconvex over the set of constraintsrdquo if by making use of theconstraints of equation 33 we can rewrite it in a form that is convex inQαQβ

42 Conditions for Convexity The problem is with the term

Sβ(Qβ) equiv minussumxβ

Qβ(xβ) log Qβ(xβ)

which is concave in Qβ Using the constraint Qβ(xβ) = Qα(xβ) we can turnit into a functional that is convex in Qα and Qβ separately but not necessarilyjointly That is with the substitution Qβ(xβ) = Qα(xβ) for any α sup β theentropy and thus the Bethe free energy is convex in Qα and in Qβ but notnecessarily in QαQβ However if we add to Sβ(Qβ) a convex entropycontribution

minusSα(Qα) equivsumXα

Qα(Xα) log Qα(Xα)

Uniqueness of Loopy Belief Propagation Fixed Points 2387

the combination of minusSα and Sβ is convex in QαQβ as the followinglemma needed in the proof of theorem 1 below shows

Lemma 1

αβ(QαQβ) equivsumXα

Qα(Xα) log Qα(Xα)minussumxβ

Qα(xβ) log Qβ(xβ)

is convex in QαQβ

Proof The matrix with second derivatives of αβ has the components

H(XαXprimeα) equivpart2αβ

partQα(Xα)partQα(Xprimeα)= 1

Qα(Xα)δXαXprimeα

H(Xα xprimeβ) equivpart2αβ

partQα(Xα)partQβ(xprimeβ)= minus 1

Qβ(xβ)δxβ xprimeβ

H(xβ xprimeβ) equivpart2αβ

partQβ(xβ)partQβ(xprimeβ)= minusQα(xβ)

Q2β(xβ)

δxβ xprimeβ

where we note that Xα and xβ should be interpreted as indices Convexityrequires that for any ldquovectorrdquo (Rα(Xα) Rβ(xβ))

0 le (Rα(Xα) Rβ(xβ))(

H(XαXprimeα) H(Xα xprimeβ)H(xβXprimeα) H(xprimeβ xβ)

)(Rα(Xprimeα)Rβ(xprimeβ)

)

=sumXα

R2α(Xα)

Qα(Xα)minus 2

sumXα

Rα(Xα)Rβ(xβ)Qβ(xβ)

+sumxβ

Qα(xβ)R2β(xβ)

Q2β(xβ)

=sumXα

Qα(Xα)

[Rα(Xα)

Qα(Xα)minus Rβ(xβ)

Qβ(xβ)

]2

The idea is that the Bethe free energy is convex over the set of constraints ifwe have sufficient convex resources Qα log Qα to compensate for the concaveminusQβ log Qβ terms This can be formalized in the following theorem

Theorem 1 The Bethe free energy is convex over the set of consistency con-straints if there exists an allocation matrix Aαβ between potentials α and nodes βsatisfying

1 Aαβ ge 0 forallαβsubα (positivity)

2sumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(41)

2388 T Heskes

Proof First we note that we do not have to worry about the energy termsthat are linear in Qα In other words to prove the theorem we can restrictourselves to proving that minus the entropy

minusS(Q) = minus[sum

α

Sα(Qα)minussumβ

(nβ minus 1)Sβ(Qβ)

]

is convex over the set of consistency constraints The resulting operation isnow a matter of resource allocation For each concave contribution (nβ minus1)Sβ we have to find convex contributions minusSα to compensate for it LetAαβ denote the ldquoamount of resourcesrdquo that we take from potential subset αto compensate for node β Now in shorthand notation and with a little bitof rewriting

minusS(Q) = minus[sum

α

Sα minussumβ

(nβ minus 1)Sβ

]

= minussumα

(1minus

sumβsubα

Aαβ +sumβsubα

Aαβ

)Sα

minussumβ

minussumαsupβ

Aαβ +sumαsupβ

Aαβ minus (nβ minus 1)

= minussumα

(1minus

sumβsubα

Aαβ

)Sα minus

sumα

sumβsubα

Aαβ [Sα minus Sβ ]

minussumβ

[sumαsupβ

Aαβ minus (nβ minus 1)

]Sβ

Convexity of the first term is guaranteed if 1minussumβ Aαβ ge 0 (condition 2) ofthe second term if Aαβ ge 0 (condition 1 and lemma 1) and of the third termifsum

α Aαβ minus (nβ minus 1) ge 0 (condition 3)

This theorem is a special case of the one in Heskes et al (2003) for themore general Kikuchi free energy Either one of the inequality signs in con-dition 2 and 3 of equation 41 can be replaced by an equality sign withoutany consequences

43 Some Implications

Corollary 1 The Bethe free energy for singly connected graphs is convex overthe set of constraints

Uniqueness of Loopy Belief Propagation Fixed Points 2389

Proof The proof is by construction Choose one of the leaf nodes as theroot βlowast and define

Aαβ = 1 iff β sub α and β closer to the root βlowast than any other β prime sub αAαβ prime = 0 for all other β prime

Obviously this choice of A satisfies conditions 1 and 2 of equation 41Arguing the other way around for eachβ = βlowast there is just a single potentialα sup β that is closer to the root βlowast than β itself (see the illustration in Figure 2)and thus there are precisely nβ minus 1 contributions Aαβ = 1 The root itselfgets nβlowast contributions Aαβlowast = 1 which is even better Hence condition 3 isalso satisfied

sumαsupβ

Aαβ = nβ minus 1 forallβ =βlowast andsumαsupβlowast

Aαβlowast = nβlowast gt nβlowast minus 1

With the above construction of A we are in a sense ldquoeating up resourcestoward the rootrdquo At the root we have one piece of resources left whichsuggests that we can still enlarge the set of graphs for which convexity canbe shown using theorem 1 This leads to the next corollary

Corollary 2 The Bethe free energy for graphs with a single loop is convex overthe set of constraints

Proof Again the proof is by construction Break the loop at one particularplace that is remove one node βlowast from a potential αlowast such that a singlyconnected structure is left Construct a matrix A as in the proof of corollary 1taking the node βlowast as the root The matrix A constructed in this way alsojust works for the graph with the closed loop since still

sumαsupβ

Aαβ = nβ minus 1 forallβ =βlowast and nowsumαsupβlowast

Aαβlowast = nβlowast minus 1

It can be seen that this construction starts to fail as soon as we have twoloops that are connected with two connected loops we have insufficientpositive resources to compensate for the negative entropy contributions

44 Connection with Other Work The same corollaries can be foundin Pakzad and Anantharam (2002) and McEliece and Yildirim (2003) Fur-thermore the conditions in theorem 1 are very similar to the ones stated inPakzad and Anantharam (2002) which for the Bethe free energy boil downto the following

2390 T Heskes

Figure 2 The construction of an allocation matrix satisfying all convexity con-straints for singly connected (a) and single-loop structures (b) Neglecting thearrows and dashes each graph corresponds to a factor graph (Kscischang Freyamp Loeliger 2001) where dark boxes refer to potentials and circles to nodes Thenumbers within the circles give the corresponding ldquoovercounting numbersrdquo forthe Bethe free energy 1minus nβ with nβ the number of neighboring potentials Thearrows pointing from potentials α to nodes β visualize the allocation matrixA with Aαβ = 1 if there is an arrow and Aαβ = 0 otherwise As can be seenfor each potential there is precisely one outgoing arrow pointing at the nodeclosest to the root chosen to be the node in the upper right corner of the graphIn the singly-connected structure (a) all nonroot nodes have precisely nβ minus 1incoming arrows just sufficient to compensate the overcounting number 1minusnβ The root node itself has one incoming arrow which it does not really need Inthe structure with the single loop (b) we open the loop by breaking the dashedlink and construct the allocation matrix for the corresponding singly connectedstructure This allocation matrix works for the single-loop structure as well be-cause now the incoming arrow at the ldquorootrdquo is just sufficient to compensate forthe negative overcounting number resulting from the extra link closing the loop

Uniqueness of Loopy Belief Propagation Fixed Points 2391

Theorem 2 (Adapted from theorem 1 in Pakzad amp Anantharam 2002) TheBethe free energy is convex for the set of constraints if for any set of nodes B wehave sum

βisinB

(1minus nβ)+sumαisinπ(B)

1 ge 0 (42)

where π(B) equiv α existβ isin Bβ sub α denotes the ldquoparentrdquo set of B that is thosepotential subsets that include at least one node in B

Proposition 1 The conditions in theorems 1 and 2 are equivalent

Proof Let us first suppose that there does exist an allocation matrix Aαβ

satisfying the conditions of equation 41 Then for any set B

sumβisinB

(nβ minus 1) lesumβisinB

sumαsupβ

Aαβ lesumαisinπ(B)

sumβsubα

Aαβ lesumαisinπ(B)

1

where the inequalities follow from conditions 3 1 and 2 in equation 41respectively In other words validity of the conditions in theorem 1 impliesthe validity of those in theorem 2

Next let us suppose that the conditions in Theorem 1 fail Above wehave seen that this can happen if and only if the graph contains at least oneconnected component with two connected loops But then condition 42is violated as well when we take for B the set of all nodes within such acomponent

Since validity implies validity and violation implies violation the con-ditions must be equivalent

Graphical models with a single loop have been studied in detail in Weiss(2000) yielding important theoretical results (eg correctness of maximuma posteriori assignments) These results are derived by ldquounwrappingrdquo thesingle loop into an infinite tree This argument also breaks down as soon asthere is more than a single loop It might be interesting to find out whetherthere is a deeper connection between this unwrapping argument and theconvexity of the Bethe free energy

In summary we have given conditions for the Bethe free energy to have aunique extremum satisfying the constraints From the connection betweenthe extrema of the Bethe free energy and fixed points of loopy belief prop-agation it then follows that loopy belief propagation has a unique fixedpoint when these conditions are satisfied These conditions fail as soon asthe structure of the graph contains two connected loops

The conditions for convexity of the Bethe free energy depend on thestructure of the graph the potentialsα(Xα) do not play any role These po-tentials appear only in the energy term that is linear in the pseudomarginals

2392 T Heskes

and thus does not affect the convexity argument Consequently adding aldquofake linkrdquo with potential α(Xα) = 1 can change the validity of the con-ditions whereas it has no effect on the loopy belief propagation updatesEven if we managed to find more interesting (ie milder) conditions forconvexity over the set of constraints3 the impact of fake links would neverdisappear In the following we will therefore dig a little deeper to arrive atmilder conditions that do take into account (the strength of) the potentials

5 The Dual Formulation

51 From Lagrangian to Dual As we have seen fixed points of loopybelief propagation are in a one-to-one correspondence with zero derivativesof the Lagrangian If we manage to find conditions under which these zeroderivatives have a unique solution then for the same conditions loopybelief propagation has a unique fixed point In the following we will workwith a Lagrangian slightly different from equation 34 First we substitutethe constraint Qα(xβ) = Qβ(xβ) to write the Bethe free energy in the ldquomoreconvexrdquo form

F(QαQβ) = minussumα

sumXα

Qα(Xα)ψα(Xα)+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumβ

sumαsupβ

Aαβ

sumxβ

Qα(xβ) log Qβ(xβ) (51)

where the allocation matrix Aαβ can be any matrix that satisfies

sumαsupβ

Aαβ = nβ minus 1 (52)

And second we express the consistency constraints from equation 33 interms of the potential pseudomarginals Qα alone This then yields

L(QαQβ λαβ λα) = minussumα

sumXα

Qα(Xα)ψα(Xα)

+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumα

sumβsubα

Aαβ

sumxβ

Qα(xβ) log Qβ(xβ)

3 We would like to conjecture that this is not possiblemdashthat the conditions in theorem 1are not only sufficient but also necessary to prove convexity of the Bethe free energy overthe set of consistency constraints Note that this would not imply that we need theseconditions to guarantee the uniqueness of fixed points since for that convexity by itselfis sufficient not necessary

Uniqueness of Loopy Belief Propagation Fixed Points 2393

+sumβ

sumαsupβ

sumxβ

λαβ(xβ)

times[

1nβ minus 1

sumαprimesupβ

AαprimeβQαprime(xβ)minusQα(xβ)

]

+sumα

λα

[1minus

sumXα

Qα(Xα)

]

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (53)

Note that the constraint Qβ(xβ) = Qα(xβ) as well as its normalization is nolonger incorporated with Lagrange multipliers but follows when we takethe minimum with respect to Qβ It is easy to check that the fixed-pointequations of loopy belief propagation still follow by setting the derivativesof the Lagrangian equation 53 to zero

Although the Bethe free energy and thus the Lagrangian equation 53may not be convex in QαQβ they are convex in Qα and Qβ separatelyTherefore we can interchange the minimum over the pseudomarginals Qα

and the maximum over the Lagrange multipliers as long as we leave theminimum over Qβ as the final operation4

minQαQβ

maxλαβ λα

L(QαQβ λαβ λα) = minQβ

maxλαβ λα

minQα

L(QαQβ λαβ λα)

Rewriting

sumβ

sumαsupβ

sumxβ

λαβ(xβ)

[1

nβ minus 1

sumαprimesupβ

AαprimeβQαprime(xβ)minusQα(xβ)

]

= minussumα

sumβsubα

sumxβ

λαβ(xβ)Qα(xβ)

with

λαβ(xβ) equiv λαβ(xβ)minus 1nβ minus 1

sumαprimesupβ

Aαprimeβλαprimeβ(xβ)

we can easily solve for the minimum with respect to Qα

Qlowastα(Xα) = α(Xα) exp

[λα minus 1+

sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

] (54)

4 In principle we could also first take the minimum over Qβ and leave the minimumover Qα but this does not seem to lead to any useful results

2394 T Heskes

Plugging this into the Lagrangian we obtain the ldquodualrdquo

G(Qβ λαβ λα) equiv L(QlowastαQβ λαβ λα)

= minussumα

sumXα

α(Xα)

exptimes[λα minus 1+

sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

]

+sumα

λα +sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (55)

Next we find for the maximum with respect to λα

exp[1minus λlowastα

] =sumXα

α(Xα) exp

[sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

]

equiv Zlowastα (56)

where we have to keep in mind that Zlowastα by itself like Qlowastα is a function of theremaining pseudomarginals Qβ and Lagrange multipliers λαβ Substitutingthis solution into the dual we arrive at

G(Qβ λαβ) equiv G(Qβ λαβ λlowastα)

= minussumα

log Zlowastα +sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (57)

Let us pause here for a moment and reflect on what we have done so farThe Lagrangian equation 53 being convex in Qα has a unique minimumin Qα (given all other parameters fixed) which is also the only extremumIt happens to be relatively straightforward to express the value at this mini-mum in terms of the remaining parameters and then also to find the optimal(maximal) λlowastα Plugging these values into the Lagrangian equation 53 wehave not lost anything That is zero derivatives of the Lagrangian are still inone-to-one correspondence with zero derivatives of the dual equation 57and thus with fixed points of loopy belief propagation

52 Recovering the Convexity Conditions (1) To find a minimum of theBethe free energy satisfying the constraints in equation 33 we first have totake the maximum of the dual equation 57 over the remaining Lagrangemultipliersλαβ and then the minimum over the remaining pseudomarginalsQβ The duality theorem a standard result from constrained optimization(see Luenberger 1984) tells us that the dual G is concave in the Lagrangemultipliers The remaining question is then whether the dual is convex in

Uniqueness of Loopy Belief Propagation Fixed Points 2395

Qβ If it is we have a convex-concave minimax problem which is guaranteedto have a unique solution

Link c in Figure 1 follows from the following proposition

Proposition 2 Convexity of the Bethe free energy equation 51 in QαQβimplies convexity of the dual equation 57 in Qβ

Proof First we note that the minimum of a convex function over someof its parameters is convex in its remaining parameters In obvious one-dimensional notation with ylowast(x) equiv argmin

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge 2 f (x (ylowast(x+ δ)+ ylowast(xminus δ))2)ge 2 f (x ylowast(x))

where the first inequality follows from the convexity of f in x yand the sec-ond inequality from ylowast(x) being the unique minimum of f (x y) Thereforethe dual equation 55 is convex in Qβ when the Lagrangian equation 53and thus the Bethe free energy equation 51 is convex in QαQβ Further-more from the duality theorem the dual equation 55 is concave in theLagrange multipliers λαβ λα Next we note that the maximum of a con-vex or concave function over its maximizing parameters is again convexwith ylowast(x) equiv argmax

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge f (x+ δ ylowast(x))+ f (xminus δ ylowast(x))ge 2 f (x ylowast(x))

where the first inequality follows from ylowast(xplusmnδ) being the unique maximumof f (x plusmn δ y) and the second inequality from the convexity of f (x y) in xHence the dual equation 57 must still be convex in Qβ

For now we did not gain or lose anything in comparison with the con-ditions for theorem 1 However the inequalities in the above proof suggesta little space that will lead to milder conditions for the uniqueness of fixedpoints

53 Boundedness of the Bethe Free Energy For completeness and tosupport link f in Figure 1 we will here prove that the Bethe free energy isbounded from below The following theorem can be considered a specialcase of the one stated in Minka (2001) on the Bethe free energy for expectationpropagation a generalization of (loopy) belief propagation

Theorem 3 If all potentials are bounded from above that isα(Xα) le max forall α and Xα the Bethe free energy is bounded from below on the set of constraints

2396 T Heskes

Proof It is sufficient to prove that the function G(Qβ) equiv maxλαβ G(Qβ λαβ)

is bounded from below for a particular choice of Aαβ satisfying equation 52Considering Aαβ = nβminus1

nβ we then have

G(Qβ) ge minussumα

logsumXα

α(Xα) exp

[sumβsubα

nβ minus 1nβ

log Qβ(xβ)

]

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

logsumXα

α(Xα)Qβ(xβ)

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

+sumβ

(nβ minus 1)

[minus log

sumxβ

Qβ(xβ)+sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

where the first inequality follows by substituting the choice λαβ(xβ) = 0 forall α β and xβ in G(Qβ λαβ) the second from the concavity of the function

ynβminus1

nβ and the third from the upper bound on the potentials

6 Toward Better Conditions

61 The Hessian The next step is to compute the Hessianmdashthe secondderivative of the dual with respect to the pseudomarginals Qβ The firstderivative yields

partGpartQβ(xβ)

= minussumαsupβ

Aαβ

Qlowastα(xβ)Qβ(xβ)

+ (nβ minus 1)

which is immediate from the Lagrangian equation 53 To compute thematrix of second derivatives

Hββ prime(xβ xprimeβ prime) equivpart2G

partQβ(xβ)partQβ prime(xprimeβ prime)

Uniqueness of Loopy Belief Propagation Fixed Points 2397

we make use of

partQlowastα(xβ)partQβ prime(xprimeβ prime)

= Aαβ primeQlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)

Qβ prime(xprimeβ prime)

where both β and β prime should be a subset of α and with convention Qlowastα(xβ xβ)= Qlowastα(xβ) and Qlowastα(xβ xprimeβ) = 0 if xβ = xprimeβ Here the first term follows from thedifferentation of equation 54 and the second term from the normalizationas in equation 56 Distinguishing between β = β prime and β = β prime we then have

Hββ(xβ xprimeβ) =sumαsupβ

Aαβ(1minus Aαβ)Qlowastα(xβ)Q2β(xβ)

δxβ xprimeβ

+sumαsupβ

A2αβ

Qlowastα(xβ)Qlowastα(xprimeβ)Qβ(xβ)Qβ(xprimeβ)

Hββ prime(xβ xprimeβ prime) = minussum

αsupββ primeAαβAαβ prime

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)Qβ(xβ)Qβ prime(xprimeβ prime)

for β prime = β

where δxβ xprimeβ = 1 if and only if xβ = xprimeβ Here it should be noted that bothβ and xβ play the role of indices that is xβ should not be mistaken for avariable or parameter The parameters are still the (tables with) Lagrangemultipliers λαβ and pseudomarginals Qβ

The goal is now to find conditions under which this Hessian is positive(semi) definite for any setting of the parameters Qβ λαβ that is conditionsthat guarantee

K equivsumββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) ge 0

for any choice of the ldquovectorrdquo S with elements Sβ(xβ) Straightforward ma-nipulations yieldsum

ββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) (K)

=sumα

sumβsubα

sumxβ

Aαβ(1minus Aαβ)Qlowastα(xβ)R2β(xβ) (K1)

+sumα

sumββ primesubα

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ)Qlowastα(xprimeβ prime)Rβ(xβ)Rβ prime(x

primeβ prime) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime) (K3)

where Rβ(xβ) equiv Sβ(xβ)Qβ(xβ)

2398 T Heskes

62 Recovering the Convexity Conditions (2) Let us first see how weget back the conditions for convexity of the Bethe free energy equation 51Since

K2 =sumα

[sumβsubα

sumxβ

AαβQlowastα(xβ)Rβ(xβ)

]2

ge 0

and5

K3 =sumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)

times

12

[Rβ(xβ)minus Rβ prime(xprimeβ prime)

]2 minus 12

R2β(xβ)minus

12

R2β prime(x

primeβ prime)

gesumα

sumβsubα

sumxβ

Aαβ

(sumβ primesubα

Aαβ prime minus Aαβ

)Qlowastα(xβ)R

2β(xβ) (61)

we have

K = K1 + K2 + K3 gesumα

sumβsubα

sumxβ

Aαβ

(1minus

sumβ primesubα

Aαβ prime

)Qlowastα(xβ)R

2β(xβ)

That is sufficient conditions for K to be nonnegative are

Aαβ ge 0 forallαβsubα andsumβsubα

Aαβ le 1 forallα

precisely the conditions for theorem 1

63 Fake Interactions While discussing the conditions for convexity ofthe Bethe free energy we noticed that adding a ldquofake interactionrdquo such as aconstant potential can change the validity of the conditions We will see thathere this is not the case and these fake interactions drop out as we wouldexpect them to

Suppose that we have a fake interaction α(Xα) = 1 From the solutionequation 54 it follows that the pseudomarginal Qlowastα(Xα) factorizes6

Qlowastα(xβ xprimeβ prime) = Qlowastα(xβ)Qlowastα(xprimeβ prime) forallββ primesubα

5 This step is in fact equivalent to the Gerschgorin theorem for bounding the eigenval-ues of a matrix

6 The exact marginal Pexact(Xα) need not factorize This is really a consequence of thelocality assumptions behind loopy belief propagation and the Bethe free energy

Uniqueness of Loopy Belief Propagation Fixed Points 2399

Consequently the terms involving α in K3 cancel with those in K2 which ismost easily seen when we combine K2 and K3 in a different way

K2 + K3 =sumα

sumβsubα

sumxβ xprime

βprime

A2αβQlowastα(xβ)Q

lowastα(xprimeβ)Rβ(xβ)Rβ(x

primeβ) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ prime

times [Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)]Rβ(xβ)Rβ prime(x

primeβ prime) (K3)

This leaves us with the weaker requirement (from K1) Aαβ(1minusAαβ) ge 0 forall β sub α The best choice is then to take Aαβ = 1 which turns condition 3of equation 41 intosum

αprimesupβαprime =α

Aαprimeβ + 1 ge nβ minus 1

The net effect is equivalent to ignoring the interaction reducing the numberof neighboring potentials nβ by 1 for all β that are part of the fake interactionα

We have seen how we get milder and thus better conditions when thereis effectively no interaction Motivated by this ldquosuccessrdquo we will work to-ward conditions that take into account the strength of the interactions Ourstarting point will be the above decomposition in K2 and K3 where sinceK2 ge 0 we will concentrate on K3

7 The Strength of a Potential

71 Bounding the Correlations The crucial observation which will al-low us to obtain milder and thus better conditions for the uniqueness of afixed point is the following lemma It bounds the term between brackets inK3 such that we can again combine this bound with the (positive) term K1However before we get to that we take some time to introduce and deriveproperties of the ldquostrengthrdquo of a potential

Lemma 2 Two-node correlations of loopy belief marginals obey the bound

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime) le σαQlowastα(xβ xprimeβ prime) forall ββprime subα

βprime =βforallxβ xprime

βprime (71)

with the ldquostrengthrdquo σα a function of the potential ψα(Xα) equiv logα(Xα) only

σα = 1minus exp(minusωα) with

ωα equiv maxXαXα

[ψα(Xα)+ (nα minus 1)ψα(Xα)minus

sumβsubα

ψα(Xαβ xβ)

] (72)

where nα equivsum

βsubα 1

2400 T Heskes

Proof For convenience and without loss of generality we omit α fromour notation and renumber the nodes that are contained in α from 1 to nWe consider the quotient between the loopy belief on the potential subsetdivided by the product of its single-node marginals

Qlowast(X)nprodβ=1

Qlowast(xβ)=(X)

prodβ

microβ(xβ)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)microβ(xβ)

=(X)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)

(73)

where we substituted the properly normalized version of equation 35 aloopy belief pseudomarginal is proportional to the potential times incomingmessages The goal is now to find the maximum of the above expressionover all possible messages and all values of X Especially the maximum overmessagesmicro seems to be difficult to compute but the following intermediatelemma helps us out

Lemma 3 The maximum of the function

V(micro) = (nminus 1) log

[sumX

(X)nprodβ=1

microβ(xβ)

]

minusnsumβ=1

log

sum

(Xβ xlowastβ)prodβ prime =β

microβ prime(xβ prime)

with respect to the messages micro under constraintssum

xβ microβ(xβ) = 1 for all β andmicroβ(xβ) ge 0 for all β and xβ occurs at an extreme point microβ(xβ) = δxβ xβ for somexβ to be found

Proof Let us consider optimizing the messagemicro1(x1)with fixed messagesmicroβ(xβ) for β gt 1 The first and second derivatives are easily found to obey

partVpartmicro1(x1)

= (nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ)

part2Vpartmicro1(x1)partmicro1(xprime1)

= (nminus 1)Q(x1)Q(xprime1)minussumβ =1

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

Uniqueness of Loopy Belief Propagation Fixed Points 2401

where

Q(X) equiv (X)prodβ microβ(xβ)sum

Xprime (Xprime)prodβ microβ(x

primeβ)

Now suppose that V has a regular extremum (maximum or minimum) notat an extreme point that is micro1(x1) gt 0 for two or more values of x1 At suchan extremum the first derivative should obey

(nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ) = λ

with λ a Lagrange multiplier implementing the constraintsum

x1micro1(x1) = 1

Summing over x1 we obtain λ = 0 (in fact V is indifferent to any multi-plicative scaling of micro) For the matrix with second derivatives at such anextremum we then have

part2Vpartmicro1(x1)partmicro1(xprime1)

=sumβ =1

sumβprime =1βprime =β

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

which is positive semidefinite the extremum cannot be a maximum Con-sequently any maximum must be at the boundary of the domain Sincethis holds for any choice of microβ(xβ) β gt 1 it follows by induction that themaximum with respect to all microβ(xβ) must be at an extreme point as well

The function V(micro) is up to a term independent of micro the logarithm ofequation 73 So the intermediate lemma 3 tells us that we can replace themaximization over messages micro by maximization over values X

maxmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

X

(X)[(X)

]nminus1

prodβ

(Xβ xβ)

Next we take the maximum over X as well and define the ldquostrengthrdquo σ tobe used in equation 71 through

11minus σ equiv max

Xmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

XX

(X)[(X)

]nminus1

prodβ

(Xβ xβ) (74)

2402 T Heskes

The inequality 71 then follows by summing out Xββ prime in

Qlowast(X)minusprodβ

Qlowast(xβ) le σQlowast(X)

The form of equation 72 then follows by rewriting equation 74 as

ω equiv minus log(1minus σ) = maxXX

W(X X) with

W(X X) =[ψ(X)+ (nminus 1)ψ(X)minus

sumβ

ψ(Xβ xβ)

]

where we recall that ψ(X) equiv log(X)

72 Some Properties In the following we will refer to both ω and σ asthe strength of the potential There are several properties worth noting

bull The strength of a potential is indifferent to multiplication with anyterm that factorizes over the nodes that is

if (X) = (X)prodβ

microβ(xβ) then ω() = ω() for any choice of micro

This property relates to the arbitrariness in the definition of equa-tion 31 if two potentials overlap then multiplying one potential witha term that only depends on the overlap and dividing the other by thesame term does not change the distribution Luckily it also does notchange the strength of those potentials

bull To compute the strength we can enumerate all possible combinationsHowever we can neglect all combinations X and X that differ in fewerthan two nodes To see this consider

W(x1 x2 x12 x1 x2 x12) = ψ(x1 x2 x12)+ ψ(x1 x2 x12)minus ψ(x1 x2 x12)minus ψ(x1 x2 x12)

= minusW(x1 x2 x12 x1 x2 x12)

If now also x2 = x2 we get W(x1 x1 x1 x1) = minusW(x1 x1 x1 x1) =0 Furthermore if W(x1 x2 x12 x1 x2 x12) le 0 then it must be thatW(x1 x2 x12 x1 x2 x12) ge 0 and vice versa So ω the maximumover all combinations must be nonnegative and we can indeed neglectall combinations that by definition yield zero

bull Thus for finite potentials 0 le ω ltinfin and 0 le σ lt 1

Uniqueness of Loopy Belief Propagation Fixed Points 2403

bull With pairwise potentials the above symmetries can be used to reducethe number of evaluations to |x1||x2|(|x1|minus1)(|x2|minus1)4 combinationsAnd indeed for binary nodes x12 isin 0 1 we immediately obtain

ω = |ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| (75)

Any pairwise binary potential can be written as a Boltzmann factor

(x1 x2) prop exp[wx1x2 + θ1x1 + θ2x2]

In this notation we find the simple and intuitive expression ω = |w|the strength is the absolute value of the ldquoweightrdquo It is indeed inde-pendent of (the size of) the thresholds In the case of minus1 1 codingthe relationship is ω = 4|w|

bull In some models there is the notion of a ldquotemperaturerdquo T that is(X) propexp[ψ(X)T] where ψ(X) is considered constant In obvious notationwe then have ω(T) = ω(1)T and thus σ(T) = 1 minus exp[minusω(1)T] =1minus [1(1minus σ(1))]1T

bull Loopy belief revision (max-product) can be interpreted as a zero-temperature limit of loopy belief propagation (sum product) Morespecifically we get the belief revision updates if we imagine runningloopy belief propagation on potentials that are scaled with tempera-ture T and then take the limit T to zero Consequently when analyzingconditions for uniqueness of loopy belief revision fixed points we cantake σ(0) = 0 if σ(1) = 0 (fake interaction) yet σ(0) = 1 wheneverσ(1) gt 0

8 Conditions for Uniqueness

81 Main Result

Theorem 4 Loopy belief propagation has a unique fixed point if there exists anallocation matrix Aαβ between potentials α and nodes β with properties

1 Aαβ ge 0 forallαβsubα (positivity)

2 (1minus σα)maxβsubα

Aαβ + σαsumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(81)

with the strength σα a function of the potentialα(Xα) as defined in equation 72

Proof For completeness we first summarize our line of reasoning Fixedpoints of loopy belief propagation are in one-to-one correspondence with

2404 T Heskes

extrema of the dual equation 55 This dual has a unique extremum if itis convexconcave Concavity is guaranteed so we focus on conditionsfor convexity that is for positive (semi)definiteness of the correspondingHessian This then boils down to conditions that ensure K = K1+K2+K3 ge 0for any choice of Rβ(xβ)

Substituting the bound equation 71 into the term K3 we obtain

K3 ge minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeσαQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime)

ge minussumα

σαsumβsubα

sumxβ

Aαβ

sumβprimesubαβprime =β

Aαβ primeQlowastα(xβ)R2β(xβ)

where in the last step we applied the same trick as in equation 61 SinceK2 ge 0 and combining K1 and (the above lower bound on) K3 we get

K = K1 + K2 + K3

gesumα

sumβsubα

sumxβ

Aαβ

[1minus Aαβ minus σα

sumβ prime =β

Aαβ prime

]Qlowastα(xβ)R

2β(xβ)

This implies

(1minus σα)Aαβ + σαsumβ primesubα

Aαβ prime le 1 forallαβsubα

which in combination with Aαβ ge 0 and σα le 1 yields condition 2 inequation 81 The equality constraint equation 52 that we started with canbe relaxed to the inequality condition 3 without any consequences

We get back the stricter conditions of theorem 1 if σα = 1 for all potentialsα Furthermore ldquofake interactionsrdquo play no role with σα = 0 condition 2becomes maxβsubα Aαβ le 1 suggesting the choice Aαβ = 1 for all β sub αwhich then effectively reduces the number of neighboring potentials nβ incondition 3

82 Comparison with Other Work To the best of our knowledge theonly conditions for uniqueness of loopy belief propagation fixed points thatdepend on more than just the structure of the graph are those in Tatikondaand Jordan (2002) for pairwise potentials The analysis in Tatikonda andJordan is based on the concept of the computation tree which represents anunwrapping of the original graph with respect to the loopy belief propaga-tion algorithm The same concept is used in Weiss (2000) to show that beliefrevision yields the correct maximum a posteriori assignments in graphs

Uniqueness of Loopy Belief Propagation Fixed Points 2405

with a single loop and Weiss and Freeman (2001) to prove that loopy beliefpropagation in gaussian graphical models yields exact means Although thecurrent theorems based on the concept of computation trees are derived forpairwise potentials it should be possible to extend them to more generalfactor graphs

The setup in Tatikonda and Jordan (2002) is slightly different it is basedon the factorization

Pexact(X) = 1Z

prodα

α(Xα)prodβ

β(xβ)

to be compared with our equation 31 where there are no self-potentialsβ(xβ) With this in mind the statement is then as follows

Theorem 5 (adapted from Tatikonda amp Jordan 2002 in particular proposi-tion 53) Loopy belief propagation on pairwise potentials has a unique fixed pointif

sumαsupβ

(max

ψα(Xα)minusminXα

ψα(Xα)

)lt 2 forallβ (82)

To make the connection between theorem 5 and theorem 4 we will firststrengthen the former and then weaken the latter We will focus on thecase of binary pairwise potentials Since the definition of self-potentials isarbitrary and the condition 82 is valid for any choice we can easily improvethe condition by optimizing this choice This then leads to the followingcorollary

Corollary 3 This corollary concerns an improvement of theorem 5 for pairwisebinary potentials Loopy belief propagation on pairwise binary potentials has aunique fixed point ifsum

αsupβωα lt 4 forallβ (83)

with ωα defined in equation 72

Proof The condition 82 applies to any arbitrary definition of self-poten-tials β(xβ) In fact it is valid for any choice

ψα(Xα) = ψα(Xα)+sumβsubα

φαβ(xβ)

where ψα(Xα) is any choice of potential subsets that fits in our frameworkof no self-potentials (as argued above there is some arbitrariness here as

2406 T Heskes

well) We can then optimize this choice to obtain milder and thus betterconditions Omitting α and renumbering the nodes from 1 to 2 we have

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= minφ1φ2

maxx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

minus minx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

In the case of binary nodes (two-by-two matrices ψ(x1 x2)) it is easy tocheck that the optimal φ1 and φ2 that yield the smallest gap are such that

ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2)

ge ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2) (84)

for some x1 x2 x1 and x2 with x1 = x1 and x2 = x2 Solving for φ1 and φ2we find

φ1(x1)minus φ1(x1) = 12

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]φ2(x2)minus φ2(x2) = 1

2

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]

Substitution back into equation 84 yields

ψ(x1 x2)+ φ1(x1)+ φ2(x2)minus ψ(x1 x2)minus φ1(x1)minus φ2(x2)

= 12

[ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)minus ψ(x1 x2)

]

which has to be nonnegative Of all four possible combinations two of themare valid and yield the same positive gap and the other two are invalid sincethey yield the same negative gap Enumerating these combinations we find

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= 12|ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| = ω

2

from equation 75 Substitution into the condition 82 then yields equa-tion 83

Next we derive the following weaker corollary of theorem 4

Uniqueness of Loopy Belief Propagation Fixed Points 2407

Corollary 4 This is a weaker version of theorem 4 for pairwise potentials Loopybelief propagation on pairwise potentials has a unique fixed point ifsum

αsupβωα le 1 forallβ (85)

with ωα defined in equation 72

Proof Consider the allocation matrix with components Aαβ = 1minus σα forall β sub α With this choice conditions 1 and 2 of equation 81 are fulfilledsince (condition 1) σα le 1 and (condition 2)

(1minus σα)(1minus σα)+ 2σα(1minus σα) = 1minus 2σ 2α le 1

Substitution into condition 3 yieldssumαsupβ

(1minus σα) gesumαsupβ

1minus 1 and thussumαsupβ

σα le 1 (86)

Since ωα = minus log(1minus σα) ge σα condition 85 is weaker than condition 86

Summarizing the conditions in Tatikonda and Jordan (2002) are for bi-nary pairwise potentials and when strengthened as above at most a constant(factor 4) less strict and thus better than the ones derived here The latterare better when the structure is (close to) a tree The best set of conditionsfollows by taking the union of both Note further that the conditions derivedin Tatikonda and Jordan (2002) are unlike theorem 4 specific to pairwisepotentials

83 Illustration For illustration we consider a 3 times 3 Ising grid withtoroidal boundary conditions as in Figure 3a and uniform ferromagneticpotentials proportional to(

α 1minus α1minus α α

)

The trivial solution which is the only minimum of the Bethe free energy forsmall α is the one with all pseudomarginals equal to (05 05) With simplealgebra for example following the line of reasoning that leads to the beliefoptimization algorithm in Welling and Teh (2003) it can be shown that thistrivial solution becomes unstable at the critical αcritical = 23 asymp 067 Forα gt 23 we find two minima one with ldquospins uprdquo and the other one withldquospins downrdquo

In this symmetric problem the strength of each potential is given by

ω = 2 log[

α

1minus α]

and thus σ = 1minus(

1minus αα

)2

2408 T Heskes

Figure 3 Three Ising grids in factor-graph notation circles denote nodes boxesinteractions (a) Toroidal boundary conditions All elements of the allocationmatrix equal to 34 (not shown) (b) Aperiodic boundary conditions and (c) twoloops left The elements of the allocation matrix along the edges follow directlyfrom optimizing condition 3 in theorem 4 and symmetry considerations WithB = 2minus 2A in b and C = 1minusA in c the optimal settings for the single remainingvariable A then boil down to 34 and 1 minus

radic18 respectively See the text for

further explanation

Uniqueness of Loopy Belief Propagation Fixed Points 2409

The minimal (uniform) compensation in condition 3 of theorem 4 amountsto A = 34 for all combinations of potentials and nodes Substitution intocondition 2 then yields

σ le 13

and thus α le 11+radic23

asymp 055

The critical value that follows from corollary 3 is in this case slightly better

ω lt 1 and thus α le 11+ eminus12 asymp 062

Next we consider the same grid with aperiodic boundary conditions asin Figure 3b Numerically we find a critical αcritical asymp 079 The value thatfollows from corollary 3 is dominated by the center node and hence staysthe same a unique loopy belief propagation fixed point for α lt 062 Theo-rem 4 can be exploited to shift resources a little In principle we can solvethe nonlinear programming problem but for this small problem it can stillbe done by hand with the following argumentation Minimal compensationaccording to condition 3 in theorem 4 combined with symmetry considera-tions yields the allocation matrix elements along on the edges in Figure 3bIt is then easy to check that there are only two different appearances ofcondition 2

(2minus 2A)σ + 34le 1 and

12σ + A le 1

The optimal choice for A is the one in which both conditions turn out to beidentical In this way we obtain A = 34 yielding

σ le 12

and thus α le 11+radic12

asymp 058

still slightly worse than the condition from corollary 3An example in which the condition obtained with theorem 4 is better

than the one from corollary 3 is given in Figure 3c Straightforward analysisfollowing the same recipe as for Figure 3b yields A = 1minusradic18 with

σ leradic

12

and thus α le 1

1+radic

1minusradic12asymp 065

better than theα lt 062 from corollary 3 and to be compared with the criticalαcritical asymp 088

9 Discussion

In this article we derived sufficient conditions for loopy belief propagationto have just a single fixed point These conditions remain much too strong tobe anywhere near the necessary conditions and in that sense should be seenas no more than a first step These conditions have the following positivefeatures

2410 T Heskes

bull Generalize the conditions for convexity of the Bethe free energy

bull Incorporate the (local) strength of potentials

bull Scale naturally as a function of the ldquotemperaturerdquo

bull Are invariant to arbitrary definitions of potentials and self-interactions

Although the analysis that led to these conditions may seem quite involvedit basically consists of a relatively straightforward combination of two ob-servations The first observation is that we can exploit the arbitrariness inthe definition of the Bethe free energy when we incorporate the constraintsThis forms the basis of the resource allocation argument And the secondobservation concerns the bound on the correlation of a loopy belief propa-gation marginal that leads to the introduction of the strength of a potential

Besides its theoretical usefulness there are more practical uses Firstalgorithms for guaranteed convergence explicitly minimize the Bethe freeenergy They can be considered ldquobound optimization algorithmsrdquo similarto expectation maximization and iterative proportional fitting in the innerloop they minimize a bound on the Bethe free energy which is then up-dated in the outer loop In practice it appears that the tighter the boundthe faster the convergence (see eg Heskes et al 2003) Instead of a boundthat is convex (Yuille 2002) or convex over the set of constraints (Teh ampWelling 2002 Heskes et al 2003) we might relax the convexity conditionand choose a tighter bound that still has a unique minimum thereby speed-ing up the convergence Second in Wainwright et al (2002) a convexifiedBethe free energy is proposed The arguments for this class of free energiesare twofold they yield a bound on the partition function (instead of justan approximation as the standard Bethe free energy) and have a uniqueminimum Focusing on the second argument the conditions in this articlecan be used to construct Bethe free energies that may not be convex (overthe set of constraints) but do have a unique minimum and being closer tothe standard Bethe free energy may yield better approximations

We can think of the following opportunities to make the sufficient con-ditions derived here stricter and thus closer to necessary conditions

bull The conditions guarantee convexity of the dual G(Qβ λαβ) with re-spect to Qβ But in fact we need only G(Qβ) equiv maxλαβ G(Qβ λαβ) to beconvex which is a weaker requirement The Hessian of G(Qβ) how-ever appears to be more difficult to compute and to analyze in generalbut may lead to stronger results in specific cases (eg only pairwiseinteractions or substituting a particular choice of Aαβ )

bull It may be possible to strengthen the bound equation 71 on loopybelief correlations especially for interactions that involve more thantwo nodes

An important question is how the uniqueness of loopy belief propaga-tion fixed points relates to the convergence of loopy belief propagation

Uniqueness of Loopy Belief Propagation Fixed Points 2411

Intuitively one might expect that if loopy belief propagation has a uniquefixed point it will also converge to it This also seems to be the argumenta-tion in Tatikonda and Jordan (2002) However to the best of our knowledgethere is no proof of such correspondence Furthermore the following set ofsimulations does seem to suggest otherwise

We consider a Boltzmann machine with four binary nodes weights

w = ω

0 1 minus1 minus11 0 1 minus1minus1 1 0 minus1minus1 minus1 minus1 0

zero thresholds and potentials

ij(xi xj) = exp[wij4] if xi = xj and ij(xi xj) = exp[minuswij4] if xi = xj

Running loopy belief propagation possibly damped as in equation 39 weobserve ldquoconvergentrdquo and ldquononconvergentrdquo behavior For relatively smallweights loopy belief propagation converges to the trivial fixed point withPi(xi) = 05 for all nodes i and xi = 0 1 as in the lower left inset inFigure 4 For relatively large weights it ends up in a limit cycle as shown inthe upper right inset The weight strength that forms the transition betweenthis ldquoconvergentrdquo and ldquononconvergentrdquo behavior strongly depends on thestep size7 This by itself makes it hard to defend a one-to-one correspondencebetween convergence of loopy belief propagation (apparently dependingon step size) and uniqueness of fixed points (obviously independent of stepsize)

For weights larger than roughly 58 loopy belief propagation failed toconverge to the trivial fixed point even for very small step sizes Howeverrunning a convergent double-loop algorithm from many different initialconditions and many weight strengths considerably larger than 58 we al-ways ended up in the trivial fixed point and never in another one We foundsimilar behavior for a three-node Boltzmann machine (same weight matrixas above except for the fourth node) for very large weights loopy beliefpropagation ends up in a limit cycle whereas a convergent double-loopalgorithm converges to the trivial fixed point which here by corollary 2is guaranteed to be unique In future work we hope to elaborate on theseissues

7 Note that the conditions for guaranteed uniqueness imply ω = 43 for corollary 3and ω = log(2) asymp 069 for theorem 4 both far below the weight strengths where ldquonon-convergentrdquo behavior sets in

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

2382 T Heskes

from nonconvex constrained minimization problem to an unconstrainednonconvexconcave minimax problem is in a particular setting relevant tothis article repeated in section 51

Rather than requiring the Bethe free energy to be convex (over the setof constraints) we then in sections 6 and 8 work toward conditions underwhich this minimax problem is convexconcave These indeed depend onthe strength of the potentials defined in section 7 These conditions can beconsidered the main result of this article Link c follows from the observa-tion in section 52 that the minimax problem corresponding to a Bethe freeenergy that is convex over the set of constraints has to be convex or concave

As indicated by link e convexconcave minimax problems have a uniquesolution This then also implies that the Bethe free energy has a uniqueextremum satisfying the constraints which since the Bethe free energy isbounded from below (see section 53) has to be a minimum link f

The concluding statement by link g in the lower-right corner is to thebest of our knowledge no more than a conjecture We discuss it in moredetail in section 9

3 The Bethe Free Energy and Loopy Belief Propagation

31 The Gibbs-Helmholtz Free Energy The exact probability distribu-tion in Bayesian networks and Markov random fields can be written in thefactorized form

Pexact(X) = 1Z

prodα

α(Xα) (31)

Hereα is a potential some function of the potential subset Xα and Z is anunknown normalization constant Potential subsets typically overlap andthey span the whole domain X The convention that we adhere to in thisarticle is that there are no potential subsets Xα and Xαprime such that Xαprime is fullysubsumed by Xα The standard choice of a potential in a Bayesian networkis a child with all its parents We further restrict ourselves to probabilisticmodels defined on discrete random variables each of which runs over afinite number of states The potentials are positive and finite

The typical goal in Bayesian networks and Markov random fields is tocompute the partition function Z or marginals for example

Pexact(Xα) =sumXα

Pexact(X)

One way to do this is with the junction tree algorithm (Lauitzen amp Spiegel-halter 1988) However the junction tree algorithm scales exponentially withthe size of the largest clique and may become intractable for complex mod-els The alternative is then to resort to approximate methods which can be

Uniqueness of Loopy Belief Propagation Fixed Points 2383

roughly divided into two categories sampling approaches and determinis-tic approximations

Most deterministic approximations derive from the so-called Gibbs-Helmholtz free energy

F(P) = minussumα

sumXα

P(Xα)ψα(Xα)+sum

X

P(X) log P(X)

with shorthandψ equiv log Minimizing this variational free energy over theset P of all properly normalized probability distributions we get back theexact probability distribution equation 31 as the argument at the minimumand minus the log of the partition function as the value at the minimum

Pexact = argminPisinP

F(P) and minus log Z = minPisinP

F(P)

Since the Gibbs-Helmholtz free energy is convex in P the equality constraint(proper normalization) is linear and the inequality constraints (nonnega-tivity) are convex this minimum is unique By itself we have not gainedanything the entropy may still be intractable to compute

32 The Bethe Free Energy The Bethe free energy is an approximationof the exact Gibbs-Helmholtz free energy In particular we approximate theentropy through

sumX

P(X) log P(X) asympsumα

sumXα

P(Xα) log P(Xα)

minussumβ

(nβ minus 1)sumxβ

P(xβ) log P(xβ)

with xβ a (super)node and nβ =sum

αsupβ 1 the number of potentials thatcontains node xβ The second term follows from a discounting argumentwithout it we would overcount the entropy contributions on the overlapbetween the potential subsets The (super)nodes xβ are themselves subsetsof the potential subsets that is

xβ cap Xα = empty or xβ cap Xα = xβ forallαβand partition the domain X

xβ cap xβ prime = empty forallββ prime and⋃β

xβ = X

Typically the xβ are taken to be single nodes and in the following we willrefer to them as such For clarity of notation we will indicate these nodes byβ and xβ in lowercase to contrast them with the potentials α and potentialsubsets Xα in uppercase

2384 T Heskes

Note that the Bethe free energy depends on only the marginals P(Xα) andP(xβ) We replace minimization of the exact Gibbs-Helmholtz free energyover probability distributions by minimization of the Bethe free energy

F(QαQβ) = minussumα

sumXα

Qα(Xα)ψα(Xα)+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumβ

(nβ minus 1)sumxβ

Qβ(xβ) log Qβ(xβ) (32)

over sets of ldquopseudomarginalsrdquo1 or beliefs QαQβ For this to make sensethese pseudomarginals have to be properly normalized as well as consistentthat is2sum

Qα(Xα) = 1 and Qα(xβ) =sumXαβ

Qα(Xα) = Qβ(xβ) (33)

Let Q denote all subsets of consistent and properly normalized pseudo-marginals Then our goal is to solve

minQαQβ isinQ

F(QαQβ)

The hope is that the pseudomarginals at this minimum are accurate approx-imations to the exact marginals Pexact(Xα) and Pexact(xβ)

33 Link with Loopy Belief Propagation For completeness and laterreference we describe the link between the Bethe free energy and loopybelief propagation as originally reported on by Yedidia et al (2001) It startswith the Lagrangian

L(QαQβ λαβ λα λβ) = F(QαQβ)

+sumβ

sumαsupβ

sumxβ

λαβ(xβ)[Qβ(xβ)minusQα(xβ)

]

+sumα

λα

[1minus

sumXα

Qα(Xα)

]

+sumβ

λβ

[1minus

sumβ

Qβ(xβ)

] (34)

1 Terminology from Wainwright Jaakkola and Willsky (2002) used to indicate thatthere need not be a joint distribution that would yield such marginals

2 Strictly speaking we also have to take inequality constraints into account namelythose of the form Qα(Xα) ge 0 However with the potentials being positive and finitethe logarithmic terms in the free energy make sure that we never really have to worryabout those they never become ldquoactiverdquo For convenience we will not consider them anyfurther

Uniqueness of Loopy Belief Propagation Fixed Points 2385

At an extremum of the Bethe free energy satisfying the constraints allderivatives of L are zero the ones with respect to the Lagrange multipliersλ give back the constraints the ones with respect to the pseudomarginalsQ give an extremum of the Bethe free energy Setting the derivatives withrespect to Qα and Qβ to zero we can solve for Qα and Qβ in terms of theLagrange multipliers

Qlowastα(Xα) = α(Xα) exp

[λα minus 1+

sumβsubα

λαβ(xβ)

]

Qlowastβ(xβ) = exp

[1

nβ minus 1

1minus λβ +

sumαsupβ

λαβ(xβ)

]

In terms of theldquomessagerdquomicroβrarrα(xβ) equiv exp[λαβ(xβ)] from node β to potentialα the pseudomarginal Qlowastα(Xα) reads

Qlowastα(Xα) prop α(Xα)prodβsubα

microβrarrα(xβ) (35)

where proper normalization yields the Lagrange multiplier λα With defi-nition

microαrarrβ(xβ) equivQlowastβ(xβ)microβrarrα(xβ)

(36)

the fixed-point equation for Qlowastβ(xβ) can after some manipulation be writtenin the form

Qlowastβ(xβ) propprodαsupβ

microαrarrβ(xβ) (37)

where again the Lagrange multiplier λβ follows from normalization Finallythe constraint Qlowastα(xβ) = Qlowastβ(xβ) in combination with equation 36 suggeststhe update

microαrarrβ(xβ) = Qlowastα(xβ)microβrarrα(xβ)

(38)

Equations 35 through 38 constitute the belief propagation equations Theycan be summarized as follows A pseudomarginal is the potential (just 1 forthe nodes in the convention where no potentials are assigned to nodes) timesits incoming messages the outgoing message is the pseudomarginal dividedby the incoming message The scheduling of the messages is somewhatarbitrary Loopy belief propagation can be ldquodampedrdquo by taking smaller

2386 T Heskes

steps This damping is usually done in terms of the Lagrange multipliersthat is in the log domain of the messages

logmicronewαrarrβ(xβ) = logmicroαrarrβ(xβ)

+ ε[log Qlowastα(xβ)minus logmicroβrarrα(xβ) minus logmicroαrarrβ(xβ)] (39)

Summarizing loopy belief propagation is equivalent to fixed-point itera-tion where the fixed points are the zero derivatives of the Lagrangian

4 Convexity of the Bethe Free Energy

41 Rewriting the Bethe Free Energy Minimization of the Bethe freeenergy equation 32 under the constraints of equation 33 is equivalent tosolving a minimax problem on the Lagrangian equation 34 namely

minQαQβ

maxλαβ λαλβ

L(QαQβ λαβ λα λβ)

The ordering of the min and max operations is important here to enforcethe constraints we first have to take the maximum The min and max oper-ations can be interchanged if we have a convex constrained minimizationproblem (Luenberger 1984) That is the function to be minimized must beconvex in its parameters the equality constraints have to be linear and theinequality constraints convex In our case the equality constraints are in-deed linear and the inequality constraints enforcing nonnegativity of thepseudomarginals indeed are convex However the Bethe free energy equa-tion 32 is clearly nonconvex in its parameters QαQβ This is what makesit a difficult optimization problem

Luckily the description in equation 32 is not unique any other formthat can be constructed by substituting the constraints of equation 33 isequally valid Following Pakzad and Anantharam (2002) we call the Bethefree energy ldquoconvex over the set of constraintsrdquo if by making use of theconstraints of equation 33 we can rewrite it in a form that is convex inQαQβ

42 Conditions for Convexity The problem is with the term

Sβ(Qβ) equiv minussumxβ

Qβ(xβ) log Qβ(xβ)

which is concave in Qβ Using the constraint Qβ(xβ) = Qα(xβ) we can turnit into a functional that is convex in Qα and Qβ separately but not necessarilyjointly That is with the substitution Qβ(xβ) = Qα(xβ) for any α sup β theentropy and thus the Bethe free energy is convex in Qα and in Qβ but notnecessarily in QαQβ However if we add to Sβ(Qβ) a convex entropycontribution

minusSα(Qα) equivsumXα

Qα(Xα) log Qα(Xα)

Uniqueness of Loopy Belief Propagation Fixed Points 2387

the combination of minusSα and Sβ is convex in QαQβ as the followinglemma needed in the proof of theorem 1 below shows

Lemma 1

αβ(QαQβ) equivsumXα

Qα(Xα) log Qα(Xα)minussumxβ

Qα(xβ) log Qβ(xβ)

is convex in QαQβ

Proof The matrix with second derivatives of αβ has the components

H(XαXprimeα) equivpart2αβ

partQα(Xα)partQα(Xprimeα)= 1

Qα(Xα)δXαXprimeα

H(Xα xprimeβ) equivpart2αβ

partQα(Xα)partQβ(xprimeβ)= minus 1

Qβ(xβ)δxβ xprimeβ

H(xβ xprimeβ) equivpart2αβ

partQβ(xβ)partQβ(xprimeβ)= minusQα(xβ)

Q2β(xβ)

δxβ xprimeβ

where we note that Xα and xβ should be interpreted as indices Convexityrequires that for any ldquovectorrdquo (Rα(Xα) Rβ(xβ))

0 le (Rα(Xα) Rβ(xβ))(

H(XαXprimeα) H(Xα xprimeβ)H(xβXprimeα) H(xprimeβ xβ)

)(Rα(Xprimeα)Rβ(xprimeβ)

)

=sumXα

R2α(Xα)

Qα(Xα)minus 2

sumXα

Rα(Xα)Rβ(xβ)Qβ(xβ)

+sumxβ

Qα(xβ)R2β(xβ)

Q2β(xβ)

=sumXα

Qα(Xα)

[Rα(Xα)

Qα(Xα)minus Rβ(xβ)

Qβ(xβ)

]2

The idea is that the Bethe free energy is convex over the set of constraints ifwe have sufficient convex resources Qα log Qα to compensate for the concaveminusQβ log Qβ terms This can be formalized in the following theorem

Theorem 1 The Bethe free energy is convex over the set of consistency con-straints if there exists an allocation matrix Aαβ between potentials α and nodes βsatisfying

1 Aαβ ge 0 forallαβsubα (positivity)

2sumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(41)

2388 T Heskes

Proof First we note that we do not have to worry about the energy termsthat are linear in Qα In other words to prove the theorem we can restrictourselves to proving that minus the entropy

minusS(Q) = minus[sum

α

Sα(Qα)minussumβ

(nβ minus 1)Sβ(Qβ)

]

is convex over the set of consistency constraints The resulting operation isnow a matter of resource allocation For each concave contribution (nβ minus1)Sβ we have to find convex contributions minusSα to compensate for it LetAαβ denote the ldquoamount of resourcesrdquo that we take from potential subset αto compensate for node β Now in shorthand notation and with a little bitof rewriting

minusS(Q) = minus[sum

α

Sα minussumβ

(nβ minus 1)Sβ

]

= minussumα

(1minus

sumβsubα

Aαβ +sumβsubα

Aαβ

)Sα

minussumβ

minussumαsupβ

Aαβ +sumαsupβ

Aαβ minus (nβ minus 1)

= minussumα

(1minus

sumβsubα

Aαβ

)Sα minus

sumα

sumβsubα

Aαβ [Sα minus Sβ ]

minussumβ

[sumαsupβ

Aαβ minus (nβ minus 1)

]Sβ

Convexity of the first term is guaranteed if 1minussumβ Aαβ ge 0 (condition 2) ofthe second term if Aαβ ge 0 (condition 1 and lemma 1) and of the third termifsum

α Aαβ minus (nβ minus 1) ge 0 (condition 3)

This theorem is a special case of the one in Heskes et al (2003) for themore general Kikuchi free energy Either one of the inequality signs in con-dition 2 and 3 of equation 41 can be replaced by an equality sign withoutany consequences

43 Some Implications

Corollary 1 The Bethe free energy for singly connected graphs is convex overthe set of constraints

Uniqueness of Loopy Belief Propagation Fixed Points 2389

Proof The proof is by construction Choose one of the leaf nodes as theroot βlowast and define

Aαβ = 1 iff β sub α and β closer to the root βlowast than any other β prime sub αAαβ prime = 0 for all other β prime

Obviously this choice of A satisfies conditions 1 and 2 of equation 41Arguing the other way around for eachβ = βlowast there is just a single potentialα sup β that is closer to the root βlowast than β itself (see the illustration in Figure 2)and thus there are precisely nβ minus 1 contributions Aαβ = 1 The root itselfgets nβlowast contributions Aαβlowast = 1 which is even better Hence condition 3 isalso satisfied

sumαsupβ

Aαβ = nβ minus 1 forallβ =βlowast andsumαsupβlowast

Aαβlowast = nβlowast gt nβlowast minus 1

With the above construction of A we are in a sense ldquoeating up resourcestoward the rootrdquo At the root we have one piece of resources left whichsuggests that we can still enlarge the set of graphs for which convexity canbe shown using theorem 1 This leads to the next corollary

Corollary 2 The Bethe free energy for graphs with a single loop is convex overthe set of constraints

Proof Again the proof is by construction Break the loop at one particularplace that is remove one node βlowast from a potential αlowast such that a singlyconnected structure is left Construct a matrix A as in the proof of corollary 1taking the node βlowast as the root The matrix A constructed in this way alsojust works for the graph with the closed loop since still

sumαsupβ

Aαβ = nβ minus 1 forallβ =βlowast and nowsumαsupβlowast

Aαβlowast = nβlowast minus 1

It can be seen that this construction starts to fail as soon as we have twoloops that are connected with two connected loops we have insufficientpositive resources to compensate for the negative entropy contributions

44 Connection with Other Work The same corollaries can be foundin Pakzad and Anantharam (2002) and McEliece and Yildirim (2003) Fur-thermore the conditions in theorem 1 are very similar to the ones stated inPakzad and Anantharam (2002) which for the Bethe free energy boil downto the following

2390 T Heskes

Figure 2 The construction of an allocation matrix satisfying all convexity con-straints for singly connected (a) and single-loop structures (b) Neglecting thearrows and dashes each graph corresponds to a factor graph (Kscischang Freyamp Loeliger 2001) where dark boxes refer to potentials and circles to nodes Thenumbers within the circles give the corresponding ldquoovercounting numbersrdquo forthe Bethe free energy 1minus nβ with nβ the number of neighboring potentials Thearrows pointing from potentials α to nodes β visualize the allocation matrixA with Aαβ = 1 if there is an arrow and Aαβ = 0 otherwise As can be seenfor each potential there is precisely one outgoing arrow pointing at the nodeclosest to the root chosen to be the node in the upper right corner of the graphIn the singly-connected structure (a) all nonroot nodes have precisely nβ minus 1incoming arrows just sufficient to compensate the overcounting number 1minusnβ The root node itself has one incoming arrow which it does not really need Inthe structure with the single loop (b) we open the loop by breaking the dashedlink and construct the allocation matrix for the corresponding singly connectedstructure This allocation matrix works for the single-loop structure as well be-cause now the incoming arrow at the ldquorootrdquo is just sufficient to compensate forthe negative overcounting number resulting from the extra link closing the loop

Uniqueness of Loopy Belief Propagation Fixed Points 2391

Theorem 2 (Adapted from theorem 1 in Pakzad amp Anantharam 2002) TheBethe free energy is convex for the set of constraints if for any set of nodes B wehave sum

βisinB

(1minus nβ)+sumαisinπ(B)

1 ge 0 (42)

where π(B) equiv α existβ isin Bβ sub α denotes the ldquoparentrdquo set of B that is thosepotential subsets that include at least one node in B

Proposition 1 The conditions in theorems 1 and 2 are equivalent

Proof Let us first suppose that there does exist an allocation matrix Aαβ

satisfying the conditions of equation 41 Then for any set B

sumβisinB

(nβ minus 1) lesumβisinB

sumαsupβ

Aαβ lesumαisinπ(B)

sumβsubα

Aαβ lesumαisinπ(B)

1

where the inequalities follow from conditions 3 1 and 2 in equation 41respectively In other words validity of the conditions in theorem 1 impliesthe validity of those in theorem 2

Next let us suppose that the conditions in Theorem 1 fail Above wehave seen that this can happen if and only if the graph contains at least oneconnected component with two connected loops But then condition 42is violated as well when we take for B the set of all nodes within such acomponent

Since validity implies validity and violation implies violation the con-ditions must be equivalent

Graphical models with a single loop have been studied in detail in Weiss(2000) yielding important theoretical results (eg correctness of maximuma posteriori assignments) These results are derived by ldquounwrappingrdquo thesingle loop into an infinite tree This argument also breaks down as soon asthere is more than a single loop It might be interesting to find out whetherthere is a deeper connection between this unwrapping argument and theconvexity of the Bethe free energy

In summary we have given conditions for the Bethe free energy to have aunique extremum satisfying the constraints From the connection betweenthe extrema of the Bethe free energy and fixed points of loopy belief prop-agation it then follows that loopy belief propagation has a unique fixedpoint when these conditions are satisfied These conditions fail as soon asthe structure of the graph contains two connected loops

The conditions for convexity of the Bethe free energy depend on thestructure of the graph the potentialsα(Xα) do not play any role These po-tentials appear only in the energy term that is linear in the pseudomarginals

2392 T Heskes

and thus does not affect the convexity argument Consequently adding aldquofake linkrdquo with potential α(Xα) = 1 can change the validity of the con-ditions whereas it has no effect on the loopy belief propagation updatesEven if we managed to find more interesting (ie milder) conditions forconvexity over the set of constraints3 the impact of fake links would neverdisappear In the following we will therefore dig a little deeper to arrive atmilder conditions that do take into account (the strength of) the potentials

5 The Dual Formulation

51 From Lagrangian to Dual As we have seen fixed points of loopybelief propagation are in a one-to-one correspondence with zero derivativesof the Lagrangian If we manage to find conditions under which these zeroderivatives have a unique solution then for the same conditions loopybelief propagation has a unique fixed point In the following we will workwith a Lagrangian slightly different from equation 34 First we substitutethe constraint Qα(xβ) = Qβ(xβ) to write the Bethe free energy in the ldquomoreconvexrdquo form

F(QαQβ) = minussumα

sumXα

Qα(Xα)ψα(Xα)+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumβ

sumαsupβ

Aαβ

sumxβ

Qα(xβ) log Qβ(xβ) (51)

where the allocation matrix Aαβ can be any matrix that satisfies

sumαsupβ

Aαβ = nβ minus 1 (52)

And second we express the consistency constraints from equation 33 interms of the potential pseudomarginals Qα alone This then yields

L(QαQβ λαβ λα) = minussumα

sumXα

Qα(Xα)ψα(Xα)

+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumα

sumβsubα

Aαβ

sumxβ

Qα(xβ) log Qβ(xβ)

3 We would like to conjecture that this is not possiblemdashthat the conditions in theorem 1are not only sufficient but also necessary to prove convexity of the Bethe free energy overthe set of consistency constraints Note that this would not imply that we need theseconditions to guarantee the uniqueness of fixed points since for that convexity by itselfis sufficient not necessary

Uniqueness of Loopy Belief Propagation Fixed Points 2393

+sumβ

sumαsupβ

sumxβ

λαβ(xβ)

times[

1nβ minus 1

sumαprimesupβ

AαprimeβQαprime(xβ)minusQα(xβ)

]

+sumα

λα

[1minus

sumXα

Qα(Xα)

]

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (53)

Note that the constraint Qβ(xβ) = Qα(xβ) as well as its normalization is nolonger incorporated with Lagrange multipliers but follows when we takethe minimum with respect to Qβ It is easy to check that the fixed-pointequations of loopy belief propagation still follow by setting the derivativesof the Lagrangian equation 53 to zero

Although the Bethe free energy and thus the Lagrangian equation 53may not be convex in QαQβ they are convex in Qα and Qβ separatelyTherefore we can interchange the minimum over the pseudomarginals Qα

and the maximum over the Lagrange multipliers as long as we leave theminimum over Qβ as the final operation4

minQαQβ

maxλαβ λα

L(QαQβ λαβ λα) = minQβ

maxλαβ λα

minQα

L(QαQβ λαβ λα)

Rewriting

sumβ

sumαsupβ

sumxβ

λαβ(xβ)

[1

nβ minus 1

sumαprimesupβ

AαprimeβQαprime(xβ)minusQα(xβ)

]

= minussumα

sumβsubα

sumxβ

λαβ(xβ)Qα(xβ)

with

λαβ(xβ) equiv λαβ(xβ)minus 1nβ minus 1

sumαprimesupβ

Aαprimeβλαprimeβ(xβ)

we can easily solve for the minimum with respect to Qα

Qlowastα(Xα) = α(Xα) exp

[λα minus 1+

sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

] (54)

4 In principle we could also first take the minimum over Qβ and leave the minimumover Qα but this does not seem to lead to any useful results

2394 T Heskes

Plugging this into the Lagrangian we obtain the ldquodualrdquo

G(Qβ λαβ λα) equiv L(QlowastαQβ λαβ λα)

= minussumα

sumXα

α(Xα)

exptimes[λα minus 1+

sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

]

+sumα

λα +sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (55)

Next we find for the maximum with respect to λα

exp[1minus λlowastα

] =sumXα

α(Xα) exp

[sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

]

equiv Zlowastα (56)

where we have to keep in mind that Zlowastα by itself like Qlowastα is a function of theremaining pseudomarginals Qβ and Lagrange multipliers λαβ Substitutingthis solution into the dual we arrive at

G(Qβ λαβ) equiv G(Qβ λαβ λlowastα)

= minussumα

log Zlowastα +sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (57)

Let us pause here for a moment and reflect on what we have done so farThe Lagrangian equation 53 being convex in Qα has a unique minimumin Qα (given all other parameters fixed) which is also the only extremumIt happens to be relatively straightforward to express the value at this mini-mum in terms of the remaining parameters and then also to find the optimal(maximal) λlowastα Plugging these values into the Lagrangian equation 53 wehave not lost anything That is zero derivatives of the Lagrangian are still inone-to-one correspondence with zero derivatives of the dual equation 57and thus with fixed points of loopy belief propagation

52 Recovering the Convexity Conditions (1) To find a minimum of theBethe free energy satisfying the constraints in equation 33 we first have totake the maximum of the dual equation 57 over the remaining Lagrangemultipliersλαβ and then the minimum over the remaining pseudomarginalsQβ The duality theorem a standard result from constrained optimization(see Luenberger 1984) tells us that the dual G is concave in the Lagrangemultipliers The remaining question is then whether the dual is convex in

Uniqueness of Loopy Belief Propagation Fixed Points 2395

Qβ If it is we have a convex-concave minimax problem which is guaranteedto have a unique solution

Link c in Figure 1 follows from the following proposition

Proposition 2 Convexity of the Bethe free energy equation 51 in QαQβimplies convexity of the dual equation 57 in Qβ

Proof First we note that the minimum of a convex function over someof its parameters is convex in its remaining parameters In obvious one-dimensional notation with ylowast(x) equiv argmin

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge 2 f (x (ylowast(x+ δ)+ ylowast(xminus δ))2)ge 2 f (x ylowast(x))

where the first inequality follows from the convexity of f in x yand the sec-ond inequality from ylowast(x) being the unique minimum of f (x y) Thereforethe dual equation 55 is convex in Qβ when the Lagrangian equation 53and thus the Bethe free energy equation 51 is convex in QαQβ Further-more from the duality theorem the dual equation 55 is concave in theLagrange multipliers λαβ λα Next we note that the maximum of a con-vex or concave function over its maximizing parameters is again convexwith ylowast(x) equiv argmax

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge f (x+ δ ylowast(x))+ f (xminus δ ylowast(x))ge 2 f (x ylowast(x))

where the first inequality follows from ylowast(xplusmnδ) being the unique maximumof f (x plusmn δ y) and the second inequality from the convexity of f (x y) in xHence the dual equation 57 must still be convex in Qβ

For now we did not gain or lose anything in comparison with the con-ditions for theorem 1 However the inequalities in the above proof suggesta little space that will lead to milder conditions for the uniqueness of fixedpoints

53 Boundedness of the Bethe Free Energy For completeness and tosupport link f in Figure 1 we will here prove that the Bethe free energy isbounded from below The following theorem can be considered a specialcase of the one stated in Minka (2001) on the Bethe free energy for expectationpropagation a generalization of (loopy) belief propagation

Theorem 3 If all potentials are bounded from above that isα(Xα) le max forall α and Xα the Bethe free energy is bounded from below on the set of constraints

2396 T Heskes

Proof It is sufficient to prove that the function G(Qβ) equiv maxλαβ G(Qβ λαβ)

is bounded from below for a particular choice of Aαβ satisfying equation 52Considering Aαβ = nβminus1

nβ we then have

G(Qβ) ge minussumα

logsumXα

α(Xα) exp

[sumβsubα

nβ minus 1nβ

log Qβ(xβ)

]

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

logsumXα

α(Xα)Qβ(xβ)

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

+sumβ

(nβ minus 1)

[minus log

sumxβ

Qβ(xβ)+sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

where the first inequality follows by substituting the choice λαβ(xβ) = 0 forall α β and xβ in G(Qβ λαβ) the second from the concavity of the function

ynβminus1

nβ and the third from the upper bound on the potentials

6 Toward Better Conditions

61 The Hessian The next step is to compute the Hessianmdashthe secondderivative of the dual with respect to the pseudomarginals Qβ The firstderivative yields

partGpartQβ(xβ)

= minussumαsupβ

Aαβ

Qlowastα(xβ)Qβ(xβ)

+ (nβ minus 1)

which is immediate from the Lagrangian equation 53 To compute thematrix of second derivatives

Hββ prime(xβ xprimeβ prime) equivpart2G

partQβ(xβ)partQβ prime(xprimeβ prime)

Uniqueness of Loopy Belief Propagation Fixed Points 2397

we make use of

partQlowastα(xβ)partQβ prime(xprimeβ prime)

= Aαβ primeQlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)

Qβ prime(xprimeβ prime)

where both β and β prime should be a subset of α and with convention Qlowastα(xβ xβ)= Qlowastα(xβ) and Qlowastα(xβ xprimeβ) = 0 if xβ = xprimeβ Here the first term follows from thedifferentation of equation 54 and the second term from the normalizationas in equation 56 Distinguishing between β = β prime and β = β prime we then have

Hββ(xβ xprimeβ) =sumαsupβ

Aαβ(1minus Aαβ)Qlowastα(xβ)Q2β(xβ)

δxβ xprimeβ

+sumαsupβ

A2αβ

Qlowastα(xβ)Qlowastα(xprimeβ)Qβ(xβ)Qβ(xprimeβ)

Hββ prime(xβ xprimeβ prime) = minussum

αsupββ primeAαβAαβ prime

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)Qβ(xβ)Qβ prime(xprimeβ prime)

for β prime = β

where δxβ xprimeβ = 1 if and only if xβ = xprimeβ Here it should be noted that bothβ and xβ play the role of indices that is xβ should not be mistaken for avariable or parameter The parameters are still the (tables with) Lagrangemultipliers λαβ and pseudomarginals Qβ

The goal is now to find conditions under which this Hessian is positive(semi) definite for any setting of the parameters Qβ λαβ that is conditionsthat guarantee

K equivsumββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) ge 0

for any choice of the ldquovectorrdquo S with elements Sβ(xβ) Straightforward ma-nipulations yieldsum

ββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) (K)

=sumα

sumβsubα

sumxβ

Aαβ(1minus Aαβ)Qlowastα(xβ)R2β(xβ) (K1)

+sumα

sumββ primesubα

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ)Qlowastα(xprimeβ prime)Rβ(xβ)Rβ prime(x

primeβ prime) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime) (K3)

where Rβ(xβ) equiv Sβ(xβ)Qβ(xβ)

2398 T Heskes

62 Recovering the Convexity Conditions (2) Let us first see how weget back the conditions for convexity of the Bethe free energy equation 51Since

K2 =sumα

[sumβsubα

sumxβ

AαβQlowastα(xβ)Rβ(xβ)

]2

ge 0

and5

K3 =sumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)

times

12

[Rβ(xβ)minus Rβ prime(xprimeβ prime)

]2 minus 12

R2β(xβ)minus

12

R2β prime(x

primeβ prime)

gesumα

sumβsubα

sumxβ

Aαβ

(sumβ primesubα

Aαβ prime minus Aαβ

)Qlowastα(xβ)R

2β(xβ) (61)

we have

K = K1 + K2 + K3 gesumα

sumβsubα

sumxβ

Aαβ

(1minus

sumβ primesubα

Aαβ prime

)Qlowastα(xβ)R

2β(xβ)

That is sufficient conditions for K to be nonnegative are

Aαβ ge 0 forallαβsubα andsumβsubα

Aαβ le 1 forallα

precisely the conditions for theorem 1

63 Fake Interactions While discussing the conditions for convexity ofthe Bethe free energy we noticed that adding a ldquofake interactionrdquo such as aconstant potential can change the validity of the conditions We will see thathere this is not the case and these fake interactions drop out as we wouldexpect them to

Suppose that we have a fake interaction α(Xα) = 1 From the solutionequation 54 it follows that the pseudomarginal Qlowastα(Xα) factorizes6

Qlowastα(xβ xprimeβ prime) = Qlowastα(xβ)Qlowastα(xprimeβ prime) forallββ primesubα

5 This step is in fact equivalent to the Gerschgorin theorem for bounding the eigenval-ues of a matrix

6 The exact marginal Pexact(Xα) need not factorize This is really a consequence of thelocality assumptions behind loopy belief propagation and the Bethe free energy

Uniqueness of Loopy Belief Propagation Fixed Points 2399

Consequently the terms involving α in K3 cancel with those in K2 which ismost easily seen when we combine K2 and K3 in a different way

K2 + K3 =sumα

sumβsubα

sumxβ xprime

βprime

A2αβQlowastα(xβ)Q

lowastα(xprimeβ)Rβ(xβ)Rβ(x

primeβ) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ prime

times [Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)]Rβ(xβ)Rβ prime(x

primeβ prime) (K3)

This leaves us with the weaker requirement (from K1) Aαβ(1minusAαβ) ge 0 forall β sub α The best choice is then to take Aαβ = 1 which turns condition 3of equation 41 intosum

αprimesupβαprime =α

Aαprimeβ + 1 ge nβ minus 1

The net effect is equivalent to ignoring the interaction reducing the numberof neighboring potentials nβ by 1 for all β that are part of the fake interactionα

We have seen how we get milder and thus better conditions when thereis effectively no interaction Motivated by this ldquosuccessrdquo we will work to-ward conditions that take into account the strength of the interactions Ourstarting point will be the above decomposition in K2 and K3 where sinceK2 ge 0 we will concentrate on K3

7 The Strength of a Potential

71 Bounding the Correlations The crucial observation which will al-low us to obtain milder and thus better conditions for the uniqueness of afixed point is the following lemma It bounds the term between brackets inK3 such that we can again combine this bound with the (positive) term K1However before we get to that we take some time to introduce and deriveproperties of the ldquostrengthrdquo of a potential

Lemma 2 Two-node correlations of loopy belief marginals obey the bound

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime) le σαQlowastα(xβ xprimeβ prime) forall ββprime subα

βprime =βforallxβ xprime

βprime (71)

with the ldquostrengthrdquo σα a function of the potential ψα(Xα) equiv logα(Xα) only

σα = 1minus exp(minusωα) with

ωα equiv maxXαXα

[ψα(Xα)+ (nα minus 1)ψα(Xα)minus

sumβsubα

ψα(Xαβ xβ)

] (72)

where nα equivsum

βsubα 1

2400 T Heskes

Proof For convenience and without loss of generality we omit α fromour notation and renumber the nodes that are contained in α from 1 to nWe consider the quotient between the loopy belief on the potential subsetdivided by the product of its single-node marginals

Qlowast(X)nprodβ=1

Qlowast(xβ)=(X)

prodβ

microβ(xβ)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)microβ(xβ)

=(X)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)

(73)

where we substituted the properly normalized version of equation 35 aloopy belief pseudomarginal is proportional to the potential times incomingmessages The goal is now to find the maximum of the above expressionover all possible messages and all values of X Especially the maximum overmessagesmicro seems to be difficult to compute but the following intermediatelemma helps us out

Lemma 3 The maximum of the function

V(micro) = (nminus 1) log

[sumX

(X)nprodβ=1

microβ(xβ)

]

minusnsumβ=1

log

sum

(Xβ xlowastβ)prodβ prime =β

microβ prime(xβ prime)

with respect to the messages micro under constraintssum

xβ microβ(xβ) = 1 for all β andmicroβ(xβ) ge 0 for all β and xβ occurs at an extreme point microβ(xβ) = δxβ xβ for somexβ to be found

Proof Let us consider optimizing the messagemicro1(x1)with fixed messagesmicroβ(xβ) for β gt 1 The first and second derivatives are easily found to obey

partVpartmicro1(x1)

= (nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ)

part2Vpartmicro1(x1)partmicro1(xprime1)

= (nminus 1)Q(x1)Q(xprime1)minussumβ =1

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

Uniqueness of Loopy Belief Propagation Fixed Points 2401

where

Q(X) equiv (X)prodβ microβ(xβ)sum

Xprime (Xprime)prodβ microβ(x

primeβ)

Now suppose that V has a regular extremum (maximum or minimum) notat an extreme point that is micro1(x1) gt 0 for two or more values of x1 At suchan extremum the first derivative should obey

(nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ) = λ

with λ a Lagrange multiplier implementing the constraintsum

x1micro1(x1) = 1

Summing over x1 we obtain λ = 0 (in fact V is indifferent to any multi-plicative scaling of micro) For the matrix with second derivatives at such anextremum we then have

part2Vpartmicro1(x1)partmicro1(xprime1)

=sumβ =1

sumβprime =1βprime =β

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

which is positive semidefinite the extremum cannot be a maximum Con-sequently any maximum must be at the boundary of the domain Sincethis holds for any choice of microβ(xβ) β gt 1 it follows by induction that themaximum with respect to all microβ(xβ) must be at an extreme point as well

The function V(micro) is up to a term independent of micro the logarithm ofequation 73 So the intermediate lemma 3 tells us that we can replace themaximization over messages micro by maximization over values X

maxmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

X

(X)[(X)

]nminus1

prodβ

(Xβ xβ)

Next we take the maximum over X as well and define the ldquostrengthrdquo σ tobe used in equation 71 through

11minus σ equiv max

Xmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

XX

(X)[(X)

]nminus1

prodβ

(Xβ xβ) (74)

2402 T Heskes

The inequality 71 then follows by summing out Xββ prime in

Qlowast(X)minusprodβ

Qlowast(xβ) le σQlowast(X)

The form of equation 72 then follows by rewriting equation 74 as

ω equiv minus log(1minus σ) = maxXX

W(X X) with

W(X X) =[ψ(X)+ (nminus 1)ψ(X)minus

sumβ

ψ(Xβ xβ)

]

where we recall that ψ(X) equiv log(X)

72 Some Properties In the following we will refer to both ω and σ asthe strength of the potential There are several properties worth noting

bull The strength of a potential is indifferent to multiplication with anyterm that factorizes over the nodes that is

if (X) = (X)prodβ

microβ(xβ) then ω() = ω() for any choice of micro

This property relates to the arbitrariness in the definition of equa-tion 31 if two potentials overlap then multiplying one potential witha term that only depends on the overlap and dividing the other by thesame term does not change the distribution Luckily it also does notchange the strength of those potentials

bull To compute the strength we can enumerate all possible combinationsHowever we can neglect all combinations X and X that differ in fewerthan two nodes To see this consider

W(x1 x2 x12 x1 x2 x12) = ψ(x1 x2 x12)+ ψ(x1 x2 x12)minus ψ(x1 x2 x12)minus ψ(x1 x2 x12)

= minusW(x1 x2 x12 x1 x2 x12)

If now also x2 = x2 we get W(x1 x1 x1 x1) = minusW(x1 x1 x1 x1) =0 Furthermore if W(x1 x2 x12 x1 x2 x12) le 0 then it must be thatW(x1 x2 x12 x1 x2 x12) ge 0 and vice versa So ω the maximumover all combinations must be nonnegative and we can indeed neglectall combinations that by definition yield zero

bull Thus for finite potentials 0 le ω ltinfin and 0 le σ lt 1

Uniqueness of Loopy Belief Propagation Fixed Points 2403

bull With pairwise potentials the above symmetries can be used to reducethe number of evaluations to |x1||x2|(|x1|minus1)(|x2|minus1)4 combinationsAnd indeed for binary nodes x12 isin 0 1 we immediately obtain

ω = |ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| (75)

Any pairwise binary potential can be written as a Boltzmann factor

(x1 x2) prop exp[wx1x2 + θ1x1 + θ2x2]

In this notation we find the simple and intuitive expression ω = |w|the strength is the absolute value of the ldquoweightrdquo It is indeed inde-pendent of (the size of) the thresholds In the case of minus1 1 codingthe relationship is ω = 4|w|

bull In some models there is the notion of a ldquotemperaturerdquo T that is(X) propexp[ψ(X)T] where ψ(X) is considered constant In obvious notationwe then have ω(T) = ω(1)T and thus σ(T) = 1 minus exp[minusω(1)T] =1minus [1(1minus σ(1))]1T

bull Loopy belief revision (max-product) can be interpreted as a zero-temperature limit of loopy belief propagation (sum product) Morespecifically we get the belief revision updates if we imagine runningloopy belief propagation on potentials that are scaled with tempera-ture T and then take the limit T to zero Consequently when analyzingconditions for uniqueness of loopy belief revision fixed points we cantake σ(0) = 0 if σ(1) = 0 (fake interaction) yet σ(0) = 1 wheneverσ(1) gt 0

8 Conditions for Uniqueness

81 Main Result

Theorem 4 Loopy belief propagation has a unique fixed point if there exists anallocation matrix Aαβ between potentials α and nodes β with properties

1 Aαβ ge 0 forallαβsubα (positivity)

2 (1minus σα)maxβsubα

Aαβ + σαsumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(81)

with the strength σα a function of the potentialα(Xα) as defined in equation 72

Proof For completeness we first summarize our line of reasoning Fixedpoints of loopy belief propagation are in one-to-one correspondence with

2404 T Heskes

extrema of the dual equation 55 This dual has a unique extremum if itis convexconcave Concavity is guaranteed so we focus on conditionsfor convexity that is for positive (semi)definiteness of the correspondingHessian This then boils down to conditions that ensure K = K1+K2+K3 ge 0for any choice of Rβ(xβ)

Substituting the bound equation 71 into the term K3 we obtain

K3 ge minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeσαQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime)

ge minussumα

σαsumβsubα

sumxβ

Aαβ

sumβprimesubαβprime =β

Aαβ primeQlowastα(xβ)R2β(xβ)

where in the last step we applied the same trick as in equation 61 SinceK2 ge 0 and combining K1 and (the above lower bound on) K3 we get

K = K1 + K2 + K3

gesumα

sumβsubα

sumxβ

Aαβ

[1minus Aαβ minus σα

sumβ prime =β

Aαβ prime

]Qlowastα(xβ)R

2β(xβ)

This implies

(1minus σα)Aαβ + σαsumβ primesubα

Aαβ prime le 1 forallαβsubα

which in combination with Aαβ ge 0 and σα le 1 yields condition 2 inequation 81 The equality constraint equation 52 that we started with canbe relaxed to the inequality condition 3 without any consequences

We get back the stricter conditions of theorem 1 if σα = 1 for all potentialsα Furthermore ldquofake interactionsrdquo play no role with σα = 0 condition 2becomes maxβsubα Aαβ le 1 suggesting the choice Aαβ = 1 for all β sub αwhich then effectively reduces the number of neighboring potentials nβ incondition 3

82 Comparison with Other Work To the best of our knowledge theonly conditions for uniqueness of loopy belief propagation fixed points thatdepend on more than just the structure of the graph are those in Tatikondaand Jordan (2002) for pairwise potentials The analysis in Tatikonda andJordan is based on the concept of the computation tree which represents anunwrapping of the original graph with respect to the loopy belief propaga-tion algorithm The same concept is used in Weiss (2000) to show that beliefrevision yields the correct maximum a posteriori assignments in graphs

Uniqueness of Loopy Belief Propagation Fixed Points 2405

with a single loop and Weiss and Freeman (2001) to prove that loopy beliefpropagation in gaussian graphical models yields exact means Although thecurrent theorems based on the concept of computation trees are derived forpairwise potentials it should be possible to extend them to more generalfactor graphs

The setup in Tatikonda and Jordan (2002) is slightly different it is basedon the factorization

Pexact(X) = 1Z

prodα

α(Xα)prodβ

β(xβ)

to be compared with our equation 31 where there are no self-potentialsβ(xβ) With this in mind the statement is then as follows

Theorem 5 (adapted from Tatikonda amp Jordan 2002 in particular proposi-tion 53) Loopy belief propagation on pairwise potentials has a unique fixed pointif

sumαsupβ

(max

ψα(Xα)minusminXα

ψα(Xα)

)lt 2 forallβ (82)

To make the connection between theorem 5 and theorem 4 we will firststrengthen the former and then weaken the latter We will focus on thecase of binary pairwise potentials Since the definition of self-potentials isarbitrary and the condition 82 is valid for any choice we can easily improvethe condition by optimizing this choice This then leads to the followingcorollary

Corollary 3 This corollary concerns an improvement of theorem 5 for pairwisebinary potentials Loopy belief propagation on pairwise binary potentials has aunique fixed point ifsum

αsupβωα lt 4 forallβ (83)

with ωα defined in equation 72

Proof The condition 82 applies to any arbitrary definition of self-poten-tials β(xβ) In fact it is valid for any choice

ψα(Xα) = ψα(Xα)+sumβsubα

φαβ(xβ)

where ψα(Xα) is any choice of potential subsets that fits in our frameworkof no self-potentials (as argued above there is some arbitrariness here as

2406 T Heskes

well) We can then optimize this choice to obtain milder and thus betterconditions Omitting α and renumbering the nodes from 1 to 2 we have

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= minφ1φ2

maxx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

minus minx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

In the case of binary nodes (two-by-two matrices ψ(x1 x2)) it is easy tocheck that the optimal φ1 and φ2 that yield the smallest gap are such that

ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2)

ge ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2) (84)

for some x1 x2 x1 and x2 with x1 = x1 and x2 = x2 Solving for φ1 and φ2we find

φ1(x1)minus φ1(x1) = 12

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]φ2(x2)minus φ2(x2) = 1

2

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]

Substitution back into equation 84 yields

ψ(x1 x2)+ φ1(x1)+ φ2(x2)minus ψ(x1 x2)minus φ1(x1)minus φ2(x2)

= 12

[ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)minus ψ(x1 x2)

]

which has to be nonnegative Of all four possible combinations two of themare valid and yield the same positive gap and the other two are invalid sincethey yield the same negative gap Enumerating these combinations we find

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= 12|ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| = ω

2

from equation 75 Substitution into the condition 82 then yields equa-tion 83

Next we derive the following weaker corollary of theorem 4

Uniqueness of Loopy Belief Propagation Fixed Points 2407

Corollary 4 This is a weaker version of theorem 4 for pairwise potentials Loopybelief propagation on pairwise potentials has a unique fixed point ifsum

αsupβωα le 1 forallβ (85)

with ωα defined in equation 72

Proof Consider the allocation matrix with components Aαβ = 1minus σα forall β sub α With this choice conditions 1 and 2 of equation 81 are fulfilledsince (condition 1) σα le 1 and (condition 2)

(1minus σα)(1minus σα)+ 2σα(1minus σα) = 1minus 2σ 2α le 1

Substitution into condition 3 yieldssumαsupβ

(1minus σα) gesumαsupβ

1minus 1 and thussumαsupβ

σα le 1 (86)

Since ωα = minus log(1minus σα) ge σα condition 85 is weaker than condition 86

Summarizing the conditions in Tatikonda and Jordan (2002) are for bi-nary pairwise potentials and when strengthened as above at most a constant(factor 4) less strict and thus better than the ones derived here The latterare better when the structure is (close to) a tree The best set of conditionsfollows by taking the union of both Note further that the conditions derivedin Tatikonda and Jordan (2002) are unlike theorem 4 specific to pairwisepotentials

83 Illustration For illustration we consider a 3 times 3 Ising grid withtoroidal boundary conditions as in Figure 3a and uniform ferromagneticpotentials proportional to(

α 1minus α1minus α α

)

The trivial solution which is the only minimum of the Bethe free energy forsmall α is the one with all pseudomarginals equal to (05 05) With simplealgebra for example following the line of reasoning that leads to the beliefoptimization algorithm in Welling and Teh (2003) it can be shown that thistrivial solution becomes unstable at the critical αcritical = 23 asymp 067 Forα gt 23 we find two minima one with ldquospins uprdquo and the other one withldquospins downrdquo

In this symmetric problem the strength of each potential is given by

ω = 2 log[

α

1minus α]

and thus σ = 1minus(

1minus αα

)2

2408 T Heskes

Figure 3 Three Ising grids in factor-graph notation circles denote nodes boxesinteractions (a) Toroidal boundary conditions All elements of the allocationmatrix equal to 34 (not shown) (b) Aperiodic boundary conditions and (c) twoloops left The elements of the allocation matrix along the edges follow directlyfrom optimizing condition 3 in theorem 4 and symmetry considerations WithB = 2minus 2A in b and C = 1minusA in c the optimal settings for the single remainingvariable A then boil down to 34 and 1 minus

radic18 respectively See the text for

further explanation

Uniqueness of Loopy Belief Propagation Fixed Points 2409

The minimal (uniform) compensation in condition 3 of theorem 4 amountsto A = 34 for all combinations of potentials and nodes Substitution intocondition 2 then yields

σ le 13

and thus α le 11+radic23

asymp 055

The critical value that follows from corollary 3 is in this case slightly better

ω lt 1 and thus α le 11+ eminus12 asymp 062

Next we consider the same grid with aperiodic boundary conditions asin Figure 3b Numerically we find a critical αcritical asymp 079 The value thatfollows from corollary 3 is dominated by the center node and hence staysthe same a unique loopy belief propagation fixed point for α lt 062 Theo-rem 4 can be exploited to shift resources a little In principle we can solvethe nonlinear programming problem but for this small problem it can stillbe done by hand with the following argumentation Minimal compensationaccording to condition 3 in theorem 4 combined with symmetry considera-tions yields the allocation matrix elements along on the edges in Figure 3bIt is then easy to check that there are only two different appearances ofcondition 2

(2minus 2A)σ + 34le 1 and

12σ + A le 1

The optimal choice for A is the one in which both conditions turn out to beidentical In this way we obtain A = 34 yielding

σ le 12

and thus α le 11+radic12

asymp 058

still slightly worse than the condition from corollary 3An example in which the condition obtained with theorem 4 is better

than the one from corollary 3 is given in Figure 3c Straightforward analysisfollowing the same recipe as for Figure 3b yields A = 1minusradic18 with

σ leradic

12

and thus α le 1

1+radic

1minusradic12asymp 065

better than theα lt 062 from corollary 3 and to be compared with the criticalαcritical asymp 088

9 Discussion

In this article we derived sufficient conditions for loopy belief propagationto have just a single fixed point These conditions remain much too strong tobe anywhere near the necessary conditions and in that sense should be seenas no more than a first step These conditions have the following positivefeatures

2410 T Heskes

bull Generalize the conditions for convexity of the Bethe free energy

bull Incorporate the (local) strength of potentials

bull Scale naturally as a function of the ldquotemperaturerdquo

bull Are invariant to arbitrary definitions of potentials and self-interactions

Although the analysis that led to these conditions may seem quite involvedit basically consists of a relatively straightforward combination of two ob-servations The first observation is that we can exploit the arbitrariness inthe definition of the Bethe free energy when we incorporate the constraintsThis forms the basis of the resource allocation argument And the secondobservation concerns the bound on the correlation of a loopy belief propa-gation marginal that leads to the introduction of the strength of a potential

Besides its theoretical usefulness there are more practical uses Firstalgorithms for guaranteed convergence explicitly minimize the Bethe freeenergy They can be considered ldquobound optimization algorithmsrdquo similarto expectation maximization and iterative proportional fitting in the innerloop they minimize a bound on the Bethe free energy which is then up-dated in the outer loop In practice it appears that the tighter the boundthe faster the convergence (see eg Heskes et al 2003) Instead of a boundthat is convex (Yuille 2002) or convex over the set of constraints (Teh ampWelling 2002 Heskes et al 2003) we might relax the convexity conditionand choose a tighter bound that still has a unique minimum thereby speed-ing up the convergence Second in Wainwright et al (2002) a convexifiedBethe free energy is proposed The arguments for this class of free energiesare twofold they yield a bound on the partition function (instead of justan approximation as the standard Bethe free energy) and have a uniqueminimum Focusing on the second argument the conditions in this articlecan be used to construct Bethe free energies that may not be convex (overthe set of constraints) but do have a unique minimum and being closer tothe standard Bethe free energy may yield better approximations

We can think of the following opportunities to make the sufficient con-ditions derived here stricter and thus closer to necessary conditions

bull The conditions guarantee convexity of the dual G(Qβ λαβ) with re-spect to Qβ But in fact we need only G(Qβ) equiv maxλαβ G(Qβ λαβ) to beconvex which is a weaker requirement The Hessian of G(Qβ) how-ever appears to be more difficult to compute and to analyze in generalbut may lead to stronger results in specific cases (eg only pairwiseinteractions or substituting a particular choice of Aαβ )

bull It may be possible to strengthen the bound equation 71 on loopybelief correlations especially for interactions that involve more thantwo nodes

An important question is how the uniqueness of loopy belief propaga-tion fixed points relates to the convergence of loopy belief propagation

Uniqueness of Loopy Belief Propagation Fixed Points 2411

Intuitively one might expect that if loopy belief propagation has a uniquefixed point it will also converge to it This also seems to be the argumenta-tion in Tatikonda and Jordan (2002) However to the best of our knowledgethere is no proof of such correspondence Furthermore the following set ofsimulations does seem to suggest otherwise

We consider a Boltzmann machine with four binary nodes weights

w = ω

0 1 minus1 minus11 0 1 minus1minus1 1 0 minus1minus1 minus1 minus1 0

zero thresholds and potentials

ij(xi xj) = exp[wij4] if xi = xj and ij(xi xj) = exp[minuswij4] if xi = xj

Running loopy belief propagation possibly damped as in equation 39 weobserve ldquoconvergentrdquo and ldquononconvergentrdquo behavior For relatively smallweights loopy belief propagation converges to the trivial fixed point withPi(xi) = 05 for all nodes i and xi = 0 1 as in the lower left inset inFigure 4 For relatively large weights it ends up in a limit cycle as shown inthe upper right inset The weight strength that forms the transition betweenthis ldquoconvergentrdquo and ldquononconvergentrdquo behavior strongly depends on thestep size7 This by itself makes it hard to defend a one-to-one correspondencebetween convergence of loopy belief propagation (apparently dependingon step size) and uniqueness of fixed points (obviously independent of stepsize)

For weights larger than roughly 58 loopy belief propagation failed toconverge to the trivial fixed point even for very small step sizes Howeverrunning a convergent double-loop algorithm from many different initialconditions and many weight strengths considerably larger than 58 we al-ways ended up in the trivial fixed point and never in another one We foundsimilar behavior for a three-node Boltzmann machine (same weight matrixas above except for the fourth node) for very large weights loopy beliefpropagation ends up in a limit cycle whereas a convergent double-loopalgorithm converges to the trivial fixed point which here by corollary 2is guaranteed to be unique In future work we hope to elaborate on theseissues

7 Note that the conditions for guaranteed uniqueness imply ω = 43 for corollary 3and ω = log(2) asymp 069 for theorem 4 both far below the weight strengths where ldquonon-convergentrdquo behavior sets in

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

Uniqueness of Loopy Belief Propagation Fixed Points 2383

roughly divided into two categories sampling approaches and determinis-tic approximations

Most deterministic approximations derive from the so-called Gibbs-Helmholtz free energy

F(P) = minussumα

sumXα

P(Xα)ψα(Xα)+sum

X

P(X) log P(X)

with shorthandψ equiv log Minimizing this variational free energy over theset P of all properly normalized probability distributions we get back theexact probability distribution equation 31 as the argument at the minimumand minus the log of the partition function as the value at the minimum

Pexact = argminPisinP

F(P) and minus log Z = minPisinP

F(P)

Since the Gibbs-Helmholtz free energy is convex in P the equality constraint(proper normalization) is linear and the inequality constraints (nonnega-tivity) are convex this minimum is unique By itself we have not gainedanything the entropy may still be intractable to compute

32 The Bethe Free Energy The Bethe free energy is an approximationof the exact Gibbs-Helmholtz free energy In particular we approximate theentropy through

sumX

P(X) log P(X) asympsumα

sumXα

P(Xα) log P(Xα)

minussumβ

(nβ minus 1)sumxβ

P(xβ) log P(xβ)

with xβ a (super)node and nβ =sum

αsupβ 1 the number of potentials thatcontains node xβ The second term follows from a discounting argumentwithout it we would overcount the entropy contributions on the overlapbetween the potential subsets The (super)nodes xβ are themselves subsetsof the potential subsets that is

xβ cap Xα = empty or xβ cap Xα = xβ forallαβand partition the domain X

xβ cap xβ prime = empty forallββ prime and⋃β

xβ = X

Typically the xβ are taken to be single nodes and in the following we willrefer to them as such For clarity of notation we will indicate these nodes byβ and xβ in lowercase to contrast them with the potentials α and potentialsubsets Xα in uppercase

2384 T Heskes

Note that the Bethe free energy depends on only the marginals P(Xα) andP(xβ) We replace minimization of the exact Gibbs-Helmholtz free energyover probability distributions by minimization of the Bethe free energy

F(QαQβ) = minussumα

sumXα

Qα(Xα)ψα(Xα)+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumβ

(nβ minus 1)sumxβ

Qβ(xβ) log Qβ(xβ) (32)

over sets of ldquopseudomarginalsrdquo1 or beliefs QαQβ For this to make sensethese pseudomarginals have to be properly normalized as well as consistentthat is2sum

Qα(Xα) = 1 and Qα(xβ) =sumXαβ

Qα(Xα) = Qβ(xβ) (33)

Let Q denote all subsets of consistent and properly normalized pseudo-marginals Then our goal is to solve

minQαQβ isinQ

F(QαQβ)

The hope is that the pseudomarginals at this minimum are accurate approx-imations to the exact marginals Pexact(Xα) and Pexact(xβ)

33 Link with Loopy Belief Propagation For completeness and laterreference we describe the link between the Bethe free energy and loopybelief propagation as originally reported on by Yedidia et al (2001) It startswith the Lagrangian

L(QαQβ λαβ λα λβ) = F(QαQβ)

+sumβ

sumαsupβ

sumxβ

λαβ(xβ)[Qβ(xβ)minusQα(xβ)

]

+sumα

λα

[1minus

sumXα

Qα(Xα)

]

+sumβ

λβ

[1minus

sumβ

Qβ(xβ)

] (34)

1 Terminology from Wainwright Jaakkola and Willsky (2002) used to indicate thatthere need not be a joint distribution that would yield such marginals

2 Strictly speaking we also have to take inequality constraints into account namelythose of the form Qα(Xα) ge 0 However with the potentials being positive and finitethe logarithmic terms in the free energy make sure that we never really have to worryabout those they never become ldquoactiverdquo For convenience we will not consider them anyfurther

Uniqueness of Loopy Belief Propagation Fixed Points 2385

At an extremum of the Bethe free energy satisfying the constraints allderivatives of L are zero the ones with respect to the Lagrange multipliersλ give back the constraints the ones with respect to the pseudomarginalsQ give an extremum of the Bethe free energy Setting the derivatives withrespect to Qα and Qβ to zero we can solve for Qα and Qβ in terms of theLagrange multipliers

Qlowastα(Xα) = α(Xα) exp

[λα minus 1+

sumβsubα

λαβ(xβ)

]

Qlowastβ(xβ) = exp

[1

nβ minus 1

1minus λβ +

sumαsupβ

λαβ(xβ)

]

In terms of theldquomessagerdquomicroβrarrα(xβ) equiv exp[λαβ(xβ)] from node β to potentialα the pseudomarginal Qlowastα(Xα) reads

Qlowastα(Xα) prop α(Xα)prodβsubα

microβrarrα(xβ) (35)

where proper normalization yields the Lagrange multiplier λα With defi-nition

microαrarrβ(xβ) equivQlowastβ(xβ)microβrarrα(xβ)

(36)

the fixed-point equation for Qlowastβ(xβ) can after some manipulation be writtenin the form

Qlowastβ(xβ) propprodαsupβ

microαrarrβ(xβ) (37)

where again the Lagrange multiplier λβ follows from normalization Finallythe constraint Qlowastα(xβ) = Qlowastβ(xβ) in combination with equation 36 suggeststhe update

microαrarrβ(xβ) = Qlowastα(xβ)microβrarrα(xβ)

(38)

Equations 35 through 38 constitute the belief propagation equations Theycan be summarized as follows A pseudomarginal is the potential (just 1 forthe nodes in the convention where no potentials are assigned to nodes) timesits incoming messages the outgoing message is the pseudomarginal dividedby the incoming message The scheduling of the messages is somewhatarbitrary Loopy belief propagation can be ldquodampedrdquo by taking smaller

2386 T Heskes

steps This damping is usually done in terms of the Lagrange multipliersthat is in the log domain of the messages

logmicronewαrarrβ(xβ) = logmicroαrarrβ(xβ)

+ ε[log Qlowastα(xβ)minus logmicroβrarrα(xβ) minus logmicroαrarrβ(xβ)] (39)

Summarizing loopy belief propagation is equivalent to fixed-point itera-tion where the fixed points are the zero derivatives of the Lagrangian

4 Convexity of the Bethe Free Energy

41 Rewriting the Bethe Free Energy Minimization of the Bethe freeenergy equation 32 under the constraints of equation 33 is equivalent tosolving a minimax problem on the Lagrangian equation 34 namely

minQαQβ

maxλαβ λαλβ

L(QαQβ λαβ λα λβ)

The ordering of the min and max operations is important here to enforcethe constraints we first have to take the maximum The min and max oper-ations can be interchanged if we have a convex constrained minimizationproblem (Luenberger 1984) That is the function to be minimized must beconvex in its parameters the equality constraints have to be linear and theinequality constraints convex In our case the equality constraints are in-deed linear and the inequality constraints enforcing nonnegativity of thepseudomarginals indeed are convex However the Bethe free energy equa-tion 32 is clearly nonconvex in its parameters QαQβ This is what makesit a difficult optimization problem

Luckily the description in equation 32 is not unique any other formthat can be constructed by substituting the constraints of equation 33 isequally valid Following Pakzad and Anantharam (2002) we call the Bethefree energy ldquoconvex over the set of constraintsrdquo if by making use of theconstraints of equation 33 we can rewrite it in a form that is convex inQαQβ

42 Conditions for Convexity The problem is with the term

Sβ(Qβ) equiv minussumxβ

Qβ(xβ) log Qβ(xβ)

which is concave in Qβ Using the constraint Qβ(xβ) = Qα(xβ) we can turnit into a functional that is convex in Qα and Qβ separately but not necessarilyjointly That is with the substitution Qβ(xβ) = Qα(xβ) for any α sup β theentropy and thus the Bethe free energy is convex in Qα and in Qβ but notnecessarily in QαQβ However if we add to Sβ(Qβ) a convex entropycontribution

minusSα(Qα) equivsumXα

Qα(Xα) log Qα(Xα)

Uniqueness of Loopy Belief Propagation Fixed Points 2387

the combination of minusSα and Sβ is convex in QαQβ as the followinglemma needed in the proof of theorem 1 below shows

Lemma 1

αβ(QαQβ) equivsumXα

Qα(Xα) log Qα(Xα)minussumxβ

Qα(xβ) log Qβ(xβ)

is convex in QαQβ

Proof The matrix with second derivatives of αβ has the components

H(XαXprimeα) equivpart2αβ

partQα(Xα)partQα(Xprimeα)= 1

Qα(Xα)δXαXprimeα

H(Xα xprimeβ) equivpart2αβ

partQα(Xα)partQβ(xprimeβ)= minus 1

Qβ(xβ)δxβ xprimeβ

H(xβ xprimeβ) equivpart2αβ

partQβ(xβ)partQβ(xprimeβ)= minusQα(xβ)

Q2β(xβ)

δxβ xprimeβ

where we note that Xα and xβ should be interpreted as indices Convexityrequires that for any ldquovectorrdquo (Rα(Xα) Rβ(xβ))

0 le (Rα(Xα) Rβ(xβ))(

H(XαXprimeα) H(Xα xprimeβ)H(xβXprimeα) H(xprimeβ xβ)

)(Rα(Xprimeα)Rβ(xprimeβ)

)

=sumXα

R2α(Xα)

Qα(Xα)minus 2

sumXα

Rα(Xα)Rβ(xβ)Qβ(xβ)

+sumxβ

Qα(xβ)R2β(xβ)

Q2β(xβ)

=sumXα

Qα(Xα)

[Rα(Xα)

Qα(Xα)minus Rβ(xβ)

Qβ(xβ)

]2

The idea is that the Bethe free energy is convex over the set of constraints ifwe have sufficient convex resources Qα log Qα to compensate for the concaveminusQβ log Qβ terms This can be formalized in the following theorem

Theorem 1 The Bethe free energy is convex over the set of consistency con-straints if there exists an allocation matrix Aαβ between potentials α and nodes βsatisfying

1 Aαβ ge 0 forallαβsubα (positivity)

2sumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(41)

2388 T Heskes

Proof First we note that we do not have to worry about the energy termsthat are linear in Qα In other words to prove the theorem we can restrictourselves to proving that minus the entropy

minusS(Q) = minus[sum

α

Sα(Qα)minussumβ

(nβ minus 1)Sβ(Qβ)

]

is convex over the set of consistency constraints The resulting operation isnow a matter of resource allocation For each concave contribution (nβ minus1)Sβ we have to find convex contributions minusSα to compensate for it LetAαβ denote the ldquoamount of resourcesrdquo that we take from potential subset αto compensate for node β Now in shorthand notation and with a little bitof rewriting

minusS(Q) = minus[sum

α

Sα minussumβ

(nβ minus 1)Sβ

]

= minussumα

(1minus

sumβsubα

Aαβ +sumβsubα

Aαβ

)Sα

minussumβ

minussumαsupβ

Aαβ +sumαsupβ

Aαβ minus (nβ minus 1)

= minussumα

(1minus

sumβsubα

Aαβ

)Sα minus

sumα

sumβsubα

Aαβ [Sα minus Sβ ]

minussumβ

[sumαsupβ

Aαβ minus (nβ minus 1)

]Sβ

Convexity of the first term is guaranteed if 1minussumβ Aαβ ge 0 (condition 2) ofthe second term if Aαβ ge 0 (condition 1 and lemma 1) and of the third termifsum

α Aαβ minus (nβ minus 1) ge 0 (condition 3)

This theorem is a special case of the one in Heskes et al (2003) for themore general Kikuchi free energy Either one of the inequality signs in con-dition 2 and 3 of equation 41 can be replaced by an equality sign withoutany consequences

43 Some Implications

Corollary 1 The Bethe free energy for singly connected graphs is convex overthe set of constraints

Uniqueness of Loopy Belief Propagation Fixed Points 2389

Proof The proof is by construction Choose one of the leaf nodes as theroot βlowast and define

Aαβ = 1 iff β sub α and β closer to the root βlowast than any other β prime sub αAαβ prime = 0 for all other β prime

Obviously this choice of A satisfies conditions 1 and 2 of equation 41Arguing the other way around for eachβ = βlowast there is just a single potentialα sup β that is closer to the root βlowast than β itself (see the illustration in Figure 2)and thus there are precisely nβ minus 1 contributions Aαβ = 1 The root itselfgets nβlowast contributions Aαβlowast = 1 which is even better Hence condition 3 isalso satisfied

sumαsupβ

Aαβ = nβ minus 1 forallβ =βlowast andsumαsupβlowast

Aαβlowast = nβlowast gt nβlowast minus 1

With the above construction of A we are in a sense ldquoeating up resourcestoward the rootrdquo At the root we have one piece of resources left whichsuggests that we can still enlarge the set of graphs for which convexity canbe shown using theorem 1 This leads to the next corollary

Corollary 2 The Bethe free energy for graphs with a single loop is convex overthe set of constraints

Proof Again the proof is by construction Break the loop at one particularplace that is remove one node βlowast from a potential αlowast such that a singlyconnected structure is left Construct a matrix A as in the proof of corollary 1taking the node βlowast as the root The matrix A constructed in this way alsojust works for the graph with the closed loop since still

sumαsupβ

Aαβ = nβ minus 1 forallβ =βlowast and nowsumαsupβlowast

Aαβlowast = nβlowast minus 1

It can be seen that this construction starts to fail as soon as we have twoloops that are connected with two connected loops we have insufficientpositive resources to compensate for the negative entropy contributions

44 Connection with Other Work The same corollaries can be foundin Pakzad and Anantharam (2002) and McEliece and Yildirim (2003) Fur-thermore the conditions in theorem 1 are very similar to the ones stated inPakzad and Anantharam (2002) which for the Bethe free energy boil downto the following

2390 T Heskes

Figure 2 The construction of an allocation matrix satisfying all convexity con-straints for singly connected (a) and single-loop structures (b) Neglecting thearrows and dashes each graph corresponds to a factor graph (Kscischang Freyamp Loeliger 2001) where dark boxes refer to potentials and circles to nodes Thenumbers within the circles give the corresponding ldquoovercounting numbersrdquo forthe Bethe free energy 1minus nβ with nβ the number of neighboring potentials Thearrows pointing from potentials α to nodes β visualize the allocation matrixA with Aαβ = 1 if there is an arrow and Aαβ = 0 otherwise As can be seenfor each potential there is precisely one outgoing arrow pointing at the nodeclosest to the root chosen to be the node in the upper right corner of the graphIn the singly-connected structure (a) all nonroot nodes have precisely nβ minus 1incoming arrows just sufficient to compensate the overcounting number 1minusnβ The root node itself has one incoming arrow which it does not really need Inthe structure with the single loop (b) we open the loop by breaking the dashedlink and construct the allocation matrix for the corresponding singly connectedstructure This allocation matrix works for the single-loop structure as well be-cause now the incoming arrow at the ldquorootrdquo is just sufficient to compensate forthe negative overcounting number resulting from the extra link closing the loop

Uniqueness of Loopy Belief Propagation Fixed Points 2391

Theorem 2 (Adapted from theorem 1 in Pakzad amp Anantharam 2002) TheBethe free energy is convex for the set of constraints if for any set of nodes B wehave sum

βisinB

(1minus nβ)+sumαisinπ(B)

1 ge 0 (42)

where π(B) equiv α existβ isin Bβ sub α denotes the ldquoparentrdquo set of B that is thosepotential subsets that include at least one node in B

Proposition 1 The conditions in theorems 1 and 2 are equivalent

Proof Let us first suppose that there does exist an allocation matrix Aαβ

satisfying the conditions of equation 41 Then for any set B

sumβisinB

(nβ minus 1) lesumβisinB

sumαsupβ

Aαβ lesumαisinπ(B)

sumβsubα

Aαβ lesumαisinπ(B)

1

where the inequalities follow from conditions 3 1 and 2 in equation 41respectively In other words validity of the conditions in theorem 1 impliesthe validity of those in theorem 2

Next let us suppose that the conditions in Theorem 1 fail Above wehave seen that this can happen if and only if the graph contains at least oneconnected component with two connected loops But then condition 42is violated as well when we take for B the set of all nodes within such acomponent

Since validity implies validity and violation implies violation the con-ditions must be equivalent

Graphical models with a single loop have been studied in detail in Weiss(2000) yielding important theoretical results (eg correctness of maximuma posteriori assignments) These results are derived by ldquounwrappingrdquo thesingle loop into an infinite tree This argument also breaks down as soon asthere is more than a single loop It might be interesting to find out whetherthere is a deeper connection between this unwrapping argument and theconvexity of the Bethe free energy

In summary we have given conditions for the Bethe free energy to have aunique extremum satisfying the constraints From the connection betweenthe extrema of the Bethe free energy and fixed points of loopy belief prop-agation it then follows that loopy belief propagation has a unique fixedpoint when these conditions are satisfied These conditions fail as soon asthe structure of the graph contains two connected loops

The conditions for convexity of the Bethe free energy depend on thestructure of the graph the potentialsα(Xα) do not play any role These po-tentials appear only in the energy term that is linear in the pseudomarginals

2392 T Heskes

and thus does not affect the convexity argument Consequently adding aldquofake linkrdquo with potential α(Xα) = 1 can change the validity of the con-ditions whereas it has no effect on the loopy belief propagation updatesEven if we managed to find more interesting (ie milder) conditions forconvexity over the set of constraints3 the impact of fake links would neverdisappear In the following we will therefore dig a little deeper to arrive atmilder conditions that do take into account (the strength of) the potentials

5 The Dual Formulation

51 From Lagrangian to Dual As we have seen fixed points of loopybelief propagation are in a one-to-one correspondence with zero derivativesof the Lagrangian If we manage to find conditions under which these zeroderivatives have a unique solution then for the same conditions loopybelief propagation has a unique fixed point In the following we will workwith a Lagrangian slightly different from equation 34 First we substitutethe constraint Qα(xβ) = Qβ(xβ) to write the Bethe free energy in the ldquomoreconvexrdquo form

F(QαQβ) = minussumα

sumXα

Qα(Xα)ψα(Xα)+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumβ

sumαsupβ

Aαβ

sumxβ

Qα(xβ) log Qβ(xβ) (51)

where the allocation matrix Aαβ can be any matrix that satisfies

sumαsupβ

Aαβ = nβ minus 1 (52)

And second we express the consistency constraints from equation 33 interms of the potential pseudomarginals Qα alone This then yields

L(QαQβ λαβ λα) = minussumα

sumXα

Qα(Xα)ψα(Xα)

+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumα

sumβsubα

Aαβ

sumxβ

Qα(xβ) log Qβ(xβ)

3 We would like to conjecture that this is not possiblemdashthat the conditions in theorem 1are not only sufficient but also necessary to prove convexity of the Bethe free energy overthe set of consistency constraints Note that this would not imply that we need theseconditions to guarantee the uniqueness of fixed points since for that convexity by itselfis sufficient not necessary

Uniqueness of Loopy Belief Propagation Fixed Points 2393

+sumβ

sumαsupβ

sumxβ

λαβ(xβ)

times[

1nβ minus 1

sumαprimesupβ

AαprimeβQαprime(xβ)minusQα(xβ)

]

+sumα

λα

[1minus

sumXα

Qα(Xα)

]

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (53)

Note that the constraint Qβ(xβ) = Qα(xβ) as well as its normalization is nolonger incorporated with Lagrange multipliers but follows when we takethe minimum with respect to Qβ It is easy to check that the fixed-pointequations of loopy belief propagation still follow by setting the derivativesof the Lagrangian equation 53 to zero

Although the Bethe free energy and thus the Lagrangian equation 53may not be convex in QαQβ they are convex in Qα and Qβ separatelyTherefore we can interchange the minimum over the pseudomarginals Qα

and the maximum over the Lagrange multipliers as long as we leave theminimum over Qβ as the final operation4

minQαQβ

maxλαβ λα

L(QαQβ λαβ λα) = minQβ

maxλαβ λα

minQα

L(QαQβ λαβ λα)

Rewriting

sumβ

sumαsupβ

sumxβ

λαβ(xβ)

[1

nβ minus 1

sumαprimesupβ

AαprimeβQαprime(xβ)minusQα(xβ)

]

= minussumα

sumβsubα

sumxβ

λαβ(xβ)Qα(xβ)

with

λαβ(xβ) equiv λαβ(xβ)minus 1nβ minus 1

sumαprimesupβ

Aαprimeβλαprimeβ(xβ)

we can easily solve for the minimum with respect to Qα

Qlowastα(Xα) = α(Xα) exp

[λα minus 1+

sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

] (54)

4 In principle we could also first take the minimum over Qβ and leave the minimumover Qα but this does not seem to lead to any useful results

2394 T Heskes

Plugging this into the Lagrangian we obtain the ldquodualrdquo

G(Qβ λαβ λα) equiv L(QlowastαQβ λαβ λα)

= minussumα

sumXα

α(Xα)

exptimes[λα minus 1+

sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

]

+sumα

λα +sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (55)

Next we find for the maximum with respect to λα

exp[1minus λlowastα

] =sumXα

α(Xα) exp

[sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

]

equiv Zlowastα (56)

where we have to keep in mind that Zlowastα by itself like Qlowastα is a function of theremaining pseudomarginals Qβ and Lagrange multipliers λαβ Substitutingthis solution into the dual we arrive at

G(Qβ λαβ) equiv G(Qβ λαβ λlowastα)

= minussumα

log Zlowastα +sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (57)

Let us pause here for a moment and reflect on what we have done so farThe Lagrangian equation 53 being convex in Qα has a unique minimumin Qα (given all other parameters fixed) which is also the only extremumIt happens to be relatively straightforward to express the value at this mini-mum in terms of the remaining parameters and then also to find the optimal(maximal) λlowastα Plugging these values into the Lagrangian equation 53 wehave not lost anything That is zero derivatives of the Lagrangian are still inone-to-one correspondence with zero derivatives of the dual equation 57and thus with fixed points of loopy belief propagation

52 Recovering the Convexity Conditions (1) To find a minimum of theBethe free energy satisfying the constraints in equation 33 we first have totake the maximum of the dual equation 57 over the remaining Lagrangemultipliersλαβ and then the minimum over the remaining pseudomarginalsQβ The duality theorem a standard result from constrained optimization(see Luenberger 1984) tells us that the dual G is concave in the Lagrangemultipliers The remaining question is then whether the dual is convex in

Uniqueness of Loopy Belief Propagation Fixed Points 2395

Qβ If it is we have a convex-concave minimax problem which is guaranteedto have a unique solution

Link c in Figure 1 follows from the following proposition

Proposition 2 Convexity of the Bethe free energy equation 51 in QαQβimplies convexity of the dual equation 57 in Qβ

Proof First we note that the minimum of a convex function over someof its parameters is convex in its remaining parameters In obvious one-dimensional notation with ylowast(x) equiv argmin

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge 2 f (x (ylowast(x+ δ)+ ylowast(xminus δ))2)ge 2 f (x ylowast(x))

where the first inequality follows from the convexity of f in x yand the sec-ond inequality from ylowast(x) being the unique minimum of f (x y) Thereforethe dual equation 55 is convex in Qβ when the Lagrangian equation 53and thus the Bethe free energy equation 51 is convex in QαQβ Further-more from the duality theorem the dual equation 55 is concave in theLagrange multipliers λαβ λα Next we note that the maximum of a con-vex or concave function over its maximizing parameters is again convexwith ylowast(x) equiv argmax

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge f (x+ δ ylowast(x))+ f (xminus δ ylowast(x))ge 2 f (x ylowast(x))

where the first inequality follows from ylowast(xplusmnδ) being the unique maximumof f (x plusmn δ y) and the second inequality from the convexity of f (x y) in xHence the dual equation 57 must still be convex in Qβ

For now we did not gain or lose anything in comparison with the con-ditions for theorem 1 However the inequalities in the above proof suggesta little space that will lead to milder conditions for the uniqueness of fixedpoints

53 Boundedness of the Bethe Free Energy For completeness and tosupport link f in Figure 1 we will here prove that the Bethe free energy isbounded from below The following theorem can be considered a specialcase of the one stated in Minka (2001) on the Bethe free energy for expectationpropagation a generalization of (loopy) belief propagation

Theorem 3 If all potentials are bounded from above that isα(Xα) le max forall α and Xα the Bethe free energy is bounded from below on the set of constraints

2396 T Heskes

Proof It is sufficient to prove that the function G(Qβ) equiv maxλαβ G(Qβ λαβ)

is bounded from below for a particular choice of Aαβ satisfying equation 52Considering Aαβ = nβminus1

nβ we then have

G(Qβ) ge minussumα

logsumXα

α(Xα) exp

[sumβsubα

nβ minus 1nβ

log Qβ(xβ)

]

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

logsumXα

α(Xα)Qβ(xβ)

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

+sumβ

(nβ minus 1)

[minus log

sumxβ

Qβ(xβ)+sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

where the first inequality follows by substituting the choice λαβ(xβ) = 0 forall α β and xβ in G(Qβ λαβ) the second from the concavity of the function

ynβminus1

nβ and the third from the upper bound on the potentials

6 Toward Better Conditions

61 The Hessian The next step is to compute the Hessianmdashthe secondderivative of the dual with respect to the pseudomarginals Qβ The firstderivative yields

partGpartQβ(xβ)

= minussumαsupβ

Aαβ

Qlowastα(xβ)Qβ(xβ)

+ (nβ minus 1)

which is immediate from the Lagrangian equation 53 To compute thematrix of second derivatives

Hββ prime(xβ xprimeβ prime) equivpart2G

partQβ(xβ)partQβ prime(xprimeβ prime)

Uniqueness of Loopy Belief Propagation Fixed Points 2397

we make use of

partQlowastα(xβ)partQβ prime(xprimeβ prime)

= Aαβ primeQlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)

Qβ prime(xprimeβ prime)

where both β and β prime should be a subset of α and with convention Qlowastα(xβ xβ)= Qlowastα(xβ) and Qlowastα(xβ xprimeβ) = 0 if xβ = xprimeβ Here the first term follows from thedifferentation of equation 54 and the second term from the normalizationas in equation 56 Distinguishing between β = β prime and β = β prime we then have

Hββ(xβ xprimeβ) =sumαsupβ

Aαβ(1minus Aαβ)Qlowastα(xβ)Q2β(xβ)

δxβ xprimeβ

+sumαsupβ

A2αβ

Qlowastα(xβ)Qlowastα(xprimeβ)Qβ(xβ)Qβ(xprimeβ)

Hββ prime(xβ xprimeβ prime) = minussum

αsupββ primeAαβAαβ prime

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)Qβ(xβ)Qβ prime(xprimeβ prime)

for β prime = β

where δxβ xprimeβ = 1 if and only if xβ = xprimeβ Here it should be noted that bothβ and xβ play the role of indices that is xβ should not be mistaken for avariable or parameter The parameters are still the (tables with) Lagrangemultipliers λαβ and pseudomarginals Qβ

The goal is now to find conditions under which this Hessian is positive(semi) definite for any setting of the parameters Qβ λαβ that is conditionsthat guarantee

K equivsumββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) ge 0

for any choice of the ldquovectorrdquo S with elements Sβ(xβ) Straightforward ma-nipulations yieldsum

ββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) (K)

=sumα

sumβsubα

sumxβ

Aαβ(1minus Aαβ)Qlowastα(xβ)R2β(xβ) (K1)

+sumα

sumββ primesubα

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ)Qlowastα(xprimeβ prime)Rβ(xβ)Rβ prime(x

primeβ prime) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime) (K3)

where Rβ(xβ) equiv Sβ(xβ)Qβ(xβ)

2398 T Heskes

62 Recovering the Convexity Conditions (2) Let us first see how weget back the conditions for convexity of the Bethe free energy equation 51Since

K2 =sumα

[sumβsubα

sumxβ

AαβQlowastα(xβ)Rβ(xβ)

]2

ge 0

and5

K3 =sumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)

times

12

[Rβ(xβ)minus Rβ prime(xprimeβ prime)

]2 minus 12

R2β(xβ)minus

12

R2β prime(x

primeβ prime)

gesumα

sumβsubα

sumxβ

Aαβ

(sumβ primesubα

Aαβ prime minus Aαβ

)Qlowastα(xβ)R

2β(xβ) (61)

we have

K = K1 + K2 + K3 gesumα

sumβsubα

sumxβ

Aαβ

(1minus

sumβ primesubα

Aαβ prime

)Qlowastα(xβ)R

2β(xβ)

That is sufficient conditions for K to be nonnegative are

Aαβ ge 0 forallαβsubα andsumβsubα

Aαβ le 1 forallα

precisely the conditions for theorem 1

63 Fake Interactions While discussing the conditions for convexity ofthe Bethe free energy we noticed that adding a ldquofake interactionrdquo such as aconstant potential can change the validity of the conditions We will see thathere this is not the case and these fake interactions drop out as we wouldexpect them to

Suppose that we have a fake interaction α(Xα) = 1 From the solutionequation 54 it follows that the pseudomarginal Qlowastα(Xα) factorizes6

Qlowastα(xβ xprimeβ prime) = Qlowastα(xβ)Qlowastα(xprimeβ prime) forallββ primesubα

5 This step is in fact equivalent to the Gerschgorin theorem for bounding the eigenval-ues of a matrix

6 The exact marginal Pexact(Xα) need not factorize This is really a consequence of thelocality assumptions behind loopy belief propagation and the Bethe free energy

Uniqueness of Loopy Belief Propagation Fixed Points 2399

Consequently the terms involving α in K3 cancel with those in K2 which ismost easily seen when we combine K2 and K3 in a different way

K2 + K3 =sumα

sumβsubα

sumxβ xprime

βprime

A2αβQlowastα(xβ)Q

lowastα(xprimeβ)Rβ(xβ)Rβ(x

primeβ) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ prime

times [Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)]Rβ(xβ)Rβ prime(x

primeβ prime) (K3)

This leaves us with the weaker requirement (from K1) Aαβ(1minusAαβ) ge 0 forall β sub α The best choice is then to take Aαβ = 1 which turns condition 3of equation 41 intosum

αprimesupβαprime =α

Aαprimeβ + 1 ge nβ minus 1

The net effect is equivalent to ignoring the interaction reducing the numberof neighboring potentials nβ by 1 for all β that are part of the fake interactionα

We have seen how we get milder and thus better conditions when thereis effectively no interaction Motivated by this ldquosuccessrdquo we will work to-ward conditions that take into account the strength of the interactions Ourstarting point will be the above decomposition in K2 and K3 where sinceK2 ge 0 we will concentrate on K3

7 The Strength of a Potential

71 Bounding the Correlations The crucial observation which will al-low us to obtain milder and thus better conditions for the uniqueness of afixed point is the following lemma It bounds the term between brackets inK3 such that we can again combine this bound with the (positive) term K1However before we get to that we take some time to introduce and deriveproperties of the ldquostrengthrdquo of a potential

Lemma 2 Two-node correlations of loopy belief marginals obey the bound

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime) le σαQlowastα(xβ xprimeβ prime) forall ββprime subα

βprime =βforallxβ xprime

βprime (71)

with the ldquostrengthrdquo σα a function of the potential ψα(Xα) equiv logα(Xα) only

σα = 1minus exp(minusωα) with

ωα equiv maxXαXα

[ψα(Xα)+ (nα minus 1)ψα(Xα)minus

sumβsubα

ψα(Xαβ xβ)

] (72)

where nα equivsum

βsubα 1

2400 T Heskes

Proof For convenience and without loss of generality we omit α fromour notation and renumber the nodes that are contained in α from 1 to nWe consider the quotient between the loopy belief on the potential subsetdivided by the product of its single-node marginals

Qlowast(X)nprodβ=1

Qlowast(xβ)=(X)

prodβ

microβ(xβ)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)microβ(xβ)

=(X)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)

(73)

where we substituted the properly normalized version of equation 35 aloopy belief pseudomarginal is proportional to the potential times incomingmessages The goal is now to find the maximum of the above expressionover all possible messages and all values of X Especially the maximum overmessagesmicro seems to be difficult to compute but the following intermediatelemma helps us out

Lemma 3 The maximum of the function

V(micro) = (nminus 1) log

[sumX

(X)nprodβ=1

microβ(xβ)

]

minusnsumβ=1

log

sum

(Xβ xlowastβ)prodβ prime =β

microβ prime(xβ prime)

with respect to the messages micro under constraintssum

xβ microβ(xβ) = 1 for all β andmicroβ(xβ) ge 0 for all β and xβ occurs at an extreme point microβ(xβ) = δxβ xβ for somexβ to be found

Proof Let us consider optimizing the messagemicro1(x1)with fixed messagesmicroβ(xβ) for β gt 1 The first and second derivatives are easily found to obey

partVpartmicro1(x1)

= (nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ)

part2Vpartmicro1(x1)partmicro1(xprime1)

= (nminus 1)Q(x1)Q(xprime1)minussumβ =1

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

Uniqueness of Loopy Belief Propagation Fixed Points 2401

where

Q(X) equiv (X)prodβ microβ(xβ)sum

Xprime (Xprime)prodβ microβ(x

primeβ)

Now suppose that V has a regular extremum (maximum or minimum) notat an extreme point that is micro1(x1) gt 0 for two or more values of x1 At suchan extremum the first derivative should obey

(nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ) = λ

with λ a Lagrange multiplier implementing the constraintsum

x1micro1(x1) = 1

Summing over x1 we obtain λ = 0 (in fact V is indifferent to any multi-plicative scaling of micro) For the matrix with second derivatives at such anextremum we then have

part2Vpartmicro1(x1)partmicro1(xprime1)

=sumβ =1

sumβprime =1βprime =β

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

which is positive semidefinite the extremum cannot be a maximum Con-sequently any maximum must be at the boundary of the domain Sincethis holds for any choice of microβ(xβ) β gt 1 it follows by induction that themaximum with respect to all microβ(xβ) must be at an extreme point as well

The function V(micro) is up to a term independent of micro the logarithm ofequation 73 So the intermediate lemma 3 tells us that we can replace themaximization over messages micro by maximization over values X

maxmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

X

(X)[(X)

]nminus1

prodβ

(Xβ xβ)

Next we take the maximum over X as well and define the ldquostrengthrdquo σ tobe used in equation 71 through

11minus σ equiv max

Xmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

XX

(X)[(X)

]nminus1

prodβ

(Xβ xβ) (74)

2402 T Heskes

The inequality 71 then follows by summing out Xββ prime in

Qlowast(X)minusprodβ

Qlowast(xβ) le σQlowast(X)

The form of equation 72 then follows by rewriting equation 74 as

ω equiv minus log(1minus σ) = maxXX

W(X X) with

W(X X) =[ψ(X)+ (nminus 1)ψ(X)minus

sumβ

ψ(Xβ xβ)

]

where we recall that ψ(X) equiv log(X)

72 Some Properties In the following we will refer to both ω and σ asthe strength of the potential There are several properties worth noting

bull The strength of a potential is indifferent to multiplication with anyterm that factorizes over the nodes that is

if (X) = (X)prodβ

microβ(xβ) then ω() = ω() for any choice of micro

This property relates to the arbitrariness in the definition of equa-tion 31 if two potentials overlap then multiplying one potential witha term that only depends on the overlap and dividing the other by thesame term does not change the distribution Luckily it also does notchange the strength of those potentials

bull To compute the strength we can enumerate all possible combinationsHowever we can neglect all combinations X and X that differ in fewerthan two nodes To see this consider

W(x1 x2 x12 x1 x2 x12) = ψ(x1 x2 x12)+ ψ(x1 x2 x12)minus ψ(x1 x2 x12)minus ψ(x1 x2 x12)

= minusW(x1 x2 x12 x1 x2 x12)

If now also x2 = x2 we get W(x1 x1 x1 x1) = minusW(x1 x1 x1 x1) =0 Furthermore if W(x1 x2 x12 x1 x2 x12) le 0 then it must be thatW(x1 x2 x12 x1 x2 x12) ge 0 and vice versa So ω the maximumover all combinations must be nonnegative and we can indeed neglectall combinations that by definition yield zero

bull Thus for finite potentials 0 le ω ltinfin and 0 le σ lt 1

Uniqueness of Loopy Belief Propagation Fixed Points 2403

bull With pairwise potentials the above symmetries can be used to reducethe number of evaluations to |x1||x2|(|x1|minus1)(|x2|minus1)4 combinationsAnd indeed for binary nodes x12 isin 0 1 we immediately obtain

ω = |ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| (75)

Any pairwise binary potential can be written as a Boltzmann factor

(x1 x2) prop exp[wx1x2 + θ1x1 + θ2x2]

In this notation we find the simple and intuitive expression ω = |w|the strength is the absolute value of the ldquoweightrdquo It is indeed inde-pendent of (the size of) the thresholds In the case of minus1 1 codingthe relationship is ω = 4|w|

bull In some models there is the notion of a ldquotemperaturerdquo T that is(X) propexp[ψ(X)T] where ψ(X) is considered constant In obvious notationwe then have ω(T) = ω(1)T and thus σ(T) = 1 minus exp[minusω(1)T] =1minus [1(1minus σ(1))]1T

bull Loopy belief revision (max-product) can be interpreted as a zero-temperature limit of loopy belief propagation (sum product) Morespecifically we get the belief revision updates if we imagine runningloopy belief propagation on potentials that are scaled with tempera-ture T and then take the limit T to zero Consequently when analyzingconditions for uniqueness of loopy belief revision fixed points we cantake σ(0) = 0 if σ(1) = 0 (fake interaction) yet σ(0) = 1 wheneverσ(1) gt 0

8 Conditions for Uniqueness

81 Main Result

Theorem 4 Loopy belief propagation has a unique fixed point if there exists anallocation matrix Aαβ between potentials α and nodes β with properties

1 Aαβ ge 0 forallαβsubα (positivity)

2 (1minus σα)maxβsubα

Aαβ + σαsumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(81)

with the strength σα a function of the potentialα(Xα) as defined in equation 72

Proof For completeness we first summarize our line of reasoning Fixedpoints of loopy belief propagation are in one-to-one correspondence with

2404 T Heskes

extrema of the dual equation 55 This dual has a unique extremum if itis convexconcave Concavity is guaranteed so we focus on conditionsfor convexity that is for positive (semi)definiteness of the correspondingHessian This then boils down to conditions that ensure K = K1+K2+K3 ge 0for any choice of Rβ(xβ)

Substituting the bound equation 71 into the term K3 we obtain

K3 ge minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeσαQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime)

ge minussumα

σαsumβsubα

sumxβ

Aαβ

sumβprimesubαβprime =β

Aαβ primeQlowastα(xβ)R2β(xβ)

where in the last step we applied the same trick as in equation 61 SinceK2 ge 0 and combining K1 and (the above lower bound on) K3 we get

K = K1 + K2 + K3

gesumα

sumβsubα

sumxβ

Aαβ

[1minus Aαβ minus σα

sumβ prime =β

Aαβ prime

]Qlowastα(xβ)R

2β(xβ)

This implies

(1minus σα)Aαβ + σαsumβ primesubα

Aαβ prime le 1 forallαβsubα

which in combination with Aαβ ge 0 and σα le 1 yields condition 2 inequation 81 The equality constraint equation 52 that we started with canbe relaxed to the inequality condition 3 without any consequences

We get back the stricter conditions of theorem 1 if σα = 1 for all potentialsα Furthermore ldquofake interactionsrdquo play no role with σα = 0 condition 2becomes maxβsubα Aαβ le 1 suggesting the choice Aαβ = 1 for all β sub αwhich then effectively reduces the number of neighboring potentials nβ incondition 3

82 Comparison with Other Work To the best of our knowledge theonly conditions for uniqueness of loopy belief propagation fixed points thatdepend on more than just the structure of the graph are those in Tatikondaand Jordan (2002) for pairwise potentials The analysis in Tatikonda andJordan is based on the concept of the computation tree which represents anunwrapping of the original graph with respect to the loopy belief propaga-tion algorithm The same concept is used in Weiss (2000) to show that beliefrevision yields the correct maximum a posteriori assignments in graphs

Uniqueness of Loopy Belief Propagation Fixed Points 2405

with a single loop and Weiss and Freeman (2001) to prove that loopy beliefpropagation in gaussian graphical models yields exact means Although thecurrent theorems based on the concept of computation trees are derived forpairwise potentials it should be possible to extend them to more generalfactor graphs

The setup in Tatikonda and Jordan (2002) is slightly different it is basedon the factorization

Pexact(X) = 1Z

prodα

α(Xα)prodβ

β(xβ)

to be compared with our equation 31 where there are no self-potentialsβ(xβ) With this in mind the statement is then as follows

Theorem 5 (adapted from Tatikonda amp Jordan 2002 in particular proposi-tion 53) Loopy belief propagation on pairwise potentials has a unique fixed pointif

sumαsupβ

(max

ψα(Xα)minusminXα

ψα(Xα)

)lt 2 forallβ (82)

To make the connection between theorem 5 and theorem 4 we will firststrengthen the former and then weaken the latter We will focus on thecase of binary pairwise potentials Since the definition of self-potentials isarbitrary and the condition 82 is valid for any choice we can easily improvethe condition by optimizing this choice This then leads to the followingcorollary

Corollary 3 This corollary concerns an improvement of theorem 5 for pairwisebinary potentials Loopy belief propagation on pairwise binary potentials has aunique fixed point ifsum

αsupβωα lt 4 forallβ (83)

with ωα defined in equation 72

Proof The condition 82 applies to any arbitrary definition of self-poten-tials β(xβ) In fact it is valid for any choice

ψα(Xα) = ψα(Xα)+sumβsubα

φαβ(xβ)

where ψα(Xα) is any choice of potential subsets that fits in our frameworkof no self-potentials (as argued above there is some arbitrariness here as

2406 T Heskes

well) We can then optimize this choice to obtain milder and thus betterconditions Omitting α and renumbering the nodes from 1 to 2 we have

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= minφ1φ2

maxx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

minus minx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

In the case of binary nodes (two-by-two matrices ψ(x1 x2)) it is easy tocheck that the optimal φ1 and φ2 that yield the smallest gap are such that

ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2)

ge ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2) (84)

for some x1 x2 x1 and x2 with x1 = x1 and x2 = x2 Solving for φ1 and φ2we find

φ1(x1)minus φ1(x1) = 12

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]φ2(x2)minus φ2(x2) = 1

2

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]

Substitution back into equation 84 yields

ψ(x1 x2)+ φ1(x1)+ φ2(x2)minus ψ(x1 x2)minus φ1(x1)minus φ2(x2)

= 12

[ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)minus ψ(x1 x2)

]

which has to be nonnegative Of all four possible combinations two of themare valid and yield the same positive gap and the other two are invalid sincethey yield the same negative gap Enumerating these combinations we find

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= 12|ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| = ω

2

from equation 75 Substitution into the condition 82 then yields equa-tion 83

Next we derive the following weaker corollary of theorem 4

Uniqueness of Loopy Belief Propagation Fixed Points 2407

Corollary 4 This is a weaker version of theorem 4 for pairwise potentials Loopybelief propagation on pairwise potentials has a unique fixed point ifsum

αsupβωα le 1 forallβ (85)

with ωα defined in equation 72

Proof Consider the allocation matrix with components Aαβ = 1minus σα forall β sub α With this choice conditions 1 and 2 of equation 81 are fulfilledsince (condition 1) σα le 1 and (condition 2)

(1minus σα)(1minus σα)+ 2σα(1minus σα) = 1minus 2σ 2α le 1

Substitution into condition 3 yieldssumαsupβ

(1minus σα) gesumαsupβ

1minus 1 and thussumαsupβ

σα le 1 (86)

Since ωα = minus log(1minus σα) ge σα condition 85 is weaker than condition 86

Summarizing the conditions in Tatikonda and Jordan (2002) are for bi-nary pairwise potentials and when strengthened as above at most a constant(factor 4) less strict and thus better than the ones derived here The latterare better when the structure is (close to) a tree The best set of conditionsfollows by taking the union of both Note further that the conditions derivedin Tatikonda and Jordan (2002) are unlike theorem 4 specific to pairwisepotentials

83 Illustration For illustration we consider a 3 times 3 Ising grid withtoroidal boundary conditions as in Figure 3a and uniform ferromagneticpotentials proportional to(

α 1minus α1minus α α

)

The trivial solution which is the only minimum of the Bethe free energy forsmall α is the one with all pseudomarginals equal to (05 05) With simplealgebra for example following the line of reasoning that leads to the beliefoptimization algorithm in Welling and Teh (2003) it can be shown that thistrivial solution becomes unstable at the critical αcritical = 23 asymp 067 Forα gt 23 we find two minima one with ldquospins uprdquo and the other one withldquospins downrdquo

In this symmetric problem the strength of each potential is given by

ω = 2 log[

α

1minus α]

and thus σ = 1minus(

1minus αα

)2

2408 T Heskes

Figure 3 Three Ising grids in factor-graph notation circles denote nodes boxesinteractions (a) Toroidal boundary conditions All elements of the allocationmatrix equal to 34 (not shown) (b) Aperiodic boundary conditions and (c) twoloops left The elements of the allocation matrix along the edges follow directlyfrom optimizing condition 3 in theorem 4 and symmetry considerations WithB = 2minus 2A in b and C = 1minusA in c the optimal settings for the single remainingvariable A then boil down to 34 and 1 minus

radic18 respectively See the text for

further explanation

Uniqueness of Loopy Belief Propagation Fixed Points 2409

The minimal (uniform) compensation in condition 3 of theorem 4 amountsto A = 34 for all combinations of potentials and nodes Substitution intocondition 2 then yields

σ le 13

and thus α le 11+radic23

asymp 055

The critical value that follows from corollary 3 is in this case slightly better

ω lt 1 and thus α le 11+ eminus12 asymp 062

Next we consider the same grid with aperiodic boundary conditions asin Figure 3b Numerically we find a critical αcritical asymp 079 The value thatfollows from corollary 3 is dominated by the center node and hence staysthe same a unique loopy belief propagation fixed point for α lt 062 Theo-rem 4 can be exploited to shift resources a little In principle we can solvethe nonlinear programming problem but for this small problem it can stillbe done by hand with the following argumentation Minimal compensationaccording to condition 3 in theorem 4 combined with symmetry considera-tions yields the allocation matrix elements along on the edges in Figure 3bIt is then easy to check that there are only two different appearances ofcondition 2

(2minus 2A)σ + 34le 1 and

12σ + A le 1

The optimal choice for A is the one in which both conditions turn out to beidentical In this way we obtain A = 34 yielding

σ le 12

and thus α le 11+radic12

asymp 058

still slightly worse than the condition from corollary 3An example in which the condition obtained with theorem 4 is better

than the one from corollary 3 is given in Figure 3c Straightforward analysisfollowing the same recipe as for Figure 3b yields A = 1minusradic18 with

σ leradic

12

and thus α le 1

1+radic

1minusradic12asymp 065

better than theα lt 062 from corollary 3 and to be compared with the criticalαcritical asymp 088

9 Discussion

In this article we derived sufficient conditions for loopy belief propagationto have just a single fixed point These conditions remain much too strong tobe anywhere near the necessary conditions and in that sense should be seenas no more than a first step These conditions have the following positivefeatures

2410 T Heskes

bull Generalize the conditions for convexity of the Bethe free energy

bull Incorporate the (local) strength of potentials

bull Scale naturally as a function of the ldquotemperaturerdquo

bull Are invariant to arbitrary definitions of potentials and self-interactions

Although the analysis that led to these conditions may seem quite involvedit basically consists of a relatively straightforward combination of two ob-servations The first observation is that we can exploit the arbitrariness inthe definition of the Bethe free energy when we incorporate the constraintsThis forms the basis of the resource allocation argument And the secondobservation concerns the bound on the correlation of a loopy belief propa-gation marginal that leads to the introduction of the strength of a potential

Besides its theoretical usefulness there are more practical uses Firstalgorithms for guaranteed convergence explicitly minimize the Bethe freeenergy They can be considered ldquobound optimization algorithmsrdquo similarto expectation maximization and iterative proportional fitting in the innerloop they minimize a bound on the Bethe free energy which is then up-dated in the outer loop In practice it appears that the tighter the boundthe faster the convergence (see eg Heskes et al 2003) Instead of a boundthat is convex (Yuille 2002) or convex over the set of constraints (Teh ampWelling 2002 Heskes et al 2003) we might relax the convexity conditionand choose a tighter bound that still has a unique minimum thereby speed-ing up the convergence Second in Wainwright et al (2002) a convexifiedBethe free energy is proposed The arguments for this class of free energiesare twofold they yield a bound on the partition function (instead of justan approximation as the standard Bethe free energy) and have a uniqueminimum Focusing on the second argument the conditions in this articlecan be used to construct Bethe free energies that may not be convex (overthe set of constraints) but do have a unique minimum and being closer tothe standard Bethe free energy may yield better approximations

We can think of the following opportunities to make the sufficient con-ditions derived here stricter and thus closer to necessary conditions

bull The conditions guarantee convexity of the dual G(Qβ λαβ) with re-spect to Qβ But in fact we need only G(Qβ) equiv maxλαβ G(Qβ λαβ) to beconvex which is a weaker requirement The Hessian of G(Qβ) how-ever appears to be more difficult to compute and to analyze in generalbut may lead to stronger results in specific cases (eg only pairwiseinteractions or substituting a particular choice of Aαβ )

bull It may be possible to strengthen the bound equation 71 on loopybelief correlations especially for interactions that involve more thantwo nodes

An important question is how the uniqueness of loopy belief propaga-tion fixed points relates to the convergence of loopy belief propagation

Uniqueness of Loopy Belief Propagation Fixed Points 2411

Intuitively one might expect that if loopy belief propagation has a uniquefixed point it will also converge to it This also seems to be the argumenta-tion in Tatikonda and Jordan (2002) However to the best of our knowledgethere is no proof of such correspondence Furthermore the following set ofsimulations does seem to suggest otherwise

We consider a Boltzmann machine with four binary nodes weights

w = ω

0 1 minus1 minus11 0 1 minus1minus1 1 0 minus1minus1 minus1 minus1 0

zero thresholds and potentials

ij(xi xj) = exp[wij4] if xi = xj and ij(xi xj) = exp[minuswij4] if xi = xj

Running loopy belief propagation possibly damped as in equation 39 weobserve ldquoconvergentrdquo and ldquononconvergentrdquo behavior For relatively smallweights loopy belief propagation converges to the trivial fixed point withPi(xi) = 05 for all nodes i and xi = 0 1 as in the lower left inset inFigure 4 For relatively large weights it ends up in a limit cycle as shown inthe upper right inset The weight strength that forms the transition betweenthis ldquoconvergentrdquo and ldquononconvergentrdquo behavior strongly depends on thestep size7 This by itself makes it hard to defend a one-to-one correspondencebetween convergence of loopy belief propagation (apparently dependingon step size) and uniqueness of fixed points (obviously independent of stepsize)

For weights larger than roughly 58 loopy belief propagation failed toconverge to the trivial fixed point even for very small step sizes Howeverrunning a convergent double-loop algorithm from many different initialconditions and many weight strengths considerably larger than 58 we al-ways ended up in the trivial fixed point and never in another one We foundsimilar behavior for a three-node Boltzmann machine (same weight matrixas above except for the fourth node) for very large weights loopy beliefpropagation ends up in a limit cycle whereas a convergent double-loopalgorithm converges to the trivial fixed point which here by corollary 2is guaranteed to be unique In future work we hope to elaborate on theseissues

7 Note that the conditions for guaranteed uniqueness imply ω = 43 for corollary 3and ω = log(2) asymp 069 for theorem 4 both far below the weight strengths where ldquonon-convergentrdquo behavior sets in

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

2384 T Heskes

Note that the Bethe free energy depends on only the marginals P(Xα) andP(xβ) We replace minimization of the exact Gibbs-Helmholtz free energyover probability distributions by minimization of the Bethe free energy

F(QαQβ) = minussumα

sumXα

Qα(Xα)ψα(Xα)+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumβ

(nβ minus 1)sumxβ

Qβ(xβ) log Qβ(xβ) (32)

over sets of ldquopseudomarginalsrdquo1 or beliefs QαQβ For this to make sensethese pseudomarginals have to be properly normalized as well as consistentthat is2sum

Qα(Xα) = 1 and Qα(xβ) =sumXαβ

Qα(Xα) = Qβ(xβ) (33)

Let Q denote all subsets of consistent and properly normalized pseudo-marginals Then our goal is to solve

minQαQβ isinQ

F(QαQβ)

The hope is that the pseudomarginals at this minimum are accurate approx-imations to the exact marginals Pexact(Xα) and Pexact(xβ)

33 Link with Loopy Belief Propagation For completeness and laterreference we describe the link between the Bethe free energy and loopybelief propagation as originally reported on by Yedidia et al (2001) It startswith the Lagrangian

L(QαQβ λαβ λα λβ) = F(QαQβ)

+sumβ

sumαsupβ

sumxβ

λαβ(xβ)[Qβ(xβ)minusQα(xβ)

]

+sumα

λα

[1minus

sumXα

Qα(Xα)

]

+sumβ

λβ

[1minus

sumβ

Qβ(xβ)

] (34)

1 Terminology from Wainwright Jaakkola and Willsky (2002) used to indicate thatthere need not be a joint distribution that would yield such marginals

2 Strictly speaking we also have to take inequality constraints into account namelythose of the form Qα(Xα) ge 0 However with the potentials being positive and finitethe logarithmic terms in the free energy make sure that we never really have to worryabout those they never become ldquoactiverdquo For convenience we will not consider them anyfurther

Uniqueness of Loopy Belief Propagation Fixed Points 2385

At an extremum of the Bethe free energy satisfying the constraints allderivatives of L are zero the ones with respect to the Lagrange multipliersλ give back the constraints the ones with respect to the pseudomarginalsQ give an extremum of the Bethe free energy Setting the derivatives withrespect to Qα and Qβ to zero we can solve for Qα and Qβ in terms of theLagrange multipliers

Qlowastα(Xα) = α(Xα) exp

[λα minus 1+

sumβsubα

λαβ(xβ)

]

Qlowastβ(xβ) = exp

[1

nβ minus 1

1minus λβ +

sumαsupβ

λαβ(xβ)

]

In terms of theldquomessagerdquomicroβrarrα(xβ) equiv exp[λαβ(xβ)] from node β to potentialα the pseudomarginal Qlowastα(Xα) reads

Qlowastα(Xα) prop α(Xα)prodβsubα

microβrarrα(xβ) (35)

where proper normalization yields the Lagrange multiplier λα With defi-nition

microαrarrβ(xβ) equivQlowastβ(xβ)microβrarrα(xβ)

(36)

the fixed-point equation for Qlowastβ(xβ) can after some manipulation be writtenin the form

Qlowastβ(xβ) propprodαsupβ

microαrarrβ(xβ) (37)

where again the Lagrange multiplier λβ follows from normalization Finallythe constraint Qlowastα(xβ) = Qlowastβ(xβ) in combination with equation 36 suggeststhe update

microαrarrβ(xβ) = Qlowastα(xβ)microβrarrα(xβ)

(38)

Equations 35 through 38 constitute the belief propagation equations Theycan be summarized as follows A pseudomarginal is the potential (just 1 forthe nodes in the convention where no potentials are assigned to nodes) timesits incoming messages the outgoing message is the pseudomarginal dividedby the incoming message The scheduling of the messages is somewhatarbitrary Loopy belief propagation can be ldquodampedrdquo by taking smaller

2386 T Heskes

steps This damping is usually done in terms of the Lagrange multipliersthat is in the log domain of the messages

logmicronewαrarrβ(xβ) = logmicroαrarrβ(xβ)

+ ε[log Qlowastα(xβ)minus logmicroβrarrα(xβ) minus logmicroαrarrβ(xβ)] (39)

Summarizing loopy belief propagation is equivalent to fixed-point itera-tion where the fixed points are the zero derivatives of the Lagrangian

4 Convexity of the Bethe Free Energy

41 Rewriting the Bethe Free Energy Minimization of the Bethe freeenergy equation 32 under the constraints of equation 33 is equivalent tosolving a minimax problem on the Lagrangian equation 34 namely

minQαQβ

maxλαβ λαλβ

L(QαQβ λαβ λα λβ)

The ordering of the min and max operations is important here to enforcethe constraints we first have to take the maximum The min and max oper-ations can be interchanged if we have a convex constrained minimizationproblem (Luenberger 1984) That is the function to be minimized must beconvex in its parameters the equality constraints have to be linear and theinequality constraints convex In our case the equality constraints are in-deed linear and the inequality constraints enforcing nonnegativity of thepseudomarginals indeed are convex However the Bethe free energy equa-tion 32 is clearly nonconvex in its parameters QαQβ This is what makesit a difficult optimization problem

Luckily the description in equation 32 is not unique any other formthat can be constructed by substituting the constraints of equation 33 isequally valid Following Pakzad and Anantharam (2002) we call the Bethefree energy ldquoconvex over the set of constraintsrdquo if by making use of theconstraints of equation 33 we can rewrite it in a form that is convex inQαQβ

42 Conditions for Convexity The problem is with the term

Sβ(Qβ) equiv minussumxβ

Qβ(xβ) log Qβ(xβ)

which is concave in Qβ Using the constraint Qβ(xβ) = Qα(xβ) we can turnit into a functional that is convex in Qα and Qβ separately but not necessarilyjointly That is with the substitution Qβ(xβ) = Qα(xβ) for any α sup β theentropy and thus the Bethe free energy is convex in Qα and in Qβ but notnecessarily in QαQβ However if we add to Sβ(Qβ) a convex entropycontribution

minusSα(Qα) equivsumXα

Qα(Xα) log Qα(Xα)

Uniqueness of Loopy Belief Propagation Fixed Points 2387

the combination of minusSα and Sβ is convex in QαQβ as the followinglemma needed in the proof of theorem 1 below shows

Lemma 1

αβ(QαQβ) equivsumXα

Qα(Xα) log Qα(Xα)minussumxβ

Qα(xβ) log Qβ(xβ)

is convex in QαQβ

Proof The matrix with second derivatives of αβ has the components

H(XαXprimeα) equivpart2αβ

partQα(Xα)partQα(Xprimeα)= 1

Qα(Xα)δXαXprimeα

H(Xα xprimeβ) equivpart2αβ

partQα(Xα)partQβ(xprimeβ)= minus 1

Qβ(xβ)δxβ xprimeβ

H(xβ xprimeβ) equivpart2αβ

partQβ(xβ)partQβ(xprimeβ)= minusQα(xβ)

Q2β(xβ)

δxβ xprimeβ

where we note that Xα and xβ should be interpreted as indices Convexityrequires that for any ldquovectorrdquo (Rα(Xα) Rβ(xβ))

0 le (Rα(Xα) Rβ(xβ))(

H(XαXprimeα) H(Xα xprimeβ)H(xβXprimeα) H(xprimeβ xβ)

)(Rα(Xprimeα)Rβ(xprimeβ)

)

=sumXα

R2α(Xα)

Qα(Xα)minus 2

sumXα

Rα(Xα)Rβ(xβ)Qβ(xβ)

+sumxβ

Qα(xβ)R2β(xβ)

Q2β(xβ)

=sumXα

Qα(Xα)

[Rα(Xα)

Qα(Xα)minus Rβ(xβ)

Qβ(xβ)

]2

The idea is that the Bethe free energy is convex over the set of constraints ifwe have sufficient convex resources Qα log Qα to compensate for the concaveminusQβ log Qβ terms This can be formalized in the following theorem

Theorem 1 The Bethe free energy is convex over the set of consistency con-straints if there exists an allocation matrix Aαβ between potentials α and nodes βsatisfying

1 Aαβ ge 0 forallαβsubα (positivity)

2sumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(41)

2388 T Heskes

Proof First we note that we do not have to worry about the energy termsthat are linear in Qα In other words to prove the theorem we can restrictourselves to proving that minus the entropy

minusS(Q) = minus[sum

α

Sα(Qα)minussumβ

(nβ minus 1)Sβ(Qβ)

]

is convex over the set of consistency constraints The resulting operation isnow a matter of resource allocation For each concave contribution (nβ minus1)Sβ we have to find convex contributions minusSα to compensate for it LetAαβ denote the ldquoamount of resourcesrdquo that we take from potential subset αto compensate for node β Now in shorthand notation and with a little bitof rewriting

minusS(Q) = minus[sum

α

Sα minussumβ

(nβ minus 1)Sβ

]

= minussumα

(1minus

sumβsubα

Aαβ +sumβsubα

Aαβ

)Sα

minussumβ

minussumαsupβ

Aαβ +sumαsupβ

Aαβ minus (nβ minus 1)

= minussumα

(1minus

sumβsubα

Aαβ

)Sα minus

sumα

sumβsubα

Aαβ [Sα minus Sβ ]

minussumβ

[sumαsupβ

Aαβ minus (nβ minus 1)

]Sβ

Convexity of the first term is guaranteed if 1minussumβ Aαβ ge 0 (condition 2) ofthe second term if Aαβ ge 0 (condition 1 and lemma 1) and of the third termifsum

α Aαβ minus (nβ minus 1) ge 0 (condition 3)

This theorem is a special case of the one in Heskes et al (2003) for themore general Kikuchi free energy Either one of the inequality signs in con-dition 2 and 3 of equation 41 can be replaced by an equality sign withoutany consequences

43 Some Implications

Corollary 1 The Bethe free energy for singly connected graphs is convex overthe set of constraints

Uniqueness of Loopy Belief Propagation Fixed Points 2389

Proof The proof is by construction Choose one of the leaf nodes as theroot βlowast and define

Aαβ = 1 iff β sub α and β closer to the root βlowast than any other β prime sub αAαβ prime = 0 for all other β prime

Obviously this choice of A satisfies conditions 1 and 2 of equation 41Arguing the other way around for eachβ = βlowast there is just a single potentialα sup β that is closer to the root βlowast than β itself (see the illustration in Figure 2)and thus there are precisely nβ minus 1 contributions Aαβ = 1 The root itselfgets nβlowast contributions Aαβlowast = 1 which is even better Hence condition 3 isalso satisfied

sumαsupβ

Aαβ = nβ minus 1 forallβ =βlowast andsumαsupβlowast

Aαβlowast = nβlowast gt nβlowast minus 1

With the above construction of A we are in a sense ldquoeating up resourcestoward the rootrdquo At the root we have one piece of resources left whichsuggests that we can still enlarge the set of graphs for which convexity canbe shown using theorem 1 This leads to the next corollary

Corollary 2 The Bethe free energy for graphs with a single loop is convex overthe set of constraints

Proof Again the proof is by construction Break the loop at one particularplace that is remove one node βlowast from a potential αlowast such that a singlyconnected structure is left Construct a matrix A as in the proof of corollary 1taking the node βlowast as the root The matrix A constructed in this way alsojust works for the graph with the closed loop since still

sumαsupβ

Aαβ = nβ minus 1 forallβ =βlowast and nowsumαsupβlowast

Aαβlowast = nβlowast minus 1

It can be seen that this construction starts to fail as soon as we have twoloops that are connected with two connected loops we have insufficientpositive resources to compensate for the negative entropy contributions

44 Connection with Other Work The same corollaries can be foundin Pakzad and Anantharam (2002) and McEliece and Yildirim (2003) Fur-thermore the conditions in theorem 1 are very similar to the ones stated inPakzad and Anantharam (2002) which for the Bethe free energy boil downto the following

2390 T Heskes

Figure 2 The construction of an allocation matrix satisfying all convexity con-straints for singly connected (a) and single-loop structures (b) Neglecting thearrows and dashes each graph corresponds to a factor graph (Kscischang Freyamp Loeliger 2001) where dark boxes refer to potentials and circles to nodes Thenumbers within the circles give the corresponding ldquoovercounting numbersrdquo forthe Bethe free energy 1minus nβ with nβ the number of neighboring potentials Thearrows pointing from potentials α to nodes β visualize the allocation matrixA with Aαβ = 1 if there is an arrow and Aαβ = 0 otherwise As can be seenfor each potential there is precisely one outgoing arrow pointing at the nodeclosest to the root chosen to be the node in the upper right corner of the graphIn the singly-connected structure (a) all nonroot nodes have precisely nβ minus 1incoming arrows just sufficient to compensate the overcounting number 1minusnβ The root node itself has one incoming arrow which it does not really need Inthe structure with the single loop (b) we open the loop by breaking the dashedlink and construct the allocation matrix for the corresponding singly connectedstructure This allocation matrix works for the single-loop structure as well be-cause now the incoming arrow at the ldquorootrdquo is just sufficient to compensate forthe negative overcounting number resulting from the extra link closing the loop

Uniqueness of Loopy Belief Propagation Fixed Points 2391

Theorem 2 (Adapted from theorem 1 in Pakzad amp Anantharam 2002) TheBethe free energy is convex for the set of constraints if for any set of nodes B wehave sum

βisinB

(1minus nβ)+sumαisinπ(B)

1 ge 0 (42)

where π(B) equiv α existβ isin Bβ sub α denotes the ldquoparentrdquo set of B that is thosepotential subsets that include at least one node in B

Proposition 1 The conditions in theorems 1 and 2 are equivalent

Proof Let us first suppose that there does exist an allocation matrix Aαβ

satisfying the conditions of equation 41 Then for any set B

sumβisinB

(nβ minus 1) lesumβisinB

sumαsupβ

Aαβ lesumαisinπ(B)

sumβsubα

Aαβ lesumαisinπ(B)

1

where the inequalities follow from conditions 3 1 and 2 in equation 41respectively In other words validity of the conditions in theorem 1 impliesthe validity of those in theorem 2

Next let us suppose that the conditions in Theorem 1 fail Above wehave seen that this can happen if and only if the graph contains at least oneconnected component with two connected loops But then condition 42is violated as well when we take for B the set of all nodes within such acomponent

Since validity implies validity and violation implies violation the con-ditions must be equivalent

Graphical models with a single loop have been studied in detail in Weiss(2000) yielding important theoretical results (eg correctness of maximuma posteriori assignments) These results are derived by ldquounwrappingrdquo thesingle loop into an infinite tree This argument also breaks down as soon asthere is more than a single loop It might be interesting to find out whetherthere is a deeper connection between this unwrapping argument and theconvexity of the Bethe free energy

In summary we have given conditions for the Bethe free energy to have aunique extremum satisfying the constraints From the connection betweenthe extrema of the Bethe free energy and fixed points of loopy belief prop-agation it then follows that loopy belief propagation has a unique fixedpoint when these conditions are satisfied These conditions fail as soon asthe structure of the graph contains two connected loops

The conditions for convexity of the Bethe free energy depend on thestructure of the graph the potentialsα(Xα) do not play any role These po-tentials appear only in the energy term that is linear in the pseudomarginals

2392 T Heskes

and thus does not affect the convexity argument Consequently adding aldquofake linkrdquo with potential α(Xα) = 1 can change the validity of the con-ditions whereas it has no effect on the loopy belief propagation updatesEven if we managed to find more interesting (ie milder) conditions forconvexity over the set of constraints3 the impact of fake links would neverdisappear In the following we will therefore dig a little deeper to arrive atmilder conditions that do take into account (the strength of) the potentials

5 The Dual Formulation

51 From Lagrangian to Dual As we have seen fixed points of loopybelief propagation are in a one-to-one correspondence with zero derivativesof the Lagrangian If we manage to find conditions under which these zeroderivatives have a unique solution then for the same conditions loopybelief propagation has a unique fixed point In the following we will workwith a Lagrangian slightly different from equation 34 First we substitutethe constraint Qα(xβ) = Qβ(xβ) to write the Bethe free energy in the ldquomoreconvexrdquo form

F(QαQβ) = minussumα

sumXα

Qα(Xα)ψα(Xα)+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumβ

sumαsupβ

Aαβ

sumxβ

Qα(xβ) log Qβ(xβ) (51)

where the allocation matrix Aαβ can be any matrix that satisfies

sumαsupβ

Aαβ = nβ minus 1 (52)

And second we express the consistency constraints from equation 33 interms of the potential pseudomarginals Qα alone This then yields

L(QαQβ λαβ λα) = minussumα

sumXα

Qα(Xα)ψα(Xα)

+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumα

sumβsubα

Aαβ

sumxβ

Qα(xβ) log Qβ(xβ)

3 We would like to conjecture that this is not possiblemdashthat the conditions in theorem 1are not only sufficient but also necessary to prove convexity of the Bethe free energy overthe set of consistency constraints Note that this would not imply that we need theseconditions to guarantee the uniqueness of fixed points since for that convexity by itselfis sufficient not necessary

Uniqueness of Loopy Belief Propagation Fixed Points 2393

+sumβ

sumαsupβ

sumxβ

λαβ(xβ)

times[

1nβ minus 1

sumαprimesupβ

AαprimeβQαprime(xβ)minusQα(xβ)

]

+sumα

λα

[1minus

sumXα

Qα(Xα)

]

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (53)

Note that the constraint Qβ(xβ) = Qα(xβ) as well as its normalization is nolonger incorporated with Lagrange multipliers but follows when we takethe minimum with respect to Qβ It is easy to check that the fixed-pointequations of loopy belief propagation still follow by setting the derivativesof the Lagrangian equation 53 to zero

Although the Bethe free energy and thus the Lagrangian equation 53may not be convex in QαQβ they are convex in Qα and Qβ separatelyTherefore we can interchange the minimum over the pseudomarginals Qα

and the maximum over the Lagrange multipliers as long as we leave theminimum over Qβ as the final operation4

minQαQβ

maxλαβ λα

L(QαQβ λαβ λα) = minQβ

maxλαβ λα

minQα

L(QαQβ λαβ λα)

Rewriting

sumβ

sumαsupβ

sumxβ

λαβ(xβ)

[1

nβ minus 1

sumαprimesupβ

AαprimeβQαprime(xβ)minusQα(xβ)

]

= minussumα

sumβsubα

sumxβ

λαβ(xβ)Qα(xβ)

with

λαβ(xβ) equiv λαβ(xβ)minus 1nβ minus 1

sumαprimesupβ

Aαprimeβλαprimeβ(xβ)

we can easily solve for the minimum with respect to Qα

Qlowastα(Xα) = α(Xα) exp

[λα minus 1+

sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

] (54)

4 In principle we could also first take the minimum over Qβ and leave the minimumover Qα but this does not seem to lead to any useful results

2394 T Heskes

Plugging this into the Lagrangian we obtain the ldquodualrdquo

G(Qβ λαβ λα) equiv L(QlowastαQβ λαβ λα)

= minussumα

sumXα

α(Xα)

exptimes[λα minus 1+

sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

]

+sumα

λα +sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (55)

Next we find for the maximum with respect to λα

exp[1minus λlowastα

] =sumXα

α(Xα) exp

[sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

]

equiv Zlowastα (56)

where we have to keep in mind that Zlowastα by itself like Qlowastα is a function of theremaining pseudomarginals Qβ and Lagrange multipliers λαβ Substitutingthis solution into the dual we arrive at

G(Qβ λαβ) equiv G(Qβ λαβ λlowastα)

= minussumα

log Zlowastα +sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (57)

Let us pause here for a moment and reflect on what we have done so farThe Lagrangian equation 53 being convex in Qα has a unique minimumin Qα (given all other parameters fixed) which is also the only extremumIt happens to be relatively straightforward to express the value at this mini-mum in terms of the remaining parameters and then also to find the optimal(maximal) λlowastα Plugging these values into the Lagrangian equation 53 wehave not lost anything That is zero derivatives of the Lagrangian are still inone-to-one correspondence with zero derivatives of the dual equation 57and thus with fixed points of loopy belief propagation

52 Recovering the Convexity Conditions (1) To find a minimum of theBethe free energy satisfying the constraints in equation 33 we first have totake the maximum of the dual equation 57 over the remaining Lagrangemultipliersλαβ and then the minimum over the remaining pseudomarginalsQβ The duality theorem a standard result from constrained optimization(see Luenberger 1984) tells us that the dual G is concave in the Lagrangemultipliers The remaining question is then whether the dual is convex in

Uniqueness of Loopy Belief Propagation Fixed Points 2395

Qβ If it is we have a convex-concave minimax problem which is guaranteedto have a unique solution

Link c in Figure 1 follows from the following proposition

Proposition 2 Convexity of the Bethe free energy equation 51 in QαQβimplies convexity of the dual equation 57 in Qβ

Proof First we note that the minimum of a convex function over someof its parameters is convex in its remaining parameters In obvious one-dimensional notation with ylowast(x) equiv argmin

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge 2 f (x (ylowast(x+ δ)+ ylowast(xminus δ))2)ge 2 f (x ylowast(x))

where the first inequality follows from the convexity of f in x yand the sec-ond inequality from ylowast(x) being the unique minimum of f (x y) Thereforethe dual equation 55 is convex in Qβ when the Lagrangian equation 53and thus the Bethe free energy equation 51 is convex in QαQβ Further-more from the duality theorem the dual equation 55 is concave in theLagrange multipliers λαβ λα Next we note that the maximum of a con-vex or concave function over its maximizing parameters is again convexwith ylowast(x) equiv argmax

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge f (x+ δ ylowast(x))+ f (xminus δ ylowast(x))ge 2 f (x ylowast(x))

where the first inequality follows from ylowast(xplusmnδ) being the unique maximumof f (x plusmn δ y) and the second inequality from the convexity of f (x y) in xHence the dual equation 57 must still be convex in Qβ

For now we did not gain or lose anything in comparison with the con-ditions for theorem 1 However the inequalities in the above proof suggesta little space that will lead to milder conditions for the uniqueness of fixedpoints

53 Boundedness of the Bethe Free Energy For completeness and tosupport link f in Figure 1 we will here prove that the Bethe free energy isbounded from below The following theorem can be considered a specialcase of the one stated in Minka (2001) on the Bethe free energy for expectationpropagation a generalization of (loopy) belief propagation

Theorem 3 If all potentials are bounded from above that isα(Xα) le max forall α and Xα the Bethe free energy is bounded from below on the set of constraints

2396 T Heskes

Proof It is sufficient to prove that the function G(Qβ) equiv maxλαβ G(Qβ λαβ)

is bounded from below for a particular choice of Aαβ satisfying equation 52Considering Aαβ = nβminus1

nβ we then have

G(Qβ) ge minussumα

logsumXα

α(Xα) exp

[sumβsubα

nβ minus 1nβ

log Qβ(xβ)

]

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

logsumXα

α(Xα)Qβ(xβ)

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

+sumβ

(nβ minus 1)

[minus log

sumxβ

Qβ(xβ)+sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

where the first inequality follows by substituting the choice λαβ(xβ) = 0 forall α β and xβ in G(Qβ λαβ) the second from the concavity of the function

ynβminus1

nβ and the third from the upper bound on the potentials

6 Toward Better Conditions

61 The Hessian The next step is to compute the Hessianmdashthe secondderivative of the dual with respect to the pseudomarginals Qβ The firstderivative yields

partGpartQβ(xβ)

= minussumαsupβ

Aαβ

Qlowastα(xβ)Qβ(xβ)

+ (nβ minus 1)

which is immediate from the Lagrangian equation 53 To compute thematrix of second derivatives

Hββ prime(xβ xprimeβ prime) equivpart2G

partQβ(xβ)partQβ prime(xprimeβ prime)

Uniqueness of Loopy Belief Propagation Fixed Points 2397

we make use of

partQlowastα(xβ)partQβ prime(xprimeβ prime)

= Aαβ primeQlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)

Qβ prime(xprimeβ prime)

where both β and β prime should be a subset of α and with convention Qlowastα(xβ xβ)= Qlowastα(xβ) and Qlowastα(xβ xprimeβ) = 0 if xβ = xprimeβ Here the first term follows from thedifferentation of equation 54 and the second term from the normalizationas in equation 56 Distinguishing between β = β prime and β = β prime we then have

Hββ(xβ xprimeβ) =sumαsupβ

Aαβ(1minus Aαβ)Qlowastα(xβ)Q2β(xβ)

δxβ xprimeβ

+sumαsupβ

A2αβ

Qlowastα(xβ)Qlowastα(xprimeβ)Qβ(xβ)Qβ(xprimeβ)

Hββ prime(xβ xprimeβ prime) = minussum

αsupββ primeAαβAαβ prime

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)Qβ(xβ)Qβ prime(xprimeβ prime)

for β prime = β

where δxβ xprimeβ = 1 if and only if xβ = xprimeβ Here it should be noted that bothβ and xβ play the role of indices that is xβ should not be mistaken for avariable or parameter The parameters are still the (tables with) Lagrangemultipliers λαβ and pseudomarginals Qβ

The goal is now to find conditions under which this Hessian is positive(semi) definite for any setting of the parameters Qβ λαβ that is conditionsthat guarantee

K equivsumββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) ge 0

for any choice of the ldquovectorrdquo S with elements Sβ(xβ) Straightforward ma-nipulations yieldsum

ββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) (K)

=sumα

sumβsubα

sumxβ

Aαβ(1minus Aαβ)Qlowastα(xβ)R2β(xβ) (K1)

+sumα

sumββ primesubα

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ)Qlowastα(xprimeβ prime)Rβ(xβ)Rβ prime(x

primeβ prime) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime) (K3)

where Rβ(xβ) equiv Sβ(xβ)Qβ(xβ)

2398 T Heskes

62 Recovering the Convexity Conditions (2) Let us first see how weget back the conditions for convexity of the Bethe free energy equation 51Since

K2 =sumα

[sumβsubα

sumxβ

AαβQlowastα(xβ)Rβ(xβ)

]2

ge 0

and5

K3 =sumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)

times

12

[Rβ(xβ)minus Rβ prime(xprimeβ prime)

]2 minus 12

R2β(xβ)minus

12

R2β prime(x

primeβ prime)

gesumα

sumβsubα

sumxβ

Aαβ

(sumβ primesubα

Aαβ prime minus Aαβ

)Qlowastα(xβ)R

2β(xβ) (61)

we have

K = K1 + K2 + K3 gesumα

sumβsubα

sumxβ

Aαβ

(1minus

sumβ primesubα

Aαβ prime

)Qlowastα(xβ)R

2β(xβ)

That is sufficient conditions for K to be nonnegative are

Aαβ ge 0 forallαβsubα andsumβsubα

Aαβ le 1 forallα

precisely the conditions for theorem 1

63 Fake Interactions While discussing the conditions for convexity ofthe Bethe free energy we noticed that adding a ldquofake interactionrdquo such as aconstant potential can change the validity of the conditions We will see thathere this is not the case and these fake interactions drop out as we wouldexpect them to

Suppose that we have a fake interaction α(Xα) = 1 From the solutionequation 54 it follows that the pseudomarginal Qlowastα(Xα) factorizes6

Qlowastα(xβ xprimeβ prime) = Qlowastα(xβ)Qlowastα(xprimeβ prime) forallββ primesubα

5 This step is in fact equivalent to the Gerschgorin theorem for bounding the eigenval-ues of a matrix

6 The exact marginal Pexact(Xα) need not factorize This is really a consequence of thelocality assumptions behind loopy belief propagation and the Bethe free energy

Uniqueness of Loopy Belief Propagation Fixed Points 2399

Consequently the terms involving α in K3 cancel with those in K2 which ismost easily seen when we combine K2 and K3 in a different way

K2 + K3 =sumα

sumβsubα

sumxβ xprime

βprime

A2αβQlowastα(xβ)Q

lowastα(xprimeβ)Rβ(xβ)Rβ(x

primeβ) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ prime

times [Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)]Rβ(xβ)Rβ prime(x

primeβ prime) (K3)

This leaves us with the weaker requirement (from K1) Aαβ(1minusAαβ) ge 0 forall β sub α The best choice is then to take Aαβ = 1 which turns condition 3of equation 41 intosum

αprimesupβαprime =α

Aαprimeβ + 1 ge nβ minus 1

The net effect is equivalent to ignoring the interaction reducing the numberof neighboring potentials nβ by 1 for all β that are part of the fake interactionα

We have seen how we get milder and thus better conditions when thereis effectively no interaction Motivated by this ldquosuccessrdquo we will work to-ward conditions that take into account the strength of the interactions Ourstarting point will be the above decomposition in K2 and K3 where sinceK2 ge 0 we will concentrate on K3

7 The Strength of a Potential

71 Bounding the Correlations The crucial observation which will al-low us to obtain milder and thus better conditions for the uniqueness of afixed point is the following lemma It bounds the term between brackets inK3 such that we can again combine this bound with the (positive) term K1However before we get to that we take some time to introduce and deriveproperties of the ldquostrengthrdquo of a potential

Lemma 2 Two-node correlations of loopy belief marginals obey the bound

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime) le σαQlowastα(xβ xprimeβ prime) forall ββprime subα

βprime =βforallxβ xprime

βprime (71)

with the ldquostrengthrdquo σα a function of the potential ψα(Xα) equiv logα(Xα) only

σα = 1minus exp(minusωα) with

ωα equiv maxXαXα

[ψα(Xα)+ (nα minus 1)ψα(Xα)minus

sumβsubα

ψα(Xαβ xβ)

] (72)

where nα equivsum

βsubα 1

2400 T Heskes

Proof For convenience and without loss of generality we omit α fromour notation and renumber the nodes that are contained in α from 1 to nWe consider the quotient between the loopy belief on the potential subsetdivided by the product of its single-node marginals

Qlowast(X)nprodβ=1

Qlowast(xβ)=(X)

prodβ

microβ(xβ)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)microβ(xβ)

=(X)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)

(73)

where we substituted the properly normalized version of equation 35 aloopy belief pseudomarginal is proportional to the potential times incomingmessages The goal is now to find the maximum of the above expressionover all possible messages and all values of X Especially the maximum overmessagesmicro seems to be difficult to compute but the following intermediatelemma helps us out

Lemma 3 The maximum of the function

V(micro) = (nminus 1) log

[sumX

(X)nprodβ=1

microβ(xβ)

]

minusnsumβ=1

log

sum

(Xβ xlowastβ)prodβ prime =β

microβ prime(xβ prime)

with respect to the messages micro under constraintssum

xβ microβ(xβ) = 1 for all β andmicroβ(xβ) ge 0 for all β and xβ occurs at an extreme point microβ(xβ) = δxβ xβ for somexβ to be found

Proof Let us consider optimizing the messagemicro1(x1)with fixed messagesmicroβ(xβ) for β gt 1 The first and second derivatives are easily found to obey

partVpartmicro1(x1)

= (nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ)

part2Vpartmicro1(x1)partmicro1(xprime1)

= (nminus 1)Q(x1)Q(xprime1)minussumβ =1

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

Uniqueness of Loopy Belief Propagation Fixed Points 2401

where

Q(X) equiv (X)prodβ microβ(xβ)sum

Xprime (Xprime)prodβ microβ(x

primeβ)

Now suppose that V has a regular extremum (maximum or minimum) notat an extreme point that is micro1(x1) gt 0 for two or more values of x1 At suchan extremum the first derivative should obey

(nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ) = λ

with λ a Lagrange multiplier implementing the constraintsum

x1micro1(x1) = 1

Summing over x1 we obtain λ = 0 (in fact V is indifferent to any multi-plicative scaling of micro) For the matrix with second derivatives at such anextremum we then have

part2Vpartmicro1(x1)partmicro1(xprime1)

=sumβ =1

sumβprime =1βprime =β

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

which is positive semidefinite the extremum cannot be a maximum Con-sequently any maximum must be at the boundary of the domain Sincethis holds for any choice of microβ(xβ) β gt 1 it follows by induction that themaximum with respect to all microβ(xβ) must be at an extreme point as well

The function V(micro) is up to a term independent of micro the logarithm ofequation 73 So the intermediate lemma 3 tells us that we can replace themaximization over messages micro by maximization over values X

maxmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

X

(X)[(X)

]nminus1

prodβ

(Xβ xβ)

Next we take the maximum over X as well and define the ldquostrengthrdquo σ tobe used in equation 71 through

11minus σ equiv max

Xmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

XX

(X)[(X)

]nminus1

prodβ

(Xβ xβ) (74)

2402 T Heskes

The inequality 71 then follows by summing out Xββ prime in

Qlowast(X)minusprodβ

Qlowast(xβ) le σQlowast(X)

The form of equation 72 then follows by rewriting equation 74 as

ω equiv minus log(1minus σ) = maxXX

W(X X) with

W(X X) =[ψ(X)+ (nminus 1)ψ(X)minus

sumβ

ψ(Xβ xβ)

]

where we recall that ψ(X) equiv log(X)

72 Some Properties In the following we will refer to both ω and σ asthe strength of the potential There are several properties worth noting

bull The strength of a potential is indifferent to multiplication with anyterm that factorizes over the nodes that is

if (X) = (X)prodβ

microβ(xβ) then ω() = ω() for any choice of micro

This property relates to the arbitrariness in the definition of equa-tion 31 if two potentials overlap then multiplying one potential witha term that only depends on the overlap and dividing the other by thesame term does not change the distribution Luckily it also does notchange the strength of those potentials

bull To compute the strength we can enumerate all possible combinationsHowever we can neglect all combinations X and X that differ in fewerthan two nodes To see this consider

W(x1 x2 x12 x1 x2 x12) = ψ(x1 x2 x12)+ ψ(x1 x2 x12)minus ψ(x1 x2 x12)minus ψ(x1 x2 x12)

= minusW(x1 x2 x12 x1 x2 x12)

If now also x2 = x2 we get W(x1 x1 x1 x1) = minusW(x1 x1 x1 x1) =0 Furthermore if W(x1 x2 x12 x1 x2 x12) le 0 then it must be thatW(x1 x2 x12 x1 x2 x12) ge 0 and vice versa So ω the maximumover all combinations must be nonnegative and we can indeed neglectall combinations that by definition yield zero

bull Thus for finite potentials 0 le ω ltinfin and 0 le σ lt 1

Uniqueness of Loopy Belief Propagation Fixed Points 2403

bull With pairwise potentials the above symmetries can be used to reducethe number of evaluations to |x1||x2|(|x1|minus1)(|x2|minus1)4 combinationsAnd indeed for binary nodes x12 isin 0 1 we immediately obtain

ω = |ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| (75)

Any pairwise binary potential can be written as a Boltzmann factor

(x1 x2) prop exp[wx1x2 + θ1x1 + θ2x2]

In this notation we find the simple and intuitive expression ω = |w|the strength is the absolute value of the ldquoweightrdquo It is indeed inde-pendent of (the size of) the thresholds In the case of minus1 1 codingthe relationship is ω = 4|w|

bull In some models there is the notion of a ldquotemperaturerdquo T that is(X) propexp[ψ(X)T] where ψ(X) is considered constant In obvious notationwe then have ω(T) = ω(1)T and thus σ(T) = 1 minus exp[minusω(1)T] =1minus [1(1minus σ(1))]1T

bull Loopy belief revision (max-product) can be interpreted as a zero-temperature limit of loopy belief propagation (sum product) Morespecifically we get the belief revision updates if we imagine runningloopy belief propagation on potentials that are scaled with tempera-ture T and then take the limit T to zero Consequently when analyzingconditions for uniqueness of loopy belief revision fixed points we cantake σ(0) = 0 if σ(1) = 0 (fake interaction) yet σ(0) = 1 wheneverσ(1) gt 0

8 Conditions for Uniqueness

81 Main Result

Theorem 4 Loopy belief propagation has a unique fixed point if there exists anallocation matrix Aαβ between potentials α and nodes β with properties

1 Aαβ ge 0 forallαβsubα (positivity)

2 (1minus σα)maxβsubα

Aαβ + σαsumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(81)

with the strength σα a function of the potentialα(Xα) as defined in equation 72

Proof For completeness we first summarize our line of reasoning Fixedpoints of loopy belief propagation are in one-to-one correspondence with

2404 T Heskes

extrema of the dual equation 55 This dual has a unique extremum if itis convexconcave Concavity is guaranteed so we focus on conditionsfor convexity that is for positive (semi)definiteness of the correspondingHessian This then boils down to conditions that ensure K = K1+K2+K3 ge 0for any choice of Rβ(xβ)

Substituting the bound equation 71 into the term K3 we obtain

K3 ge minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeσαQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime)

ge minussumα

σαsumβsubα

sumxβ

Aαβ

sumβprimesubαβprime =β

Aαβ primeQlowastα(xβ)R2β(xβ)

where in the last step we applied the same trick as in equation 61 SinceK2 ge 0 and combining K1 and (the above lower bound on) K3 we get

K = K1 + K2 + K3

gesumα

sumβsubα

sumxβ

Aαβ

[1minus Aαβ minus σα

sumβ prime =β

Aαβ prime

]Qlowastα(xβ)R

2β(xβ)

This implies

(1minus σα)Aαβ + σαsumβ primesubα

Aαβ prime le 1 forallαβsubα

which in combination with Aαβ ge 0 and σα le 1 yields condition 2 inequation 81 The equality constraint equation 52 that we started with canbe relaxed to the inequality condition 3 without any consequences

We get back the stricter conditions of theorem 1 if σα = 1 for all potentialsα Furthermore ldquofake interactionsrdquo play no role with σα = 0 condition 2becomes maxβsubα Aαβ le 1 suggesting the choice Aαβ = 1 for all β sub αwhich then effectively reduces the number of neighboring potentials nβ incondition 3

82 Comparison with Other Work To the best of our knowledge theonly conditions for uniqueness of loopy belief propagation fixed points thatdepend on more than just the structure of the graph are those in Tatikondaand Jordan (2002) for pairwise potentials The analysis in Tatikonda andJordan is based on the concept of the computation tree which represents anunwrapping of the original graph with respect to the loopy belief propaga-tion algorithm The same concept is used in Weiss (2000) to show that beliefrevision yields the correct maximum a posteriori assignments in graphs

Uniqueness of Loopy Belief Propagation Fixed Points 2405

with a single loop and Weiss and Freeman (2001) to prove that loopy beliefpropagation in gaussian graphical models yields exact means Although thecurrent theorems based on the concept of computation trees are derived forpairwise potentials it should be possible to extend them to more generalfactor graphs

The setup in Tatikonda and Jordan (2002) is slightly different it is basedon the factorization

Pexact(X) = 1Z

prodα

α(Xα)prodβ

β(xβ)

to be compared with our equation 31 where there are no self-potentialsβ(xβ) With this in mind the statement is then as follows

Theorem 5 (adapted from Tatikonda amp Jordan 2002 in particular proposi-tion 53) Loopy belief propagation on pairwise potentials has a unique fixed pointif

sumαsupβ

(max

ψα(Xα)minusminXα

ψα(Xα)

)lt 2 forallβ (82)

To make the connection between theorem 5 and theorem 4 we will firststrengthen the former and then weaken the latter We will focus on thecase of binary pairwise potentials Since the definition of self-potentials isarbitrary and the condition 82 is valid for any choice we can easily improvethe condition by optimizing this choice This then leads to the followingcorollary

Corollary 3 This corollary concerns an improvement of theorem 5 for pairwisebinary potentials Loopy belief propagation on pairwise binary potentials has aunique fixed point ifsum

αsupβωα lt 4 forallβ (83)

with ωα defined in equation 72

Proof The condition 82 applies to any arbitrary definition of self-poten-tials β(xβ) In fact it is valid for any choice

ψα(Xα) = ψα(Xα)+sumβsubα

φαβ(xβ)

where ψα(Xα) is any choice of potential subsets that fits in our frameworkof no self-potentials (as argued above there is some arbitrariness here as

2406 T Heskes

well) We can then optimize this choice to obtain milder and thus betterconditions Omitting α and renumbering the nodes from 1 to 2 we have

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= minφ1φ2

maxx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

minus minx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

In the case of binary nodes (two-by-two matrices ψ(x1 x2)) it is easy tocheck that the optimal φ1 and φ2 that yield the smallest gap are such that

ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2)

ge ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2) (84)

for some x1 x2 x1 and x2 with x1 = x1 and x2 = x2 Solving for φ1 and φ2we find

φ1(x1)minus φ1(x1) = 12

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]φ2(x2)minus φ2(x2) = 1

2

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]

Substitution back into equation 84 yields

ψ(x1 x2)+ φ1(x1)+ φ2(x2)minus ψ(x1 x2)minus φ1(x1)minus φ2(x2)

= 12

[ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)minus ψ(x1 x2)

]

which has to be nonnegative Of all four possible combinations two of themare valid and yield the same positive gap and the other two are invalid sincethey yield the same negative gap Enumerating these combinations we find

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= 12|ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| = ω

2

from equation 75 Substitution into the condition 82 then yields equa-tion 83

Next we derive the following weaker corollary of theorem 4

Uniqueness of Loopy Belief Propagation Fixed Points 2407

Corollary 4 This is a weaker version of theorem 4 for pairwise potentials Loopybelief propagation on pairwise potentials has a unique fixed point ifsum

αsupβωα le 1 forallβ (85)

with ωα defined in equation 72

Proof Consider the allocation matrix with components Aαβ = 1minus σα forall β sub α With this choice conditions 1 and 2 of equation 81 are fulfilledsince (condition 1) σα le 1 and (condition 2)

(1minus σα)(1minus σα)+ 2σα(1minus σα) = 1minus 2σ 2α le 1

Substitution into condition 3 yieldssumαsupβ

(1minus σα) gesumαsupβ

1minus 1 and thussumαsupβ

σα le 1 (86)

Since ωα = minus log(1minus σα) ge σα condition 85 is weaker than condition 86

Summarizing the conditions in Tatikonda and Jordan (2002) are for bi-nary pairwise potentials and when strengthened as above at most a constant(factor 4) less strict and thus better than the ones derived here The latterare better when the structure is (close to) a tree The best set of conditionsfollows by taking the union of both Note further that the conditions derivedin Tatikonda and Jordan (2002) are unlike theorem 4 specific to pairwisepotentials

83 Illustration For illustration we consider a 3 times 3 Ising grid withtoroidal boundary conditions as in Figure 3a and uniform ferromagneticpotentials proportional to(

α 1minus α1minus α α

)

The trivial solution which is the only minimum of the Bethe free energy forsmall α is the one with all pseudomarginals equal to (05 05) With simplealgebra for example following the line of reasoning that leads to the beliefoptimization algorithm in Welling and Teh (2003) it can be shown that thistrivial solution becomes unstable at the critical αcritical = 23 asymp 067 Forα gt 23 we find two minima one with ldquospins uprdquo and the other one withldquospins downrdquo

In this symmetric problem the strength of each potential is given by

ω = 2 log[

α

1minus α]

and thus σ = 1minus(

1minus αα

)2

2408 T Heskes

Figure 3 Three Ising grids in factor-graph notation circles denote nodes boxesinteractions (a) Toroidal boundary conditions All elements of the allocationmatrix equal to 34 (not shown) (b) Aperiodic boundary conditions and (c) twoloops left The elements of the allocation matrix along the edges follow directlyfrom optimizing condition 3 in theorem 4 and symmetry considerations WithB = 2minus 2A in b and C = 1minusA in c the optimal settings for the single remainingvariable A then boil down to 34 and 1 minus

radic18 respectively See the text for

further explanation

Uniqueness of Loopy Belief Propagation Fixed Points 2409

The minimal (uniform) compensation in condition 3 of theorem 4 amountsto A = 34 for all combinations of potentials and nodes Substitution intocondition 2 then yields

σ le 13

and thus α le 11+radic23

asymp 055

The critical value that follows from corollary 3 is in this case slightly better

ω lt 1 and thus α le 11+ eminus12 asymp 062

Next we consider the same grid with aperiodic boundary conditions asin Figure 3b Numerically we find a critical αcritical asymp 079 The value thatfollows from corollary 3 is dominated by the center node and hence staysthe same a unique loopy belief propagation fixed point for α lt 062 Theo-rem 4 can be exploited to shift resources a little In principle we can solvethe nonlinear programming problem but for this small problem it can stillbe done by hand with the following argumentation Minimal compensationaccording to condition 3 in theorem 4 combined with symmetry considera-tions yields the allocation matrix elements along on the edges in Figure 3bIt is then easy to check that there are only two different appearances ofcondition 2

(2minus 2A)σ + 34le 1 and

12σ + A le 1

The optimal choice for A is the one in which both conditions turn out to beidentical In this way we obtain A = 34 yielding

σ le 12

and thus α le 11+radic12

asymp 058

still slightly worse than the condition from corollary 3An example in which the condition obtained with theorem 4 is better

than the one from corollary 3 is given in Figure 3c Straightforward analysisfollowing the same recipe as for Figure 3b yields A = 1minusradic18 with

σ leradic

12

and thus α le 1

1+radic

1minusradic12asymp 065

better than theα lt 062 from corollary 3 and to be compared with the criticalαcritical asymp 088

9 Discussion

In this article we derived sufficient conditions for loopy belief propagationto have just a single fixed point These conditions remain much too strong tobe anywhere near the necessary conditions and in that sense should be seenas no more than a first step These conditions have the following positivefeatures

2410 T Heskes

bull Generalize the conditions for convexity of the Bethe free energy

bull Incorporate the (local) strength of potentials

bull Scale naturally as a function of the ldquotemperaturerdquo

bull Are invariant to arbitrary definitions of potentials and self-interactions

Although the analysis that led to these conditions may seem quite involvedit basically consists of a relatively straightforward combination of two ob-servations The first observation is that we can exploit the arbitrariness inthe definition of the Bethe free energy when we incorporate the constraintsThis forms the basis of the resource allocation argument And the secondobservation concerns the bound on the correlation of a loopy belief propa-gation marginal that leads to the introduction of the strength of a potential

Besides its theoretical usefulness there are more practical uses Firstalgorithms for guaranteed convergence explicitly minimize the Bethe freeenergy They can be considered ldquobound optimization algorithmsrdquo similarto expectation maximization and iterative proportional fitting in the innerloop they minimize a bound on the Bethe free energy which is then up-dated in the outer loop In practice it appears that the tighter the boundthe faster the convergence (see eg Heskes et al 2003) Instead of a boundthat is convex (Yuille 2002) or convex over the set of constraints (Teh ampWelling 2002 Heskes et al 2003) we might relax the convexity conditionand choose a tighter bound that still has a unique minimum thereby speed-ing up the convergence Second in Wainwright et al (2002) a convexifiedBethe free energy is proposed The arguments for this class of free energiesare twofold they yield a bound on the partition function (instead of justan approximation as the standard Bethe free energy) and have a uniqueminimum Focusing on the second argument the conditions in this articlecan be used to construct Bethe free energies that may not be convex (overthe set of constraints) but do have a unique minimum and being closer tothe standard Bethe free energy may yield better approximations

We can think of the following opportunities to make the sufficient con-ditions derived here stricter and thus closer to necessary conditions

bull The conditions guarantee convexity of the dual G(Qβ λαβ) with re-spect to Qβ But in fact we need only G(Qβ) equiv maxλαβ G(Qβ λαβ) to beconvex which is a weaker requirement The Hessian of G(Qβ) how-ever appears to be more difficult to compute and to analyze in generalbut may lead to stronger results in specific cases (eg only pairwiseinteractions or substituting a particular choice of Aαβ )

bull It may be possible to strengthen the bound equation 71 on loopybelief correlations especially for interactions that involve more thantwo nodes

An important question is how the uniqueness of loopy belief propaga-tion fixed points relates to the convergence of loopy belief propagation

Uniqueness of Loopy Belief Propagation Fixed Points 2411

Intuitively one might expect that if loopy belief propagation has a uniquefixed point it will also converge to it This also seems to be the argumenta-tion in Tatikonda and Jordan (2002) However to the best of our knowledgethere is no proof of such correspondence Furthermore the following set ofsimulations does seem to suggest otherwise

We consider a Boltzmann machine with four binary nodes weights

w = ω

0 1 minus1 minus11 0 1 minus1minus1 1 0 minus1minus1 minus1 minus1 0

zero thresholds and potentials

ij(xi xj) = exp[wij4] if xi = xj and ij(xi xj) = exp[minuswij4] if xi = xj

Running loopy belief propagation possibly damped as in equation 39 weobserve ldquoconvergentrdquo and ldquononconvergentrdquo behavior For relatively smallweights loopy belief propagation converges to the trivial fixed point withPi(xi) = 05 for all nodes i and xi = 0 1 as in the lower left inset inFigure 4 For relatively large weights it ends up in a limit cycle as shown inthe upper right inset The weight strength that forms the transition betweenthis ldquoconvergentrdquo and ldquononconvergentrdquo behavior strongly depends on thestep size7 This by itself makes it hard to defend a one-to-one correspondencebetween convergence of loopy belief propagation (apparently dependingon step size) and uniqueness of fixed points (obviously independent of stepsize)

For weights larger than roughly 58 loopy belief propagation failed toconverge to the trivial fixed point even for very small step sizes Howeverrunning a convergent double-loop algorithm from many different initialconditions and many weight strengths considerably larger than 58 we al-ways ended up in the trivial fixed point and never in another one We foundsimilar behavior for a three-node Boltzmann machine (same weight matrixas above except for the fourth node) for very large weights loopy beliefpropagation ends up in a limit cycle whereas a convergent double-loopalgorithm converges to the trivial fixed point which here by corollary 2is guaranteed to be unique In future work we hope to elaborate on theseissues

7 Note that the conditions for guaranteed uniqueness imply ω = 43 for corollary 3and ω = log(2) asymp 069 for theorem 4 both far below the weight strengths where ldquonon-convergentrdquo behavior sets in

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

Uniqueness of Loopy Belief Propagation Fixed Points 2385

At an extremum of the Bethe free energy satisfying the constraints allderivatives of L are zero the ones with respect to the Lagrange multipliersλ give back the constraints the ones with respect to the pseudomarginalsQ give an extremum of the Bethe free energy Setting the derivatives withrespect to Qα and Qβ to zero we can solve for Qα and Qβ in terms of theLagrange multipliers

Qlowastα(Xα) = α(Xα) exp

[λα minus 1+

sumβsubα

λαβ(xβ)

]

Qlowastβ(xβ) = exp

[1

nβ minus 1

1minus λβ +

sumαsupβ

λαβ(xβ)

]

In terms of theldquomessagerdquomicroβrarrα(xβ) equiv exp[λαβ(xβ)] from node β to potentialα the pseudomarginal Qlowastα(Xα) reads

Qlowastα(Xα) prop α(Xα)prodβsubα

microβrarrα(xβ) (35)

where proper normalization yields the Lagrange multiplier λα With defi-nition

microαrarrβ(xβ) equivQlowastβ(xβ)microβrarrα(xβ)

(36)

the fixed-point equation for Qlowastβ(xβ) can after some manipulation be writtenin the form

Qlowastβ(xβ) propprodαsupβ

microαrarrβ(xβ) (37)

where again the Lagrange multiplier λβ follows from normalization Finallythe constraint Qlowastα(xβ) = Qlowastβ(xβ) in combination with equation 36 suggeststhe update

microαrarrβ(xβ) = Qlowastα(xβ)microβrarrα(xβ)

(38)

Equations 35 through 38 constitute the belief propagation equations Theycan be summarized as follows A pseudomarginal is the potential (just 1 forthe nodes in the convention where no potentials are assigned to nodes) timesits incoming messages the outgoing message is the pseudomarginal dividedby the incoming message The scheduling of the messages is somewhatarbitrary Loopy belief propagation can be ldquodampedrdquo by taking smaller

2386 T Heskes

steps This damping is usually done in terms of the Lagrange multipliersthat is in the log domain of the messages

logmicronewαrarrβ(xβ) = logmicroαrarrβ(xβ)

+ ε[log Qlowastα(xβ)minus logmicroβrarrα(xβ) minus logmicroαrarrβ(xβ)] (39)

Summarizing loopy belief propagation is equivalent to fixed-point itera-tion where the fixed points are the zero derivatives of the Lagrangian

4 Convexity of the Bethe Free Energy

41 Rewriting the Bethe Free Energy Minimization of the Bethe freeenergy equation 32 under the constraints of equation 33 is equivalent tosolving a minimax problem on the Lagrangian equation 34 namely

minQαQβ

maxλαβ λαλβ

L(QαQβ λαβ λα λβ)

The ordering of the min and max operations is important here to enforcethe constraints we first have to take the maximum The min and max oper-ations can be interchanged if we have a convex constrained minimizationproblem (Luenberger 1984) That is the function to be minimized must beconvex in its parameters the equality constraints have to be linear and theinequality constraints convex In our case the equality constraints are in-deed linear and the inequality constraints enforcing nonnegativity of thepseudomarginals indeed are convex However the Bethe free energy equa-tion 32 is clearly nonconvex in its parameters QαQβ This is what makesit a difficult optimization problem

Luckily the description in equation 32 is not unique any other formthat can be constructed by substituting the constraints of equation 33 isequally valid Following Pakzad and Anantharam (2002) we call the Bethefree energy ldquoconvex over the set of constraintsrdquo if by making use of theconstraints of equation 33 we can rewrite it in a form that is convex inQαQβ

42 Conditions for Convexity The problem is with the term

Sβ(Qβ) equiv minussumxβ

Qβ(xβ) log Qβ(xβ)

which is concave in Qβ Using the constraint Qβ(xβ) = Qα(xβ) we can turnit into a functional that is convex in Qα and Qβ separately but not necessarilyjointly That is with the substitution Qβ(xβ) = Qα(xβ) for any α sup β theentropy and thus the Bethe free energy is convex in Qα and in Qβ but notnecessarily in QαQβ However if we add to Sβ(Qβ) a convex entropycontribution

minusSα(Qα) equivsumXα

Qα(Xα) log Qα(Xα)

Uniqueness of Loopy Belief Propagation Fixed Points 2387

the combination of minusSα and Sβ is convex in QαQβ as the followinglemma needed in the proof of theorem 1 below shows

Lemma 1

αβ(QαQβ) equivsumXα

Qα(Xα) log Qα(Xα)minussumxβ

Qα(xβ) log Qβ(xβ)

is convex in QαQβ

Proof The matrix with second derivatives of αβ has the components

H(XαXprimeα) equivpart2αβ

partQα(Xα)partQα(Xprimeα)= 1

Qα(Xα)δXαXprimeα

H(Xα xprimeβ) equivpart2αβ

partQα(Xα)partQβ(xprimeβ)= minus 1

Qβ(xβ)δxβ xprimeβ

H(xβ xprimeβ) equivpart2αβ

partQβ(xβ)partQβ(xprimeβ)= minusQα(xβ)

Q2β(xβ)

δxβ xprimeβ

where we note that Xα and xβ should be interpreted as indices Convexityrequires that for any ldquovectorrdquo (Rα(Xα) Rβ(xβ))

0 le (Rα(Xα) Rβ(xβ))(

H(XαXprimeα) H(Xα xprimeβ)H(xβXprimeα) H(xprimeβ xβ)

)(Rα(Xprimeα)Rβ(xprimeβ)

)

=sumXα

R2α(Xα)

Qα(Xα)minus 2

sumXα

Rα(Xα)Rβ(xβ)Qβ(xβ)

+sumxβ

Qα(xβ)R2β(xβ)

Q2β(xβ)

=sumXα

Qα(Xα)

[Rα(Xα)

Qα(Xα)minus Rβ(xβ)

Qβ(xβ)

]2

The idea is that the Bethe free energy is convex over the set of constraints ifwe have sufficient convex resources Qα log Qα to compensate for the concaveminusQβ log Qβ terms This can be formalized in the following theorem

Theorem 1 The Bethe free energy is convex over the set of consistency con-straints if there exists an allocation matrix Aαβ between potentials α and nodes βsatisfying

1 Aαβ ge 0 forallαβsubα (positivity)

2sumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(41)

2388 T Heskes

Proof First we note that we do not have to worry about the energy termsthat are linear in Qα In other words to prove the theorem we can restrictourselves to proving that minus the entropy

minusS(Q) = minus[sum

α

Sα(Qα)minussumβ

(nβ minus 1)Sβ(Qβ)

]

is convex over the set of consistency constraints The resulting operation isnow a matter of resource allocation For each concave contribution (nβ minus1)Sβ we have to find convex contributions minusSα to compensate for it LetAαβ denote the ldquoamount of resourcesrdquo that we take from potential subset αto compensate for node β Now in shorthand notation and with a little bitof rewriting

minusS(Q) = minus[sum

α

Sα minussumβ

(nβ minus 1)Sβ

]

= minussumα

(1minus

sumβsubα

Aαβ +sumβsubα

Aαβ

)Sα

minussumβ

minussumαsupβ

Aαβ +sumαsupβ

Aαβ minus (nβ minus 1)

= minussumα

(1minus

sumβsubα

Aαβ

)Sα minus

sumα

sumβsubα

Aαβ [Sα minus Sβ ]

minussumβ

[sumαsupβ

Aαβ minus (nβ minus 1)

]Sβ

Convexity of the first term is guaranteed if 1minussumβ Aαβ ge 0 (condition 2) ofthe second term if Aαβ ge 0 (condition 1 and lemma 1) and of the third termifsum

α Aαβ minus (nβ minus 1) ge 0 (condition 3)

This theorem is a special case of the one in Heskes et al (2003) for themore general Kikuchi free energy Either one of the inequality signs in con-dition 2 and 3 of equation 41 can be replaced by an equality sign withoutany consequences

43 Some Implications

Corollary 1 The Bethe free energy for singly connected graphs is convex overthe set of constraints

Uniqueness of Loopy Belief Propagation Fixed Points 2389

Proof The proof is by construction Choose one of the leaf nodes as theroot βlowast and define

Aαβ = 1 iff β sub α and β closer to the root βlowast than any other β prime sub αAαβ prime = 0 for all other β prime

Obviously this choice of A satisfies conditions 1 and 2 of equation 41Arguing the other way around for eachβ = βlowast there is just a single potentialα sup β that is closer to the root βlowast than β itself (see the illustration in Figure 2)and thus there are precisely nβ minus 1 contributions Aαβ = 1 The root itselfgets nβlowast contributions Aαβlowast = 1 which is even better Hence condition 3 isalso satisfied

sumαsupβ

Aαβ = nβ minus 1 forallβ =βlowast andsumαsupβlowast

Aαβlowast = nβlowast gt nβlowast minus 1

With the above construction of A we are in a sense ldquoeating up resourcestoward the rootrdquo At the root we have one piece of resources left whichsuggests that we can still enlarge the set of graphs for which convexity canbe shown using theorem 1 This leads to the next corollary

Corollary 2 The Bethe free energy for graphs with a single loop is convex overthe set of constraints

Proof Again the proof is by construction Break the loop at one particularplace that is remove one node βlowast from a potential αlowast such that a singlyconnected structure is left Construct a matrix A as in the proof of corollary 1taking the node βlowast as the root The matrix A constructed in this way alsojust works for the graph with the closed loop since still

sumαsupβ

Aαβ = nβ minus 1 forallβ =βlowast and nowsumαsupβlowast

Aαβlowast = nβlowast minus 1

It can be seen that this construction starts to fail as soon as we have twoloops that are connected with two connected loops we have insufficientpositive resources to compensate for the negative entropy contributions

44 Connection with Other Work The same corollaries can be foundin Pakzad and Anantharam (2002) and McEliece and Yildirim (2003) Fur-thermore the conditions in theorem 1 are very similar to the ones stated inPakzad and Anantharam (2002) which for the Bethe free energy boil downto the following

2390 T Heskes

Figure 2 The construction of an allocation matrix satisfying all convexity con-straints for singly connected (a) and single-loop structures (b) Neglecting thearrows and dashes each graph corresponds to a factor graph (Kscischang Freyamp Loeliger 2001) where dark boxes refer to potentials and circles to nodes Thenumbers within the circles give the corresponding ldquoovercounting numbersrdquo forthe Bethe free energy 1minus nβ with nβ the number of neighboring potentials Thearrows pointing from potentials α to nodes β visualize the allocation matrixA with Aαβ = 1 if there is an arrow and Aαβ = 0 otherwise As can be seenfor each potential there is precisely one outgoing arrow pointing at the nodeclosest to the root chosen to be the node in the upper right corner of the graphIn the singly-connected structure (a) all nonroot nodes have precisely nβ minus 1incoming arrows just sufficient to compensate the overcounting number 1minusnβ The root node itself has one incoming arrow which it does not really need Inthe structure with the single loop (b) we open the loop by breaking the dashedlink and construct the allocation matrix for the corresponding singly connectedstructure This allocation matrix works for the single-loop structure as well be-cause now the incoming arrow at the ldquorootrdquo is just sufficient to compensate forthe negative overcounting number resulting from the extra link closing the loop

Uniqueness of Loopy Belief Propagation Fixed Points 2391

Theorem 2 (Adapted from theorem 1 in Pakzad amp Anantharam 2002) TheBethe free energy is convex for the set of constraints if for any set of nodes B wehave sum

βisinB

(1minus nβ)+sumαisinπ(B)

1 ge 0 (42)

where π(B) equiv α existβ isin Bβ sub α denotes the ldquoparentrdquo set of B that is thosepotential subsets that include at least one node in B

Proposition 1 The conditions in theorems 1 and 2 are equivalent

Proof Let us first suppose that there does exist an allocation matrix Aαβ

satisfying the conditions of equation 41 Then for any set B

sumβisinB

(nβ minus 1) lesumβisinB

sumαsupβ

Aαβ lesumαisinπ(B)

sumβsubα

Aαβ lesumαisinπ(B)

1

where the inequalities follow from conditions 3 1 and 2 in equation 41respectively In other words validity of the conditions in theorem 1 impliesthe validity of those in theorem 2

Next let us suppose that the conditions in Theorem 1 fail Above wehave seen that this can happen if and only if the graph contains at least oneconnected component with two connected loops But then condition 42is violated as well when we take for B the set of all nodes within such acomponent

Since validity implies validity and violation implies violation the con-ditions must be equivalent

Graphical models with a single loop have been studied in detail in Weiss(2000) yielding important theoretical results (eg correctness of maximuma posteriori assignments) These results are derived by ldquounwrappingrdquo thesingle loop into an infinite tree This argument also breaks down as soon asthere is more than a single loop It might be interesting to find out whetherthere is a deeper connection between this unwrapping argument and theconvexity of the Bethe free energy

In summary we have given conditions for the Bethe free energy to have aunique extremum satisfying the constraints From the connection betweenthe extrema of the Bethe free energy and fixed points of loopy belief prop-agation it then follows that loopy belief propagation has a unique fixedpoint when these conditions are satisfied These conditions fail as soon asthe structure of the graph contains two connected loops

The conditions for convexity of the Bethe free energy depend on thestructure of the graph the potentialsα(Xα) do not play any role These po-tentials appear only in the energy term that is linear in the pseudomarginals

2392 T Heskes

and thus does not affect the convexity argument Consequently adding aldquofake linkrdquo with potential α(Xα) = 1 can change the validity of the con-ditions whereas it has no effect on the loopy belief propagation updatesEven if we managed to find more interesting (ie milder) conditions forconvexity over the set of constraints3 the impact of fake links would neverdisappear In the following we will therefore dig a little deeper to arrive atmilder conditions that do take into account (the strength of) the potentials

5 The Dual Formulation

51 From Lagrangian to Dual As we have seen fixed points of loopybelief propagation are in a one-to-one correspondence with zero derivativesof the Lagrangian If we manage to find conditions under which these zeroderivatives have a unique solution then for the same conditions loopybelief propagation has a unique fixed point In the following we will workwith a Lagrangian slightly different from equation 34 First we substitutethe constraint Qα(xβ) = Qβ(xβ) to write the Bethe free energy in the ldquomoreconvexrdquo form

F(QαQβ) = minussumα

sumXα

Qα(Xα)ψα(Xα)+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumβ

sumαsupβ

Aαβ

sumxβ

Qα(xβ) log Qβ(xβ) (51)

where the allocation matrix Aαβ can be any matrix that satisfies

sumαsupβ

Aαβ = nβ minus 1 (52)

And second we express the consistency constraints from equation 33 interms of the potential pseudomarginals Qα alone This then yields

L(QαQβ λαβ λα) = minussumα

sumXα

Qα(Xα)ψα(Xα)

+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumα

sumβsubα

Aαβ

sumxβ

Qα(xβ) log Qβ(xβ)

3 We would like to conjecture that this is not possiblemdashthat the conditions in theorem 1are not only sufficient but also necessary to prove convexity of the Bethe free energy overthe set of consistency constraints Note that this would not imply that we need theseconditions to guarantee the uniqueness of fixed points since for that convexity by itselfis sufficient not necessary

Uniqueness of Loopy Belief Propagation Fixed Points 2393

+sumβ

sumαsupβ

sumxβ

λαβ(xβ)

times[

1nβ minus 1

sumαprimesupβ

AαprimeβQαprime(xβ)minusQα(xβ)

]

+sumα

λα

[1minus

sumXα

Qα(Xα)

]

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (53)

Note that the constraint Qβ(xβ) = Qα(xβ) as well as its normalization is nolonger incorporated with Lagrange multipliers but follows when we takethe minimum with respect to Qβ It is easy to check that the fixed-pointequations of loopy belief propagation still follow by setting the derivativesof the Lagrangian equation 53 to zero

Although the Bethe free energy and thus the Lagrangian equation 53may not be convex in QαQβ they are convex in Qα and Qβ separatelyTherefore we can interchange the minimum over the pseudomarginals Qα

and the maximum over the Lagrange multipliers as long as we leave theminimum over Qβ as the final operation4

minQαQβ

maxλαβ λα

L(QαQβ λαβ λα) = minQβ

maxλαβ λα

minQα

L(QαQβ λαβ λα)

Rewriting

sumβ

sumαsupβ

sumxβ

λαβ(xβ)

[1

nβ minus 1

sumαprimesupβ

AαprimeβQαprime(xβ)minusQα(xβ)

]

= minussumα

sumβsubα

sumxβ

λαβ(xβ)Qα(xβ)

with

λαβ(xβ) equiv λαβ(xβ)minus 1nβ minus 1

sumαprimesupβ

Aαprimeβλαprimeβ(xβ)

we can easily solve for the minimum with respect to Qα

Qlowastα(Xα) = α(Xα) exp

[λα minus 1+

sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

] (54)

4 In principle we could also first take the minimum over Qβ and leave the minimumover Qα but this does not seem to lead to any useful results

2394 T Heskes

Plugging this into the Lagrangian we obtain the ldquodualrdquo

G(Qβ λαβ λα) equiv L(QlowastαQβ λαβ λα)

= minussumα

sumXα

α(Xα)

exptimes[λα minus 1+

sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

]

+sumα

λα +sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (55)

Next we find for the maximum with respect to λα

exp[1minus λlowastα

] =sumXα

α(Xα) exp

[sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

]

equiv Zlowastα (56)

where we have to keep in mind that Zlowastα by itself like Qlowastα is a function of theremaining pseudomarginals Qβ and Lagrange multipliers λαβ Substitutingthis solution into the dual we arrive at

G(Qβ λαβ) equiv G(Qβ λαβ λlowastα)

= minussumα

log Zlowastα +sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (57)

Let us pause here for a moment and reflect on what we have done so farThe Lagrangian equation 53 being convex in Qα has a unique minimumin Qα (given all other parameters fixed) which is also the only extremumIt happens to be relatively straightforward to express the value at this mini-mum in terms of the remaining parameters and then also to find the optimal(maximal) λlowastα Plugging these values into the Lagrangian equation 53 wehave not lost anything That is zero derivatives of the Lagrangian are still inone-to-one correspondence with zero derivatives of the dual equation 57and thus with fixed points of loopy belief propagation

52 Recovering the Convexity Conditions (1) To find a minimum of theBethe free energy satisfying the constraints in equation 33 we first have totake the maximum of the dual equation 57 over the remaining Lagrangemultipliersλαβ and then the minimum over the remaining pseudomarginalsQβ The duality theorem a standard result from constrained optimization(see Luenberger 1984) tells us that the dual G is concave in the Lagrangemultipliers The remaining question is then whether the dual is convex in

Uniqueness of Loopy Belief Propagation Fixed Points 2395

Qβ If it is we have a convex-concave minimax problem which is guaranteedto have a unique solution

Link c in Figure 1 follows from the following proposition

Proposition 2 Convexity of the Bethe free energy equation 51 in QαQβimplies convexity of the dual equation 57 in Qβ

Proof First we note that the minimum of a convex function over someof its parameters is convex in its remaining parameters In obvious one-dimensional notation with ylowast(x) equiv argmin

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge 2 f (x (ylowast(x+ δ)+ ylowast(xminus δ))2)ge 2 f (x ylowast(x))

where the first inequality follows from the convexity of f in x yand the sec-ond inequality from ylowast(x) being the unique minimum of f (x y) Thereforethe dual equation 55 is convex in Qβ when the Lagrangian equation 53and thus the Bethe free energy equation 51 is convex in QαQβ Further-more from the duality theorem the dual equation 55 is concave in theLagrange multipliers λαβ λα Next we note that the maximum of a con-vex or concave function over its maximizing parameters is again convexwith ylowast(x) equiv argmax

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge f (x+ δ ylowast(x))+ f (xminus δ ylowast(x))ge 2 f (x ylowast(x))

where the first inequality follows from ylowast(xplusmnδ) being the unique maximumof f (x plusmn δ y) and the second inequality from the convexity of f (x y) in xHence the dual equation 57 must still be convex in Qβ

For now we did not gain or lose anything in comparison with the con-ditions for theorem 1 However the inequalities in the above proof suggesta little space that will lead to milder conditions for the uniqueness of fixedpoints

53 Boundedness of the Bethe Free Energy For completeness and tosupport link f in Figure 1 we will here prove that the Bethe free energy isbounded from below The following theorem can be considered a specialcase of the one stated in Minka (2001) on the Bethe free energy for expectationpropagation a generalization of (loopy) belief propagation

Theorem 3 If all potentials are bounded from above that isα(Xα) le max forall α and Xα the Bethe free energy is bounded from below on the set of constraints

2396 T Heskes

Proof It is sufficient to prove that the function G(Qβ) equiv maxλαβ G(Qβ λαβ)

is bounded from below for a particular choice of Aαβ satisfying equation 52Considering Aαβ = nβminus1

nβ we then have

G(Qβ) ge minussumα

logsumXα

α(Xα) exp

[sumβsubα

nβ minus 1nβ

log Qβ(xβ)

]

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

logsumXα

α(Xα)Qβ(xβ)

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

+sumβ

(nβ minus 1)

[minus log

sumxβ

Qβ(xβ)+sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

where the first inequality follows by substituting the choice λαβ(xβ) = 0 forall α β and xβ in G(Qβ λαβ) the second from the concavity of the function

ynβminus1

nβ and the third from the upper bound on the potentials

6 Toward Better Conditions

61 The Hessian The next step is to compute the Hessianmdashthe secondderivative of the dual with respect to the pseudomarginals Qβ The firstderivative yields

partGpartQβ(xβ)

= minussumαsupβ

Aαβ

Qlowastα(xβ)Qβ(xβ)

+ (nβ minus 1)

which is immediate from the Lagrangian equation 53 To compute thematrix of second derivatives

Hββ prime(xβ xprimeβ prime) equivpart2G

partQβ(xβ)partQβ prime(xprimeβ prime)

Uniqueness of Loopy Belief Propagation Fixed Points 2397

we make use of

partQlowastα(xβ)partQβ prime(xprimeβ prime)

= Aαβ primeQlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)

Qβ prime(xprimeβ prime)

where both β and β prime should be a subset of α and with convention Qlowastα(xβ xβ)= Qlowastα(xβ) and Qlowastα(xβ xprimeβ) = 0 if xβ = xprimeβ Here the first term follows from thedifferentation of equation 54 and the second term from the normalizationas in equation 56 Distinguishing between β = β prime and β = β prime we then have

Hββ(xβ xprimeβ) =sumαsupβ

Aαβ(1minus Aαβ)Qlowastα(xβ)Q2β(xβ)

δxβ xprimeβ

+sumαsupβ

A2αβ

Qlowastα(xβ)Qlowastα(xprimeβ)Qβ(xβ)Qβ(xprimeβ)

Hββ prime(xβ xprimeβ prime) = minussum

αsupββ primeAαβAαβ prime

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)Qβ(xβ)Qβ prime(xprimeβ prime)

for β prime = β

where δxβ xprimeβ = 1 if and only if xβ = xprimeβ Here it should be noted that bothβ and xβ play the role of indices that is xβ should not be mistaken for avariable or parameter The parameters are still the (tables with) Lagrangemultipliers λαβ and pseudomarginals Qβ

The goal is now to find conditions under which this Hessian is positive(semi) definite for any setting of the parameters Qβ λαβ that is conditionsthat guarantee

K equivsumββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) ge 0

for any choice of the ldquovectorrdquo S with elements Sβ(xβ) Straightforward ma-nipulations yieldsum

ββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) (K)

=sumα

sumβsubα

sumxβ

Aαβ(1minus Aαβ)Qlowastα(xβ)R2β(xβ) (K1)

+sumα

sumββ primesubα

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ)Qlowastα(xprimeβ prime)Rβ(xβ)Rβ prime(x

primeβ prime) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime) (K3)

where Rβ(xβ) equiv Sβ(xβ)Qβ(xβ)

2398 T Heskes

62 Recovering the Convexity Conditions (2) Let us first see how weget back the conditions for convexity of the Bethe free energy equation 51Since

K2 =sumα

[sumβsubα

sumxβ

AαβQlowastα(xβ)Rβ(xβ)

]2

ge 0

and5

K3 =sumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)

times

12

[Rβ(xβ)minus Rβ prime(xprimeβ prime)

]2 minus 12

R2β(xβ)minus

12

R2β prime(x

primeβ prime)

gesumα

sumβsubα

sumxβ

Aαβ

(sumβ primesubα

Aαβ prime minus Aαβ

)Qlowastα(xβ)R

2β(xβ) (61)

we have

K = K1 + K2 + K3 gesumα

sumβsubα

sumxβ

Aαβ

(1minus

sumβ primesubα

Aαβ prime

)Qlowastα(xβ)R

2β(xβ)

That is sufficient conditions for K to be nonnegative are

Aαβ ge 0 forallαβsubα andsumβsubα

Aαβ le 1 forallα

precisely the conditions for theorem 1

63 Fake Interactions While discussing the conditions for convexity ofthe Bethe free energy we noticed that adding a ldquofake interactionrdquo such as aconstant potential can change the validity of the conditions We will see thathere this is not the case and these fake interactions drop out as we wouldexpect them to

Suppose that we have a fake interaction α(Xα) = 1 From the solutionequation 54 it follows that the pseudomarginal Qlowastα(Xα) factorizes6

Qlowastα(xβ xprimeβ prime) = Qlowastα(xβ)Qlowastα(xprimeβ prime) forallββ primesubα

5 This step is in fact equivalent to the Gerschgorin theorem for bounding the eigenval-ues of a matrix

6 The exact marginal Pexact(Xα) need not factorize This is really a consequence of thelocality assumptions behind loopy belief propagation and the Bethe free energy

Uniqueness of Loopy Belief Propagation Fixed Points 2399

Consequently the terms involving α in K3 cancel with those in K2 which ismost easily seen when we combine K2 and K3 in a different way

K2 + K3 =sumα

sumβsubα

sumxβ xprime

βprime

A2αβQlowastα(xβ)Q

lowastα(xprimeβ)Rβ(xβ)Rβ(x

primeβ) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ prime

times [Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)]Rβ(xβ)Rβ prime(x

primeβ prime) (K3)

This leaves us with the weaker requirement (from K1) Aαβ(1minusAαβ) ge 0 forall β sub α The best choice is then to take Aαβ = 1 which turns condition 3of equation 41 intosum

αprimesupβαprime =α

Aαprimeβ + 1 ge nβ minus 1

The net effect is equivalent to ignoring the interaction reducing the numberof neighboring potentials nβ by 1 for all β that are part of the fake interactionα

We have seen how we get milder and thus better conditions when thereis effectively no interaction Motivated by this ldquosuccessrdquo we will work to-ward conditions that take into account the strength of the interactions Ourstarting point will be the above decomposition in K2 and K3 where sinceK2 ge 0 we will concentrate on K3

7 The Strength of a Potential

71 Bounding the Correlations The crucial observation which will al-low us to obtain milder and thus better conditions for the uniqueness of afixed point is the following lemma It bounds the term between brackets inK3 such that we can again combine this bound with the (positive) term K1However before we get to that we take some time to introduce and deriveproperties of the ldquostrengthrdquo of a potential

Lemma 2 Two-node correlations of loopy belief marginals obey the bound

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime) le σαQlowastα(xβ xprimeβ prime) forall ββprime subα

βprime =βforallxβ xprime

βprime (71)

with the ldquostrengthrdquo σα a function of the potential ψα(Xα) equiv logα(Xα) only

σα = 1minus exp(minusωα) with

ωα equiv maxXαXα

[ψα(Xα)+ (nα minus 1)ψα(Xα)minus

sumβsubα

ψα(Xαβ xβ)

] (72)

where nα equivsum

βsubα 1

2400 T Heskes

Proof For convenience and without loss of generality we omit α fromour notation and renumber the nodes that are contained in α from 1 to nWe consider the quotient between the loopy belief on the potential subsetdivided by the product of its single-node marginals

Qlowast(X)nprodβ=1

Qlowast(xβ)=(X)

prodβ

microβ(xβ)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)microβ(xβ)

=(X)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)

(73)

where we substituted the properly normalized version of equation 35 aloopy belief pseudomarginal is proportional to the potential times incomingmessages The goal is now to find the maximum of the above expressionover all possible messages and all values of X Especially the maximum overmessagesmicro seems to be difficult to compute but the following intermediatelemma helps us out

Lemma 3 The maximum of the function

V(micro) = (nminus 1) log

[sumX

(X)nprodβ=1

microβ(xβ)

]

minusnsumβ=1

log

sum

(Xβ xlowastβ)prodβ prime =β

microβ prime(xβ prime)

with respect to the messages micro under constraintssum

xβ microβ(xβ) = 1 for all β andmicroβ(xβ) ge 0 for all β and xβ occurs at an extreme point microβ(xβ) = δxβ xβ for somexβ to be found

Proof Let us consider optimizing the messagemicro1(x1)with fixed messagesmicroβ(xβ) for β gt 1 The first and second derivatives are easily found to obey

partVpartmicro1(x1)

= (nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ)

part2Vpartmicro1(x1)partmicro1(xprime1)

= (nminus 1)Q(x1)Q(xprime1)minussumβ =1

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

Uniqueness of Loopy Belief Propagation Fixed Points 2401

where

Q(X) equiv (X)prodβ microβ(xβ)sum

Xprime (Xprime)prodβ microβ(x

primeβ)

Now suppose that V has a regular extremum (maximum or minimum) notat an extreme point that is micro1(x1) gt 0 for two or more values of x1 At suchan extremum the first derivative should obey

(nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ) = λ

with λ a Lagrange multiplier implementing the constraintsum

x1micro1(x1) = 1

Summing over x1 we obtain λ = 0 (in fact V is indifferent to any multi-plicative scaling of micro) For the matrix with second derivatives at such anextremum we then have

part2Vpartmicro1(x1)partmicro1(xprime1)

=sumβ =1

sumβprime =1βprime =β

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

which is positive semidefinite the extremum cannot be a maximum Con-sequently any maximum must be at the boundary of the domain Sincethis holds for any choice of microβ(xβ) β gt 1 it follows by induction that themaximum with respect to all microβ(xβ) must be at an extreme point as well

The function V(micro) is up to a term independent of micro the logarithm ofequation 73 So the intermediate lemma 3 tells us that we can replace themaximization over messages micro by maximization over values X

maxmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

X

(X)[(X)

]nminus1

prodβ

(Xβ xβ)

Next we take the maximum over X as well and define the ldquostrengthrdquo σ tobe used in equation 71 through

11minus σ equiv max

Xmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

XX

(X)[(X)

]nminus1

prodβ

(Xβ xβ) (74)

2402 T Heskes

The inequality 71 then follows by summing out Xββ prime in

Qlowast(X)minusprodβ

Qlowast(xβ) le σQlowast(X)

The form of equation 72 then follows by rewriting equation 74 as

ω equiv minus log(1minus σ) = maxXX

W(X X) with

W(X X) =[ψ(X)+ (nminus 1)ψ(X)minus

sumβ

ψ(Xβ xβ)

]

where we recall that ψ(X) equiv log(X)

72 Some Properties In the following we will refer to both ω and σ asthe strength of the potential There are several properties worth noting

bull The strength of a potential is indifferent to multiplication with anyterm that factorizes over the nodes that is

if (X) = (X)prodβ

microβ(xβ) then ω() = ω() for any choice of micro

This property relates to the arbitrariness in the definition of equa-tion 31 if two potentials overlap then multiplying one potential witha term that only depends on the overlap and dividing the other by thesame term does not change the distribution Luckily it also does notchange the strength of those potentials

bull To compute the strength we can enumerate all possible combinationsHowever we can neglect all combinations X and X that differ in fewerthan two nodes To see this consider

W(x1 x2 x12 x1 x2 x12) = ψ(x1 x2 x12)+ ψ(x1 x2 x12)minus ψ(x1 x2 x12)minus ψ(x1 x2 x12)

= minusW(x1 x2 x12 x1 x2 x12)

If now also x2 = x2 we get W(x1 x1 x1 x1) = minusW(x1 x1 x1 x1) =0 Furthermore if W(x1 x2 x12 x1 x2 x12) le 0 then it must be thatW(x1 x2 x12 x1 x2 x12) ge 0 and vice versa So ω the maximumover all combinations must be nonnegative and we can indeed neglectall combinations that by definition yield zero

bull Thus for finite potentials 0 le ω ltinfin and 0 le σ lt 1

Uniqueness of Loopy Belief Propagation Fixed Points 2403

bull With pairwise potentials the above symmetries can be used to reducethe number of evaluations to |x1||x2|(|x1|minus1)(|x2|minus1)4 combinationsAnd indeed for binary nodes x12 isin 0 1 we immediately obtain

ω = |ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| (75)

Any pairwise binary potential can be written as a Boltzmann factor

(x1 x2) prop exp[wx1x2 + θ1x1 + θ2x2]

In this notation we find the simple and intuitive expression ω = |w|the strength is the absolute value of the ldquoweightrdquo It is indeed inde-pendent of (the size of) the thresholds In the case of minus1 1 codingthe relationship is ω = 4|w|

bull In some models there is the notion of a ldquotemperaturerdquo T that is(X) propexp[ψ(X)T] where ψ(X) is considered constant In obvious notationwe then have ω(T) = ω(1)T and thus σ(T) = 1 minus exp[minusω(1)T] =1minus [1(1minus σ(1))]1T

bull Loopy belief revision (max-product) can be interpreted as a zero-temperature limit of loopy belief propagation (sum product) Morespecifically we get the belief revision updates if we imagine runningloopy belief propagation on potentials that are scaled with tempera-ture T and then take the limit T to zero Consequently when analyzingconditions for uniqueness of loopy belief revision fixed points we cantake σ(0) = 0 if σ(1) = 0 (fake interaction) yet σ(0) = 1 wheneverσ(1) gt 0

8 Conditions for Uniqueness

81 Main Result

Theorem 4 Loopy belief propagation has a unique fixed point if there exists anallocation matrix Aαβ between potentials α and nodes β with properties

1 Aαβ ge 0 forallαβsubα (positivity)

2 (1minus σα)maxβsubα

Aαβ + σαsumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(81)

with the strength σα a function of the potentialα(Xα) as defined in equation 72

Proof For completeness we first summarize our line of reasoning Fixedpoints of loopy belief propagation are in one-to-one correspondence with

2404 T Heskes

extrema of the dual equation 55 This dual has a unique extremum if itis convexconcave Concavity is guaranteed so we focus on conditionsfor convexity that is for positive (semi)definiteness of the correspondingHessian This then boils down to conditions that ensure K = K1+K2+K3 ge 0for any choice of Rβ(xβ)

Substituting the bound equation 71 into the term K3 we obtain

K3 ge minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeσαQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime)

ge minussumα

σαsumβsubα

sumxβ

Aαβ

sumβprimesubαβprime =β

Aαβ primeQlowastα(xβ)R2β(xβ)

where in the last step we applied the same trick as in equation 61 SinceK2 ge 0 and combining K1 and (the above lower bound on) K3 we get

K = K1 + K2 + K3

gesumα

sumβsubα

sumxβ

Aαβ

[1minus Aαβ minus σα

sumβ prime =β

Aαβ prime

]Qlowastα(xβ)R

2β(xβ)

This implies

(1minus σα)Aαβ + σαsumβ primesubα

Aαβ prime le 1 forallαβsubα

which in combination with Aαβ ge 0 and σα le 1 yields condition 2 inequation 81 The equality constraint equation 52 that we started with canbe relaxed to the inequality condition 3 without any consequences

We get back the stricter conditions of theorem 1 if σα = 1 for all potentialsα Furthermore ldquofake interactionsrdquo play no role with σα = 0 condition 2becomes maxβsubα Aαβ le 1 suggesting the choice Aαβ = 1 for all β sub αwhich then effectively reduces the number of neighboring potentials nβ incondition 3

82 Comparison with Other Work To the best of our knowledge theonly conditions for uniqueness of loopy belief propagation fixed points thatdepend on more than just the structure of the graph are those in Tatikondaand Jordan (2002) for pairwise potentials The analysis in Tatikonda andJordan is based on the concept of the computation tree which represents anunwrapping of the original graph with respect to the loopy belief propaga-tion algorithm The same concept is used in Weiss (2000) to show that beliefrevision yields the correct maximum a posteriori assignments in graphs

Uniqueness of Loopy Belief Propagation Fixed Points 2405

with a single loop and Weiss and Freeman (2001) to prove that loopy beliefpropagation in gaussian graphical models yields exact means Although thecurrent theorems based on the concept of computation trees are derived forpairwise potentials it should be possible to extend them to more generalfactor graphs

The setup in Tatikonda and Jordan (2002) is slightly different it is basedon the factorization

Pexact(X) = 1Z

prodα

α(Xα)prodβ

β(xβ)

to be compared with our equation 31 where there are no self-potentialsβ(xβ) With this in mind the statement is then as follows

Theorem 5 (adapted from Tatikonda amp Jordan 2002 in particular proposi-tion 53) Loopy belief propagation on pairwise potentials has a unique fixed pointif

sumαsupβ

(max

ψα(Xα)minusminXα

ψα(Xα)

)lt 2 forallβ (82)

To make the connection between theorem 5 and theorem 4 we will firststrengthen the former and then weaken the latter We will focus on thecase of binary pairwise potentials Since the definition of self-potentials isarbitrary and the condition 82 is valid for any choice we can easily improvethe condition by optimizing this choice This then leads to the followingcorollary

Corollary 3 This corollary concerns an improvement of theorem 5 for pairwisebinary potentials Loopy belief propagation on pairwise binary potentials has aunique fixed point ifsum

αsupβωα lt 4 forallβ (83)

with ωα defined in equation 72

Proof The condition 82 applies to any arbitrary definition of self-poten-tials β(xβ) In fact it is valid for any choice

ψα(Xα) = ψα(Xα)+sumβsubα

φαβ(xβ)

where ψα(Xα) is any choice of potential subsets that fits in our frameworkof no self-potentials (as argued above there is some arbitrariness here as

2406 T Heskes

well) We can then optimize this choice to obtain milder and thus betterconditions Omitting α and renumbering the nodes from 1 to 2 we have

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= minφ1φ2

maxx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

minus minx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

In the case of binary nodes (two-by-two matrices ψ(x1 x2)) it is easy tocheck that the optimal φ1 and φ2 that yield the smallest gap are such that

ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2)

ge ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2) (84)

for some x1 x2 x1 and x2 with x1 = x1 and x2 = x2 Solving for φ1 and φ2we find

φ1(x1)minus φ1(x1) = 12

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]φ2(x2)minus φ2(x2) = 1

2

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]

Substitution back into equation 84 yields

ψ(x1 x2)+ φ1(x1)+ φ2(x2)minus ψ(x1 x2)minus φ1(x1)minus φ2(x2)

= 12

[ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)minus ψ(x1 x2)

]

which has to be nonnegative Of all four possible combinations two of themare valid and yield the same positive gap and the other two are invalid sincethey yield the same negative gap Enumerating these combinations we find

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= 12|ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| = ω

2

from equation 75 Substitution into the condition 82 then yields equa-tion 83

Next we derive the following weaker corollary of theorem 4

Uniqueness of Loopy Belief Propagation Fixed Points 2407

Corollary 4 This is a weaker version of theorem 4 for pairwise potentials Loopybelief propagation on pairwise potentials has a unique fixed point ifsum

αsupβωα le 1 forallβ (85)

with ωα defined in equation 72

Proof Consider the allocation matrix with components Aαβ = 1minus σα forall β sub α With this choice conditions 1 and 2 of equation 81 are fulfilledsince (condition 1) σα le 1 and (condition 2)

(1minus σα)(1minus σα)+ 2σα(1minus σα) = 1minus 2σ 2α le 1

Substitution into condition 3 yieldssumαsupβ

(1minus σα) gesumαsupβ

1minus 1 and thussumαsupβ

σα le 1 (86)

Since ωα = minus log(1minus σα) ge σα condition 85 is weaker than condition 86

Summarizing the conditions in Tatikonda and Jordan (2002) are for bi-nary pairwise potentials and when strengthened as above at most a constant(factor 4) less strict and thus better than the ones derived here The latterare better when the structure is (close to) a tree The best set of conditionsfollows by taking the union of both Note further that the conditions derivedin Tatikonda and Jordan (2002) are unlike theorem 4 specific to pairwisepotentials

83 Illustration For illustration we consider a 3 times 3 Ising grid withtoroidal boundary conditions as in Figure 3a and uniform ferromagneticpotentials proportional to(

α 1minus α1minus α α

)

The trivial solution which is the only minimum of the Bethe free energy forsmall α is the one with all pseudomarginals equal to (05 05) With simplealgebra for example following the line of reasoning that leads to the beliefoptimization algorithm in Welling and Teh (2003) it can be shown that thistrivial solution becomes unstable at the critical αcritical = 23 asymp 067 Forα gt 23 we find two minima one with ldquospins uprdquo and the other one withldquospins downrdquo

In this symmetric problem the strength of each potential is given by

ω = 2 log[

α

1minus α]

and thus σ = 1minus(

1minus αα

)2

2408 T Heskes

Figure 3 Three Ising grids in factor-graph notation circles denote nodes boxesinteractions (a) Toroidal boundary conditions All elements of the allocationmatrix equal to 34 (not shown) (b) Aperiodic boundary conditions and (c) twoloops left The elements of the allocation matrix along the edges follow directlyfrom optimizing condition 3 in theorem 4 and symmetry considerations WithB = 2minus 2A in b and C = 1minusA in c the optimal settings for the single remainingvariable A then boil down to 34 and 1 minus

radic18 respectively See the text for

further explanation

Uniqueness of Loopy Belief Propagation Fixed Points 2409

The minimal (uniform) compensation in condition 3 of theorem 4 amountsto A = 34 for all combinations of potentials and nodes Substitution intocondition 2 then yields

σ le 13

and thus α le 11+radic23

asymp 055

The critical value that follows from corollary 3 is in this case slightly better

ω lt 1 and thus α le 11+ eminus12 asymp 062

Next we consider the same grid with aperiodic boundary conditions asin Figure 3b Numerically we find a critical αcritical asymp 079 The value thatfollows from corollary 3 is dominated by the center node and hence staysthe same a unique loopy belief propagation fixed point for α lt 062 Theo-rem 4 can be exploited to shift resources a little In principle we can solvethe nonlinear programming problem but for this small problem it can stillbe done by hand with the following argumentation Minimal compensationaccording to condition 3 in theorem 4 combined with symmetry considera-tions yields the allocation matrix elements along on the edges in Figure 3bIt is then easy to check that there are only two different appearances ofcondition 2

(2minus 2A)σ + 34le 1 and

12σ + A le 1

The optimal choice for A is the one in which both conditions turn out to beidentical In this way we obtain A = 34 yielding

σ le 12

and thus α le 11+radic12

asymp 058

still slightly worse than the condition from corollary 3An example in which the condition obtained with theorem 4 is better

than the one from corollary 3 is given in Figure 3c Straightforward analysisfollowing the same recipe as for Figure 3b yields A = 1minusradic18 with

σ leradic

12

and thus α le 1

1+radic

1minusradic12asymp 065

better than theα lt 062 from corollary 3 and to be compared with the criticalαcritical asymp 088

9 Discussion

In this article we derived sufficient conditions for loopy belief propagationto have just a single fixed point These conditions remain much too strong tobe anywhere near the necessary conditions and in that sense should be seenas no more than a first step These conditions have the following positivefeatures

2410 T Heskes

bull Generalize the conditions for convexity of the Bethe free energy

bull Incorporate the (local) strength of potentials

bull Scale naturally as a function of the ldquotemperaturerdquo

bull Are invariant to arbitrary definitions of potentials and self-interactions

Although the analysis that led to these conditions may seem quite involvedit basically consists of a relatively straightforward combination of two ob-servations The first observation is that we can exploit the arbitrariness inthe definition of the Bethe free energy when we incorporate the constraintsThis forms the basis of the resource allocation argument And the secondobservation concerns the bound on the correlation of a loopy belief propa-gation marginal that leads to the introduction of the strength of a potential

Besides its theoretical usefulness there are more practical uses Firstalgorithms for guaranteed convergence explicitly minimize the Bethe freeenergy They can be considered ldquobound optimization algorithmsrdquo similarto expectation maximization and iterative proportional fitting in the innerloop they minimize a bound on the Bethe free energy which is then up-dated in the outer loop In practice it appears that the tighter the boundthe faster the convergence (see eg Heskes et al 2003) Instead of a boundthat is convex (Yuille 2002) or convex over the set of constraints (Teh ampWelling 2002 Heskes et al 2003) we might relax the convexity conditionand choose a tighter bound that still has a unique minimum thereby speed-ing up the convergence Second in Wainwright et al (2002) a convexifiedBethe free energy is proposed The arguments for this class of free energiesare twofold they yield a bound on the partition function (instead of justan approximation as the standard Bethe free energy) and have a uniqueminimum Focusing on the second argument the conditions in this articlecan be used to construct Bethe free energies that may not be convex (overthe set of constraints) but do have a unique minimum and being closer tothe standard Bethe free energy may yield better approximations

We can think of the following opportunities to make the sufficient con-ditions derived here stricter and thus closer to necessary conditions

bull The conditions guarantee convexity of the dual G(Qβ λαβ) with re-spect to Qβ But in fact we need only G(Qβ) equiv maxλαβ G(Qβ λαβ) to beconvex which is a weaker requirement The Hessian of G(Qβ) how-ever appears to be more difficult to compute and to analyze in generalbut may lead to stronger results in specific cases (eg only pairwiseinteractions or substituting a particular choice of Aαβ )

bull It may be possible to strengthen the bound equation 71 on loopybelief correlations especially for interactions that involve more thantwo nodes

An important question is how the uniqueness of loopy belief propaga-tion fixed points relates to the convergence of loopy belief propagation

Uniqueness of Loopy Belief Propagation Fixed Points 2411

Intuitively one might expect that if loopy belief propagation has a uniquefixed point it will also converge to it This also seems to be the argumenta-tion in Tatikonda and Jordan (2002) However to the best of our knowledgethere is no proof of such correspondence Furthermore the following set ofsimulations does seem to suggest otherwise

We consider a Boltzmann machine with four binary nodes weights

w = ω

0 1 minus1 minus11 0 1 minus1minus1 1 0 minus1minus1 minus1 minus1 0

zero thresholds and potentials

ij(xi xj) = exp[wij4] if xi = xj and ij(xi xj) = exp[minuswij4] if xi = xj

Running loopy belief propagation possibly damped as in equation 39 weobserve ldquoconvergentrdquo and ldquononconvergentrdquo behavior For relatively smallweights loopy belief propagation converges to the trivial fixed point withPi(xi) = 05 for all nodes i and xi = 0 1 as in the lower left inset inFigure 4 For relatively large weights it ends up in a limit cycle as shown inthe upper right inset The weight strength that forms the transition betweenthis ldquoconvergentrdquo and ldquononconvergentrdquo behavior strongly depends on thestep size7 This by itself makes it hard to defend a one-to-one correspondencebetween convergence of loopy belief propagation (apparently dependingon step size) and uniqueness of fixed points (obviously independent of stepsize)

For weights larger than roughly 58 loopy belief propagation failed toconverge to the trivial fixed point even for very small step sizes Howeverrunning a convergent double-loop algorithm from many different initialconditions and many weight strengths considerably larger than 58 we al-ways ended up in the trivial fixed point and never in another one We foundsimilar behavior for a three-node Boltzmann machine (same weight matrixas above except for the fourth node) for very large weights loopy beliefpropagation ends up in a limit cycle whereas a convergent double-loopalgorithm converges to the trivial fixed point which here by corollary 2is guaranteed to be unique In future work we hope to elaborate on theseissues

7 Note that the conditions for guaranteed uniqueness imply ω = 43 for corollary 3and ω = log(2) asymp 069 for theorem 4 both far below the weight strengths where ldquonon-convergentrdquo behavior sets in

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

2386 T Heskes

steps This damping is usually done in terms of the Lagrange multipliersthat is in the log domain of the messages

logmicronewαrarrβ(xβ) = logmicroαrarrβ(xβ)

+ ε[log Qlowastα(xβ)minus logmicroβrarrα(xβ) minus logmicroαrarrβ(xβ)] (39)

Summarizing loopy belief propagation is equivalent to fixed-point itera-tion where the fixed points are the zero derivatives of the Lagrangian

4 Convexity of the Bethe Free Energy

41 Rewriting the Bethe Free Energy Minimization of the Bethe freeenergy equation 32 under the constraints of equation 33 is equivalent tosolving a minimax problem on the Lagrangian equation 34 namely

minQαQβ

maxλαβ λαλβ

L(QαQβ λαβ λα λβ)

The ordering of the min and max operations is important here to enforcethe constraints we first have to take the maximum The min and max oper-ations can be interchanged if we have a convex constrained minimizationproblem (Luenberger 1984) That is the function to be minimized must beconvex in its parameters the equality constraints have to be linear and theinequality constraints convex In our case the equality constraints are in-deed linear and the inequality constraints enforcing nonnegativity of thepseudomarginals indeed are convex However the Bethe free energy equa-tion 32 is clearly nonconvex in its parameters QαQβ This is what makesit a difficult optimization problem

Luckily the description in equation 32 is not unique any other formthat can be constructed by substituting the constraints of equation 33 isequally valid Following Pakzad and Anantharam (2002) we call the Bethefree energy ldquoconvex over the set of constraintsrdquo if by making use of theconstraints of equation 33 we can rewrite it in a form that is convex inQαQβ

42 Conditions for Convexity The problem is with the term

Sβ(Qβ) equiv minussumxβ

Qβ(xβ) log Qβ(xβ)

which is concave in Qβ Using the constraint Qβ(xβ) = Qα(xβ) we can turnit into a functional that is convex in Qα and Qβ separately but not necessarilyjointly That is with the substitution Qβ(xβ) = Qα(xβ) for any α sup β theentropy and thus the Bethe free energy is convex in Qα and in Qβ but notnecessarily in QαQβ However if we add to Sβ(Qβ) a convex entropycontribution

minusSα(Qα) equivsumXα

Qα(Xα) log Qα(Xα)

Uniqueness of Loopy Belief Propagation Fixed Points 2387

the combination of minusSα and Sβ is convex in QαQβ as the followinglemma needed in the proof of theorem 1 below shows

Lemma 1

αβ(QαQβ) equivsumXα

Qα(Xα) log Qα(Xα)minussumxβ

Qα(xβ) log Qβ(xβ)

is convex in QαQβ

Proof The matrix with second derivatives of αβ has the components

H(XαXprimeα) equivpart2αβ

partQα(Xα)partQα(Xprimeα)= 1

Qα(Xα)δXαXprimeα

H(Xα xprimeβ) equivpart2αβ

partQα(Xα)partQβ(xprimeβ)= minus 1

Qβ(xβ)δxβ xprimeβ

H(xβ xprimeβ) equivpart2αβ

partQβ(xβ)partQβ(xprimeβ)= minusQα(xβ)

Q2β(xβ)

δxβ xprimeβ

where we note that Xα and xβ should be interpreted as indices Convexityrequires that for any ldquovectorrdquo (Rα(Xα) Rβ(xβ))

0 le (Rα(Xα) Rβ(xβ))(

H(XαXprimeα) H(Xα xprimeβ)H(xβXprimeα) H(xprimeβ xβ)

)(Rα(Xprimeα)Rβ(xprimeβ)

)

=sumXα

R2α(Xα)

Qα(Xα)minus 2

sumXα

Rα(Xα)Rβ(xβ)Qβ(xβ)

+sumxβ

Qα(xβ)R2β(xβ)

Q2β(xβ)

=sumXα

Qα(Xα)

[Rα(Xα)

Qα(Xα)minus Rβ(xβ)

Qβ(xβ)

]2

The idea is that the Bethe free energy is convex over the set of constraints ifwe have sufficient convex resources Qα log Qα to compensate for the concaveminusQβ log Qβ terms This can be formalized in the following theorem

Theorem 1 The Bethe free energy is convex over the set of consistency con-straints if there exists an allocation matrix Aαβ between potentials α and nodes βsatisfying

1 Aαβ ge 0 forallαβsubα (positivity)

2sumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(41)

2388 T Heskes

Proof First we note that we do not have to worry about the energy termsthat are linear in Qα In other words to prove the theorem we can restrictourselves to proving that minus the entropy

minusS(Q) = minus[sum

α

Sα(Qα)minussumβ

(nβ minus 1)Sβ(Qβ)

]

is convex over the set of consistency constraints The resulting operation isnow a matter of resource allocation For each concave contribution (nβ minus1)Sβ we have to find convex contributions minusSα to compensate for it LetAαβ denote the ldquoamount of resourcesrdquo that we take from potential subset αto compensate for node β Now in shorthand notation and with a little bitof rewriting

minusS(Q) = minus[sum

α

Sα minussumβ

(nβ minus 1)Sβ

]

= minussumα

(1minus

sumβsubα

Aαβ +sumβsubα

Aαβ

)Sα

minussumβ

minussumαsupβ

Aαβ +sumαsupβ

Aαβ minus (nβ minus 1)

= minussumα

(1minus

sumβsubα

Aαβ

)Sα minus

sumα

sumβsubα

Aαβ [Sα minus Sβ ]

minussumβ

[sumαsupβ

Aαβ minus (nβ minus 1)

]Sβ

Convexity of the first term is guaranteed if 1minussumβ Aαβ ge 0 (condition 2) ofthe second term if Aαβ ge 0 (condition 1 and lemma 1) and of the third termifsum

α Aαβ minus (nβ minus 1) ge 0 (condition 3)

This theorem is a special case of the one in Heskes et al (2003) for themore general Kikuchi free energy Either one of the inequality signs in con-dition 2 and 3 of equation 41 can be replaced by an equality sign withoutany consequences

43 Some Implications

Corollary 1 The Bethe free energy for singly connected graphs is convex overthe set of constraints

Uniqueness of Loopy Belief Propagation Fixed Points 2389

Proof The proof is by construction Choose one of the leaf nodes as theroot βlowast and define

Aαβ = 1 iff β sub α and β closer to the root βlowast than any other β prime sub αAαβ prime = 0 for all other β prime

Obviously this choice of A satisfies conditions 1 and 2 of equation 41Arguing the other way around for eachβ = βlowast there is just a single potentialα sup β that is closer to the root βlowast than β itself (see the illustration in Figure 2)and thus there are precisely nβ minus 1 contributions Aαβ = 1 The root itselfgets nβlowast contributions Aαβlowast = 1 which is even better Hence condition 3 isalso satisfied

sumαsupβ

Aαβ = nβ minus 1 forallβ =βlowast andsumαsupβlowast

Aαβlowast = nβlowast gt nβlowast minus 1

With the above construction of A we are in a sense ldquoeating up resourcestoward the rootrdquo At the root we have one piece of resources left whichsuggests that we can still enlarge the set of graphs for which convexity canbe shown using theorem 1 This leads to the next corollary

Corollary 2 The Bethe free energy for graphs with a single loop is convex overthe set of constraints

Proof Again the proof is by construction Break the loop at one particularplace that is remove one node βlowast from a potential αlowast such that a singlyconnected structure is left Construct a matrix A as in the proof of corollary 1taking the node βlowast as the root The matrix A constructed in this way alsojust works for the graph with the closed loop since still

sumαsupβ

Aαβ = nβ minus 1 forallβ =βlowast and nowsumαsupβlowast

Aαβlowast = nβlowast minus 1

It can be seen that this construction starts to fail as soon as we have twoloops that are connected with two connected loops we have insufficientpositive resources to compensate for the negative entropy contributions

44 Connection with Other Work The same corollaries can be foundin Pakzad and Anantharam (2002) and McEliece and Yildirim (2003) Fur-thermore the conditions in theorem 1 are very similar to the ones stated inPakzad and Anantharam (2002) which for the Bethe free energy boil downto the following

2390 T Heskes

Figure 2 The construction of an allocation matrix satisfying all convexity con-straints for singly connected (a) and single-loop structures (b) Neglecting thearrows and dashes each graph corresponds to a factor graph (Kscischang Freyamp Loeliger 2001) where dark boxes refer to potentials and circles to nodes Thenumbers within the circles give the corresponding ldquoovercounting numbersrdquo forthe Bethe free energy 1minus nβ with nβ the number of neighboring potentials Thearrows pointing from potentials α to nodes β visualize the allocation matrixA with Aαβ = 1 if there is an arrow and Aαβ = 0 otherwise As can be seenfor each potential there is precisely one outgoing arrow pointing at the nodeclosest to the root chosen to be the node in the upper right corner of the graphIn the singly-connected structure (a) all nonroot nodes have precisely nβ minus 1incoming arrows just sufficient to compensate the overcounting number 1minusnβ The root node itself has one incoming arrow which it does not really need Inthe structure with the single loop (b) we open the loop by breaking the dashedlink and construct the allocation matrix for the corresponding singly connectedstructure This allocation matrix works for the single-loop structure as well be-cause now the incoming arrow at the ldquorootrdquo is just sufficient to compensate forthe negative overcounting number resulting from the extra link closing the loop

Uniqueness of Loopy Belief Propagation Fixed Points 2391

Theorem 2 (Adapted from theorem 1 in Pakzad amp Anantharam 2002) TheBethe free energy is convex for the set of constraints if for any set of nodes B wehave sum

βisinB

(1minus nβ)+sumαisinπ(B)

1 ge 0 (42)

where π(B) equiv α existβ isin Bβ sub α denotes the ldquoparentrdquo set of B that is thosepotential subsets that include at least one node in B

Proposition 1 The conditions in theorems 1 and 2 are equivalent

Proof Let us first suppose that there does exist an allocation matrix Aαβ

satisfying the conditions of equation 41 Then for any set B

sumβisinB

(nβ minus 1) lesumβisinB

sumαsupβ

Aαβ lesumαisinπ(B)

sumβsubα

Aαβ lesumαisinπ(B)

1

where the inequalities follow from conditions 3 1 and 2 in equation 41respectively In other words validity of the conditions in theorem 1 impliesthe validity of those in theorem 2

Next let us suppose that the conditions in Theorem 1 fail Above wehave seen that this can happen if and only if the graph contains at least oneconnected component with two connected loops But then condition 42is violated as well when we take for B the set of all nodes within such acomponent

Since validity implies validity and violation implies violation the con-ditions must be equivalent

Graphical models with a single loop have been studied in detail in Weiss(2000) yielding important theoretical results (eg correctness of maximuma posteriori assignments) These results are derived by ldquounwrappingrdquo thesingle loop into an infinite tree This argument also breaks down as soon asthere is more than a single loop It might be interesting to find out whetherthere is a deeper connection between this unwrapping argument and theconvexity of the Bethe free energy

In summary we have given conditions for the Bethe free energy to have aunique extremum satisfying the constraints From the connection betweenthe extrema of the Bethe free energy and fixed points of loopy belief prop-agation it then follows that loopy belief propagation has a unique fixedpoint when these conditions are satisfied These conditions fail as soon asthe structure of the graph contains two connected loops

The conditions for convexity of the Bethe free energy depend on thestructure of the graph the potentialsα(Xα) do not play any role These po-tentials appear only in the energy term that is linear in the pseudomarginals

2392 T Heskes

and thus does not affect the convexity argument Consequently adding aldquofake linkrdquo with potential α(Xα) = 1 can change the validity of the con-ditions whereas it has no effect on the loopy belief propagation updatesEven if we managed to find more interesting (ie milder) conditions forconvexity over the set of constraints3 the impact of fake links would neverdisappear In the following we will therefore dig a little deeper to arrive atmilder conditions that do take into account (the strength of) the potentials

5 The Dual Formulation

51 From Lagrangian to Dual As we have seen fixed points of loopybelief propagation are in a one-to-one correspondence with zero derivativesof the Lagrangian If we manage to find conditions under which these zeroderivatives have a unique solution then for the same conditions loopybelief propagation has a unique fixed point In the following we will workwith a Lagrangian slightly different from equation 34 First we substitutethe constraint Qα(xβ) = Qβ(xβ) to write the Bethe free energy in the ldquomoreconvexrdquo form

F(QαQβ) = minussumα

sumXα

Qα(Xα)ψα(Xα)+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumβ

sumαsupβ

Aαβ

sumxβ

Qα(xβ) log Qβ(xβ) (51)

where the allocation matrix Aαβ can be any matrix that satisfies

sumαsupβ

Aαβ = nβ minus 1 (52)

And second we express the consistency constraints from equation 33 interms of the potential pseudomarginals Qα alone This then yields

L(QαQβ λαβ λα) = minussumα

sumXα

Qα(Xα)ψα(Xα)

+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumα

sumβsubα

Aαβ

sumxβ

Qα(xβ) log Qβ(xβ)

3 We would like to conjecture that this is not possiblemdashthat the conditions in theorem 1are not only sufficient but also necessary to prove convexity of the Bethe free energy overthe set of consistency constraints Note that this would not imply that we need theseconditions to guarantee the uniqueness of fixed points since for that convexity by itselfis sufficient not necessary

Uniqueness of Loopy Belief Propagation Fixed Points 2393

+sumβ

sumαsupβ

sumxβ

λαβ(xβ)

times[

1nβ minus 1

sumαprimesupβ

AαprimeβQαprime(xβ)minusQα(xβ)

]

+sumα

λα

[1minus

sumXα

Qα(Xα)

]

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (53)

Note that the constraint Qβ(xβ) = Qα(xβ) as well as its normalization is nolonger incorporated with Lagrange multipliers but follows when we takethe minimum with respect to Qβ It is easy to check that the fixed-pointequations of loopy belief propagation still follow by setting the derivativesof the Lagrangian equation 53 to zero

Although the Bethe free energy and thus the Lagrangian equation 53may not be convex in QαQβ they are convex in Qα and Qβ separatelyTherefore we can interchange the minimum over the pseudomarginals Qα

and the maximum over the Lagrange multipliers as long as we leave theminimum over Qβ as the final operation4

minQαQβ

maxλαβ λα

L(QαQβ λαβ λα) = minQβ

maxλαβ λα

minQα

L(QαQβ λαβ λα)

Rewriting

sumβ

sumαsupβ

sumxβ

λαβ(xβ)

[1

nβ minus 1

sumαprimesupβ

AαprimeβQαprime(xβ)minusQα(xβ)

]

= minussumα

sumβsubα

sumxβ

λαβ(xβ)Qα(xβ)

with

λαβ(xβ) equiv λαβ(xβ)minus 1nβ minus 1

sumαprimesupβ

Aαprimeβλαprimeβ(xβ)

we can easily solve for the minimum with respect to Qα

Qlowastα(Xα) = α(Xα) exp

[λα minus 1+

sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

] (54)

4 In principle we could also first take the minimum over Qβ and leave the minimumover Qα but this does not seem to lead to any useful results

2394 T Heskes

Plugging this into the Lagrangian we obtain the ldquodualrdquo

G(Qβ λαβ λα) equiv L(QlowastαQβ λαβ λα)

= minussumα

sumXα

α(Xα)

exptimes[λα minus 1+

sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

]

+sumα

λα +sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (55)

Next we find for the maximum with respect to λα

exp[1minus λlowastα

] =sumXα

α(Xα) exp

[sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

]

equiv Zlowastα (56)

where we have to keep in mind that Zlowastα by itself like Qlowastα is a function of theremaining pseudomarginals Qβ and Lagrange multipliers λαβ Substitutingthis solution into the dual we arrive at

G(Qβ λαβ) equiv G(Qβ λαβ λlowastα)

= minussumα

log Zlowastα +sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (57)

Let us pause here for a moment and reflect on what we have done so farThe Lagrangian equation 53 being convex in Qα has a unique minimumin Qα (given all other parameters fixed) which is also the only extremumIt happens to be relatively straightforward to express the value at this mini-mum in terms of the remaining parameters and then also to find the optimal(maximal) λlowastα Plugging these values into the Lagrangian equation 53 wehave not lost anything That is zero derivatives of the Lagrangian are still inone-to-one correspondence with zero derivatives of the dual equation 57and thus with fixed points of loopy belief propagation

52 Recovering the Convexity Conditions (1) To find a minimum of theBethe free energy satisfying the constraints in equation 33 we first have totake the maximum of the dual equation 57 over the remaining Lagrangemultipliersλαβ and then the minimum over the remaining pseudomarginalsQβ The duality theorem a standard result from constrained optimization(see Luenberger 1984) tells us that the dual G is concave in the Lagrangemultipliers The remaining question is then whether the dual is convex in

Uniqueness of Loopy Belief Propagation Fixed Points 2395

Qβ If it is we have a convex-concave minimax problem which is guaranteedto have a unique solution

Link c in Figure 1 follows from the following proposition

Proposition 2 Convexity of the Bethe free energy equation 51 in QαQβimplies convexity of the dual equation 57 in Qβ

Proof First we note that the minimum of a convex function over someof its parameters is convex in its remaining parameters In obvious one-dimensional notation with ylowast(x) equiv argmin

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge 2 f (x (ylowast(x+ δ)+ ylowast(xminus δ))2)ge 2 f (x ylowast(x))

where the first inequality follows from the convexity of f in x yand the sec-ond inequality from ylowast(x) being the unique minimum of f (x y) Thereforethe dual equation 55 is convex in Qβ when the Lagrangian equation 53and thus the Bethe free energy equation 51 is convex in QαQβ Further-more from the duality theorem the dual equation 55 is concave in theLagrange multipliers λαβ λα Next we note that the maximum of a con-vex or concave function over its maximizing parameters is again convexwith ylowast(x) equiv argmax

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge f (x+ δ ylowast(x))+ f (xminus δ ylowast(x))ge 2 f (x ylowast(x))

where the first inequality follows from ylowast(xplusmnδ) being the unique maximumof f (x plusmn δ y) and the second inequality from the convexity of f (x y) in xHence the dual equation 57 must still be convex in Qβ

For now we did not gain or lose anything in comparison with the con-ditions for theorem 1 However the inequalities in the above proof suggesta little space that will lead to milder conditions for the uniqueness of fixedpoints

53 Boundedness of the Bethe Free Energy For completeness and tosupport link f in Figure 1 we will here prove that the Bethe free energy isbounded from below The following theorem can be considered a specialcase of the one stated in Minka (2001) on the Bethe free energy for expectationpropagation a generalization of (loopy) belief propagation

Theorem 3 If all potentials are bounded from above that isα(Xα) le max forall α and Xα the Bethe free energy is bounded from below on the set of constraints

2396 T Heskes

Proof It is sufficient to prove that the function G(Qβ) equiv maxλαβ G(Qβ λαβ)

is bounded from below for a particular choice of Aαβ satisfying equation 52Considering Aαβ = nβminus1

nβ we then have

G(Qβ) ge minussumα

logsumXα

α(Xα) exp

[sumβsubα

nβ minus 1nβ

log Qβ(xβ)

]

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

logsumXα

α(Xα)Qβ(xβ)

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

+sumβ

(nβ minus 1)

[minus log

sumxβ

Qβ(xβ)+sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

where the first inequality follows by substituting the choice λαβ(xβ) = 0 forall α β and xβ in G(Qβ λαβ) the second from the concavity of the function

ynβminus1

nβ and the third from the upper bound on the potentials

6 Toward Better Conditions

61 The Hessian The next step is to compute the Hessianmdashthe secondderivative of the dual with respect to the pseudomarginals Qβ The firstderivative yields

partGpartQβ(xβ)

= minussumαsupβ

Aαβ

Qlowastα(xβ)Qβ(xβ)

+ (nβ minus 1)

which is immediate from the Lagrangian equation 53 To compute thematrix of second derivatives

Hββ prime(xβ xprimeβ prime) equivpart2G

partQβ(xβ)partQβ prime(xprimeβ prime)

Uniqueness of Loopy Belief Propagation Fixed Points 2397

we make use of

partQlowastα(xβ)partQβ prime(xprimeβ prime)

= Aαβ primeQlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)

Qβ prime(xprimeβ prime)

where both β and β prime should be a subset of α and with convention Qlowastα(xβ xβ)= Qlowastα(xβ) and Qlowastα(xβ xprimeβ) = 0 if xβ = xprimeβ Here the first term follows from thedifferentation of equation 54 and the second term from the normalizationas in equation 56 Distinguishing between β = β prime and β = β prime we then have

Hββ(xβ xprimeβ) =sumαsupβ

Aαβ(1minus Aαβ)Qlowastα(xβ)Q2β(xβ)

δxβ xprimeβ

+sumαsupβ

A2αβ

Qlowastα(xβ)Qlowastα(xprimeβ)Qβ(xβ)Qβ(xprimeβ)

Hββ prime(xβ xprimeβ prime) = minussum

αsupββ primeAαβAαβ prime

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)Qβ(xβ)Qβ prime(xprimeβ prime)

for β prime = β

where δxβ xprimeβ = 1 if and only if xβ = xprimeβ Here it should be noted that bothβ and xβ play the role of indices that is xβ should not be mistaken for avariable or parameter The parameters are still the (tables with) Lagrangemultipliers λαβ and pseudomarginals Qβ

The goal is now to find conditions under which this Hessian is positive(semi) definite for any setting of the parameters Qβ λαβ that is conditionsthat guarantee

K equivsumββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) ge 0

for any choice of the ldquovectorrdquo S with elements Sβ(xβ) Straightforward ma-nipulations yieldsum

ββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) (K)

=sumα

sumβsubα

sumxβ

Aαβ(1minus Aαβ)Qlowastα(xβ)R2β(xβ) (K1)

+sumα

sumββ primesubα

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ)Qlowastα(xprimeβ prime)Rβ(xβ)Rβ prime(x

primeβ prime) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime) (K3)

where Rβ(xβ) equiv Sβ(xβ)Qβ(xβ)

2398 T Heskes

62 Recovering the Convexity Conditions (2) Let us first see how weget back the conditions for convexity of the Bethe free energy equation 51Since

K2 =sumα

[sumβsubα

sumxβ

AαβQlowastα(xβ)Rβ(xβ)

]2

ge 0

and5

K3 =sumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)

times

12

[Rβ(xβ)minus Rβ prime(xprimeβ prime)

]2 minus 12

R2β(xβ)minus

12

R2β prime(x

primeβ prime)

gesumα

sumβsubα

sumxβ

Aαβ

(sumβ primesubα

Aαβ prime minus Aαβ

)Qlowastα(xβ)R

2β(xβ) (61)

we have

K = K1 + K2 + K3 gesumα

sumβsubα

sumxβ

Aαβ

(1minus

sumβ primesubα

Aαβ prime

)Qlowastα(xβ)R

2β(xβ)

That is sufficient conditions for K to be nonnegative are

Aαβ ge 0 forallαβsubα andsumβsubα

Aαβ le 1 forallα

precisely the conditions for theorem 1

63 Fake Interactions While discussing the conditions for convexity ofthe Bethe free energy we noticed that adding a ldquofake interactionrdquo such as aconstant potential can change the validity of the conditions We will see thathere this is not the case and these fake interactions drop out as we wouldexpect them to

Suppose that we have a fake interaction α(Xα) = 1 From the solutionequation 54 it follows that the pseudomarginal Qlowastα(Xα) factorizes6

Qlowastα(xβ xprimeβ prime) = Qlowastα(xβ)Qlowastα(xprimeβ prime) forallββ primesubα

5 This step is in fact equivalent to the Gerschgorin theorem for bounding the eigenval-ues of a matrix

6 The exact marginal Pexact(Xα) need not factorize This is really a consequence of thelocality assumptions behind loopy belief propagation and the Bethe free energy

Uniqueness of Loopy Belief Propagation Fixed Points 2399

Consequently the terms involving α in K3 cancel with those in K2 which ismost easily seen when we combine K2 and K3 in a different way

K2 + K3 =sumα

sumβsubα

sumxβ xprime

βprime

A2αβQlowastα(xβ)Q

lowastα(xprimeβ)Rβ(xβ)Rβ(x

primeβ) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ prime

times [Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)]Rβ(xβ)Rβ prime(x

primeβ prime) (K3)

This leaves us with the weaker requirement (from K1) Aαβ(1minusAαβ) ge 0 forall β sub α The best choice is then to take Aαβ = 1 which turns condition 3of equation 41 intosum

αprimesupβαprime =α

Aαprimeβ + 1 ge nβ minus 1

The net effect is equivalent to ignoring the interaction reducing the numberof neighboring potentials nβ by 1 for all β that are part of the fake interactionα

We have seen how we get milder and thus better conditions when thereis effectively no interaction Motivated by this ldquosuccessrdquo we will work to-ward conditions that take into account the strength of the interactions Ourstarting point will be the above decomposition in K2 and K3 where sinceK2 ge 0 we will concentrate on K3

7 The Strength of a Potential

71 Bounding the Correlations The crucial observation which will al-low us to obtain milder and thus better conditions for the uniqueness of afixed point is the following lemma It bounds the term between brackets inK3 such that we can again combine this bound with the (positive) term K1However before we get to that we take some time to introduce and deriveproperties of the ldquostrengthrdquo of a potential

Lemma 2 Two-node correlations of loopy belief marginals obey the bound

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime) le σαQlowastα(xβ xprimeβ prime) forall ββprime subα

βprime =βforallxβ xprime

βprime (71)

with the ldquostrengthrdquo σα a function of the potential ψα(Xα) equiv logα(Xα) only

σα = 1minus exp(minusωα) with

ωα equiv maxXαXα

[ψα(Xα)+ (nα minus 1)ψα(Xα)minus

sumβsubα

ψα(Xαβ xβ)

] (72)

where nα equivsum

βsubα 1

2400 T Heskes

Proof For convenience and without loss of generality we omit α fromour notation and renumber the nodes that are contained in α from 1 to nWe consider the quotient between the loopy belief on the potential subsetdivided by the product of its single-node marginals

Qlowast(X)nprodβ=1

Qlowast(xβ)=(X)

prodβ

microβ(xβ)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)microβ(xβ)

=(X)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)

(73)

where we substituted the properly normalized version of equation 35 aloopy belief pseudomarginal is proportional to the potential times incomingmessages The goal is now to find the maximum of the above expressionover all possible messages and all values of X Especially the maximum overmessagesmicro seems to be difficult to compute but the following intermediatelemma helps us out

Lemma 3 The maximum of the function

V(micro) = (nminus 1) log

[sumX

(X)nprodβ=1

microβ(xβ)

]

minusnsumβ=1

log

sum

(Xβ xlowastβ)prodβ prime =β

microβ prime(xβ prime)

with respect to the messages micro under constraintssum

xβ microβ(xβ) = 1 for all β andmicroβ(xβ) ge 0 for all β and xβ occurs at an extreme point microβ(xβ) = δxβ xβ for somexβ to be found

Proof Let us consider optimizing the messagemicro1(x1)with fixed messagesmicroβ(xβ) for β gt 1 The first and second derivatives are easily found to obey

partVpartmicro1(x1)

= (nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ)

part2Vpartmicro1(x1)partmicro1(xprime1)

= (nminus 1)Q(x1)Q(xprime1)minussumβ =1

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

Uniqueness of Loopy Belief Propagation Fixed Points 2401

where

Q(X) equiv (X)prodβ microβ(xβ)sum

Xprime (Xprime)prodβ microβ(x

primeβ)

Now suppose that V has a regular extremum (maximum or minimum) notat an extreme point that is micro1(x1) gt 0 for two or more values of x1 At suchan extremum the first derivative should obey

(nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ) = λ

with λ a Lagrange multiplier implementing the constraintsum

x1micro1(x1) = 1

Summing over x1 we obtain λ = 0 (in fact V is indifferent to any multi-plicative scaling of micro) For the matrix with second derivatives at such anextremum we then have

part2Vpartmicro1(x1)partmicro1(xprime1)

=sumβ =1

sumβprime =1βprime =β

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

which is positive semidefinite the extremum cannot be a maximum Con-sequently any maximum must be at the boundary of the domain Sincethis holds for any choice of microβ(xβ) β gt 1 it follows by induction that themaximum with respect to all microβ(xβ) must be at an extreme point as well

The function V(micro) is up to a term independent of micro the logarithm ofequation 73 So the intermediate lemma 3 tells us that we can replace themaximization over messages micro by maximization over values X

maxmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

X

(X)[(X)

]nminus1

prodβ

(Xβ xβ)

Next we take the maximum over X as well and define the ldquostrengthrdquo σ tobe used in equation 71 through

11minus σ equiv max

Xmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

XX

(X)[(X)

]nminus1

prodβ

(Xβ xβ) (74)

2402 T Heskes

The inequality 71 then follows by summing out Xββ prime in

Qlowast(X)minusprodβ

Qlowast(xβ) le σQlowast(X)

The form of equation 72 then follows by rewriting equation 74 as

ω equiv minus log(1minus σ) = maxXX

W(X X) with

W(X X) =[ψ(X)+ (nminus 1)ψ(X)minus

sumβ

ψ(Xβ xβ)

]

where we recall that ψ(X) equiv log(X)

72 Some Properties In the following we will refer to both ω and σ asthe strength of the potential There are several properties worth noting

bull The strength of a potential is indifferent to multiplication with anyterm that factorizes over the nodes that is

if (X) = (X)prodβ

microβ(xβ) then ω() = ω() for any choice of micro

This property relates to the arbitrariness in the definition of equa-tion 31 if two potentials overlap then multiplying one potential witha term that only depends on the overlap and dividing the other by thesame term does not change the distribution Luckily it also does notchange the strength of those potentials

bull To compute the strength we can enumerate all possible combinationsHowever we can neglect all combinations X and X that differ in fewerthan two nodes To see this consider

W(x1 x2 x12 x1 x2 x12) = ψ(x1 x2 x12)+ ψ(x1 x2 x12)minus ψ(x1 x2 x12)minus ψ(x1 x2 x12)

= minusW(x1 x2 x12 x1 x2 x12)

If now also x2 = x2 we get W(x1 x1 x1 x1) = minusW(x1 x1 x1 x1) =0 Furthermore if W(x1 x2 x12 x1 x2 x12) le 0 then it must be thatW(x1 x2 x12 x1 x2 x12) ge 0 and vice versa So ω the maximumover all combinations must be nonnegative and we can indeed neglectall combinations that by definition yield zero

bull Thus for finite potentials 0 le ω ltinfin and 0 le σ lt 1

Uniqueness of Loopy Belief Propagation Fixed Points 2403

bull With pairwise potentials the above symmetries can be used to reducethe number of evaluations to |x1||x2|(|x1|minus1)(|x2|minus1)4 combinationsAnd indeed for binary nodes x12 isin 0 1 we immediately obtain

ω = |ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| (75)

Any pairwise binary potential can be written as a Boltzmann factor

(x1 x2) prop exp[wx1x2 + θ1x1 + θ2x2]

In this notation we find the simple and intuitive expression ω = |w|the strength is the absolute value of the ldquoweightrdquo It is indeed inde-pendent of (the size of) the thresholds In the case of minus1 1 codingthe relationship is ω = 4|w|

bull In some models there is the notion of a ldquotemperaturerdquo T that is(X) propexp[ψ(X)T] where ψ(X) is considered constant In obvious notationwe then have ω(T) = ω(1)T and thus σ(T) = 1 minus exp[minusω(1)T] =1minus [1(1minus σ(1))]1T

bull Loopy belief revision (max-product) can be interpreted as a zero-temperature limit of loopy belief propagation (sum product) Morespecifically we get the belief revision updates if we imagine runningloopy belief propagation on potentials that are scaled with tempera-ture T and then take the limit T to zero Consequently when analyzingconditions for uniqueness of loopy belief revision fixed points we cantake σ(0) = 0 if σ(1) = 0 (fake interaction) yet σ(0) = 1 wheneverσ(1) gt 0

8 Conditions for Uniqueness

81 Main Result

Theorem 4 Loopy belief propagation has a unique fixed point if there exists anallocation matrix Aαβ between potentials α and nodes β with properties

1 Aαβ ge 0 forallαβsubα (positivity)

2 (1minus σα)maxβsubα

Aαβ + σαsumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(81)

with the strength σα a function of the potentialα(Xα) as defined in equation 72

Proof For completeness we first summarize our line of reasoning Fixedpoints of loopy belief propagation are in one-to-one correspondence with

2404 T Heskes

extrema of the dual equation 55 This dual has a unique extremum if itis convexconcave Concavity is guaranteed so we focus on conditionsfor convexity that is for positive (semi)definiteness of the correspondingHessian This then boils down to conditions that ensure K = K1+K2+K3 ge 0for any choice of Rβ(xβ)

Substituting the bound equation 71 into the term K3 we obtain

K3 ge minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeσαQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime)

ge minussumα

σαsumβsubα

sumxβ

Aαβ

sumβprimesubαβprime =β

Aαβ primeQlowastα(xβ)R2β(xβ)

where in the last step we applied the same trick as in equation 61 SinceK2 ge 0 and combining K1 and (the above lower bound on) K3 we get

K = K1 + K2 + K3

gesumα

sumβsubα

sumxβ

Aαβ

[1minus Aαβ minus σα

sumβ prime =β

Aαβ prime

]Qlowastα(xβ)R

2β(xβ)

This implies

(1minus σα)Aαβ + σαsumβ primesubα

Aαβ prime le 1 forallαβsubα

which in combination with Aαβ ge 0 and σα le 1 yields condition 2 inequation 81 The equality constraint equation 52 that we started with canbe relaxed to the inequality condition 3 without any consequences

We get back the stricter conditions of theorem 1 if σα = 1 for all potentialsα Furthermore ldquofake interactionsrdquo play no role with σα = 0 condition 2becomes maxβsubα Aαβ le 1 suggesting the choice Aαβ = 1 for all β sub αwhich then effectively reduces the number of neighboring potentials nβ incondition 3

82 Comparison with Other Work To the best of our knowledge theonly conditions for uniqueness of loopy belief propagation fixed points thatdepend on more than just the structure of the graph are those in Tatikondaand Jordan (2002) for pairwise potentials The analysis in Tatikonda andJordan is based on the concept of the computation tree which represents anunwrapping of the original graph with respect to the loopy belief propaga-tion algorithm The same concept is used in Weiss (2000) to show that beliefrevision yields the correct maximum a posteriori assignments in graphs

Uniqueness of Loopy Belief Propagation Fixed Points 2405

with a single loop and Weiss and Freeman (2001) to prove that loopy beliefpropagation in gaussian graphical models yields exact means Although thecurrent theorems based on the concept of computation trees are derived forpairwise potentials it should be possible to extend them to more generalfactor graphs

The setup in Tatikonda and Jordan (2002) is slightly different it is basedon the factorization

Pexact(X) = 1Z

prodα

α(Xα)prodβ

β(xβ)

to be compared with our equation 31 where there are no self-potentialsβ(xβ) With this in mind the statement is then as follows

Theorem 5 (adapted from Tatikonda amp Jordan 2002 in particular proposi-tion 53) Loopy belief propagation on pairwise potentials has a unique fixed pointif

sumαsupβ

(max

ψα(Xα)minusminXα

ψα(Xα)

)lt 2 forallβ (82)

To make the connection between theorem 5 and theorem 4 we will firststrengthen the former and then weaken the latter We will focus on thecase of binary pairwise potentials Since the definition of self-potentials isarbitrary and the condition 82 is valid for any choice we can easily improvethe condition by optimizing this choice This then leads to the followingcorollary

Corollary 3 This corollary concerns an improvement of theorem 5 for pairwisebinary potentials Loopy belief propagation on pairwise binary potentials has aunique fixed point ifsum

αsupβωα lt 4 forallβ (83)

with ωα defined in equation 72

Proof The condition 82 applies to any arbitrary definition of self-poten-tials β(xβ) In fact it is valid for any choice

ψα(Xα) = ψα(Xα)+sumβsubα

φαβ(xβ)

where ψα(Xα) is any choice of potential subsets that fits in our frameworkof no self-potentials (as argued above there is some arbitrariness here as

2406 T Heskes

well) We can then optimize this choice to obtain milder and thus betterconditions Omitting α and renumbering the nodes from 1 to 2 we have

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= minφ1φ2

maxx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

minus minx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

In the case of binary nodes (two-by-two matrices ψ(x1 x2)) it is easy tocheck that the optimal φ1 and φ2 that yield the smallest gap are such that

ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2)

ge ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2) (84)

for some x1 x2 x1 and x2 with x1 = x1 and x2 = x2 Solving for φ1 and φ2we find

φ1(x1)minus φ1(x1) = 12

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]φ2(x2)minus φ2(x2) = 1

2

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]

Substitution back into equation 84 yields

ψ(x1 x2)+ φ1(x1)+ φ2(x2)minus ψ(x1 x2)minus φ1(x1)minus φ2(x2)

= 12

[ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)minus ψ(x1 x2)

]

which has to be nonnegative Of all four possible combinations two of themare valid and yield the same positive gap and the other two are invalid sincethey yield the same negative gap Enumerating these combinations we find

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= 12|ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| = ω

2

from equation 75 Substitution into the condition 82 then yields equa-tion 83

Next we derive the following weaker corollary of theorem 4

Uniqueness of Loopy Belief Propagation Fixed Points 2407

Corollary 4 This is a weaker version of theorem 4 for pairwise potentials Loopybelief propagation on pairwise potentials has a unique fixed point ifsum

αsupβωα le 1 forallβ (85)

with ωα defined in equation 72

Proof Consider the allocation matrix with components Aαβ = 1minus σα forall β sub α With this choice conditions 1 and 2 of equation 81 are fulfilledsince (condition 1) σα le 1 and (condition 2)

(1minus σα)(1minus σα)+ 2σα(1minus σα) = 1minus 2σ 2α le 1

Substitution into condition 3 yieldssumαsupβ

(1minus σα) gesumαsupβ

1minus 1 and thussumαsupβ

σα le 1 (86)

Since ωα = minus log(1minus σα) ge σα condition 85 is weaker than condition 86

Summarizing the conditions in Tatikonda and Jordan (2002) are for bi-nary pairwise potentials and when strengthened as above at most a constant(factor 4) less strict and thus better than the ones derived here The latterare better when the structure is (close to) a tree The best set of conditionsfollows by taking the union of both Note further that the conditions derivedin Tatikonda and Jordan (2002) are unlike theorem 4 specific to pairwisepotentials

83 Illustration For illustration we consider a 3 times 3 Ising grid withtoroidal boundary conditions as in Figure 3a and uniform ferromagneticpotentials proportional to(

α 1minus α1minus α α

)

The trivial solution which is the only minimum of the Bethe free energy forsmall α is the one with all pseudomarginals equal to (05 05) With simplealgebra for example following the line of reasoning that leads to the beliefoptimization algorithm in Welling and Teh (2003) it can be shown that thistrivial solution becomes unstable at the critical αcritical = 23 asymp 067 Forα gt 23 we find two minima one with ldquospins uprdquo and the other one withldquospins downrdquo

In this symmetric problem the strength of each potential is given by

ω = 2 log[

α

1minus α]

and thus σ = 1minus(

1minus αα

)2

2408 T Heskes

Figure 3 Three Ising grids in factor-graph notation circles denote nodes boxesinteractions (a) Toroidal boundary conditions All elements of the allocationmatrix equal to 34 (not shown) (b) Aperiodic boundary conditions and (c) twoloops left The elements of the allocation matrix along the edges follow directlyfrom optimizing condition 3 in theorem 4 and symmetry considerations WithB = 2minus 2A in b and C = 1minusA in c the optimal settings for the single remainingvariable A then boil down to 34 and 1 minus

radic18 respectively See the text for

further explanation

Uniqueness of Loopy Belief Propagation Fixed Points 2409

The minimal (uniform) compensation in condition 3 of theorem 4 amountsto A = 34 for all combinations of potentials and nodes Substitution intocondition 2 then yields

σ le 13

and thus α le 11+radic23

asymp 055

The critical value that follows from corollary 3 is in this case slightly better

ω lt 1 and thus α le 11+ eminus12 asymp 062

Next we consider the same grid with aperiodic boundary conditions asin Figure 3b Numerically we find a critical αcritical asymp 079 The value thatfollows from corollary 3 is dominated by the center node and hence staysthe same a unique loopy belief propagation fixed point for α lt 062 Theo-rem 4 can be exploited to shift resources a little In principle we can solvethe nonlinear programming problem but for this small problem it can stillbe done by hand with the following argumentation Minimal compensationaccording to condition 3 in theorem 4 combined with symmetry considera-tions yields the allocation matrix elements along on the edges in Figure 3bIt is then easy to check that there are only two different appearances ofcondition 2

(2minus 2A)σ + 34le 1 and

12σ + A le 1

The optimal choice for A is the one in which both conditions turn out to beidentical In this way we obtain A = 34 yielding

σ le 12

and thus α le 11+radic12

asymp 058

still slightly worse than the condition from corollary 3An example in which the condition obtained with theorem 4 is better

than the one from corollary 3 is given in Figure 3c Straightforward analysisfollowing the same recipe as for Figure 3b yields A = 1minusradic18 with

σ leradic

12

and thus α le 1

1+radic

1minusradic12asymp 065

better than theα lt 062 from corollary 3 and to be compared with the criticalαcritical asymp 088

9 Discussion

In this article we derived sufficient conditions for loopy belief propagationto have just a single fixed point These conditions remain much too strong tobe anywhere near the necessary conditions and in that sense should be seenas no more than a first step These conditions have the following positivefeatures

2410 T Heskes

bull Generalize the conditions for convexity of the Bethe free energy

bull Incorporate the (local) strength of potentials

bull Scale naturally as a function of the ldquotemperaturerdquo

bull Are invariant to arbitrary definitions of potentials and self-interactions

Although the analysis that led to these conditions may seem quite involvedit basically consists of a relatively straightforward combination of two ob-servations The first observation is that we can exploit the arbitrariness inthe definition of the Bethe free energy when we incorporate the constraintsThis forms the basis of the resource allocation argument And the secondobservation concerns the bound on the correlation of a loopy belief propa-gation marginal that leads to the introduction of the strength of a potential

Besides its theoretical usefulness there are more practical uses Firstalgorithms for guaranteed convergence explicitly minimize the Bethe freeenergy They can be considered ldquobound optimization algorithmsrdquo similarto expectation maximization and iterative proportional fitting in the innerloop they minimize a bound on the Bethe free energy which is then up-dated in the outer loop In practice it appears that the tighter the boundthe faster the convergence (see eg Heskes et al 2003) Instead of a boundthat is convex (Yuille 2002) or convex over the set of constraints (Teh ampWelling 2002 Heskes et al 2003) we might relax the convexity conditionand choose a tighter bound that still has a unique minimum thereby speed-ing up the convergence Second in Wainwright et al (2002) a convexifiedBethe free energy is proposed The arguments for this class of free energiesare twofold they yield a bound on the partition function (instead of justan approximation as the standard Bethe free energy) and have a uniqueminimum Focusing on the second argument the conditions in this articlecan be used to construct Bethe free energies that may not be convex (overthe set of constraints) but do have a unique minimum and being closer tothe standard Bethe free energy may yield better approximations

We can think of the following opportunities to make the sufficient con-ditions derived here stricter and thus closer to necessary conditions

bull The conditions guarantee convexity of the dual G(Qβ λαβ) with re-spect to Qβ But in fact we need only G(Qβ) equiv maxλαβ G(Qβ λαβ) to beconvex which is a weaker requirement The Hessian of G(Qβ) how-ever appears to be more difficult to compute and to analyze in generalbut may lead to stronger results in specific cases (eg only pairwiseinteractions or substituting a particular choice of Aαβ )

bull It may be possible to strengthen the bound equation 71 on loopybelief correlations especially for interactions that involve more thantwo nodes

An important question is how the uniqueness of loopy belief propaga-tion fixed points relates to the convergence of loopy belief propagation

Uniqueness of Loopy Belief Propagation Fixed Points 2411

Intuitively one might expect that if loopy belief propagation has a uniquefixed point it will also converge to it This also seems to be the argumenta-tion in Tatikonda and Jordan (2002) However to the best of our knowledgethere is no proof of such correspondence Furthermore the following set ofsimulations does seem to suggest otherwise

We consider a Boltzmann machine with four binary nodes weights

w = ω

0 1 minus1 minus11 0 1 minus1minus1 1 0 minus1minus1 minus1 minus1 0

zero thresholds and potentials

ij(xi xj) = exp[wij4] if xi = xj and ij(xi xj) = exp[minuswij4] if xi = xj

Running loopy belief propagation possibly damped as in equation 39 weobserve ldquoconvergentrdquo and ldquononconvergentrdquo behavior For relatively smallweights loopy belief propagation converges to the trivial fixed point withPi(xi) = 05 for all nodes i and xi = 0 1 as in the lower left inset inFigure 4 For relatively large weights it ends up in a limit cycle as shown inthe upper right inset The weight strength that forms the transition betweenthis ldquoconvergentrdquo and ldquononconvergentrdquo behavior strongly depends on thestep size7 This by itself makes it hard to defend a one-to-one correspondencebetween convergence of loopy belief propagation (apparently dependingon step size) and uniqueness of fixed points (obviously independent of stepsize)

For weights larger than roughly 58 loopy belief propagation failed toconverge to the trivial fixed point even for very small step sizes Howeverrunning a convergent double-loop algorithm from many different initialconditions and many weight strengths considerably larger than 58 we al-ways ended up in the trivial fixed point and never in another one We foundsimilar behavior for a three-node Boltzmann machine (same weight matrixas above except for the fourth node) for very large weights loopy beliefpropagation ends up in a limit cycle whereas a convergent double-loopalgorithm converges to the trivial fixed point which here by corollary 2is guaranteed to be unique In future work we hope to elaborate on theseissues

7 Note that the conditions for guaranteed uniqueness imply ω = 43 for corollary 3and ω = log(2) asymp 069 for theorem 4 both far below the weight strengths where ldquonon-convergentrdquo behavior sets in

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

Uniqueness of Loopy Belief Propagation Fixed Points 2387

the combination of minusSα and Sβ is convex in QαQβ as the followinglemma needed in the proof of theorem 1 below shows

Lemma 1

αβ(QαQβ) equivsumXα

Qα(Xα) log Qα(Xα)minussumxβ

Qα(xβ) log Qβ(xβ)

is convex in QαQβ

Proof The matrix with second derivatives of αβ has the components

H(XαXprimeα) equivpart2αβ

partQα(Xα)partQα(Xprimeα)= 1

Qα(Xα)δXαXprimeα

H(Xα xprimeβ) equivpart2αβ

partQα(Xα)partQβ(xprimeβ)= minus 1

Qβ(xβ)δxβ xprimeβ

H(xβ xprimeβ) equivpart2αβ

partQβ(xβ)partQβ(xprimeβ)= minusQα(xβ)

Q2β(xβ)

δxβ xprimeβ

where we note that Xα and xβ should be interpreted as indices Convexityrequires that for any ldquovectorrdquo (Rα(Xα) Rβ(xβ))

0 le (Rα(Xα) Rβ(xβ))(

H(XαXprimeα) H(Xα xprimeβ)H(xβXprimeα) H(xprimeβ xβ)

)(Rα(Xprimeα)Rβ(xprimeβ)

)

=sumXα

R2α(Xα)

Qα(Xα)minus 2

sumXα

Rα(Xα)Rβ(xβ)Qβ(xβ)

+sumxβ

Qα(xβ)R2β(xβ)

Q2β(xβ)

=sumXα

Qα(Xα)

[Rα(Xα)

Qα(Xα)minus Rβ(xβ)

Qβ(xβ)

]2

The idea is that the Bethe free energy is convex over the set of constraints ifwe have sufficient convex resources Qα log Qα to compensate for the concaveminusQβ log Qβ terms This can be formalized in the following theorem

Theorem 1 The Bethe free energy is convex over the set of consistency con-straints if there exists an allocation matrix Aαβ between potentials α and nodes βsatisfying

1 Aαβ ge 0 forallαβsubα (positivity)

2sumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(41)

2388 T Heskes

Proof First we note that we do not have to worry about the energy termsthat are linear in Qα In other words to prove the theorem we can restrictourselves to proving that minus the entropy

minusS(Q) = minus[sum

α

Sα(Qα)minussumβ

(nβ minus 1)Sβ(Qβ)

]

is convex over the set of consistency constraints The resulting operation isnow a matter of resource allocation For each concave contribution (nβ minus1)Sβ we have to find convex contributions minusSα to compensate for it LetAαβ denote the ldquoamount of resourcesrdquo that we take from potential subset αto compensate for node β Now in shorthand notation and with a little bitof rewriting

minusS(Q) = minus[sum

α

Sα minussumβ

(nβ minus 1)Sβ

]

= minussumα

(1minus

sumβsubα

Aαβ +sumβsubα

Aαβ

)Sα

minussumβ

minussumαsupβ

Aαβ +sumαsupβ

Aαβ minus (nβ minus 1)

= minussumα

(1minus

sumβsubα

Aαβ

)Sα minus

sumα

sumβsubα

Aαβ [Sα minus Sβ ]

minussumβ

[sumαsupβ

Aαβ minus (nβ minus 1)

]Sβ

Convexity of the first term is guaranteed if 1minussumβ Aαβ ge 0 (condition 2) ofthe second term if Aαβ ge 0 (condition 1 and lemma 1) and of the third termifsum

α Aαβ minus (nβ minus 1) ge 0 (condition 3)

This theorem is a special case of the one in Heskes et al (2003) for themore general Kikuchi free energy Either one of the inequality signs in con-dition 2 and 3 of equation 41 can be replaced by an equality sign withoutany consequences

43 Some Implications

Corollary 1 The Bethe free energy for singly connected graphs is convex overthe set of constraints

Uniqueness of Loopy Belief Propagation Fixed Points 2389

Proof The proof is by construction Choose one of the leaf nodes as theroot βlowast and define

Aαβ = 1 iff β sub α and β closer to the root βlowast than any other β prime sub αAαβ prime = 0 for all other β prime

Obviously this choice of A satisfies conditions 1 and 2 of equation 41Arguing the other way around for eachβ = βlowast there is just a single potentialα sup β that is closer to the root βlowast than β itself (see the illustration in Figure 2)and thus there are precisely nβ minus 1 contributions Aαβ = 1 The root itselfgets nβlowast contributions Aαβlowast = 1 which is even better Hence condition 3 isalso satisfied

sumαsupβ

Aαβ = nβ minus 1 forallβ =βlowast andsumαsupβlowast

Aαβlowast = nβlowast gt nβlowast minus 1

With the above construction of A we are in a sense ldquoeating up resourcestoward the rootrdquo At the root we have one piece of resources left whichsuggests that we can still enlarge the set of graphs for which convexity canbe shown using theorem 1 This leads to the next corollary

Corollary 2 The Bethe free energy for graphs with a single loop is convex overthe set of constraints

Proof Again the proof is by construction Break the loop at one particularplace that is remove one node βlowast from a potential αlowast such that a singlyconnected structure is left Construct a matrix A as in the proof of corollary 1taking the node βlowast as the root The matrix A constructed in this way alsojust works for the graph with the closed loop since still

sumαsupβ

Aαβ = nβ minus 1 forallβ =βlowast and nowsumαsupβlowast

Aαβlowast = nβlowast minus 1

It can be seen that this construction starts to fail as soon as we have twoloops that are connected with two connected loops we have insufficientpositive resources to compensate for the negative entropy contributions

44 Connection with Other Work The same corollaries can be foundin Pakzad and Anantharam (2002) and McEliece and Yildirim (2003) Fur-thermore the conditions in theorem 1 are very similar to the ones stated inPakzad and Anantharam (2002) which for the Bethe free energy boil downto the following

2390 T Heskes

Figure 2 The construction of an allocation matrix satisfying all convexity con-straints for singly connected (a) and single-loop structures (b) Neglecting thearrows and dashes each graph corresponds to a factor graph (Kscischang Freyamp Loeliger 2001) where dark boxes refer to potentials and circles to nodes Thenumbers within the circles give the corresponding ldquoovercounting numbersrdquo forthe Bethe free energy 1minus nβ with nβ the number of neighboring potentials Thearrows pointing from potentials α to nodes β visualize the allocation matrixA with Aαβ = 1 if there is an arrow and Aαβ = 0 otherwise As can be seenfor each potential there is precisely one outgoing arrow pointing at the nodeclosest to the root chosen to be the node in the upper right corner of the graphIn the singly-connected structure (a) all nonroot nodes have precisely nβ minus 1incoming arrows just sufficient to compensate the overcounting number 1minusnβ The root node itself has one incoming arrow which it does not really need Inthe structure with the single loop (b) we open the loop by breaking the dashedlink and construct the allocation matrix for the corresponding singly connectedstructure This allocation matrix works for the single-loop structure as well be-cause now the incoming arrow at the ldquorootrdquo is just sufficient to compensate forthe negative overcounting number resulting from the extra link closing the loop

Uniqueness of Loopy Belief Propagation Fixed Points 2391

Theorem 2 (Adapted from theorem 1 in Pakzad amp Anantharam 2002) TheBethe free energy is convex for the set of constraints if for any set of nodes B wehave sum

βisinB

(1minus nβ)+sumαisinπ(B)

1 ge 0 (42)

where π(B) equiv α existβ isin Bβ sub α denotes the ldquoparentrdquo set of B that is thosepotential subsets that include at least one node in B

Proposition 1 The conditions in theorems 1 and 2 are equivalent

Proof Let us first suppose that there does exist an allocation matrix Aαβ

satisfying the conditions of equation 41 Then for any set B

sumβisinB

(nβ minus 1) lesumβisinB

sumαsupβ

Aαβ lesumαisinπ(B)

sumβsubα

Aαβ lesumαisinπ(B)

1

where the inequalities follow from conditions 3 1 and 2 in equation 41respectively In other words validity of the conditions in theorem 1 impliesthe validity of those in theorem 2

Next let us suppose that the conditions in Theorem 1 fail Above wehave seen that this can happen if and only if the graph contains at least oneconnected component with two connected loops But then condition 42is violated as well when we take for B the set of all nodes within such acomponent

Since validity implies validity and violation implies violation the con-ditions must be equivalent

Graphical models with a single loop have been studied in detail in Weiss(2000) yielding important theoretical results (eg correctness of maximuma posteriori assignments) These results are derived by ldquounwrappingrdquo thesingle loop into an infinite tree This argument also breaks down as soon asthere is more than a single loop It might be interesting to find out whetherthere is a deeper connection between this unwrapping argument and theconvexity of the Bethe free energy

In summary we have given conditions for the Bethe free energy to have aunique extremum satisfying the constraints From the connection betweenthe extrema of the Bethe free energy and fixed points of loopy belief prop-agation it then follows that loopy belief propagation has a unique fixedpoint when these conditions are satisfied These conditions fail as soon asthe structure of the graph contains two connected loops

The conditions for convexity of the Bethe free energy depend on thestructure of the graph the potentialsα(Xα) do not play any role These po-tentials appear only in the energy term that is linear in the pseudomarginals

2392 T Heskes

and thus does not affect the convexity argument Consequently adding aldquofake linkrdquo with potential α(Xα) = 1 can change the validity of the con-ditions whereas it has no effect on the loopy belief propagation updatesEven if we managed to find more interesting (ie milder) conditions forconvexity over the set of constraints3 the impact of fake links would neverdisappear In the following we will therefore dig a little deeper to arrive atmilder conditions that do take into account (the strength of) the potentials

5 The Dual Formulation

51 From Lagrangian to Dual As we have seen fixed points of loopybelief propagation are in a one-to-one correspondence with zero derivativesof the Lagrangian If we manage to find conditions under which these zeroderivatives have a unique solution then for the same conditions loopybelief propagation has a unique fixed point In the following we will workwith a Lagrangian slightly different from equation 34 First we substitutethe constraint Qα(xβ) = Qβ(xβ) to write the Bethe free energy in the ldquomoreconvexrdquo form

F(QαQβ) = minussumα

sumXα

Qα(Xα)ψα(Xα)+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumβ

sumαsupβ

Aαβ

sumxβ

Qα(xβ) log Qβ(xβ) (51)

where the allocation matrix Aαβ can be any matrix that satisfies

sumαsupβ

Aαβ = nβ minus 1 (52)

And second we express the consistency constraints from equation 33 interms of the potential pseudomarginals Qα alone This then yields

L(QαQβ λαβ λα) = minussumα

sumXα

Qα(Xα)ψα(Xα)

+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumα

sumβsubα

Aαβ

sumxβ

Qα(xβ) log Qβ(xβ)

3 We would like to conjecture that this is not possiblemdashthat the conditions in theorem 1are not only sufficient but also necessary to prove convexity of the Bethe free energy overthe set of consistency constraints Note that this would not imply that we need theseconditions to guarantee the uniqueness of fixed points since for that convexity by itselfis sufficient not necessary

Uniqueness of Loopy Belief Propagation Fixed Points 2393

+sumβ

sumαsupβ

sumxβ

λαβ(xβ)

times[

1nβ minus 1

sumαprimesupβ

AαprimeβQαprime(xβ)minusQα(xβ)

]

+sumα

λα

[1minus

sumXα

Qα(Xα)

]

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (53)

Note that the constraint Qβ(xβ) = Qα(xβ) as well as its normalization is nolonger incorporated with Lagrange multipliers but follows when we takethe minimum with respect to Qβ It is easy to check that the fixed-pointequations of loopy belief propagation still follow by setting the derivativesof the Lagrangian equation 53 to zero

Although the Bethe free energy and thus the Lagrangian equation 53may not be convex in QαQβ they are convex in Qα and Qβ separatelyTherefore we can interchange the minimum over the pseudomarginals Qα

and the maximum over the Lagrange multipliers as long as we leave theminimum over Qβ as the final operation4

minQαQβ

maxλαβ λα

L(QαQβ λαβ λα) = minQβ

maxλαβ λα

minQα

L(QαQβ λαβ λα)

Rewriting

sumβ

sumαsupβ

sumxβ

λαβ(xβ)

[1

nβ minus 1

sumαprimesupβ

AαprimeβQαprime(xβ)minusQα(xβ)

]

= minussumα

sumβsubα

sumxβ

λαβ(xβ)Qα(xβ)

with

λαβ(xβ) equiv λαβ(xβ)minus 1nβ minus 1

sumαprimesupβ

Aαprimeβλαprimeβ(xβ)

we can easily solve for the minimum with respect to Qα

Qlowastα(Xα) = α(Xα) exp

[λα minus 1+

sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

] (54)

4 In principle we could also first take the minimum over Qβ and leave the minimumover Qα but this does not seem to lead to any useful results

2394 T Heskes

Plugging this into the Lagrangian we obtain the ldquodualrdquo

G(Qβ λαβ λα) equiv L(QlowastαQβ λαβ λα)

= minussumα

sumXα

α(Xα)

exptimes[λα minus 1+

sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

]

+sumα

λα +sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (55)

Next we find for the maximum with respect to λα

exp[1minus λlowastα

] =sumXα

α(Xα) exp

[sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

]

equiv Zlowastα (56)

where we have to keep in mind that Zlowastα by itself like Qlowastα is a function of theremaining pseudomarginals Qβ and Lagrange multipliers λαβ Substitutingthis solution into the dual we arrive at

G(Qβ λαβ) equiv G(Qβ λαβ λlowastα)

= minussumα

log Zlowastα +sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (57)

Let us pause here for a moment and reflect on what we have done so farThe Lagrangian equation 53 being convex in Qα has a unique minimumin Qα (given all other parameters fixed) which is also the only extremumIt happens to be relatively straightforward to express the value at this mini-mum in terms of the remaining parameters and then also to find the optimal(maximal) λlowastα Plugging these values into the Lagrangian equation 53 wehave not lost anything That is zero derivatives of the Lagrangian are still inone-to-one correspondence with zero derivatives of the dual equation 57and thus with fixed points of loopy belief propagation

52 Recovering the Convexity Conditions (1) To find a minimum of theBethe free energy satisfying the constraints in equation 33 we first have totake the maximum of the dual equation 57 over the remaining Lagrangemultipliersλαβ and then the minimum over the remaining pseudomarginalsQβ The duality theorem a standard result from constrained optimization(see Luenberger 1984) tells us that the dual G is concave in the Lagrangemultipliers The remaining question is then whether the dual is convex in

Uniqueness of Loopy Belief Propagation Fixed Points 2395

Qβ If it is we have a convex-concave minimax problem which is guaranteedto have a unique solution

Link c in Figure 1 follows from the following proposition

Proposition 2 Convexity of the Bethe free energy equation 51 in QαQβimplies convexity of the dual equation 57 in Qβ

Proof First we note that the minimum of a convex function over someof its parameters is convex in its remaining parameters In obvious one-dimensional notation with ylowast(x) equiv argmin

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge 2 f (x (ylowast(x+ δ)+ ylowast(xminus δ))2)ge 2 f (x ylowast(x))

where the first inequality follows from the convexity of f in x yand the sec-ond inequality from ylowast(x) being the unique minimum of f (x y) Thereforethe dual equation 55 is convex in Qβ when the Lagrangian equation 53and thus the Bethe free energy equation 51 is convex in QαQβ Further-more from the duality theorem the dual equation 55 is concave in theLagrange multipliers λαβ λα Next we note that the maximum of a con-vex or concave function over its maximizing parameters is again convexwith ylowast(x) equiv argmax

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge f (x+ δ ylowast(x))+ f (xminus δ ylowast(x))ge 2 f (x ylowast(x))

where the first inequality follows from ylowast(xplusmnδ) being the unique maximumof f (x plusmn δ y) and the second inequality from the convexity of f (x y) in xHence the dual equation 57 must still be convex in Qβ

For now we did not gain or lose anything in comparison with the con-ditions for theorem 1 However the inequalities in the above proof suggesta little space that will lead to milder conditions for the uniqueness of fixedpoints

53 Boundedness of the Bethe Free Energy For completeness and tosupport link f in Figure 1 we will here prove that the Bethe free energy isbounded from below The following theorem can be considered a specialcase of the one stated in Minka (2001) on the Bethe free energy for expectationpropagation a generalization of (loopy) belief propagation

Theorem 3 If all potentials are bounded from above that isα(Xα) le max forall α and Xα the Bethe free energy is bounded from below on the set of constraints

2396 T Heskes

Proof It is sufficient to prove that the function G(Qβ) equiv maxλαβ G(Qβ λαβ)

is bounded from below for a particular choice of Aαβ satisfying equation 52Considering Aαβ = nβminus1

nβ we then have

G(Qβ) ge minussumα

logsumXα

α(Xα) exp

[sumβsubα

nβ minus 1nβ

log Qβ(xβ)

]

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

logsumXα

α(Xα)Qβ(xβ)

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

+sumβ

(nβ minus 1)

[minus log

sumxβ

Qβ(xβ)+sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

where the first inequality follows by substituting the choice λαβ(xβ) = 0 forall α β and xβ in G(Qβ λαβ) the second from the concavity of the function

ynβminus1

nβ and the third from the upper bound on the potentials

6 Toward Better Conditions

61 The Hessian The next step is to compute the Hessianmdashthe secondderivative of the dual with respect to the pseudomarginals Qβ The firstderivative yields

partGpartQβ(xβ)

= minussumαsupβ

Aαβ

Qlowastα(xβ)Qβ(xβ)

+ (nβ minus 1)

which is immediate from the Lagrangian equation 53 To compute thematrix of second derivatives

Hββ prime(xβ xprimeβ prime) equivpart2G

partQβ(xβ)partQβ prime(xprimeβ prime)

Uniqueness of Loopy Belief Propagation Fixed Points 2397

we make use of

partQlowastα(xβ)partQβ prime(xprimeβ prime)

= Aαβ primeQlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)

Qβ prime(xprimeβ prime)

where both β and β prime should be a subset of α and with convention Qlowastα(xβ xβ)= Qlowastα(xβ) and Qlowastα(xβ xprimeβ) = 0 if xβ = xprimeβ Here the first term follows from thedifferentation of equation 54 and the second term from the normalizationas in equation 56 Distinguishing between β = β prime and β = β prime we then have

Hββ(xβ xprimeβ) =sumαsupβ

Aαβ(1minus Aαβ)Qlowastα(xβ)Q2β(xβ)

δxβ xprimeβ

+sumαsupβ

A2αβ

Qlowastα(xβ)Qlowastα(xprimeβ)Qβ(xβ)Qβ(xprimeβ)

Hββ prime(xβ xprimeβ prime) = minussum

αsupββ primeAαβAαβ prime

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)Qβ(xβ)Qβ prime(xprimeβ prime)

for β prime = β

where δxβ xprimeβ = 1 if and only if xβ = xprimeβ Here it should be noted that bothβ and xβ play the role of indices that is xβ should not be mistaken for avariable or parameter The parameters are still the (tables with) Lagrangemultipliers λαβ and pseudomarginals Qβ

The goal is now to find conditions under which this Hessian is positive(semi) definite for any setting of the parameters Qβ λαβ that is conditionsthat guarantee

K equivsumββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) ge 0

for any choice of the ldquovectorrdquo S with elements Sβ(xβ) Straightforward ma-nipulations yieldsum

ββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) (K)

=sumα

sumβsubα

sumxβ

Aαβ(1minus Aαβ)Qlowastα(xβ)R2β(xβ) (K1)

+sumα

sumββ primesubα

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ)Qlowastα(xprimeβ prime)Rβ(xβ)Rβ prime(x

primeβ prime) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime) (K3)

where Rβ(xβ) equiv Sβ(xβ)Qβ(xβ)

2398 T Heskes

62 Recovering the Convexity Conditions (2) Let us first see how weget back the conditions for convexity of the Bethe free energy equation 51Since

K2 =sumα

[sumβsubα

sumxβ

AαβQlowastα(xβ)Rβ(xβ)

]2

ge 0

and5

K3 =sumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)

times

12

[Rβ(xβ)minus Rβ prime(xprimeβ prime)

]2 minus 12

R2β(xβ)minus

12

R2β prime(x

primeβ prime)

gesumα

sumβsubα

sumxβ

Aαβ

(sumβ primesubα

Aαβ prime minus Aαβ

)Qlowastα(xβ)R

2β(xβ) (61)

we have

K = K1 + K2 + K3 gesumα

sumβsubα

sumxβ

Aαβ

(1minus

sumβ primesubα

Aαβ prime

)Qlowastα(xβ)R

2β(xβ)

That is sufficient conditions for K to be nonnegative are

Aαβ ge 0 forallαβsubα andsumβsubα

Aαβ le 1 forallα

precisely the conditions for theorem 1

63 Fake Interactions While discussing the conditions for convexity ofthe Bethe free energy we noticed that adding a ldquofake interactionrdquo such as aconstant potential can change the validity of the conditions We will see thathere this is not the case and these fake interactions drop out as we wouldexpect them to

Suppose that we have a fake interaction α(Xα) = 1 From the solutionequation 54 it follows that the pseudomarginal Qlowastα(Xα) factorizes6

Qlowastα(xβ xprimeβ prime) = Qlowastα(xβ)Qlowastα(xprimeβ prime) forallββ primesubα

5 This step is in fact equivalent to the Gerschgorin theorem for bounding the eigenval-ues of a matrix

6 The exact marginal Pexact(Xα) need not factorize This is really a consequence of thelocality assumptions behind loopy belief propagation and the Bethe free energy

Uniqueness of Loopy Belief Propagation Fixed Points 2399

Consequently the terms involving α in K3 cancel with those in K2 which ismost easily seen when we combine K2 and K3 in a different way

K2 + K3 =sumα

sumβsubα

sumxβ xprime

βprime

A2αβQlowastα(xβ)Q

lowastα(xprimeβ)Rβ(xβ)Rβ(x

primeβ) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ prime

times [Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)]Rβ(xβ)Rβ prime(x

primeβ prime) (K3)

This leaves us with the weaker requirement (from K1) Aαβ(1minusAαβ) ge 0 forall β sub α The best choice is then to take Aαβ = 1 which turns condition 3of equation 41 intosum

αprimesupβαprime =α

Aαprimeβ + 1 ge nβ minus 1

The net effect is equivalent to ignoring the interaction reducing the numberof neighboring potentials nβ by 1 for all β that are part of the fake interactionα

We have seen how we get milder and thus better conditions when thereis effectively no interaction Motivated by this ldquosuccessrdquo we will work to-ward conditions that take into account the strength of the interactions Ourstarting point will be the above decomposition in K2 and K3 where sinceK2 ge 0 we will concentrate on K3

7 The Strength of a Potential

71 Bounding the Correlations The crucial observation which will al-low us to obtain milder and thus better conditions for the uniqueness of afixed point is the following lemma It bounds the term between brackets inK3 such that we can again combine this bound with the (positive) term K1However before we get to that we take some time to introduce and deriveproperties of the ldquostrengthrdquo of a potential

Lemma 2 Two-node correlations of loopy belief marginals obey the bound

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime) le σαQlowastα(xβ xprimeβ prime) forall ββprime subα

βprime =βforallxβ xprime

βprime (71)

with the ldquostrengthrdquo σα a function of the potential ψα(Xα) equiv logα(Xα) only

σα = 1minus exp(minusωα) with

ωα equiv maxXαXα

[ψα(Xα)+ (nα minus 1)ψα(Xα)minus

sumβsubα

ψα(Xαβ xβ)

] (72)

where nα equivsum

βsubα 1

2400 T Heskes

Proof For convenience and without loss of generality we omit α fromour notation and renumber the nodes that are contained in α from 1 to nWe consider the quotient between the loopy belief on the potential subsetdivided by the product of its single-node marginals

Qlowast(X)nprodβ=1

Qlowast(xβ)=(X)

prodβ

microβ(xβ)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)microβ(xβ)

=(X)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)

(73)

where we substituted the properly normalized version of equation 35 aloopy belief pseudomarginal is proportional to the potential times incomingmessages The goal is now to find the maximum of the above expressionover all possible messages and all values of X Especially the maximum overmessagesmicro seems to be difficult to compute but the following intermediatelemma helps us out

Lemma 3 The maximum of the function

V(micro) = (nminus 1) log

[sumX

(X)nprodβ=1

microβ(xβ)

]

minusnsumβ=1

log

sum

(Xβ xlowastβ)prodβ prime =β

microβ prime(xβ prime)

with respect to the messages micro under constraintssum

xβ microβ(xβ) = 1 for all β andmicroβ(xβ) ge 0 for all β and xβ occurs at an extreme point microβ(xβ) = δxβ xβ for somexβ to be found

Proof Let us consider optimizing the messagemicro1(x1)with fixed messagesmicroβ(xβ) for β gt 1 The first and second derivatives are easily found to obey

partVpartmicro1(x1)

= (nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ)

part2Vpartmicro1(x1)partmicro1(xprime1)

= (nminus 1)Q(x1)Q(xprime1)minussumβ =1

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

Uniqueness of Loopy Belief Propagation Fixed Points 2401

where

Q(X) equiv (X)prodβ microβ(xβ)sum

Xprime (Xprime)prodβ microβ(x

primeβ)

Now suppose that V has a regular extremum (maximum or minimum) notat an extreme point that is micro1(x1) gt 0 for two or more values of x1 At suchan extremum the first derivative should obey

(nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ) = λ

with λ a Lagrange multiplier implementing the constraintsum

x1micro1(x1) = 1

Summing over x1 we obtain λ = 0 (in fact V is indifferent to any multi-plicative scaling of micro) For the matrix with second derivatives at such anextremum we then have

part2Vpartmicro1(x1)partmicro1(xprime1)

=sumβ =1

sumβprime =1βprime =β

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

which is positive semidefinite the extremum cannot be a maximum Con-sequently any maximum must be at the boundary of the domain Sincethis holds for any choice of microβ(xβ) β gt 1 it follows by induction that themaximum with respect to all microβ(xβ) must be at an extreme point as well

The function V(micro) is up to a term independent of micro the logarithm ofequation 73 So the intermediate lemma 3 tells us that we can replace themaximization over messages micro by maximization over values X

maxmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

X

(X)[(X)

]nminus1

prodβ

(Xβ xβ)

Next we take the maximum over X as well and define the ldquostrengthrdquo σ tobe used in equation 71 through

11minus σ equiv max

Xmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

XX

(X)[(X)

]nminus1

prodβ

(Xβ xβ) (74)

2402 T Heskes

The inequality 71 then follows by summing out Xββ prime in

Qlowast(X)minusprodβ

Qlowast(xβ) le σQlowast(X)

The form of equation 72 then follows by rewriting equation 74 as

ω equiv minus log(1minus σ) = maxXX

W(X X) with

W(X X) =[ψ(X)+ (nminus 1)ψ(X)minus

sumβ

ψ(Xβ xβ)

]

where we recall that ψ(X) equiv log(X)

72 Some Properties In the following we will refer to both ω and σ asthe strength of the potential There are several properties worth noting

bull The strength of a potential is indifferent to multiplication with anyterm that factorizes over the nodes that is

if (X) = (X)prodβ

microβ(xβ) then ω() = ω() for any choice of micro

This property relates to the arbitrariness in the definition of equa-tion 31 if two potentials overlap then multiplying one potential witha term that only depends on the overlap and dividing the other by thesame term does not change the distribution Luckily it also does notchange the strength of those potentials

bull To compute the strength we can enumerate all possible combinationsHowever we can neglect all combinations X and X that differ in fewerthan two nodes To see this consider

W(x1 x2 x12 x1 x2 x12) = ψ(x1 x2 x12)+ ψ(x1 x2 x12)minus ψ(x1 x2 x12)minus ψ(x1 x2 x12)

= minusW(x1 x2 x12 x1 x2 x12)

If now also x2 = x2 we get W(x1 x1 x1 x1) = minusW(x1 x1 x1 x1) =0 Furthermore if W(x1 x2 x12 x1 x2 x12) le 0 then it must be thatW(x1 x2 x12 x1 x2 x12) ge 0 and vice versa So ω the maximumover all combinations must be nonnegative and we can indeed neglectall combinations that by definition yield zero

bull Thus for finite potentials 0 le ω ltinfin and 0 le σ lt 1

Uniqueness of Loopy Belief Propagation Fixed Points 2403

bull With pairwise potentials the above symmetries can be used to reducethe number of evaluations to |x1||x2|(|x1|minus1)(|x2|minus1)4 combinationsAnd indeed for binary nodes x12 isin 0 1 we immediately obtain

ω = |ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| (75)

Any pairwise binary potential can be written as a Boltzmann factor

(x1 x2) prop exp[wx1x2 + θ1x1 + θ2x2]

In this notation we find the simple and intuitive expression ω = |w|the strength is the absolute value of the ldquoweightrdquo It is indeed inde-pendent of (the size of) the thresholds In the case of minus1 1 codingthe relationship is ω = 4|w|

bull In some models there is the notion of a ldquotemperaturerdquo T that is(X) propexp[ψ(X)T] where ψ(X) is considered constant In obvious notationwe then have ω(T) = ω(1)T and thus σ(T) = 1 minus exp[minusω(1)T] =1minus [1(1minus σ(1))]1T

bull Loopy belief revision (max-product) can be interpreted as a zero-temperature limit of loopy belief propagation (sum product) Morespecifically we get the belief revision updates if we imagine runningloopy belief propagation on potentials that are scaled with tempera-ture T and then take the limit T to zero Consequently when analyzingconditions for uniqueness of loopy belief revision fixed points we cantake σ(0) = 0 if σ(1) = 0 (fake interaction) yet σ(0) = 1 wheneverσ(1) gt 0

8 Conditions for Uniqueness

81 Main Result

Theorem 4 Loopy belief propagation has a unique fixed point if there exists anallocation matrix Aαβ between potentials α and nodes β with properties

1 Aαβ ge 0 forallαβsubα (positivity)

2 (1minus σα)maxβsubα

Aαβ + σαsumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(81)

with the strength σα a function of the potentialα(Xα) as defined in equation 72

Proof For completeness we first summarize our line of reasoning Fixedpoints of loopy belief propagation are in one-to-one correspondence with

2404 T Heskes

extrema of the dual equation 55 This dual has a unique extremum if itis convexconcave Concavity is guaranteed so we focus on conditionsfor convexity that is for positive (semi)definiteness of the correspondingHessian This then boils down to conditions that ensure K = K1+K2+K3 ge 0for any choice of Rβ(xβ)

Substituting the bound equation 71 into the term K3 we obtain

K3 ge minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeσαQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime)

ge minussumα

σαsumβsubα

sumxβ

Aαβ

sumβprimesubαβprime =β

Aαβ primeQlowastα(xβ)R2β(xβ)

where in the last step we applied the same trick as in equation 61 SinceK2 ge 0 and combining K1 and (the above lower bound on) K3 we get

K = K1 + K2 + K3

gesumα

sumβsubα

sumxβ

Aαβ

[1minus Aαβ minus σα

sumβ prime =β

Aαβ prime

]Qlowastα(xβ)R

2β(xβ)

This implies

(1minus σα)Aαβ + σαsumβ primesubα

Aαβ prime le 1 forallαβsubα

which in combination with Aαβ ge 0 and σα le 1 yields condition 2 inequation 81 The equality constraint equation 52 that we started with canbe relaxed to the inequality condition 3 without any consequences

We get back the stricter conditions of theorem 1 if σα = 1 for all potentialsα Furthermore ldquofake interactionsrdquo play no role with σα = 0 condition 2becomes maxβsubα Aαβ le 1 suggesting the choice Aαβ = 1 for all β sub αwhich then effectively reduces the number of neighboring potentials nβ incondition 3

82 Comparison with Other Work To the best of our knowledge theonly conditions for uniqueness of loopy belief propagation fixed points thatdepend on more than just the structure of the graph are those in Tatikondaand Jordan (2002) for pairwise potentials The analysis in Tatikonda andJordan is based on the concept of the computation tree which represents anunwrapping of the original graph with respect to the loopy belief propaga-tion algorithm The same concept is used in Weiss (2000) to show that beliefrevision yields the correct maximum a posteriori assignments in graphs

Uniqueness of Loopy Belief Propagation Fixed Points 2405

with a single loop and Weiss and Freeman (2001) to prove that loopy beliefpropagation in gaussian graphical models yields exact means Although thecurrent theorems based on the concept of computation trees are derived forpairwise potentials it should be possible to extend them to more generalfactor graphs

The setup in Tatikonda and Jordan (2002) is slightly different it is basedon the factorization

Pexact(X) = 1Z

prodα

α(Xα)prodβ

β(xβ)

to be compared with our equation 31 where there are no self-potentialsβ(xβ) With this in mind the statement is then as follows

Theorem 5 (adapted from Tatikonda amp Jordan 2002 in particular proposi-tion 53) Loopy belief propagation on pairwise potentials has a unique fixed pointif

sumαsupβ

(max

ψα(Xα)minusminXα

ψα(Xα)

)lt 2 forallβ (82)

To make the connection between theorem 5 and theorem 4 we will firststrengthen the former and then weaken the latter We will focus on thecase of binary pairwise potentials Since the definition of self-potentials isarbitrary and the condition 82 is valid for any choice we can easily improvethe condition by optimizing this choice This then leads to the followingcorollary

Corollary 3 This corollary concerns an improvement of theorem 5 for pairwisebinary potentials Loopy belief propagation on pairwise binary potentials has aunique fixed point ifsum

αsupβωα lt 4 forallβ (83)

with ωα defined in equation 72

Proof The condition 82 applies to any arbitrary definition of self-poten-tials β(xβ) In fact it is valid for any choice

ψα(Xα) = ψα(Xα)+sumβsubα

φαβ(xβ)

where ψα(Xα) is any choice of potential subsets that fits in our frameworkof no self-potentials (as argued above there is some arbitrariness here as

2406 T Heskes

well) We can then optimize this choice to obtain milder and thus betterconditions Omitting α and renumbering the nodes from 1 to 2 we have

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= minφ1φ2

maxx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

minus minx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

In the case of binary nodes (two-by-two matrices ψ(x1 x2)) it is easy tocheck that the optimal φ1 and φ2 that yield the smallest gap are such that

ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2)

ge ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2) (84)

for some x1 x2 x1 and x2 with x1 = x1 and x2 = x2 Solving for φ1 and φ2we find

φ1(x1)minus φ1(x1) = 12

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]φ2(x2)minus φ2(x2) = 1

2

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]

Substitution back into equation 84 yields

ψ(x1 x2)+ φ1(x1)+ φ2(x2)minus ψ(x1 x2)minus φ1(x1)minus φ2(x2)

= 12

[ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)minus ψ(x1 x2)

]

which has to be nonnegative Of all four possible combinations two of themare valid and yield the same positive gap and the other two are invalid sincethey yield the same negative gap Enumerating these combinations we find

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= 12|ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| = ω

2

from equation 75 Substitution into the condition 82 then yields equa-tion 83

Next we derive the following weaker corollary of theorem 4

Uniqueness of Loopy Belief Propagation Fixed Points 2407

Corollary 4 This is a weaker version of theorem 4 for pairwise potentials Loopybelief propagation on pairwise potentials has a unique fixed point ifsum

αsupβωα le 1 forallβ (85)

with ωα defined in equation 72

Proof Consider the allocation matrix with components Aαβ = 1minus σα forall β sub α With this choice conditions 1 and 2 of equation 81 are fulfilledsince (condition 1) σα le 1 and (condition 2)

(1minus σα)(1minus σα)+ 2σα(1minus σα) = 1minus 2σ 2α le 1

Substitution into condition 3 yieldssumαsupβ

(1minus σα) gesumαsupβ

1minus 1 and thussumαsupβ

σα le 1 (86)

Since ωα = minus log(1minus σα) ge σα condition 85 is weaker than condition 86

Summarizing the conditions in Tatikonda and Jordan (2002) are for bi-nary pairwise potentials and when strengthened as above at most a constant(factor 4) less strict and thus better than the ones derived here The latterare better when the structure is (close to) a tree The best set of conditionsfollows by taking the union of both Note further that the conditions derivedin Tatikonda and Jordan (2002) are unlike theorem 4 specific to pairwisepotentials

83 Illustration For illustration we consider a 3 times 3 Ising grid withtoroidal boundary conditions as in Figure 3a and uniform ferromagneticpotentials proportional to(

α 1minus α1minus α α

)

The trivial solution which is the only minimum of the Bethe free energy forsmall α is the one with all pseudomarginals equal to (05 05) With simplealgebra for example following the line of reasoning that leads to the beliefoptimization algorithm in Welling and Teh (2003) it can be shown that thistrivial solution becomes unstable at the critical αcritical = 23 asymp 067 Forα gt 23 we find two minima one with ldquospins uprdquo and the other one withldquospins downrdquo

In this symmetric problem the strength of each potential is given by

ω = 2 log[

α

1minus α]

and thus σ = 1minus(

1minus αα

)2

2408 T Heskes

Figure 3 Three Ising grids in factor-graph notation circles denote nodes boxesinteractions (a) Toroidal boundary conditions All elements of the allocationmatrix equal to 34 (not shown) (b) Aperiodic boundary conditions and (c) twoloops left The elements of the allocation matrix along the edges follow directlyfrom optimizing condition 3 in theorem 4 and symmetry considerations WithB = 2minus 2A in b and C = 1minusA in c the optimal settings for the single remainingvariable A then boil down to 34 and 1 minus

radic18 respectively See the text for

further explanation

Uniqueness of Loopy Belief Propagation Fixed Points 2409

The minimal (uniform) compensation in condition 3 of theorem 4 amountsto A = 34 for all combinations of potentials and nodes Substitution intocondition 2 then yields

σ le 13

and thus α le 11+radic23

asymp 055

The critical value that follows from corollary 3 is in this case slightly better

ω lt 1 and thus α le 11+ eminus12 asymp 062

Next we consider the same grid with aperiodic boundary conditions asin Figure 3b Numerically we find a critical αcritical asymp 079 The value thatfollows from corollary 3 is dominated by the center node and hence staysthe same a unique loopy belief propagation fixed point for α lt 062 Theo-rem 4 can be exploited to shift resources a little In principle we can solvethe nonlinear programming problem but for this small problem it can stillbe done by hand with the following argumentation Minimal compensationaccording to condition 3 in theorem 4 combined with symmetry considera-tions yields the allocation matrix elements along on the edges in Figure 3bIt is then easy to check that there are only two different appearances ofcondition 2

(2minus 2A)σ + 34le 1 and

12σ + A le 1

The optimal choice for A is the one in which both conditions turn out to beidentical In this way we obtain A = 34 yielding

σ le 12

and thus α le 11+radic12

asymp 058

still slightly worse than the condition from corollary 3An example in which the condition obtained with theorem 4 is better

than the one from corollary 3 is given in Figure 3c Straightforward analysisfollowing the same recipe as for Figure 3b yields A = 1minusradic18 with

σ leradic

12

and thus α le 1

1+radic

1minusradic12asymp 065

better than theα lt 062 from corollary 3 and to be compared with the criticalαcritical asymp 088

9 Discussion

In this article we derived sufficient conditions for loopy belief propagationto have just a single fixed point These conditions remain much too strong tobe anywhere near the necessary conditions and in that sense should be seenas no more than a first step These conditions have the following positivefeatures

2410 T Heskes

bull Generalize the conditions for convexity of the Bethe free energy

bull Incorporate the (local) strength of potentials

bull Scale naturally as a function of the ldquotemperaturerdquo

bull Are invariant to arbitrary definitions of potentials and self-interactions

Although the analysis that led to these conditions may seem quite involvedit basically consists of a relatively straightforward combination of two ob-servations The first observation is that we can exploit the arbitrariness inthe definition of the Bethe free energy when we incorporate the constraintsThis forms the basis of the resource allocation argument And the secondobservation concerns the bound on the correlation of a loopy belief propa-gation marginal that leads to the introduction of the strength of a potential

Besides its theoretical usefulness there are more practical uses Firstalgorithms for guaranteed convergence explicitly minimize the Bethe freeenergy They can be considered ldquobound optimization algorithmsrdquo similarto expectation maximization and iterative proportional fitting in the innerloop they minimize a bound on the Bethe free energy which is then up-dated in the outer loop In practice it appears that the tighter the boundthe faster the convergence (see eg Heskes et al 2003) Instead of a boundthat is convex (Yuille 2002) or convex over the set of constraints (Teh ampWelling 2002 Heskes et al 2003) we might relax the convexity conditionand choose a tighter bound that still has a unique minimum thereby speed-ing up the convergence Second in Wainwright et al (2002) a convexifiedBethe free energy is proposed The arguments for this class of free energiesare twofold they yield a bound on the partition function (instead of justan approximation as the standard Bethe free energy) and have a uniqueminimum Focusing on the second argument the conditions in this articlecan be used to construct Bethe free energies that may not be convex (overthe set of constraints) but do have a unique minimum and being closer tothe standard Bethe free energy may yield better approximations

We can think of the following opportunities to make the sufficient con-ditions derived here stricter and thus closer to necessary conditions

bull The conditions guarantee convexity of the dual G(Qβ λαβ) with re-spect to Qβ But in fact we need only G(Qβ) equiv maxλαβ G(Qβ λαβ) to beconvex which is a weaker requirement The Hessian of G(Qβ) how-ever appears to be more difficult to compute and to analyze in generalbut may lead to stronger results in specific cases (eg only pairwiseinteractions or substituting a particular choice of Aαβ )

bull It may be possible to strengthen the bound equation 71 on loopybelief correlations especially for interactions that involve more thantwo nodes

An important question is how the uniqueness of loopy belief propaga-tion fixed points relates to the convergence of loopy belief propagation

Uniqueness of Loopy Belief Propagation Fixed Points 2411

Intuitively one might expect that if loopy belief propagation has a uniquefixed point it will also converge to it This also seems to be the argumenta-tion in Tatikonda and Jordan (2002) However to the best of our knowledgethere is no proof of such correspondence Furthermore the following set ofsimulations does seem to suggest otherwise

We consider a Boltzmann machine with four binary nodes weights

w = ω

0 1 minus1 minus11 0 1 minus1minus1 1 0 minus1minus1 minus1 minus1 0

zero thresholds and potentials

ij(xi xj) = exp[wij4] if xi = xj and ij(xi xj) = exp[minuswij4] if xi = xj

Running loopy belief propagation possibly damped as in equation 39 weobserve ldquoconvergentrdquo and ldquononconvergentrdquo behavior For relatively smallweights loopy belief propagation converges to the trivial fixed point withPi(xi) = 05 for all nodes i and xi = 0 1 as in the lower left inset inFigure 4 For relatively large weights it ends up in a limit cycle as shown inthe upper right inset The weight strength that forms the transition betweenthis ldquoconvergentrdquo and ldquononconvergentrdquo behavior strongly depends on thestep size7 This by itself makes it hard to defend a one-to-one correspondencebetween convergence of loopy belief propagation (apparently dependingon step size) and uniqueness of fixed points (obviously independent of stepsize)

For weights larger than roughly 58 loopy belief propagation failed toconverge to the trivial fixed point even for very small step sizes Howeverrunning a convergent double-loop algorithm from many different initialconditions and many weight strengths considerably larger than 58 we al-ways ended up in the trivial fixed point and never in another one We foundsimilar behavior for a three-node Boltzmann machine (same weight matrixas above except for the fourth node) for very large weights loopy beliefpropagation ends up in a limit cycle whereas a convergent double-loopalgorithm converges to the trivial fixed point which here by corollary 2is guaranteed to be unique In future work we hope to elaborate on theseissues

7 Note that the conditions for guaranteed uniqueness imply ω = 43 for corollary 3and ω = log(2) asymp 069 for theorem 4 both far below the weight strengths where ldquonon-convergentrdquo behavior sets in

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

2388 T Heskes

Proof First we note that we do not have to worry about the energy termsthat are linear in Qα In other words to prove the theorem we can restrictourselves to proving that minus the entropy

minusS(Q) = minus[sum

α

Sα(Qα)minussumβ

(nβ minus 1)Sβ(Qβ)

]

is convex over the set of consistency constraints The resulting operation isnow a matter of resource allocation For each concave contribution (nβ minus1)Sβ we have to find convex contributions minusSα to compensate for it LetAαβ denote the ldquoamount of resourcesrdquo that we take from potential subset αto compensate for node β Now in shorthand notation and with a little bitof rewriting

minusS(Q) = minus[sum

α

Sα minussumβ

(nβ minus 1)Sβ

]

= minussumα

(1minus

sumβsubα

Aαβ +sumβsubα

Aαβ

)Sα

minussumβ

minussumαsupβ

Aαβ +sumαsupβ

Aαβ minus (nβ minus 1)

= minussumα

(1minus

sumβsubα

Aαβ

)Sα minus

sumα

sumβsubα

Aαβ [Sα minus Sβ ]

minussumβ

[sumαsupβ

Aαβ minus (nβ minus 1)

]Sβ

Convexity of the first term is guaranteed if 1minussumβ Aαβ ge 0 (condition 2) ofthe second term if Aαβ ge 0 (condition 1 and lemma 1) and of the third termifsum

α Aαβ minus (nβ minus 1) ge 0 (condition 3)

This theorem is a special case of the one in Heskes et al (2003) for themore general Kikuchi free energy Either one of the inequality signs in con-dition 2 and 3 of equation 41 can be replaced by an equality sign withoutany consequences

43 Some Implications

Corollary 1 The Bethe free energy for singly connected graphs is convex overthe set of constraints

Uniqueness of Loopy Belief Propagation Fixed Points 2389

Proof The proof is by construction Choose one of the leaf nodes as theroot βlowast and define

Aαβ = 1 iff β sub α and β closer to the root βlowast than any other β prime sub αAαβ prime = 0 for all other β prime

Obviously this choice of A satisfies conditions 1 and 2 of equation 41Arguing the other way around for eachβ = βlowast there is just a single potentialα sup β that is closer to the root βlowast than β itself (see the illustration in Figure 2)and thus there are precisely nβ minus 1 contributions Aαβ = 1 The root itselfgets nβlowast contributions Aαβlowast = 1 which is even better Hence condition 3 isalso satisfied

sumαsupβ

Aαβ = nβ minus 1 forallβ =βlowast andsumαsupβlowast

Aαβlowast = nβlowast gt nβlowast minus 1

With the above construction of A we are in a sense ldquoeating up resourcestoward the rootrdquo At the root we have one piece of resources left whichsuggests that we can still enlarge the set of graphs for which convexity canbe shown using theorem 1 This leads to the next corollary

Corollary 2 The Bethe free energy for graphs with a single loop is convex overthe set of constraints

Proof Again the proof is by construction Break the loop at one particularplace that is remove one node βlowast from a potential αlowast such that a singlyconnected structure is left Construct a matrix A as in the proof of corollary 1taking the node βlowast as the root The matrix A constructed in this way alsojust works for the graph with the closed loop since still

sumαsupβ

Aαβ = nβ minus 1 forallβ =βlowast and nowsumαsupβlowast

Aαβlowast = nβlowast minus 1

It can be seen that this construction starts to fail as soon as we have twoloops that are connected with two connected loops we have insufficientpositive resources to compensate for the negative entropy contributions

44 Connection with Other Work The same corollaries can be foundin Pakzad and Anantharam (2002) and McEliece and Yildirim (2003) Fur-thermore the conditions in theorem 1 are very similar to the ones stated inPakzad and Anantharam (2002) which for the Bethe free energy boil downto the following

2390 T Heskes

Figure 2 The construction of an allocation matrix satisfying all convexity con-straints for singly connected (a) and single-loop structures (b) Neglecting thearrows and dashes each graph corresponds to a factor graph (Kscischang Freyamp Loeliger 2001) where dark boxes refer to potentials and circles to nodes Thenumbers within the circles give the corresponding ldquoovercounting numbersrdquo forthe Bethe free energy 1minus nβ with nβ the number of neighboring potentials Thearrows pointing from potentials α to nodes β visualize the allocation matrixA with Aαβ = 1 if there is an arrow and Aαβ = 0 otherwise As can be seenfor each potential there is precisely one outgoing arrow pointing at the nodeclosest to the root chosen to be the node in the upper right corner of the graphIn the singly-connected structure (a) all nonroot nodes have precisely nβ minus 1incoming arrows just sufficient to compensate the overcounting number 1minusnβ The root node itself has one incoming arrow which it does not really need Inthe structure with the single loop (b) we open the loop by breaking the dashedlink and construct the allocation matrix for the corresponding singly connectedstructure This allocation matrix works for the single-loop structure as well be-cause now the incoming arrow at the ldquorootrdquo is just sufficient to compensate forthe negative overcounting number resulting from the extra link closing the loop

Uniqueness of Loopy Belief Propagation Fixed Points 2391

Theorem 2 (Adapted from theorem 1 in Pakzad amp Anantharam 2002) TheBethe free energy is convex for the set of constraints if for any set of nodes B wehave sum

βisinB

(1minus nβ)+sumαisinπ(B)

1 ge 0 (42)

where π(B) equiv α existβ isin Bβ sub α denotes the ldquoparentrdquo set of B that is thosepotential subsets that include at least one node in B

Proposition 1 The conditions in theorems 1 and 2 are equivalent

Proof Let us first suppose that there does exist an allocation matrix Aαβ

satisfying the conditions of equation 41 Then for any set B

sumβisinB

(nβ minus 1) lesumβisinB

sumαsupβ

Aαβ lesumαisinπ(B)

sumβsubα

Aαβ lesumαisinπ(B)

1

where the inequalities follow from conditions 3 1 and 2 in equation 41respectively In other words validity of the conditions in theorem 1 impliesthe validity of those in theorem 2

Next let us suppose that the conditions in Theorem 1 fail Above wehave seen that this can happen if and only if the graph contains at least oneconnected component with two connected loops But then condition 42is violated as well when we take for B the set of all nodes within such acomponent

Since validity implies validity and violation implies violation the con-ditions must be equivalent

Graphical models with a single loop have been studied in detail in Weiss(2000) yielding important theoretical results (eg correctness of maximuma posteriori assignments) These results are derived by ldquounwrappingrdquo thesingle loop into an infinite tree This argument also breaks down as soon asthere is more than a single loop It might be interesting to find out whetherthere is a deeper connection between this unwrapping argument and theconvexity of the Bethe free energy

In summary we have given conditions for the Bethe free energy to have aunique extremum satisfying the constraints From the connection betweenthe extrema of the Bethe free energy and fixed points of loopy belief prop-agation it then follows that loopy belief propagation has a unique fixedpoint when these conditions are satisfied These conditions fail as soon asthe structure of the graph contains two connected loops

The conditions for convexity of the Bethe free energy depend on thestructure of the graph the potentialsα(Xα) do not play any role These po-tentials appear only in the energy term that is linear in the pseudomarginals

2392 T Heskes

and thus does not affect the convexity argument Consequently adding aldquofake linkrdquo with potential α(Xα) = 1 can change the validity of the con-ditions whereas it has no effect on the loopy belief propagation updatesEven if we managed to find more interesting (ie milder) conditions forconvexity over the set of constraints3 the impact of fake links would neverdisappear In the following we will therefore dig a little deeper to arrive atmilder conditions that do take into account (the strength of) the potentials

5 The Dual Formulation

51 From Lagrangian to Dual As we have seen fixed points of loopybelief propagation are in a one-to-one correspondence with zero derivativesof the Lagrangian If we manage to find conditions under which these zeroderivatives have a unique solution then for the same conditions loopybelief propagation has a unique fixed point In the following we will workwith a Lagrangian slightly different from equation 34 First we substitutethe constraint Qα(xβ) = Qβ(xβ) to write the Bethe free energy in the ldquomoreconvexrdquo form

F(QαQβ) = minussumα

sumXα

Qα(Xα)ψα(Xα)+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumβ

sumαsupβ

Aαβ

sumxβ

Qα(xβ) log Qβ(xβ) (51)

where the allocation matrix Aαβ can be any matrix that satisfies

sumαsupβ

Aαβ = nβ minus 1 (52)

And second we express the consistency constraints from equation 33 interms of the potential pseudomarginals Qα alone This then yields

L(QαQβ λαβ λα) = minussumα

sumXα

Qα(Xα)ψα(Xα)

+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumα

sumβsubα

Aαβ

sumxβ

Qα(xβ) log Qβ(xβ)

3 We would like to conjecture that this is not possiblemdashthat the conditions in theorem 1are not only sufficient but also necessary to prove convexity of the Bethe free energy overthe set of consistency constraints Note that this would not imply that we need theseconditions to guarantee the uniqueness of fixed points since for that convexity by itselfis sufficient not necessary

Uniqueness of Loopy Belief Propagation Fixed Points 2393

+sumβ

sumαsupβ

sumxβ

λαβ(xβ)

times[

1nβ minus 1

sumαprimesupβ

AαprimeβQαprime(xβ)minusQα(xβ)

]

+sumα

λα

[1minus

sumXα

Qα(Xα)

]

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (53)

Note that the constraint Qβ(xβ) = Qα(xβ) as well as its normalization is nolonger incorporated with Lagrange multipliers but follows when we takethe minimum with respect to Qβ It is easy to check that the fixed-pointequations of loopy belief propagation still follow by setting the derivativesof the Lagrangian equation 53 to zero

Although the Bethe free energy and thus the Lagrangian equation 53may not be convex in QαQβ they are convex in Qα and Qβ separatelyTherefore we can interchange the minimum over the pseudomarginals Qα

and the maximum over the Lagrange multipliers as long as we leave theminimum over Qβ as the final operation4

minQαQβ

maxλαβ λα

L(QαQβ λαβ λα) = minQβ

maxλαβ λα

minQα

L(QαQβ λαβ λα)

Rewriting

sumβ

sumαsupβ

sumxβ

λαβ(xβ)

[1

nβ minus 1

sumαprimesupβ

AαprimeβQαprime(xβ)minusQα(xβ)

]

= minussumα

sumβsubα

sumxβ

λαβ(xβ)Qα(xβ)

with

λαβ(xβ) equiv λαβ(xβ)minus 1nβ minus 1

sumαprimesupβ

Aαprimeβλαprimeβ(xβ)

we can easily solve for the minimum with respect to Qα

Qlowastα(Xα) = α(Xα) exp

[λα minus 1+

sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

] (54)

4 In principle we could also first take the minimum over Qβ and leave the minimumover Qα but this does not seem to lead to any useful results

2394 T Heskes

Plugging this into the Lagrangian we obtain the ldquodualrdquo

G(Qβ λαβ λα) equiv L(QlowastαQβ λαβ λα)

= minussumα

sumXα

α(Xα)

exptimes[λα minus 1+

sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

]

+sumα

λα +sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (55)

Next we find for the maximum with respect to λα

exp[1minus λlowastα

] =sumXα

α(Xα) exp

[sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

]

equiv Zlowastα (56)

where we have to keep in mind that Zlowastα by itself like Qlowastα is a function of theremaining pseudomarginals Qβ and Lagrange multipliers λαβ Substitutingthis solution into the dual we arrive at

G(Qβ λαβ) equiv G(Qβ λαβ λlowastα)

= minussumα

log Zlowastα +sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (57)

Let us pause here for a moment and reflect on what we have done so farThe Lagrangian equation 53 being convex in Qα has a unique minimumin Qα (given all other parameters fixed) which is also the only extremumIt happens to be relatively straightforward to express the value at this mini-mum in terms of the remaining parameters and then also to find the optimal(maximal) λlowastα Plugging these values into the Lagrangian equation 53 wehave not lost anything That is zero derivatives of the Lagrangian are still inone-to-one correspondence with zero derivatives of the dual equation 57and thus with fixed points of loopy belief propagation

52 Recovering the Convexity Conditions (1) To find a minimum of theBethe free energy satisfying the constraints in equation 33 we first have totake the maximum of the dual equation 57 over the remaining Lagrangemultipliersλαβ and then the minimum over the remaining pseudomarginalsQβ The duality theorem a standard result from constrained optimization(see Luenberger 1984) tells us that the dual G is concave in the Lagrangemultipliers The remaining question is then whether the dual is convex in

Uniqueness of Loopy Belief Propagation Fixed Points 2395

Qβ If it is we have a convex-concave minimax problem which is guaranteedto have a unique solution

Link c in Figure 1 follows from the following proposition

Proposition 2 Convexity of the Bethe free energy equation 51 in QαQβimplies convexity of the dual equation 57 in Qβ

Proof First we note that the minimum of a convex function over someof its parameters is convex in its remaining parameters In obvious one-dimensional notation with ylowast(x) equiv argmin

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge 2 f (x (ylowast(x+ δ)+ ylowast(xminus δ))2)ge 2 f (x ylowast(x))

where the first inequality follows from the convexity of f in x yand the sec-ond inequality from ylowast(x) being the unique minimum of f (x y) Thereforethe dual equation 55 is convex in Qβ when the Lagrangian equation 53and thus the Bethe free energy equation 51 is convex in QαQβ Further-more from the duality theorem the dual equation 55 is concave in theLagrange multipliers λαβ λα Next we note that the maximum of a con-vex or concave function over its maximizing parameters is again convexwith ylowast(x) equiv argmax

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge f (x+ δ ylowast(x))+ f (xminus δ ylowast(x))ge 2 f (x ylowast(x))

where the first inequality follows from ylowast(xplusmnδ) being the unique maximumof f (x plusmn δ y) and the second inequality from the convexity of f (x y) in xHence the dual equation 57 must still be convex in Qβ

For now we did not gain or lose anything in comparison with the con-ditions for theorem 1 However the inequalities in the above proof suggesta little space that will lead to milder conditions for the uniqueness of fixedpoints

53 Boundedness of the Bethe Free Energy For completeness and tosupport link f in Figure 1 we will here prove that the Bethe free energy isbounded from below The following theorem can be considered a specialcase of the one stated in Minka (2001) on the Bethe free energy for expectationpropagation a generalization of (loopy) belief propagation

Theorem 3 If all potentials are bounded from above that isα(Xα) le max forall α and Xα the Bethe free energy is bounded from below on the set of constraints

2396 T Heskes

Proof It is sufficient to prove that the function G(Qβ) equiv maxλαβ G(Qβ λαβ)

is bounded from below for a particular choice of Aαβ satisfying equation 52Considering Aαβ = nβminus1

nβ we then have

G(Qβ) ge minussumα

logsumXα

α(Xα) exp

[sumβsubα

nβ minus 1nβ

log Qβ(xβ)

]

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

logsumXα

α(Xα)Qβ(xβ)

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

+sumβ

(nβ minus 1)

[minus log

sumxβ

Qβ(xβ)+sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

where the first inequality follows by substituting the choice λαβ(xβ) = 0 forall α β and xβ in G(Qβ λαβ) the second from the concavity of the function

ynβminus1

nβ and the third from the upper bound on the potentials

6 Toward Better Conditions

61 The Hessian The next step is to compute the Hessianmdashthe secondderivative of the dual with respect to the pseudomarginals Qβ The firstderivative yields

partGpartQβ(xβ)

= minussumαsupβ

Aαβ

Qlowastα(xβ)Qβ(xβ)

+ (nβ minus 1)

which is immediate from the Lagrangian equation 53 To compute thematrix of second derivatives

Hββ prime(xβ xprimeβ prime) equivpart2G

partQβ(xβ)partQβ prime(xprimeβ prime)

Uniqueness of Loopy Belief Propagation Fixed Points 2397

we make use of

partQlowastα(xβ)partQβ prime(xprimeβ prime)

= Aαβ primeQlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)

Qβ prime(xprimeβ prime)

where both β and β prime should be a subset of α and with convention Qlowastα(xβ xβ)= Qlowastα(xβ) and Qlowastα(xβ xprimeβ) = 0 if xβ = xprimeβ Here the first term follows from thedifferentation of equation 54 and the second term from the normalizationas in equation 56 Distinguishing between β = β prime and β = β prime we then have

Hββ(xβ xprimeβ) =sumαsupβ

Aαβ(1minus Aαβ)Qlowastα(xβ)Q2β(xβ)

δxβ xprimeβ

+sumαsupβ

A2αβ

Qlowastα(xβ)Qlowastα(xprimeβ)Qβ(xβ)Qβ(xprimeβ)

Hββ prime(xβ xprimeβ prime) = minussum

αsupββ primeAαβAαβ prime

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)Qβ(xβ)Qβ prime(xprimeβ prime)

for β prime = β

where δxβ xprimeβ = 1 if and only if xβ = xprimeβ Here it should be noted that bothβ and xβ play the role of indices that is xβ should not be mistaken for avariable or parameter The parameters are still the (tables with) Lagrangemultipliers λαβ and pseudomarginals Qβ

The goal is now to find conditions under which this Hessian is positive(semi) definite for any setting of the parameters Qβ λαβ that is conditionsthat guarantee

K equivsumββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) ge 0

for any choice of the ldquovectorrdquo S with elements Sβ(xβ) Straightforward ma-nipulations yieldsum

ββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) (K)

=sumα

sumβsubα

sumxβ

Aαβ(1minus Aαβ)Qlowastα(xβ)R2β(xβ) (K1)

+sumα

sumββ primesubα

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ)Qlowastα(xprimeβ prime)Rβ(xβ)Rβ prime(x

primeβ prime) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime) (K3)

where Rβ(xβ) equiv Sβ(xβ)Qβ(xβ)

2398 T Heskes

62 Recovering the Convexity Conditions (2) Let us first see how weget back the conditions for convexity of the Bethe free energy equation 51Since

K2 =sumα

[sumβsubα

sumxβ

AαβQlowastα(xβ)Rβ(xβ)

]2

ge 0

and5

K3 =sumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)

times

12

[Rβ(xβ)minus Rβ prime(xprimeβ prime)

]2 minus 12

R2β(xβ)minus

12

R2β prime(x

primeβ prime)

gesumα

sumβsubα

sumxβ

Aαβ

(sumβ primesubα

Aαβ prime minus Aαβ

)Qlowastα(xβ)R

2β(xβ) (61)

we have

K = K1 + K2 + K3 gesumα

sumβsubα

sumxβ

Aαβ

(1minus

sumβ primesubα

Aαβ prime

)Qlowastα(xβ)R

2β(xβ)

That is sufficient conditions for K to be nonnegative are

Aαβ ge 0 forallαβsubα andsumβsubα

Aαβ le 1 forallα

precisely the conditions for theorem 1

63 Fake Interactions While discussing the conditions for convexity ofthe Bethe free energy we noticed that adding a ldquofake interactionrdquo such as aconstant potential can change the validity of the conditions We will see thathere this is not the case and these fake interactions drop out as we wouldexpect them to

Suppose that we have a fake interaction α(Xα) = 1 From the solutionequation 54 it follows that the pseudomarginal Qlowastα(Xα) factorizes6

Qlowastα(xβ xprimeβ prime) = Qlowastα(xβ)Qlowastα(xprimeβ prime) forallββ primesubα

5 This step is in fact equivalent to the Gerschgorin theorem for bounding the eigenval-ues of a matrix

6 The exact marginal Pexact(Xα) need not factorize This is really a consequence of thelocality assumptions behind loopy belief propagation and the Bethe free energy

Uniqueness of Loopy Belief Propagation Fixed Points 2399

Consequently the terms involving α in K3 cancel with those in K2 which ismost easily seen when we combine K2 and K3 in a different way

K2 + K3 =sumα

sumβsubα

sumxβ xprime

βprime

A2αβQlowastα(xβ)Q

lowastα(xprimeβ)Rβ(xβ)Rβ(x

primeβ) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ prime

times [Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)]Rβ(xβ)Rβ prime(x

primeβ prime) (K3)

This leaves us with the weaker requirement (from K1) Aαβ(1minusAαβ) ge 0 forall β sub α The best choice is then to take Aαβ = 1 which turns condition 3of equation 41 intosum

αprimesupβαprime =α

Aαprimeβ + 1 ge nβ minus 1

The net effect is equivalent to ignoring the interaction reducing the numberof neighboring potentials nβ by 1 for all β that are part of the fake interactionα

We have seen how we get milder and thus better conditions when thereis effectively no interaction Motivated by this ldquosuccessrdquo we will work to-ward conditions that take into account the strength of the interactions Ourstarting point will be the above decomposition in K2 and K3 where sinceK2 ge 0 we will concentrate on K3

7 The Strength of a Potential

71 Bounding the Correlations The crucial observation which will al-low us to obtain milder and thus better conditions for the uniqueness of afixed point is the following lemma It bounds the term between brackets inK3 such that we can again combine this bound with the (positive) term K1However before we get to that we take some time to introduce and deriveproperties of the ldquostrengthrdquo of a potential

Lemma 2 Two-node correlations of loopy belief marginals obey the bound

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime) le σαQlowastα(xβ xprimeβ prime) forall ββprime subα

βprime =βforallxβ xprime

βprime (71)

with the ldquostrengthrdquo σα a function of the potential ψα(Xα) equiv logα(Xα) only

σα = 1minus exp(minusωα) with

ωα equiv maxXαXα

[ψα(Xα)+ (nα minus 1)ψα(Xα)minus

sumβsubα

ψα(Xαβ xβ)

] (72)

where nα equivsum

βsubα 1

2400 T Heskes

Proof For convenience and without loss of generality we omit α fromour notation and renumber the nodes that are contained in α from 1 to nWe consider the quotient between the loopy belief on the potential subsetdivided by the product of its single-node marginals

Qlowast(X)nprodβ=1

Qlowast(xβ)=(X)

prodβ

microβ(xβ)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)microβ(xβ)

=(X)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)

(73)

where we substituted the properly normalized version of equation 35 aloopy belief pseudomarginal is proportional to the potential times incomingmessages The goal is now to find the maximum of the above expressionover all possible messages and all values of X Especially the maximum overmessagesmicro seems to be difficult to compute but the following intermediatelemma helps us out

Lemma 3 The maximum of the function

V(micro) = (nminus 1) log

[sumX

(X)nprodβ=1

microβ(xβ)

]

minusnsumβ=1

log

sum

(Xβ xlowastβ)prodβ prime =β

microβ prime(xβ prime)

with respect to the messages micro under constraintssum

xβ microβ(xβ) = 1 for all β andmicroβ(xβ) ge 0 for all β and xβ occurs at an extreme point microβ(xβ) = δxβ xβ for somexβ to be found

Proof Let us consider optimizing the messagemicro1(x1)with fixed messagesmicroβ(xβ) for β gt 1 The first and second derivatives are easily found to obey

partVpartmicro1(x1)

= (nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ)

part2Vpartmicro1(x1)partmicro1(xprime1)

= (nminus 1)Q(x1)Q(xprime1)minussumβ =1

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

Uniqueness of Loopy Belief Propagation Fixed Points 2401

where

Q(X) equiv (X)prodβ microβ(xβ)sum

Xprime (Xprime)prodβ microβ(x

primeβ)

Now suppose that V has a regular extremum (maximum or minimum) notat an extreme point that is micro1(x1) gt 0 for two or more values of x1 At suchan extremum the first derivative should obey

(nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ) = λ

with λ a Lagrange multiplier implementing the constraintsum

x1micro1(x1) = 1

Summing over x1 we obtain λ = 0 (in fact V is indifferent to any multi-plicative scaling of micro) For the matrix with second derivatives at such anextremum we then have

part2Vpartmicro1(x1)partmicro1(xprime1)

=sumβ =1

sumβprime =1βprime =β

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

which is positive semidefinite the extremum cannot be a maximum Con-sequently any maximum must be at the boundary of the domain Sincethis holds for any choice of microβ(xβ) β gt 1 it follows by induction that themaximum with respect to all microβ(xβ) must be at an extreme point as well

The function V(micro) is up to a term independent of micro the logarithm ofequation 73 So the intermediate lemma 3 tells us that we can replace themaximization over messages micro by maximization over values X

maxmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

X

(X)[(X)

]nminus1

prodβ

(Xβ xβ)

Next we take the maximum over X as well and define the ldquostrengthrdquo σ tobe used in equation 71 through

11minus σ equiv max

Xmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

XX

(X)[(X)

]nminus1

prodβ

(Xβ xβ) (74)

2402 T Heskes

The inequality 71 then follows by summing out Xββ prime in

Qlowast(X)minusprodβ

Qlowast(xβ) le σQlowast(X)

The form of equation 72 then follows by rewriting equation 74 as

ω equiv minus log(1minus σ) = maxXX

W(X X) with

W(X X) =[ψ(X)+ (nminus 1)ψ(X)minus

sumβ

ψ(Xβ xβ)

]

where we recall that ψ(X) equiv log(X)

72 Some Properties In the following we will refer to both ω and σ asthe strength of the potential There are several properties worth noting

bull The strength of a potential is indifferent to multiplication with anyterm that factorizes over the nodes that is

if (X) = (X)prodβ

microβ(xβ) then ω() = ω() for any choice of micro

This property relates to the arbitrariness in the definition of equa-tion 31 if two potentials overlap then multiplying one potential witha term that only depends on the overlap and dividing the other by thesame term does not change the distribution Luckily it also does notchange the strength of those potentials

bull To compute the strength we can enumerate all possible combinationsHowever we can neglect all combinations X and X that differ in fewerthan two nodes To see this consider

W(x1 x2 x12 x1 x2 x12) = ψ(x1 x2 x12)+ ψ(x1 x2 x12)minus ψ(x1 x2 x12)minus ψ(x1 x2 x12)

= minusW(x1 x2 x12 x1 x2 x12)

If now also x2 = x2 we get W(x1 x1 x1 x1) = minusW(x1 x1 x1 x1) =0 Furthermore if W(x1 x2 x12 x1 x2 x12) le 0 then it must be thatW(x1 x2 x12 x1 x2 x12) ge 0 and vice versa So ω the maximumover all combinations must be nonnegative and we can indeed neglectall combinations that by definition yield zero

bull Thus for finite potentials 0 le ω ltinfin and 0 le σ lt 1

Uniqueness of Loopy Belief Propagation Fixed Points 2403

bull With pairwise potentials the above symmetries can be used to reducethe number of evaluations to |x1||x2|(|x1|minus1)(|x2|minus1)4 combinationsAnd indeed for binary nodes x12 isin 0 1 we immediately obtain

ω = |ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| (75)

Any pairwise binary potential can be written as a Boltzmann factor

(x1 x2) prop exp[wx1x2 + θ1x1 + θ2x2]

In this notation we find the simple and intuitive expression ω = |w|the strength is the absolute value of the ldquoweightrdquo It is indeed inde-pendent of (the size of) the thresholds In the case of minus1 1 codingthe relationship is ω = 4|w|

bull In some models there is the notion of a ldquotemperaturerdquo T that is(X) propexp[ψ(X)T] where ψ(X) is considered constant In obvious notationwe then have ω(T) = ω(1)T and thus σ(T) = 1 minus exp[minusω(1)T] =1minus [1(1minus σ(1))]1T

bull Loopy belief revision (max-product) can be interpreted as a zero-temperature limit of loopy belief propagation (sum product) Morespecifically we get the belief revision updates if we imagine runningloopy belief propagation on potentials that are scaled with tempera-ture T and then take the limit T to zero Consequently when analyzingconditions for uniqueness of loopy belief revision fixed points we cantake σ(0) = 0 if σ(1) = 0 (fake interaction) yet σ(0) = 1 wheneverσ(1) gt 0

8 Conditions for Uniqueness

81 Main Result

Theorem 4 Loopy belief propagation has a unique fixed point if there exists anallocation matrix Aαβ between potentials α and nodes β with properties

1 Aαβ ge 0 forallαβsubα (positivity)

2 (1minus σα)maxβsubα

Aαβ + σαsumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(81)

with the strength σα a function of the potentialα(Xα) as defined in equation 72

Proof For completeness we first summarize our line of reasoning Fixedpoints of loopy belief propagation are in one-to-one correspondence with

2404 T Heskes

extrema of the dual equation 55 This dual has a unique extremum if itis convexconcave Concavity is guaranteed so we focus on conditionsfor convexity that is for positive (semi)definiteness of the correspondingHessian This then boils down to conditions that ensure K = K1+K2+K3 ge 0for any choice of Rβ(xβ)

Substituting the bound equation 71 into the term K3 we obtain

K3 ge minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeσαQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime)

ge minussumα

σαsumβsubα

sumxβ

Aαβ

sumβprimesubαβprime =β

Aαβ primeQlowastα(xβ)R2β(xβ)

where in the last step we applied the same trick as in equation 61 SinceK2 ge 0 and combining K1 and (the above lower bound on) K3 we get

K = K1 + K2 + K3

gesumα

sumβsubα

sumxβ

Aαβ

[1minus Aαβ minus σα

sumβ prime =β

Aαβ prime

]Qlowastα(xβ)R

2β(xβ)

This implies

(1minus σα)Aαβ + σαsumβ primesubα

Aαβ prime le 1 forallαβsubα

which in combination with Aαβ ge 0 and σα le 1 yields condition 2 inequation 81 The equality constraint equation 52 that we started with canbe relaxed to the inequality condition 3 without any consequences

We get back the stricter conditions of theorem 1 if σα = 1 for all potentialsα Furthermore ldquofake interactionsrdquo play no role with σα = 0 condition 2becomes maxβsubα Aαβ le 1 suggesting the choice Aαβ = 1 for all β sub αwhich then effectively reduces the number of neighboring potentials nβ incondition 3

82 Comparison with Other Work To the best of our knowledge theonly conditions for uniqueness of loopy belief propagation fixed points thatdepend on more than just the structure of the graph are those in Tatikondaand Jordan (2002) for pairwise potentials The analysis in Tatikonda andJordan is based on the concept of the computation tree which represents anunwrapping of the original graph with respect to the loopy belief propaga-tion algorithm The same concept is used in Weiss (2000) to show that beliefrevision yields the correct maximum a posteriori assignments in graphs

Uniqueness of Loopy Belief Propagation Fixed Points 2405

with a single loop and Weiss and Freeman (2001) to prove that loopy beliefpropagation in gaussian graphical models yields exact means Although thecurrent theorems based on the concept of computation trees are derived forpairwise potentials it should be possible to extend them to more generalfactor graphs

The setup in Tatikonda and Jordan (2002) is slightly different it is basedon the factorization

Pexact(X) = 1Z

prodα

α(Xα)prodβ

β(xβ)

to be compared with our equation 31 where there are no self-potentialsβ(xβ) With this in mind the statement is then as follows

Theorem 5 (adapted from Tatikonda amp Jordan 2002 in particular proposi-tion 53) Loopy belief propagation on pairwise potentials has a unique fixed pointif

sumαsupβ

(max

ψα(Xα)minusminXα

ψα(Xα)

)lt 2 forallβ (82)

To make the connection between theorem 5 and theorem 4 we will firststrengthen the former and then weaken the latter We will focus on thecase of binary pairwise potentials Since the definition of self-potentials isarbitrary and the condition 82 is valid for any choice we can easily improvethe condition by optimizing this choice This then leads to the followingcorollary

Corollary 3 This corollary concerns an improvement of theorem 5 for pairwisebinary potentials Loopy belief propagation on pairwise binary potentials has aunique fixed point ifsum

αsupβωα lt 4 forallβ (83)

with ωα defined in equation 72

Proof The condition 82 applies to any arbitrary definition of self-poten-tials β(xβ) In fact it is valid for any choice

ψα(Xα) = ψα(Xα)+sumβsubα

φαβ(xβ)

where ψα(Xα) is any choice of potential subsets that fits in our frameworkof no self-potentials (as argued above there is some arbitrariness here as

2406 T Heskes

well) We can then optimize this choice to obtain milder and thus betterconditions Omitting α and renumbering the nodes from 1 to 2 we have

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= minφ1φ2

maxx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

minus minx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

In the case of binary nodes (two-by-two matrices ψ(x1 x2)) it is easy tocheck that the optimal φ1 and φ2 that yield the smallest gap are such that

ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2)

ge ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2) (84)

for some x1 x2 x1 and x2 with x1 = x1 and x2 = x2 Solving for φ1 and φ2we find

φ1(x1)minus φ1(x1) = 12

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]φ2(x2)minus φ2(x2) = 1

2

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]

Substitution back into equation 84 yields

ψ(x1 x2)+ φ1(x1)+ φ2(x2)minus ψ(x1 x2)minus φ1(x1)minus φ2(x2)

= 12

[ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)minus ψ(x1 x2)

]

which has to be nonnegative Of all four possible combinations two of themare valid and yield the same positive gap and the other two are invalid sincethey yield the same negative gap Enumerating these combinations we find

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= 12|ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| = ω

2

from equation 75 Substitution into the condition 82 then yields equa-tion 83

Next we derive the following weaker corollary of theorem 4

Uniqueness of Loopy Belief Propagation Fixed Points 2407

Corollary 4 This is a weaker version of theorem 4 for pairwise potentials Loopybelief propagation on pairwise potentials has a unique fixed point ifsum

αsupβωα le 1 forallβ (85)

with ωα defined in equation 72

Proof Consider the allocation matrix with components Aαβ = 1minus σα forall β sub α With this choice conditions 1 and 2 of equation 81 are fulfilledsince (condition 1) σα le 1 and (condition 2)

(1minus σα)(1minus σα)+ 2σα(1minus σα) = 1minus 2σ 2α le 1

Substitution into condition 3 yieldssumαsupβ

(1minus σα) gesumαsupβ

1minus 1 and thussumαsupβ

σα le 1 (86)

Since ωα = minus log(1minus σα) ge σα condition 85 is weaker than condition 86

Summarizing the conditions in Tatikonda and Jordan (2002) are for bi-nary pairwise potentials and when strengthened as above at most a constant(factor 4) less strict and thus better than the ones derived here The latterare better when the structure is (close to) a tree The best set of conditionsfollows by taking the union of both Note further that the conditions derivedin Tatikonda and Jordan (2002) are unlike theorem 4 specific to pairwisepotentials

83 Illustration For illustration we consider a 3 times 3 Ising grid withtoroidal boundary conditions as in Figure 3a and uniform ferromagneticpotentials proportional to(

α 1minus α1minus α α

)

The trivial solution which is the only minimum of the Bethe free energy forsmall α is the one with all pseudomarginals equal to (05 05) With simplealgebra for example following the line of reasoning that leads to the beliefoptimization algorithm in Welling and Teh (2003) it can be shown that thistrivial solution becomes unstable at the critical αcritical = 23 asymp 067 Forα gt 23 we find two minima one with ldquospins uprdquo and the other one withldquospins downrdquo

In this symmetric problem the strength of each potential is given by

ω = 2 log[

α

1minus α]

and thus σ = 1minus(

1minus αα

)2

2408 T Heskes

Figure 3 Three Ising grids in factor-graph notation circles denote nodes boxesinteractions (a) Toroidal boundary conditions All elements of the allocationmatrix equal to 34 (not shown) (b) Aperiodic boundary conditions and (c) twoloops left The elements of the allocation matrix along the edges follow directlyfrom optimizing condition 3 in theorem 4 and symmetry considerations WithB = 2minus 2A in b and C = 1minusA in c the optimal settings for the single remainingvariable A then boil down to 34 and 1 minus

radic18 respectively See the text for

further explanation

Uniqueness of Loopy Belief Propagation Fixed Points 2409

The minimal (uniform) compensation in condition 3 of theorem 4 amountsto A = 34 for all combinations of potentials and nodes Substitution intocondition 2 then yields

σ le 13

and thus α le 11+radic23

asymp 055

The critical value that follows from corollary 3 is in this case slightly better

ω lt 1 and thus α le 11+ eminus12 asymp 062

Next we consider the same grid with aperiodic boundary conditions asin Figure 3b Numerically we find a critical αcritical asymp 079 The value thatfollows from corollary 3 is dominated by the center node and hence staysthe same a unique loopy belief propagation fixed point for α lt 062 Theo-rem 4 can be exploited to shift resources a little In principle we can solvethe nonlinear programming problem but for this small problem it can stillbe done by hand with the following argumentation Minimal compensationaccording to condition 3 in theorem 4 combined with symmetry considera-tions yields the allocation matrix elements along on the edges in Figure 3bIt is then easy to check that there are only two different appearances ofcondition 2

(2minus 2A)σ + 34le 1 and

12σ + A le 1

The optimal choice for A is the one in which both conditions turn out to beidentical In this way we obtain A = 34 yielding

σ le 12

and thus α le 11+radic12

asymp 058

still slightly worse than the condition from corollary 3An example in which the condition obtained with theorem 4 is better

than the one from corollary 3 is given in Figure 3c Straightforward analysisfollowing the same recipe as for Figure 3b yields A = 1minusradic18 with

σ leradic

12

and thus α le 1

1+radic

1minusradic12asymp 065

better than theα lt 062 from corollary 3 and to be compared with the criticalαcritical asymp 088

9 Discussion

In this article we derived sufficient conditions for loopy belief propagationto have just a single fixed point These conditions remain much too strong tobe anywhere near the necessary conditions and in that sense should be seenas no more than a first step These conditions have the following positivefeatures

2410 T Heskes

bull Generalize the conditions for convexity of the Bethe free energy

bull Incorporate the (local) strength of potentials

bull Scale naturally as a function of the ldquotemperaturerdquo

bull Are invariant to arbitrary definitions of potentials and self-interactions

Although the analysis that led to these conditions may seem quite involvedit basically consists of a relatively straightforward combination of two ob-servations The first observation is that we can exploit the arbitrariness inthe definition of the Bethe free energy when we incorporate the constraintsThis forms the basis of the resource allocation argument And the secondobservation concerns the bound on the correlation of a loopy belief propa-gation marginal that leads to the introduction of the strength of a potential

Besides its theoretical usefulness there are more practical uses Firstalgorithms for guaranteed convergence explicitly minimize the Bethe freeenergy They can be considered ldquobound optimization algorithmsrdquo similarto expectation maximization and iterative proportional fitting in the innerloop they minimize a bound on the Bethe free energy which is then up-dated in the outer loop In practice it appears that the tighter the boundthe faster the convergence (see eg Heskes et al 2003) Instead of a boundthat is convex (Yuille 2002) or convex over the set of constraints (Teh ampWelling 2002 Heskes et al 2003) we might relax the convexity conditionand choose a tighter bound that still has a unique minimum thereby speed-ing up the convergence Second in Wainwright et al (2002) a convexifiedBethe free energy is proposed The arguments for this class of free energiesare twofold they yield a bound on the partition function (instead of justan approximation as the standard Bethe free energy) and have a uniqueminimum Focusing on the second argument the conditions in this articlecan be used to construct Bethe free energies that may not be convex (overthe set of constraints) but do have a unique minimum and being closer tothe standard Bethe free energy may yield better approximations

We can think of the following opportunities to make the sufficient con-ditions derived here stricter and thus closer to necessary conditions

bull The conditions guarantee convexity of the dual G(Qβ λαβ) with re-spect to Qβ But in fact we need only G(Qβ) equiv maxλαβ G(Qβ λαβ) to beconvex which is a weaker requirement The Hessian of G(Qβ) how-ever appears to be more difficult to compute and to analyze in generalbut may lead to stronger results in specific cases (eg only pairwiseinteractions or substituting a particular choice of Aαβ )

bull It may be possible to strengthen the bound equation 71 on loopybelief correlations especially for interactions that involve more thantwo nodes

An important question is how the uniqueness of loopy belief propaga-tion fixed points relates to the convergence of loopy belief propagation

Uniqueness of Loopy Belief Propagation Fixed Points 2411

Intuitively one might expect that if loopy belief propagation has a uniquefixed point it will also converge to it This also seems to be the argumenta-tion in Tatikonda and Jordan (2002) However to the best of our knowledgethere is no proof of such correspondence Furthermore the following set ofsimulations does seem to suggest otherwise

We consider a Boltzmann machine with four binary nodes weights

w = ω

0 1 minus1 minus11 0 1 minus1minus1 1 0 minus1minus1 minus1 minus1 0

zero thresholds and potentials

ij(xi xj) = exp[wij4] if xi = xj and ij(xi xj) = exp[minuswij4] if xi = xj

Running loopy belief propagation possibly damped as in equation 39 weobserve ldquoconvergentrdquo and ldquononconvergentrdquo behavior For relatively smallweights loopy belief propagation converges to the trivial fixed point withPi(xi) = 05 for all nodes i and xi = 0 1 as in the lower left inset inFigure 4 For relatively large weights it ends up in a limit cycle as shown inthe upper right inset The weight strength that forms the transition betweenthis ldquoconvergentrdquo and ldquononconvergentrdquo behavior strongly depends on thestep size7 This by itself makes it hard to defend a one-to-one correspondencebetween convergence of loopy belief propagation (apparently dependingon step size) and uniqueness of fixed points (obviously independent of stepsize)

For weights larger than roughly 58 loopy belief propagation failed toconverge to the trivial fixed point even for very small step sizes Howeverrunning a convergent double-loop algorithm from many different initialconditions and many weight strengths considerably larger than 58 we al-ways ended up in the trivial fixed point and never in another one We foundsimilar behavior for a three-node Boltzmann machine (same weight matrixas above except for the fourth node) for very large weights loopy beliefpropagation ends up in a limit cycle whereas a convergent double-loopalgorithm converges to the trivial fixed point which here by corollary 2is guaranteed to be unique In future work we hope to elaborate on theseissues

7 Note that the conditions for guaranteed uniqueness imply ω = 43 for corollary 3and ω = log(2) asymp 069 for theorem 4 both far below the weight strengths where ldquonon-convergentrdquo behavior sets in

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

Uniqueness of Loopy Belief Propagation Fixed Points 2389

Proof The proof is by construction Choose one of the leaf nodes as theroot βlowast and define

Aαβ = 1 iff β sub α and β closer to the root βlowast than any other β prime sub αAαβ prime = 0 for all other β prime

Obviously this choice of A satisfies conditions 1 and 2 of equation 41Arguing the other way around for eachβ = βlowast there is just a single potentialα sup β that is closer to the root βlowast than β itself (see the illustration in Figure 2)and thus there are precisely nβ minus 1 contributions Aαβ = 1 The root itselfgets nβlowast contributions Aαβlowast = 1 which is even better Hence condition 3 isalso satisfied

sumαsupβ

Aαβ = nβ minus 1 forallβ =βlowast andsumαsupβlowast

Aαβlowast = nβlowast gt nβlowast minus 1

With the above construction of A we are in a sense ldquoeating up resourcestoward the rootrdquo At the root we have one piece of resources left whichsuggests that we can still enlarge the set of graphs for which convexity canbe shown using theorem 1 This leads to the next corollary

Corollary 2 The Bethe free energy for graphs with a single loop is convex overthe set of constraints

Proof Again the proof is by construction Break the loop at one particularplace that is remove one node βlowast from a potential αlowast such that a singlyconnected structure is left Construct a matrix A as in the proof of corollary 1taking the node βlowast as the root The matrix A constructed in this way alsojust works for the graph with the closed loop since still

sumαsupβ

Aαβ = nβ minus 1 forallβ =βlowast and nowsumαsupβlowast

Aαβlowast = nβlowast minus 1

It can be seen that this construction starts to fail as soon as we have twoloops that are connected with two connected loops we have insufficientpositive resources to compensate for the negative entropy contributions

44 Connection with Other Work The same corollaries can be foundin Pakzad and Anantharam (2002) and McEliece and Yildirim (2003) Fur-thermore the conditions in theorem 1 are very similar to the ones stated inPakzad and Anantharam (2002) which for the Bethe free energy boil downto the following

2390 T Heskes

Figure 2 The construction of an allocation matrix satisfying all convexity con-straints for singly connected (a) and single-loop structures (b) Neglecting thearrows and dashes each graph corresponds to a factor graph (Kscischang Freyamp Loeliger 2001) where dark boxes refer to potentials and circles to nodes Thenumbers within the circles give the corresponding ldquoovercounting numbersrdquo forthe Bethe free energy 1minus nβ with nβ the number of neighboring potentials Thearrows pointing from potentials α to nodes β visualize the allocation matrixA with Aαβ = 1 if there is an arrow and Aαβ = 0 otherwise As can be seenfor each potential there is precisely one outgoing arrow pointing at the nodeclosest to the root chosen to be the node in the upper right corner of the graphIn the singly-connected structure (a) all nonroot nodes have precisely nβ minus 1incoming arrows just sufficient to compensate the overcounting number 1minusnβ The root node itself has one incoming arrow which it does not really need Inthe structure with the single loop (b) we open the loop by breaking the dashedlink and construct the allocation matrix for the corresponding singly connectedstructure This allocation matrix works for the single-loop structure as well be-cause now the incoming arrow at the ldquorootrdquo is just sufficient to compensate forthe negative overcounting number resulting from the extra link closing the loop

Uniqueness of Loopy Belief Propagation Fixed Points 2391

Theorem 2 (Adapted from theorem 1 in Pakzad amp Anantharam 2002) TheBethe free energy is convex for the set of constraints if for any set of nodes B wehave sum

βisinB

(1minus nβ)+sumαisinπ(B)

1 ge 0 (42)

where π(B) equiv α existβ isin Bβ sub α denotes the ldquoparentrdquo set of B that is thosepotential subsets that include at least one node in B

Proposition 1 The conditions in theorems 1 and 2 are equivalent

Proof Let us first suppose that there does exist an allocation matrix Aαβ

satisfying the conditions of equation 41 Then for any set B

sumβisinB

(nβ minus 1) lesumβisinB

sumαsupβ

Aαβ lesumαisinπ(B)

sumβsubα

Aαβ lesumαisinπ(B)

1

where the inequalities follow from conditions 3 1 and 2 in equation 41respectively In other words validity of the conditions in theorem 1 impliesthe validity of those in theorem 2

Next let us suppose that the conditions in Theorem 1 fail Above wehave seen that this can happen if and only if the graph contains at least oneconnected component with two connected loops But then condition 42is violated as well when we take for B the set of all nodes within such acomponent

Since validity implies validity and violation implies violation the con-ditions must be equivalent

Graphical models with a single loop have been studied in detail in Weiss(2000) yielding important theoretical results (eg correctness of maximuma posteriori assignments) These results are derived by ldquounwrappingrdquo thesingle loop into an infinite tree This argument also breaks down as soon asthere is more than a single loop It might be interesting to find out whetherthere is a deeper connection between this unwrapping argument and theconvexity of the Bethe free energy

In summary we have given conditions for the Bethe free energy to have aunique extremum satisfying the constraints From the connection betweenthe extrema of the Bethe free energy and fixed points of loopy belief prop-agation it then follows that loopy belief propagation has a unique fixedpoint when these conditions are satisfied These conditions fail as soon asthe structure of the graph contains two connected loops

The conditions for convexity of the Bethe free energy depend on thestructure of the graph the potentialsα(Xα) do not play any role These po-tentials appear only in the energy term that is linear in the pseudomarginals

2392 T Heskes

and thus does not affect the convexity argument Consequently adding aldquofake linkrdquo with potential α(Xα) = 1 can change the validity of the con-ditions whereas it has no effect on the loopy belief propagation updatesEven if we managed to find more interesting (ie milder) conditions forconvexity over the set of constraints3 the impact of fake links would neverdisappear In the following we will therefore dig a little deeper to arrive atmilder conditions that do take into account (the strength of) the potentials

5 The Dual Formulation

51 From Lagrangian to Dual As we have seen fixed points of loopybelief propagation are in a one-to-one correspondence with zero derivativesof the Lagrangian If we manage to find conditions under which these zeroderivatives have a unique solution then for the same conditions loopybelief propagation has a unique fixed point In the following we will workwith a Lagrangian slightly different from equation 34 First we substitutethe constraint Qα(xβ) = Qβ(xβ) to write the Bethe free energy in the ldquomoreconvexrdquo form

F(QαQβ) = minussumα

sumXα

Qα(Xα)ψα(Xα)+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumβ

sumαsupβ

Aαβ

sumxβ

Qα(xβ) log Qβ(xβ) (51)

where the allocation matrix Aαβ can be any matrix that satisfies

sumαsupβ

Aαβ = nβ minus 1 (52)

And second we express the consistency constraints from equation 33 interms of the potential pseudomarginals Qα alone This then yields

L(QαQβ λαβ λα) = minussumα

sumXα

Qα(Xα)ψα(Xα)

+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumα

sumβsubα

Aαβ

sumxβ

Qα(xβ) log Qβ(xβ)

3 We would like to conjecture that this is not possiblemdashthat the conditions in theorem 1are not only sufficient but also necessary to prove convexity of the Bethe free energy overthe set of consistency constraints Note that this would not imply that we need theseconditions to guarantee the uniqueness of fixed points since for that convexity by itselfis sufficient not necessary

Uniqueness of Loopy Belief Propagation Fixed Points 2393

+sumβ

sumαsupβ

sumxβ

λαβ(xβ)

times[

1nβ minus 1

sumαprimesupβ

AαprimeβQαprime(xβ)minusQα(xβ)

]

+sumα

λα

[1minus

sumXα

Qα(Xα)

]

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (53)

Note that the constraint Qβ(xβ) = Qα(xβ) as well as its normalization is nolonger incorporated with Lagrange multipliers but follows when we takethe minimum with respect to Qβ It is easy to check that the fixed-pointequations of loopy belief propagation still follow by setting the derivativesof the Lagrangian equation 53 to zero

Although the Bethe free energy and thus the Lagrangian equation 53may not be convex in QαQβ they are convex in Qα and Qβ separatelyTherefore we can interchange the minimum over the pseudomarginals Qα

and the maximum over the Lagrange multipliers as long as we leave theminimum over Qβ as the final operation4

minQαQβ

maxλαβ λα

L(QαQβ λαβ λα) = minQβ

maxλαβ λα

minQα

L(QαQβ λαβ λα)

Rewriting

sumβ

sumαsupβ

sumxβ

λαβ(xβ)

[1

nβ minus 1

sumαprimesupβ

AαprimeβQαprime(xβ)minusQα(xβ)

]

= minussumα

sumβsubα

sumxβ

λαβ(xβ)Qα(xβ)

with

λαβ(xβ) equiv λαβ(xβ)minus 1nβ minus 1

sumαprimesupβ

Aαprimeβλαprimeβ(xβ)

we can easily solve for the minimum with respect to Qα

Qlowastα(Xα) = α(Xα) exp

[λα minus 1+

sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

] (54)

4 In principle we could also first take the minimum over Qβ and leave the minimumover Qα but this does not seem to lead to any useful results

2394 T Heskes

Plugging this into the Lagrangian we obtain the ldquodualrdquo

G(Qβ λαβ λα) equiv L(QlowastαQβ λαβ λα)

= minussumα

sumXα

α(Xα)

exptimes[λα minus 1+

sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

]

+sumα

λα +sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (55)

Next we find for the maximum with respect to λα

exp[1minus λlowastα

] =sumXα

α(Xα) exp

[sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

]

equiv Zlowastα (56)

where we have to keep in mind that Zlowastα by itself like Qlowastα is a function of theremaining pseudomarginals Qβ and Lagrange multipliers λαβ Substitutingthis solution into the dual we arrive at

G(Qβ λαβ) equiv G(Qβ λαβ λlowastα)

= minussumα

log Zlowastα +sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (57)

Let us pause here for a moment and reflect on what we have done so farThe Lagrangian equation 53 being convex in Qα has a unique minimumin Qα (given all other parameters fixed) which is also the only extremumIt happens to be relatively straightforward to express the value at this mini-mum in terms of the remaining parameters and then also to find the optimal(maximal) λlowastα Plugging these values into the Lagrangian equation 53 wehave not lost anything That is zero derivatives of the Lagrangian are still inone-to-one correspondence with zero derivatives of the dual equation 57and thus with fixed points of loopy belief propagation

52 Recovering the Convexity Conditions (1) To find a minimum of theBethe free energy satisfying the constraints in equation 33 we first have totake the maximum of the dual equation 57 over the remaining Lagrangemultipliersλαβ and then the minimum over the remaining pseudomarginalsQβ The duality theorem a standard result from constrained optimization(see Luenberger 1984) tells us that the dual G is concave in the Lagrangemultipliers The remaining question is then whether the dual is convex in

Uniqueness of Loopy Belief Propagation Fixed Points 2395

Qβ If it is we have a convex-concave minimax problem which is guaranteedto have a unique solution

Link c in Figure 1 follows from the following proposition

Proposition 2 Convexity of the Bethe free energy equation 51 in QαQβimplies convexity of the dual equation 57 in Qβ

Proof First we note that the minimum of a convex function over someof its parameters is convex in its remaining parameters In obvious one-dimensional notation with ylowast(x) equiv argmin

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge 2 f (x (ylowast(x+ δ)+ ylowast(xminus δ))2)ge 2 f (x ylowast(x))

where the first inequality follows from the convexity of f in x yand the sec-ond inequality from ylowast(x) being the unique minimum of f (x y) Thereforethe dual equation 55 is convex in Qβ when the Lagrangian equation 53and thus the Bethe free energy equation 51 is convex in QαQβ Further-more from the duality theorem the dual equation 55 is concave in theLagrange multipliers λαβ λα Next we note that the maximum of a con-vex or concave function over its maximizing parameters is again convexwith ylowast(x) equiv argmax

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge f (x+ δ ylowast(x))+ f (xminus δ ylowast(x))ge 2 f (x ylowast(x))

where the first inequality follows from ylowast(xplusmnδ) being the unique maximumof f (x plusmn δ y) and the second inequality from the convexity of f (x y) in xHence the dual equation 57 must still be convex in Qβ

For now we did not gain or lose anything in comparison with the con-ditions for theorem 1 However the inequalities in the above proof suggesta little space that will lead to milder conditions for the uniqueness of fixedpoints

53 Boundedness of the Bethe Free Energy For completeness and tosupport link f in Figure 1 we will here prove that the Bethe free energy isbounded from below The following theorem can be considered a specialcase of the one stated in Minka (2001) on the Bethe free energy for expectationpropagation a generalization of (loopy) belief propagation

Theorem 3 If all potentials are bounded from above that isα(Xα) le max forall α and Xα the Bethe free energy is bounded from below on the set of constraints

2396 T Heskes

Proof It is sufficient to prove that the function G(Qβ) equiv maxλαβ G(Qβ λαβ)

is bounded from below for a particular choice of Aαβ satisfying equation 52Considering Aαβ = nβminus1

nβ we then have

G(Qβ) ge minussumα

logsumXα

α(Xα) exp

[sumβsubα

nβ minus 1nβ

log Qβ(xβ)

]

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

logsumXα

α(Xα)Qβ(xβ)

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

+sumβ

(nβ minus 1)

[minus log

sumxβ

Qβ(xβ)+sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

where the first inequality follows by substituting the choice λαβ(xβ) = 0 forall α β and xβ in G(Qβ λαβ) the second from the concavity of the function

ynβminus1

nβ and the third from the upper bound on the potentials

6 Toward Better Conditions

61 The Hessian The next step is to compute the Hessianmdashthe secondderivative of the dual with respect to the pseudomarginals Qβ The firstderivative yields

partGpartQβ(xβ)

= minussumαsupβ

Aαβ

Qlowastα(xβ)Qβ(xβ)

+ (nβ minus 1)

which is immediate from the Lagrangian equation 53 To compute thematrix of second derivatives

Hββ prime(xβ xprimeβ prime) equivpart2G

partQβ(xβ)partQβ prime(xprimeβ prime)

Uniqueness of Loopy Belief Propagation Fixed Points 2397

we make use of

partQlowastα(xβ)partQβ prime(xprimeβ prime)

= Aαβ primeQlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)

Qβ prime(xprimeβ prime)

where both β and β prime should be a subset of α and with convention Qlowastα(xβ xβ)= Qlowastα(xβ) and Qlowastα(xβ xprimeβ) = 0 if xβ = xprimeβ Here the first term follows from thedifferentation of equation 54 and the second term from the normalizationas in equation 56 Distinguishing between β = β prime and β = β prime we then have

Hββ(xβ xprimeβ) =sumαsupβ

Aαβ(1minus Aαβ)Qlowastα(xβ)Q2β(xβ)

δxβ xprimeβ

+sumαsupβ

A2αβ

Qlowastα(xβ)Qlowastα(xprimeβ)Qβ(xβ)Qβ(xprimeβ)

Hββ prime(xβ xprimeβ prime) = minussum

αsupββ primeAαβAαβ prime

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)Qβ(xβ)Qβ prime(xprimeβ prime)

for β prime = β

where δxβ xprimeβ = 1 if and only if xβ = xprimeβ Here it should be noted that bothβ and xβ play the role of indices that is xβ should not be mistaken for avariable or parameter The parameters are still the (tables with) Lagrangemultipliers λαβ and pseudomarginals Qβ

The goal is now to find conditions under which this Hessian is positive(semi) definite for any setting of the parameters Qβ λαβ that is conditionsthat guarantee

K equivsumββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) ge 0

for any choice of the ldquovectorrdquo S with elements Sβ(xβ) Straightforward ma-nipulations yieldsum

ββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) (K)

=sumα

sumβsubα

sumxβ

Aαβ(1minus Aαβ)Qlowastα(xβ)R2β(xβ) (K1)

+sumα

sumββ primesubα

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ)Qlowastα(xprimeβ prime)Rβ(xβ)Rβ prime(x

primeβ prime) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime) (K3)

where Rβ(xβ) equiv Sβ(xβ)Qβ(xβ)

2398 T Heskes

62 Recovering the Convexity Conditions (2) Let us first see how weget back the conditions for convexity of the Bethe free energy equation 51Since

K2 =sumα

[sumβsubα

sumxβ

AαβQlowastα(xβ)Rβ(xβ)

]2

ge 0

and5

K3 =sumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)

times

12

[Rβ(xβ)minus Rβ prime(xprimeβ prime)

]2 minus 12

R2β(xβ)minus

12

R2β prime(x

primeβ prime)

gesumα

sumβsubα

sumxβ

Aαβ

(sumβ primesubα

Aαβ prime minus Aαβ

)Qlowastα(xβ)R

2β(xβ) (61)

we have

K = K1 + K2 + K3 gesumα

sumβsubα

sumxβ

Aαβ

(1minus

sumβ primesubα

Aαβ prime

)Qlowastα(xβ)R

2β(xβ)

That is sufficient conditions for K to be nonnegative are

Aαβ ge 0 forallαβsubα andsumβsubα

Aαβ le 1 forallα

precisely the conditions for theorem 1

63 Fake Interactions While discussing the conditions for convexity ofthe Bethe free energy we noticed that adding a ldquofake interactionrdquo such as aconstant potential can change the validity of the conditions We will see thathere this is not the case and these fake interactions drop out as we wouldexpect them to

Suppose that we have a fake interaction α(Xα) = 1 From the solutionequation 54 it follows that the pseudomarginal Qlowastα(Xα) factorizes6

Qlowastα(xβ xprimeβ prime) = Qlowastα(xβ)Qlowastα(xprimeβ prime) forallββ primesubα

5 This step is in fact equivalent to the Gerschgorin theorem for bounding the eigenval-ues of a matrix

6 The exact marginal Pexact(Xα) need not factorize This is really a consequence of thelocality assumptions behind loopy belief propagation and the Bethe free energy

Uniqueness of Loopy Belief Propagation Fixed Points 2399

Consequently the terms involving α in K3 cancel with those in K2 which ismost easily seen when we combine K2 and K3 in a different way

K2 + K3 =sumα

sumβsubα

sumxβ xprime

βprime

A2αβQlowastα(xβ)Q

lowastα(xprimeβ)Rβ(xβ)Rβ(x

primeβ) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ prime

times [Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)]Rβ(xβ)Rβ prime(x

primeβ prime) (K3)

This leaves us with the weaker requirement (from K1) Aαβ(1minusAαβ) ge 0 forall β sub α The best choice is then to take Aαβ = 1 which turns condition 3of equation 41 intosum

αprimesupβαprime =α

Aαprimeβ + 1 ge nβ minus 1

The net effect is equivalent to ignoring the interaction reducing the numberof neighboring potentials nβ by 1 for all β that are part of the fake interactionα

We have seen how we get milder and thus better conditions when thereis effectively no interaction Motivated by this ldquosuccessrdquo we will work to-ward conditions that take into account the strength of the interactions Ourstarting point will be the above decomposition in K2 and K3 where sinceK2 ge 0 we will concentrate on K3

7 The Strength of a Potential

71 Bounding the Correlations The crucial observation which will al-low us to obtain milder and thus better conditions for the uniqueness of afixed point is the following lemma It bounds the term between brackets inK3 such that we can again combine this bound with the (positive) term K1However before we get to that we take some time to introduce and deriveproperties of the ldquostrengthrdquo of a potential

Lemma 2 Two-node correlations of loopy belief marginals obey the bound

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime) le σαQlowastα(xβ xprimeβ prime) forall ββprime subα

βprime =βforallxβ xprime

βprime (71)

with the ldquostrengthrdquo σα a function of the potential ψα(Xα) equiv logα(Xα) only

σα = 1minus exp(minusωα) with

ωα equiv maxXαXα

[ψα(Xα)+ (nα minus 1)ψα(Xα)minus

sumβsubα

ψα(Xαβ xβ)

] (72)

where nα equivsum

βsubα 1

2400 T Heskes

Proof For convenience and without loss of generality we omit α fromour notation and renumber the nodes that are contained in α from 1 to nWe consider the quotient between the loopy belief on the potential subsetdivided by the product of its single-node marginals

Qlowast(X)nprodβ=1

Qlowast(xβ)=(X)

prodβ

microβ(xβ)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)microβ(xβ)

=(X)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)

(73)

where we substituted the properly normalized version of equation 35 aloopy belief pseudomarginal is proportional to the potential times incomingmessages The goal is now to find the maximum of the above expressionover all possible messages and all values of X Especially the maximum overmessagesmicro seems to be difficult to compute but the following intermediatelemma helps us out

Lemma 3 The maximum of the function

V(micro) = (nminus 1) log

[sumX

(X)nprodβ=1

microβ(xβ)

]

minusnsumβ=1

log

sum

(Xβ xlowastβ)prodβ prime =β

microβ prime(xβ prime)

with respect to the messages micro under constraintssum

xβ microβ(xβ) = 1 for all β andmicroβ(xβ) ge 0 for all β and xβ occurs at an extreme point microβ(xβ) = δxβ xβ for somexβ to be found

Proof Let us consider optimizing the messagemicro1(x1)with fixed messagesmicroβ(xβ) for β gt 1 The first and second derivatives are easily found to obey

partVpartmicro1(x1)

= (nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ)

part2Vpartmicro1(x1)partmicro1(xprime1)

= (nminus 1)Q(x1)Q(xprime1)minussumβ =1

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

Uniqueness of Loopy Belief Propagation Fixed Points 2401

where

Q(X) equiv (X)prodβ microβ(xβ)sum

Xprime (Xprime)prodβ microβ(x

primeβ)

Now suppose that V has a regular extremum (maximum or minimum) notat an extreme point that is micro1(x1) gt 0 for two or more values of x1 At suchan extremum the first derivative should obey

(nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ) = λ

with λ a Lagrange multiplier implementing the constraintsum

x1micro1(x1) = 1

Summing over x1 we obtain λ = 0 (in fact V is indifferent to any multi-plicative scaling of micro) For the matrix with second derivatives at such anextremum we then have

part2Vpartmicro1(x1)partmicro1(xprime1)

=sumβ =1

sumβprime =1βprime =β

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

which is positive semidefinite the extremum cannot be a maximum Con-sequently any maximum must be at the boundary of the domain Sincethis holds for any choice of microβ(xβ) β gt 1 it follows by induction that themaximum with respect to all microβ(xβ) must be at an extreme point as well

The function V(micro) is up to a term independent of micro the logarithm ofequation 73 So the intermediate lemma 3 tells us that we can replace themaximization over messages micro by maximization over values X

maxmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

X

(X)[(X)

]nminus1

prodβ

(Xβ xβ)

Next we take the maximum over X as well and define the ldquostrengthrdquo σ tobe used in equation 71 through

11minus σ equiv max

Xmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

XX

(X)[(X)

]nminus1

prodβ

(Xβ xβ) (74)

2402 T Heskes

The inequality 71 then follows by summing out Xββ prime in

Qlowast(X)minusprodβ

Qlowast(xβ) le σQlowast(X)

The form of equation 72 then follows by rewriting equation 74 as

ω equiv minus log(1minus σ) = maxXX

W(X X) with

W(X X) =[ψ(X)+ (nminus 1)ψ(X)minus

sumβ

ψ(Xβ xβ)

]

where we recall that ψ(X) equiv log(X)

72 Some Properties In the following we will refer to both ω and σ asthe strength of the potential There are several properties worth noting

bull The strength of a potential is indifferent to multiplication with anyterm that factorizes over the nodes that is

if (X) = (X)prodβ

microβ(xβ) then ω() = ω() for any choice of micro

This property relates to the arbitrariness in the definition of equa-tion 31 if two potentials overlap then multiplying one potential witha term that only depends on the overlap and dividing the other by thesame term does not change the distribution Luckily it also does notchange the strength of those potentials

bull To compute the strength we can enumerate all possible combinationsHowever we can neglect all combinations X and X that differ in fewerthan two nodes To see this consider

W(x1 x2 x12 x1 x2 x12) = ψ(x1 x2 x12)+ ψ(x1 x2 x12)minus ψ(x1 x2 x12)minus ψ(x1 x2 x12)

= minusW(x1 x2 x12 x1 x2 x12)

If now also x2 = x2 we get W(x1 x1 x1 x1) = minusW(x1 x1 x1 x1) =0 Furthermore if W(x1 x2 x12 x1 x2 x12) le 0 then it must be thatW(x1 x2 x12 x1 x2 x12) ge 0 and vice versa So ω the maximumover all combinations must be nonnegative and we can indeed neglectall combinations that by definition yield zero

bull Thus for finite potentials 0 le ω ltinfin and 0 le σ lt 1

Uniqueness of Loopy Belief Propagation Fixed Points 2403

bull With pairwise potentials the above symmetries can be used to reducethe number of evaluations to |x1||x2|(|x1|minus1)(|x2|minus1)4 combinationsAnd indeed for binary nodes x12 isin 0 1 we immediately obtain

ω = |ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| (75)

Any pairwise binary potential can be written as a Boltzmann factor

(x1 x2) prop exp[wx1x2 + θ1x1 + θ2x2]

In this notation we find the simple and intuitive expression ω = |w|the strength is the absolute value of the ldquoweightrdquo It is indeed inde-pendent of (the size of) the thresholds In the case of minus1 1 codingthe relationship is ω = 4|w|

bull In some models there is the notion of a ldquotemperaturerdquo T that is(X) propexp[ψ(X)T] where ψ(X) is considered constant In obvious notationwe then have ω(T) = ω(1)T and thus σ(T) = 1 minus exp[minusω(1)T] =1minus [1(1minus σ(1))]1T

bull Loopy belief revision (max-product) can be interpreted as a zero-temperature limit of loopy belief propagation (sum product) Morespecifically we get the belief revision updates if we imagine runningloopy belief propagation on potentials that are scaled with tempera-ture T and then take the limit T to zero Consequently when analyzingconditions for uniqueness of loopy belief revision fixed points we cantake σ(0) = 0 if σ(1) = 0 (fake interaction) yet σ(0) = 1 wheneverσ(1) gt 0

8 Conditions for Uniqueness

81 Main Result

Theorem 4 Loopy belief propagation has a unique fixed point if there exists anallocation matrix Aαβ between potentials α and nodes β with properties

1 Aαβ ge 0 forallαβsubα (positivity)

2 (1minus σα)maxβsubα

Aαβ + σαsumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(81)

with the strength σα a function of the potentialα(Xα) as defined in equation 72

Proof For completeness we first summarize our line of reasoning Fixedpoints of loopy belief propagation are in one-to-one correspondence with

2404 T Heskes

extrema of the dual equation 55 This dual has a unique extremum if itis convexconcave Concavity is guaranteed so we focus on conditionsfor convexity that is for positive (semi)definiteness of the correspondingHessian This then boils down to conditions that ensure K = K1+K2+K3 ge 0for any choice of Rβ(xβ)

Substituting the bound equation 71 into the term K3 we obtain

K3 ge minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeσαQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime)

ge minussumα

σαsumβsubα

sumxβ

Aαβ

sumβprimesubαβprime =β

Aαβ primeQlowastα(xβ)R2β(xβ)

where in the last step we applied the same trick as in equation 61 SinceK2 ge 0 and combining K1 and (the above lower bound on) K3 we get

K = K1 + K2 + K3

gesumα

sumβsubα

sumxβ

Aαβ

[1minus Aαβ minus σα

sumβ prime =β

Aαβ prime

]Qlowastα(xβ)R

2β(xβ)

This implies

(1minus σα)Aαβ + σαsumβ primesubα

Aαβ prime le 1 forallαβsubα

which in combination with Aαβ ge 0 and σα le 1 yields condition 2 inequation 81 The equality constraint equation 52 that we started with canbe relaxed to the inequality condition 3 without any consequences

We get back the stricter conditions of theorem 1 if σα = 1 for all potentialsα Furthermore ldquofake interactionsrdquo play no role with σα = 0 condition 2becomes maxβsubα Aαβ le 1 suggesting the choice Aαβ = 1 for all β sub αwhich then effectively reduces the number of neighboring potentials nβ incondition 3

82 Comparison with Other Work To the best of our knowledge theonly conditions for uniqueness of loopy belief propagation fixed points thatdepend on more than just the structure of the graph are those in Tatikondaand Jordan (2002) for pairwise potentials The analysis in Tatikonda andJordan is based on the concept of the computation tree which represents anunwrapping of the original graph with respect to the loopy belief propaga-tion algorithm The same concept is used in Weiss (2000) to show that beliefrevision yields the correct maximum a posteriori assignments in graphs

Uniqueness of Loopy Belief Propagation Fixed Points 2405

with a single loop and Weiss and Freeman (2001) to prove that loopy beliefpropagation in gaussian graphical models yields exact means Although thecurrent theorems based on the concept of computation trees are derived forpairwise potentials it should be possible to extend them to more generalfactor graphs

The setup in Tatikonda and Jordan (2002) is slightly different it is basedon the factorization

Pexact(X) = 1Z

prodα

α(Xα)prodβ

β(xβ)

to be compared with our equation 31 where there are no self-potentialsβ(xβ) With this in mind the statement is then as follows

Theorem 5 (adapted from Tatikonda amp Jordan 2002 in particular proposi-tion 53) Loopy belief propagation on pairwise potentials has a unique fixed pointif

sumαsupβ

(max

ψα(Xα)minusminXα

ψα(Xα)

)lt 2 forallβ (82)

To make the connection between theorem 5 and theorem 4 we will firststrengthen the former and then weaken the latter We will focus on thecase of binary pairwise potentials Since the definition of self-potentials isarbitrary and the condition 82 is valid for any choice we can easily improvethe condition by optimizing this choice This then leads to the followingcorollary

Corollary 3 This corollary concerns an improvement of theorem 5 for pairwisebinary potentials Loopy belief propagation on pairwise binary potentials has aunique fixed point ifsum

αsupβωα lt 4 forallβ (83)

with ωα defined in equation 72

Proof The condition 82 applies to any arbitrary definition of self-poten-tials β(xβ) In fact it is valid for any choice

ψα(Xα) = ψα(Xα)+sumβsubα

φαβ(xβ)

where ψα(Xα) is any choice of potential subsets that fits in our frameworkof no self-potentials (as argued above there is some arbitrariness here as

2406 T Heskes

well) We can then optimize this choice to obtain milder and thus betterconditions Omitting α and renumbering the nodes from 1 to 2 we have

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= minφ1φ2

maxx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

minus minx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

In the case of binary nodes (two-by-two matrices ψ(x1 x2)) it is easy tocheck that the optimal φ1 and φ2 that yield the smallest gap are such that

ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2)

ge ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2) (84)

for some x1 x2 x1 and x2 with x1 = x1 and x2 = x2 Solving for φ1 and φ2we find

φ1(x1)minus φ1(x1) = 12

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]φ2(x2)minus φ2(x2) = 1

2

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]

Substitution back into equation 84 yields

ψ(x1 x2)+ φ1(x1)+ φ2(x2)minus ψ(x1 x2)minus φ1(x1)minus φ2(x2)

= 12

[ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)minus ψ(x1 x2)

]

which has to be nonnegative Of all four possible combinations two of themare valid and yield the same positive gap and the other two are invalid sincethey yield the same negative gap Enumerating these combinations we find

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= 12|ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| = ω

2

from equation 75 Substitution into the condition 82 then yields equa-tion 83

Next we derive the following weaker corollary of theorem 4

Uniqueness of Loopy Belief Propagation Fixed Points 2407

Corollary 4 This is a weaker version of theorem 4 for pairwise potentials Loopybelief propagation on pairwise potentials has a unique fixed point ifsum

αsupβωα le 1 forallβ (85)

with ωα defined in equation 72

Proof Consider the allocation matrix with components Aαβ = 1minus σα forall β sub α With this choice conditions 1 and 2 of equation 81 are fulfilledsince (condition 1) σα le 1 and (condition 2)

(1minus σα)(1minus σα)+ 2σα(1minus σα) = 1minus 2σ 2α le 1

Substitution into condition 3 yieldssumαsupβ

(1minus σα) gesumαsupβ

1minus 1 and thussumαsupβ

σα le 1 (86)

Since ωα = minus log(1minus σα) ge σα condition 85 is weaker than condition 86

Summarizing the conditions in Tatikonda and Jordan (2002) are for bi-nary pairwise potentials and when strengthened as above at most a constant(factor 4) less strict and thus better than the ones derived here The latterare better when the structure is (close to) a tree The best set of conditionsfollows by taking the union of both Note further that the conditions derivedin Tatikonda and Jordan (2002) are unlike theorem 4 specific to pairwisepotentials

83 Illustration For illustration we consider a 3 times 3 Ising grid withtoroidal boundary conditions as in Figure 3a and uniform ferromagneticpotentials proportional to(

α 1minus α1minus α α

)

The trivial solution which is the only minimum of the Bethe free energy forsmall α is the one with all pseudomarginals equal to (05 05) With simplealgebra for example following the line of reasoning that leads to the beliefoptimization algorithm in Welling and Teh (2003) it can be shown that thistrivial solution becomes unstable at the critical αcritical = 23 asymp 067 Forα gt 23 we find two minima one with ldquospins uprdquo and the other one withldquospins downrdquo

In this symmetric problem the strength of each potential is given by

ω = 2 log[

α

1minus α]

and thus σ = 1minus(

1minus αα

)2

2408 T Heskes

Figure 3 Three Ising grids in factor-graph notation circles denote nodes boxesinteractions (a) Toroidal boundary conditions All elements of the allocationmatrix equal to 34 (not shown) (b) Aperiodic boundary conditions and (c) twoloops left The elements of the allocation matrix along the edges follow directlyfrom optimizing condition 3 in theorem 4 and symmetry considerations WithB = 2minus 2A in b and C = 1minusA in c the optimal settings for the single remainingvariable A then boil down to 34 and 1 minus

radic18 respectively See the text for

further explanation

Uniqueness of Loopy Belief Propagation Fixed Points 2409

The minimal (uniform) compensation in condition 3 of theorem 4 amountsto A = 34 for all combinations of potentials and nodes Substitution intocondition 2 then yields

σ le 13

and thus α le 11+radic23

asymp 055

The critical value that follows from corollary 3 is in this case slightly better

ω lt 1 and thus α le 11+ eminus12 asymp 062

Next we consider the same grid with aperiodic boundary conditions asin Figure 3b Numerically we find a critical αcritical asymp 079 The value thatfollows from corollary 3 is dominated by the center node and hence staysthe same a unique loopy belief propagation fixed point for α lt 062 Theo-rem 4 can be exploited to shift resources a little In principle we can solvethe nonlinear programming problem but for this small problem it can stillbe done by hand with the following argumentation Minimal compensationaccording to condition 3 in theorem 4 combined with symmetry considera-tions yields the allocation matrix elements along on the edges in Figure 3bIt is then easy to check that there are only two different appearances ofcondition 2

(2minus 2A)σ + 34le 1 and

12σ + A le 1

The optimal choice for A is the one in which both conditions turn out to beidentical In this way we obtain A = 34 yielding

σ le 12

and thus α le 11+radic12

asymp 058

still slightly worse than the condition from corollary 3An example in which the condition obtained with theorem 4 is better

than the one from corollary 3 is given in Figure 3c Straightforward analysisfollowing the same recipe as for Figure 3b yields A = 1minusradic18 with

σ leradic

12

and thus α le 1

1+radic

1minusradic12asymp 065

better than theα lt 062 from corollary 3 and to be compared with the criticalαcritical asymp 088

9 Discussion

In this article we derived sufficient conditions for loopy belief propagationto have just a single fixed point These conditions remain much too strong tobe anywhere near the necessary conditions and in that sense should be seenas no more than a first step These conditions have the following positivefeatures

2410 T Heskes

bull Generalize the conditions for convexity of the Bethe free energy

bull Incorporate the (local) strength of potentials

bull Scale naturally as a function of the ldquotemperaturerdquo

bull Are invariant to arbitrary definitions of potentials and self-interactions

Although the analysis that led to these conditions may seem quite involvedit basically consists of a relatively straightforward combination of two ob-servations The first observation is that we can exploit the arbitrariness inthe definition of the Bethe free energy when we incorporate the constraintsThis forms the basis of the resource allocation argument And the secondobservation concerns the bound on the correlation of a loopy belief propa-gation marginal that leads to the introduction of the strength of a potential

Besides its theoretical usefulness there are more practical uses Firstalgorithms for guaranteed convergence explicitly minimize the Bethe freeenergy They can be considered ldquobound optimization algorithmsrdquo similarto expectation maximization and iterative proportional fitting in the innerloop they minimize a bound on the Bethe free energy which is then up-dated in the outer loop In practice it appears that the tighter the boundthe faster the convergence (see eg Heskes et al 2003) Instead of a boundthat is convex (Yuille 2002) or convex over the set of constraints (Teh ampWelling 2002 Heskes et al 2003) we might relax the convexity conditionand choose a tighter bound that still has a unique minimum thereby speed-ing up the convergence Second in Wainwright et al (2002) a convexifiedBethe free energy is proposed The arguments for this class of free energiesare twofold they yield a bound on the partition function (instead of justan approximation as the standard Bethe free energy) and have a uniqueminimum Focusing on the second argument the conditions in this articlecan be used to construct Bethe free energies that may not be convex (overthe set of constraints) but do have a unique minimum and being closer tothe standard Bethe free energy may yield better approximations

We can think of the following opportunities to make the sufficient con-ditions derived here stricter and thus closer to necessary conditions

bull The conditions guarantee convexity of the dual G(Qβ λαβ) with re-spect to Qβ But in fact we need only G(Qβ) equiv maxλαβ G(Qβ λαβ) to beconvex which is a weaker requirement The Hessian of G(Qβ) how-ever appears to be more difficult to compute and to analyze in generalbut may lead to stronger results in specific cases (eg only pairwiseinteractions or substituting a particular choice of Aαβ )

bull It may be possible to strengthen the bound equation 71 on loopybelief correlations especially for interactions that involve more thantwo nodes

An important question is how the uniqueness of loopy belief propaga-tion fixed points relates to the convergence of loopy belief propagation

Uniqueness of Loopy Belief Propagation Fixed Points 2411

Intuitively one might expect that if loopy belief propagation has a uniquefixed point it will also converge to it This also seems to be the argumenta-tion in Tatikonda and Jordan (2002) However to the best of our knowledgethere is no proof of such correspondence Furthermore the following set ofsimulations does seem to suggest otherwise

We consider a Boltzmann machine with four binary nodes weights

w = ω

0 1 minus1 minus11 0 1 minus1minus1 1 0 minus1minus1 minus1 minus1 0

zero thresholds and potentials

ij(xi xj) = exp[wij4] if xi = xj and ij(xi xj) = exp[minuswij4] if xi = xj

Running loopy belief propagation possibly damped as in equation 39 weobserve ldquoconvergentrdquo and ldquononconvergentrdquo behavior For relatively smallweights loopy belief propagation converges to the trivial fixed point withPi(xi) = 05 for all nodes i and xi = 0 1 as in the lower left inset inFigure 4 For relatively large weights it ends up in a limit cycle as shown inthe upper right inset The weight strength that forms the transition betweenthis ldquoconvergentrdquo and ldquononconvergentrdquo behavior strongly depends on thestep size7 This by itself makes it hard to defend a one-to-one correspondencebetween convergence of loopy belief propagation (apparently dependingon step size) and uniqueness of fixed points (obviously independent of stepsize)

For weights larger than roughly 58 loopy belief propagation failed toconverge to the trivial fixed point even for very small step sizes Howeverrunning a convergent double-loop algorithm from many different initialconditions and many weight strengths considerably larger than 58 we al-ways ended up in the trivial fixed point and never in another one We foundsimilar behavior for a three-node Boltzmann machine (same weight matrixas above except for the fourth node) for very large weights loopy beliefpropagation ends up in a limit cycle whereas a convergent double-loopalgorithm converges to the trivial fixed point which here by corollary 2is guaranteed to be unique In future work we hope to elaborate on theseissues

7 Note that the conditions for guaranteed uniqueness imply ω = 43 for corollary 3and ω = log(2) asymp 069 for theorem 4 both far below the weight strengths where ldquonon-convergentrdquo behavior sets in

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

2390 T Heskes

Figure 2 The construction of an allocation matrix satisfying all convexity con-straints for singly connected (a) and single-loop structures (b) Neglecting thearrows and dashes each graph corresponds to a factor graph (Kscischang Freyamp Loeliger 2001) where dark boxes refer to potentials and circles to nodes Thenumbers within the circles give the corresponding ldquoovercounting numbersrdquo forthe Bethe free energy 1minus nβ with nβ the number of neighboring potentials Thearrows pointing from potentials α to nodes β visualize the allocation matrixA with Aαβ = 1 if there is an arrow and Aαβ = 0 otherwise As can be seenfor each potential there is precisely one outgoing arrow pointing at the nodeclosest to the root chosen to be the node in the upper right corner of the graphIn the singly-connected structure (a) all nonroot nodes have precisely nβ minus 1incoming arrows just sufficient to compensate the overcounting number 1minusnβ The root node itself has one incoming arrow which it does not really need Inthe structure with the single loop (b) we open the loop by breaking the dashedlink and construct the allocation matrix for the corresponding singly connectedstructure This allocation matrix works for the single-loop structure as well be-cause now the incoming arrow at the ldquorootrdquo is just sufficient to compensate forthe negative overcounting number resulting from the extra link closing the loop

Uniqueness of Loopy Belief Propagation Fixed Points 2391

Theorem 2 (Adapted from theorem 1 in Pakzad amp Anantharam 2002) TheBethe free energy is convex for the set of constraints if for any set of nodes B wehave sum

βisinB

(1minus nβ)+sumαisinπ(B)

1 ge 0 (42)

where π(B) equiv α existβ isin Bβ sub α denotes the ldquoparentrdquo set of B that is thosepotential subsets that include at least one node in B

Proposition 1 The conditions in theorems 1 and 2 are equivalent

Proof Let us first suppose that there does exist an allocation matrix Aαβ

satisfying the conditions of equation 41 Then for any set B

sumβisinB

(nβ minus 1) lesumβisinB

sumαsupβ

Aαβ lesumαisinπ(B)

sumβsubα

Aαβ lesumαisinπ(B)

1

where the inequalities follow from conditions 3 1 and 2 in equation 41respectively In other words validity of the conditions in theorem 1 impliesthe validity of those in theorem 2

Next let us suppose that the conditions in Theorem 1 fail Above wehave seen that this can happen if and only if the graph contains at least oneconnected component with two connected loops But then condition 42is violated as well when we take for B the set of all nodes within such acomponent

Since validity implies validity and violation implies violation the con-ditions must be equivalent

Graphical models with a single loop have been studied in detail in Weiss(2000) yielding important theoretical results (eg correctness of maximuma posteriori assignments) These results are derived by ldquounwrappingrdquo thesingle loop into an infinite tree This argument also breaks down as soon asthere is more than a single loop It might be interesting to find out whetherthere is a deeper connection between this unwrapping argument and theconvexity of the Bethe free energy

In summary we have given conditions for the Bethe free energy to have aunique extremum satisfying the constraints From the connection betweenthe extrema of the Bethe free energy and fixed points of loopy belief prop-agation it then follows that loopy belief propagation has a unique fixedpoint when these conditions are satisfied These conditions fail as soon asthe structure of the graph contains two connected loops

The conditions for convexity of the Bethe free energy depend on thestructure of the graph the potentialsα(Xα) do not play any role These po-tentials appear only in the energy term that is linear in the pseudomarginals

2392 T Heskes

and thus does not affect the convexity argument Consequently adding aldquofake linkrdquo with potential α(Xα) = 1 can change the validity of the con-ditions whereas it has no effect on the loopy belief propagation updatesEven if we managed to find more interesting (ie milder) conditions forconvexity over the set of constraints3 the impact of fake links would neverdisappear In the following we will therefore dig a little deeper to arrive atmilder conditions that do take into account (the strength of) the potentials

5 The Dual Formulation

51 From Lagrangian to Dual As we have seen fixed points of loopybelief propagation are in a one-to-one correspondence with zero derivativesof the Lagrangian If we manage to find conditions under which these zeroderivatives have a unique solution then for the same conditions loopybelief propagation has a unique fixed point In the following we will workwith a Lagrangian slightly different from equation 34 First we substitutethe constraint Qα(xβ) = Qβ(xβ) to write the Bethe free energy in the ldquomoreconvexrdquo form

F(QαQβ) = minussumα

sumXα

Qα(Xα)ψα(Xα)+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumβ

sumαsupβ

Aαβ

sumxβ

Qα(xβ) log Qβ(xβ) (51)

where the allocation matrix Aαβ can be any matrix that satisfies

sumαsupβ

Aαβ = nβ minus 1 (52)

And second we express the consistency constraints from equation 33 interms of the potential pseudomarginals Qα alone This then yields

L(QαQβ λαβ λα) = minussumα

sumXα

Qα(Xα)ψα(Xα)

+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumα

sumβsubα

Aαβ

sumxβ

Qα(xβ) log Qβ(xβ)

3 We would like to conjecture that this is not possiblemdashthat the conditions in theorem 1are not only sufficient but also necessary to prove convexity of the Bethe free energy overthe set of consistency constraints Note that this would not imply that we need theseconditions to guarantee the uniqueness of fixed points since for that convexity by itselfis sufficient not necessary

Uniqueness of Loopy Belief Propagation Fixed Points 2393

+sumβ

sumαsupβ

sumxβ

λαβ(xβ)

times[

1nβ minus 1

sumαprimesupβ

AαprimeβQαprime(xβ)minusQα(xβ)

]

+sumα

λα

[1minus

sumXα

Qα(Xα)

]

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (53)

Note that the constraint Qβ(xβ) = Qα(xβ) as well as its normalization is nolonger incorporated with Lagrange multipliers but follows when we takethe minimum with respect to Qβ It is easy to check that the fixed-pointequations of loopy belief propagation still follow by setting the derivativesof the Lagrangian equation 53 to zero

Although the Bethe free energy and thus the Lagrangian equation 53may not be convex in QαQβ they are convex in Qα and Qβ separatelyTherefore we can interchange the minimum over the pseudomarginals Qα

and the maximum over the Lagrange multipliers as long as we leave theminimum over Qβ as the final operation4

minQαQβ

maxλαβ λα

L(QαQβ λαβ λα) = minQβ

maxλαβ λα

minQα

L(QαQβ λαβ λα)

Rewriting

sumβ

sumαsupβ

sumxβ

λαβ(xβ)

[1

nβ minus 1

sumαprimesupβ

AαprimeβQαprime(xβ)minusQα(xβ)

]

= minussumα

sumβsubα

sumxβ

λαβ(xβ)Qα(xβ)

with

λαβ(xβ) equiv λαβ(xβ)minus 1nβ minus 1

sumαprimesupβ

Aαprimeβλαprimeβ(xβ)

we can easily solve for the minimum with respect to Qα

Qlowastα(Xα) = α(Xα) exp

[λα minus 1+

sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

] (54)

4 In principle we could also first take the minimum over Qβ and leave the minimumover Qα but this does not seem to lead to any useful results

2394 T Heskes

Plugging this into the Lagrangian we obtain the ldquodualrdquo

G(Qβ λαβ λα) equiv L(QlowastαQβ λαβ λα)

= minussumα

sumXα

α(Xα)

exptimes[λα minus 1+

sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

]

+sumα

λα +sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (55)

Next we find for the maximum with respect to λα

exp[1minus λlowastα

] =sumXα

α(Xα) exp

[sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

]

equiv Zlowastα (56)

where we have to keep in mind that Zlowastα by itself like Qlowastα is a function of theremaining pseudomarginals Qβ and Lagrange multipliers λαβ Substitutingthis solution into the dual we arrive at

G(Qβ λαβ) equiv G(Qβ λαβ λlowastα)

= minussumα

log Zlowastα +sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (57)

Let us pause here for a moment and reflect on what we have done so farThe Lagrangian equation 53 being convex in Qα has a unique minimumin Qα (given all other parameters fixed) which is also the only extremumIt happens to be relatively straightforward to express the value at this mini-mum in terms of the remaining parameters and then also to find the optimal(maximal) λlowastα Plugging these values into the Lagrangian equation 53 wehave not lost anything That is zero derivatives of the Lagrangian are still inone-to-one correspondence with zero derivatives of the dual equation 57and thus with fixed points of loopy belief propagation

52 Recovering the Convexity Conditions (1) To find a minimum of theBethe free energy satisfying the constraints in equation 33 we first have totake the maximum of the dual equation 57 over the remaining Lagrangemultipliersλαβ and then the minimum over the remaining pseudomarginalsQβ The duality theorem a standard result from constrained optimization(see Luenberger 1984) tells us that the dual G is concave in the Lagrangemultipliers The remaining question is then whether the dual is convex in

Uniqueness of Loopy Belief Propagation Fixed Points 2395

Qβ If it is we have a convex-concave minimax problem which is guaranteedto have a unique solution

Link c in Figure 1 follows from the following proposition

Proposition 2 Convexity of the Bethe free energy equation 51 in QαQβimplies convexity of the dual equation 57 in Qβ

Proof First we note that the minimum of a convex function over someof its parameters is convex in its remaining parameters In obvious one-dimensional notation with ylowast(x) equiv argmin

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge 2 f (x (ylowast(x+ δ)+ ylowast(xminus δ))2)ge 2 f (x ylowast(x))

where the first inequality follows from the convexity of f in x yand the sec-ond inequality from ylowast(x) being the unique minimum of f (x y) Thereforethe dual equation 55 is convex in Qβ when the Lagrangian equation 53and thus the Bethe free energy equation 51 is convex in QαQβ Further-more from the duality theorem the dual equation 55 is concave in theLagrange multipliers λαβ λα Next we note that the maximum of a con-vex or concave function over its maximizing parameters is again convexwith ylowast(x) equiv argmax

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge f (x+ δ ylowast(x))+ f (xminus δ ylowast(x))ge 2 f (x ylowast(x))

where the first inequality follows from ylowast(xplusmnδ) being the unique maximumof f (x plusmn δ y) and the second inequality from the convexity of f (x y) in xHence the dual equation 57 must still be convex in Qβ

For now we did not gain or lose anything in comparison with the con-ditions for theorem 1 However the inequalities in the above proof suggesta little space that will lead to milder conditions for the uniqueness of fixedpoints

53 Boundedness of the Bethe Free Energy For completeness and tosupport link f in Figure 1 we will here prove that the Bethe free energy isbounded from below The following theorem can be considered a specialcase of the one stated in Minka (2001) on the Bethe free energy for expectationpropagation a generalization of (loopy) belief propagation

Theorem 3 If all potentials are bounded from above that isα(Xα) le max forall α and Xα the Bethe free energy is bounded from below on the set of constraints

2396 T Heskes

Proof It is sufficient to prove that the function G(Qβ) equiv maxλαβ G(Qβ λαβ)

is bounded from below for a particular choice of Aαβ satisfying equation 52Considering Aαβ = nβminus1

nβ we then have

G(Qβ) ge minussumα

logsumXα

α(Xα) exp

[sumβsubα

nβ minus 1nβ

log Qβ(xβ)

]

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

logsumXα

α(Xα)Qβ(xβ)

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

+sumβ

(nβ minus 1)

[minus log

sumxβ

Qβ(xβ)+sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

where the first inequality follows by substituting the choice λαβ(xβ) = 0 forall α β and xβ in G(Qβ λαβ) the second from the concavity of the function

ynβminus1

nβ and the third from the upper bound on the potentials

6 Toward Better Conditions

61 The Hessian The next step is to compute the Hessianmdashthe secondderivative of the dual with respect to the pseudomarginals Qβ The firstderivative yields

partGpartQβ(xβ)

= minussumαsupβ

Aαβ

Qlowastα(xβ)Qβ(xβ)

+ (nβ minus 1)

which is immediate from the Lagrangian equation 53 To compute thematrix of second derivatives

Hββ prime(xβ xprimeβ prime) equivpart2G

partQβ(xβ)partQβ prime(xprimeβ prime)

Uniqueness of Loopy Belief Propagation Fixed Points 2397

we make use of

partQlowastα(xβ)partQβ prime(xprimeβ prime)

= Aαβ primeQlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)

Qβ prime(xprimeβ prime)

where both β and β prime should be a subset of α and with convention Qlowastα(xβ xβ)= Qlowastα(xβ) and Qlowastα(xβ xprimeβ) = 0 if xβ = xprimeβ Here the first term follows from thedifferentation of equation 54 and the second term from the normalizationas in equation 56 Distinguishing between β = β prime and β = β prime we then have

Hββ(xβ xprimeβ) =sumαsupβ

Aαβ(1minus Aαβ)Qlowastα(xβ)Q2β(xβ)

δxβ xprimeβ

+sumαsupβ

A2αβ

Qlowastα(xβ)Qlowastα(xprimeβ)Qβ(xβ)Qβ(xprimeβ)

Hββ prime(xβ xprimeβ prime) = minussum

αsupββ primeAαβAαβ prime

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)Qβ(xβ)Qβ prime(xprimeβ prime)

for β prime = β

where δxβ xprimeβ = 1 if and only if xβ = xprimeβ Here it should be noted that bothβ and xβ play the role of indices that is xβ should not be mistaken for avariable or parameter The parameters are still the (tables with) Lagrangemultipliers λαβ and pseudomarginals Qβ

The goal is now to find conditions under which this Hessian is positive(semi) definite for any setting of the parameters Qβ λαβ that is conditionsthat guarantee

K equivsumββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) ge 0

for any choice of the ldquovectorrdquo S with elements Sβ(xβ) Straightforward ma-nipulations yieldsum

ββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) (K)

=sumα

sumβsubα

sumxβ

Aαβ(1minus Aαβ)Qlowastα(xβ)R2β(xβ) (K1)

+sumα

sumββ primesubα

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ)Qlowastα(xprimeβ prime)Rβ(xβ)Rβ prime(x

primeβ prime) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime) (K3)

where Rβ(xβ) equiv Sβ(xβ)Qβ(xβ)

2398 T Heskes

62 Recovering the Convexity Conditions (2) Let us first see how weget back the conditions for convexity of the Bethe free energy equation 51Since

K2 =sumα

[sumβsubα

sumxβ

AαβQlowastα(xβ)Rβ(xβ)

]2

ge 0

and5

K3 =sumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)

times

12

[Rβ(xβ)minus Rβ prime(xprimeβ prime)

]2 minus 12

R2β(xβ)minus

12

R2β prime(x

primeβ prime)

gesumα

sumβsubα

sumxβ

Aαβ

(sumβ primesubα

Aαβ prime minus Aαβ

)Qlowastα(xβ)R

2β(xβ) (61)

we have

K = K1 + K2 + K3 gesumα

sumβsubα

sumxβ

Aαβ

(1minus

sumβ primesubα

Aαβ prime

)Qlowastα(xβ)R

2β(xβ)

That is sufficient conditions for K to be nonnegative are

Aαβ ge 0 forallαβsubα andsumβsubα

Aαβ le 1 forallα

precisely the conditions for theorem 1

63 Fake Interactions While discussing the conditions for convexity ofthe Bethe free energy we noticed that adding a ldquofake interactionrdquo such as aconstant potential can change the validity of the conditions We will see thathere this is not the case and these fake interactions drop out as we wouldexpect them to

Suppose that we have a fake interaction α(Xα) = 1 From the solutionequation 54 it follows that the pseudomarginal Qlowastα(Xα) factorizes6

Qlowastα(xβ xprimeβ prime) = Qlowastα(xβ)Qlowastα(xprimeβ prime) forallββ primesubα

5 This step is in fact equivalent to the Gerschgorin theorem for bounding the eigenval-ues of a matrix

6 The exact marginal Pexact(Xα) need not factorize This is really a consequence of thelocality assumptions behind loopy belief propagation and the Bethe free energy

Uniqueness of Loopy Belief Propagation Fixed Points 2399

Consequently the terms involving α in K3 cancel with those in K2 which ismost easily seen when we combine K2 and K3 in a different way

K2 + K3 =sumα

sumβsubα

sumxβ xprime

βprime

A2αβQlowastα(xβ)Q

lowastα(xprimeβ)Rβ(xβ)Rβ(x

primeβ) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ prime

times [Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)]Rβ(xβ)Rβ prime(x

primeβ prime) (K3)

This leaves us with the weaker requirement (from K1) Aαβ(1minusAαβ) ge 0 forall β sub α The best choice is then to take Aαβ = 1 which turns condition 3of equation 41 intosum

αprimesupβαprime =α

Aαprimeβ + 1 ge nβ minus 1

The net effect is equivalent to ignoring the interaction reducing the numberof neighboring potentials nβ by 1 for all β that are part of the fake interactionα

We have seen how we get milder and thus better conditions when thereis effectively no interaction Motivated by this ldquosuccessrdquo we will work to-ward conditions that take into account the strength of the interactions Ourstarting point will be the above decomposition in K2 and K3 where sinceK2 ge 0 we will concentrate on K3

7 The Strength of a Potential

71 Bounding the Correlations The crucial observation which will al-low us to obtain milder and thus better conditions for the uniqueness of afixed point is the following lemma It bounds the term between brackets inK3 such that we can again combine this bound with the (positive) term K1However before we get to that we take some time to introduce and deriveproperties of the ldquostrengthrdquo of a potential

Lemma 2 Two-node correlations of loopy belief marginals obey the bound

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime) le σαQlowastα(xβ xprimeβ prime) forall ββprime subα

βprime =βforallxβ xprime

βprime (71)

with the ldquostrengthrdquo σα a function of the potential ψα(Xα) equiv logα(Xα) only

σα = 1minus exp(minusωα) with

ωα equiv maxXαXα

[ψα(Xα)+ (nα minus 1)ψα(Xα)minus

sumβsubα

ψα(Xαβ xβ)

] (72)

where nα equivsum

βsubα 1

2400 T Heskes

Proof For convenience and without loss of generality we omit α fromour notation and renumber the nodes that are contained in α from 1 to nWe consider the quotient between the loopy belief on the potential subsetdivided by the product of its single-node marginals

Qlowast(X)nprodβ=1

Qlowast(xβ)=(X)

prodβ

microβ(xβ)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)microβ(xβ)

=(X)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)

(73)

where we substituted the properly normalized version of equation 35 aloopy belief pseudomarginal is proportional to the potential times incomingmessages The goal is now to find the maximum of the above expressionover all possible messages and all values of X Especially the maximum overmessagesmicro seems to be difficult to compute but the following intermediatelemma helps us out

Lemma 3 The maximum of the function

V(micro) = (nminus 1) log

[sumX

(X)nprodβ=1

microβ(xβ)

]

minusnsumβ=1

log

sum

(Xβ xlowastβ)prodβ prime =β

microβ prime(xβ prime)

with respect to the messages micro under constraintssum

xβ microβ(xβ) = 1 for all β andmicroβ(xβ) ge 0 for all β and xβ occurs at an extreme point microβ(xβ) = δxβ xβ for somexβ to be found

Proof Let us consider optimizing the messagemicro1(x1)with fixed messagesmicroβ(xβ) for β gt 1 The first and second derivatives are easily found to obey

partVpartmicro1(x1)

= (nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ)

part2Vpartmicro1(x1)partmicro1(xprime1)

= (nminus 1)Q(x1)Q(xprime1)minussumβ =1

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

Uniqueness of Loopy Belief Propagation Fixed Points 2401

where

Q(X) equiv (X)prodβ microβ(xβ)sum

Xprime (Xprime)prodβ microβ(x

primeβ)

Now suppose that V has a regular extremum (maximum or minimum) notat an extreme point that is micro1(x1) gt 0 for two or more values of x1 At suchan extremum the first derivative should obey

(nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ) = λ

with λ a Lagrange multiplier implementing the constraintsum

x1micro1(x1) = 1

Summing over x1 we obtain λ = 0 (in fact V is indifferent to any multi-plicative scaling of micro) For the matrix with second derivatives at such anextremum we then have

part2Vpartmicro1(x1)partmicro1(xprime1)

=sumβ =1

sumβprime =1βprime =β

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

which is positive semidefinite the extremum cannot be a maximum Con-sequently any maximum must be at the boundary of the domain Sincethis holds for any choice of microβ(xβ) β gt 1 it follows by induction that themaximum with respect to all microβ(xβ) must be at an extreme point as well

The function V(micro) is up to a term independent of micro the logarithm ofequation 73 So the intermediate lemma 3 tells us that we can replace themaximization over messages micro by maximization over values X

maxmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

X

(X)[(X)

]nminus1

prodβ

(Xβ xβ)

Next we take the maximum over X as well and define the ldquostrengthrdquo σ tobe used in equation 71 through

11minus σ equiv max

Xmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

XX

(X)[(X)

]nminus1

prodβ

(Xβ xβ) (74)

2402 T Heskes

The inequality 71 then follows by summing out Xββ prime in

Qlowast(X)minusprodβ

Qlowast(xβ) le σQlowast(X)

The form of equation 72 then follows by rewriting equation 74 as

ω equiv minus log(1minus σ) = maxXX

W(X X) with

W(X X) =[ψ(X)+ (nminus 1)ψ(X)minus

sumβ

ψ(Xβ xβ)

]

where we recall that ψ(X) equiv log(X)

72 Some Properties In the following we will refer to both ω and σ asthe strength of the potential There are several properties worth noting

bull The strength of a potential is indifferent to multiplication with anyterm that factorizes over the nodes that is

if (X) = (X)prodβ

microβ(xβ) then ω() = ω() for any choice of micro

This property relates to the arbitrariness in the definition of equa-tion 31 if two potentials overlap then multiplying one potential witha term that only depends on the overlap and dividing the other by thesame term does not change the distribution Luckily it also does notchange the strength of those potentials

bull To compute the strength we can enumerate all possible combinationsHowever we can neglect all combinations X and X that differ in fewerthan two nodes To see this consider

W(x1 x2 x12 x1 x2 x12) = ψ(x1 x2 x12)+ ψ(x1 x2 x12)minus ψ(x1 x2 x12)minus ψ(x1 x2 x12)

= minusW(x1 x2 x12 x1 x2 x12)

If now also x2 = x2 we get W(x1 x1 x1 x1) = minusW(x1 x1 x1 x1) =0 Furthermore if W(x1 x2 x12 x1 x2 x12) le 0 then it must be thatW(x1 x2 x12 x1 x2 x12) ge 0 and vice versa So ω the maximumover all combinations must be nonnegative and we can indeed neglectall combinations that by definition yield zero

bull Thus for finite potentials 0 le ω ltinfin and 0 le σ lt 1

Uniqueness of Loopy Belief Propagation Fixed Points 2403

bull With pairwise potentials the above symmetries can be used to reducethe number of evaluations to |x1||x2|(|x1|minus1)(|x2|minus1)4 combinationsAnd indeed for binary nodes x12 isin 0 1 we immediately obtain

ω = |ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| (75)

Any pairwise binary potential can be written as a Boltzmann factor

(x1 x2) prop exp[wx1x2 + θ1x1 + θ2x2]

In this notation we find the simple and intuitive expression ω = |w|the strength is the absolute value of the ldquoweightrdquo It is indeed inde-pendent of (the size of) the thresholds In the case of minus1 1 codingthe relationship is ω = 4|w|

bull In some models there is the notion of a ldquotemperaturerdquo T that is(X) propexp[ψ(X)T] where ψ(X) is considered constant In obvious notationwe then have ω(T) = ω(1)T and thus σ(T) = 1 minus exp[minusω(1)T] =1minus [1(1minus σ(1))]1T

bull Loopy belief revision (max-product) can be interpreted as a zero-temperature limit of loopy belief propagation (sum product) Morespecifically we get the belief revision updates if we imagine runningloopy belief propagation on potentials that are scaled with tempera-ture T and then take the limit T to zero Consequently when analyzingconditions for uniqueness of loopy belief revision fixed points we cantake σ(0) = 0 if σ(1) = 0 (fake interaction) yet σ(0) = 1 wheneverσ(1) gt 0

8 Conditions for Uniqueness

81 Main Result

Theorem 4 Loopy belief propagation has a unique fixed point if there exists anallocation matrix Aαβ between potentials α and nodes β with properties

1 Aαβ ge 0 forallαβsubα (positivity)

2 (1minus σα)maxβsubα

Aαβ + σαsumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(81)

with the strength σα a function of the potentialα(Xα) as defined in equation 72

Proof For completeness we first summarize our line of reasoning Fixedpoints of loopy belief propagation are in one-to-one correspondence with

2404 T Heskes

extrema of the dual equation 55 This dual has a unique extremum if itis convexconcave Concavity is guaranteed so we focus on conditionsfor convexity that is for positive (semi)definiteness of the correspondingHessian This then boils down to conditions that ensure K = K1+K2+K3 ge 0for any choice of Rβ(xβ)

Substituting the bound equation 71 into the term K3 we obtain

K3 ge minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeσαQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime)

ge minussumα

σαsumβsubα

sumxβ

Aαβ

sumβprimesubαβprime =β

Aαβ primeQlowastα(xβ)R2β(xβ)

where in the last step we applied the same trick as in equation 61 SinceK2 ge 0 and combining K1 and (the above lower bound on) K3 we get

K = K1 + K2 + K3

gesumα

sumβsubα

sumxβ

Aαβ

[1minus Aαβ minus σα

sumβ prime =β

Aαβ prime

]Qlowastα(xβ)R

2β(xβ)

This implies

(1minus σα)Aαβ + σαsumβ primesubα

Aαβ prime le 1 forallαβsubα

which in combination with Aαβ ge 0 and σα le 1 yields condition 2 inequation 81 The equality constraint equation 52 that we started with canbe relaxed to the inequality condition 3 without any consequences

We get back the stricter conditions of theorem 1 if σα = 1 for all potentialsα Furthermore ldquofake interactionsrdquo play no role with σα = 0 condition 2becomes maxβsubα Aαβ le 1 suggesting the choice Aαβ = 1 for all β sub αwhich then effectively reduces the number of neighboring potentials nβ incondition 3

82 Comparison with Other Work To the best of our knowledge theonly conditions for uniqueness of loopy belief propagation fixed points thatdepend on more than just the structure of the graph are those in Tatikondaand Jordan (2002) for pairwise potentials The analysis in Tatikonda andJordan is based on the concept of the computation tree which represents anunwrapping of the original graph with respect to the loopy belief propaga-tion algorithm The same concept is used in Weiss (2000) to show that beliefrevision yields the correct maximum a posteriori assignments in graphs

Uniqueness of Loopy Belief Propagation Fixed Points 2405

with a single loop and Weiss and Freeman (2001) to prove that loopy beliefpropagation in gaussian graphical models yields exact means Although thecurrent theorems based on the concept of computation trees are derived forpairwise potentials it should be possible to extend them to more generalfactor graphs

The setup in Tatikonda and Jordan (2002) is slightly different it is basedon the factorization

Pexact(X) = 1Z

prodα

α(Xα)prodβ

β(xβ)

to be compared with our equation 31 where there are no self-potentialsβ(xβ) With this in mind the statement is then as follows

Theorem 5 (adapted from Tatikonda amp Jordan 2002 in particular proposi-tion 53) Loopy belief propagation on pairwise potentials has a unique fixed pointif

sumαsupβ

(max

ψα(Xα)minusminXα

ψα(Xα)

)lt 2 forallβ (82)

To make the connection between theorem 5 and theorem 4 we will firststrengthen the former and then weaken the latter We will focus on thecase of binary pairwise potentials Since the definition of self-potentials isarbitrary and the condition 82 is valid for any choice we can easily improvethe condition by optimizing this choice This then leads to the followingcorollary

Corollary 3 This corollary concerns an improvement of theorem 5 for pairwisebinary potentials Loopy belief propagation on pairwise binary potentials has aunique fixed point ifsum

αsupβωα lt 4 forallβ (83)

with ωα defined in equation 72

Proof The condition 82 applies to any arbitrary definition of self-poten-tials β(xβ) In fact it is valid for any choice

ψα(Xα) = ψα(Xα)+sumβsubα

φαβ(xβ)

where ψα(Xα) is any choice of potential subsets that fits in our frameworkof no self-potentials (as argued above there is some arbitrariness here as

2406 T Heskes

well) We can then optimize this choice to obtain milder and thus betterconditions Omitting α and renumbering the nodes from 1 to 2 we have

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= minφ1φ2

maxx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

minus minx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

In the case of binary nodes (two-by-two matrices ψ(x1 x2)) it is easy tocheck that the optimal φ1 and φ2 that yield the smallest gap are such that

ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2)

ge ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2) (84)

for some x1 x2 x1 and x2 with x1 = x1 and x2 = x2 Solving for φ1 and φ2we find

φ1(x1)minus φ1(x1) = 12

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]φ2(x2)minus φ2(x2) = 1

2

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]

Substitution back into equation 84 yields

ψ(x1 x2)+ φ1(x1)+ φ2(x2)minus ψ(x1 x2)minus φ1(x1)minus φ2(x2)

= 12

[ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)minus ψ(x1 x2)

]

which has to be nonnegative Of all four possible combinations two of themare valid and yield the same positive gap and the other two are invalid sincethey yield the same negative gap Enumerating these combinations we find

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= 12|ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| = ω

2

from equation 75 Substitution into the condition 82 then yields equa-tion 83

Next we derive the following weaker corollary of theorem 4

Uniqueness of Loopy Belief Propagation Fixed Points 2407

Corollary 4 This is a weaker version of theorem 4 for pairwise potentials Loopybelief propagation on pairwise potentials has a unique fixed point ifsum

αsupβωα le 1 forallβ (85)

with ωα defined in equation 72

Proof Consider the allocation matrix with components Aαβ = 1minus σα forall β sub α With this choice conditions 1 and 2 of equation 81 are fulfilledsince (condition 1) σα le 1 and (condition 2)

(1minus σα)(1minus σα)+ 2σα(1minus σα) = 1minus 2σ 2α le 1

Substitution into condition 3 yieldssumαsupβ

(1minus σα) gesumαsupβ

1minus 1 and thussumαsupβ

σα le 1 (86)

Since ωα = minus log(1minus σα) ge σα condition 85 is weaker than condition 86

Summarizing the conditions in Tatikonda and Jordan (2002) are for bi-nary pairwise potentials and when strengthened as above at most a constant(factor 4) less strict and thus better than the ones derived here The latterare better when the structure is (close to) a tree The best set of conditionsfollows by taking the union of both Note further that the conditions derivedin Tatikonda and Jordan (2002) are unlike theorem 4 specific to pairwisepotentials

83 Illustration For illustration we consider a 3 times 3 Ising grid withtoroidal boundary conditions as in Figure 3a and uniform ferromagneticpotentials proportional to(

α 1minus α1minus α α

)

The trivial solution which is the only minimum of the Bethe free energy forsmall α is the one with all pseudomarginals equal to (05 05) With simplealgebra for example following the line of reasoning that leads to the beliefoptimization algorithm in Welling and Teh (2003) it can be shown that thistrivial solution becomes unstable at the critical αcritical = 23 asymp 067 Forα gt 23 we find two minima one with ldquospins uprdquo and the other one withldquospins downrdquo

In this symmetric problem the strength of each potential is given by

ω = 2 log[

α

1minus α]

and thus σ = 1minus(

1minus αα

)2

2408 T Heskes

Figure 3 Three Ising grids in factor-graph notation circles denote nodes boxesinteractions (a) Toroidal boundary conditions All elements of the allocationmatrix equal to 34 (not shown) (b) Aperiodic boundary conditions and (c) twoloops left The elements of the allocation matrix along the edges follow directlyfrom optimizing condition 3 in theorem 4 and symmetry considerations WithB = 2minus 2A in b and C = 1minusA in c the optimal settings for the single remainingvariable A then boil down to 34 and 1 minus

radic18 respectively See the text for

further explanation

Uniqueness of Loopy Belief Propagation Fixed Points 2409

The minimal (uniform) compensation in condition 3 of theorem 4 amountsto A = 34 for all combinations of potentials and nodes Substitution intocondition 2 then yields

σ le 13

and thus α le 11+radic23

asymp 055

The critical value that follows from corollary 3 is in this case slightly better

ω lt 1 and thus α le 11+ eminus12 asymp 062

Next we consider the same grid with aperiodic boundary conditions asin Figure 3b Numerically we find a critical αcritical asymp 079 The value thatfollows from corollary 3 is dominated by the center node and hence staysthe same a unique loopy belief propagation fixed point for α lt 062 Theo-rem 4 can be exploited to shift resources a little In principle we can solvethe nonlinear programming problem but for this small problem it can stillbe done by hand with the following argumentation Minimal compensationaccording to condition 3 in theorem 4 combined with symmetry considera-tions yields the allocation matrix elements along on the edges in Figure 3bIt is then easy to check that there are only two different appearances ofcondition 2

(2minus 2A)σ + 34le 1 and

12σ + A le 1

The optimal choice for A is the one in which both conditions turn out to beidentical In this way we obtain A = 34 yielding

σ le 12

and thus α le 11+radic12

asymp 058

still slightly worse than the condition from corollary 3An example in which the condition obtained with theorem 4 is better

than the one from corollary 3 is given in Figure 3c Straightforward analysisfollowing the same recipe as for Figure 3b yields A = 1minusradic18 with

σ leradic

12

and thus α le 1

1+radic

1minusradic12asymp 065

better than theα lt 062 from corollary 3 and to be compared with the criticalαcritical asymp 088

9 Discussion

In this article we derived sufficient conditions for loopy belief propagationto have just a single fixed point These conditions remain much too strong tobe anywhere near the necessary conditions and in that sense should be seenas no more than a first step These conditions have the following positivefeatures

2410 T Heskes

bull Generalize the conditions for convexity of the Bethe free energy

bull Incorporate the (local) strength of potentials

bull Scale naturally as a function of the ldquotemperaturerdquo

bull Are invariant to arbitrary definitions of potentials and self-interactions

Although the analysis that led to these conditions may seem quite involvedit basically consists of a relatively straightforward combination of two ob-servations The first observation is that we can exploit the arbitrariness inthe definition of the Bethe free energy when we incorporate the constraintsThis forms the basis of the resource allocation argument And the secondobservation concerns the bound on the correlation of a loopy belief propa-gation marginal that leads to the introduction of the strength of a potential

Besides its theoretical usefulness there are more practical uses Firstalgorithms for guaranteed convergence explicitly minimize the Bethe freeenergy They can be considered ldquobound optimization algorithmsrdquo similarto expectation maximization and iterative proportional fitting in the innerloop they minimize a bound on the Bethe free energy which is then up-dated in the outer loop In practice it appears that the tighter the boundthe faster the convergence (see eg Heskes et al 2003) Instead of a boundthat is convex (Yuille 2002) or convex over the set of constraints (Teh ampWelling 2002 Heskes et al 2003) we might relax the convexity conditionand choose a tighter bound that still has a unique minimum thereby speed-ing up the convergence Second in Wainwright et al (2002) a convexifiedBethe free energy is proposed The arguments for this class of free energiesare twofold they yield a bound on the partition function (instead of justan approximation as the standard Bethe free energy) and have a uniqueminimum Focusing on the second argument the conditions in this articlecan be used to construct Bethe free energies that may not be convex (overthe set of constraints) but do have a unique minimum and being closer tothe standard Bethe free energy may yield better approximations

We can think of the following opportunities to make the sufficient con-ditions derived here stricter and thus closer to necessary conditions

bull The conditions guarantee convexity of the dual G(Qβ λαβ) with re-spect to Qβ But in fact we need only G(Qβ) equiv maxλαβ G(Qβ λαβ) to beconvex which is a weaker requirement The Hessian of G(Qβ) how-ever appears to be more difficult to compute and to analyze in generalbut may lead to stronger results in specific cases (eg only pairwiseinteractions or substituting a particular choice of Aαβ )

bull It may be possible to strengthen the bound equation 71 on loopybelief correlations especially for interactions that involve more thantwo nodes

An important question is how the uniqueness of loopy belief propaga-tion fixed points relates to the convergence of loopy belief propagation

Uniqueness of Loopy Belief Propagation Fixed Points 2411

Intuitively one might expect that if loopy belief propagation has a uniquefixed point it will also converge to it This also seems to be the argumenta-tion in Tatikonda and Jordan (2002) However to the best of our knowledgethere is no proof of such correspondence Furthermore the following set ofsimulations does seem to suggest otherwise

We consider a Boltzmann machine with four binary nodes weights

w = ω

0 1 minus1 minus11 0 1 minus1minus1 1 0 minus1minus1 minus1 minus1 0

zero thresholds and potentials

ij(xi xj) = exp[wij4] if xi = xj and ij(xi xj) = exp[minuswij4] if xi = xj

Running loopy belief propagation possibly damped as in equation 39 weobserve ldquoconvergentrdquo and ldquononconvergentrdquo behavior For relatively smallweights loopy belief propagation converges to the trivial fixed point withPi(xi) = 05 for all nodes i and xi = 0 1 as in the lower left inset inFigure 4 For relatively large weights it ends up in a limit cycle as shown inthe upper right inset The weight strength that forms the transition betweenthis ldquoconvergentrdquo and ldquononconvergentrdquo behavior strongly depends on thestep size7 This by itself makes it hard to defend a one-to-one correspondencebetween convergence of loopy belief propagation (apparently dependingon step size) and uniqueness of fixed points (obviously independent of stepsize)

For weights larger than roughly 58 loopy belief propagation failed toconverge to the trivial fixed point even for very small step sizes Howeverrunning a convergent double-loop algorithm from many different initialconditions and many weight strengths considerably larger than 58 we al-ways ended up in the trivial fixed point and never in another one We foundsimilar behavior for a three-node Boltzmann machine (same weight matrixas above except for the fourth node) for very large weights loopy beliefpropagation ends up in a limit cycle whereas a convergent double-loopalgorithm converges to the trivial fixed point which here by corollary 2is guaranteed to be unique In future work we hope to elaborate on theseissues

7 Note that the conditions for guaranteed uniqueness imply ω = 43 for corollary 3and ω = log(2) asymp 069 for theorem 4 both far below the weight strengths where ldquonon-convergentrdquo behavior sets in

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

Uniqueness of Loopy Belief Propagation Fixed Points 2391

Theorem 2 (Adapted from theorem 1 in Pakzad amp Anantharam 2002) TheBethe free energy is convex for the set of constraints if for any set of nodes B wehave sum

βisinB

(1minus nβ)+sumαisinπ(B)

1 ge 0 (42)

where π(B) equiv α existβ isin Bβ sub α denotes the ldquoparentrdquo set of B that is thosepotential subsets that include at least one node in B

Proposition 1 The conditions in theorems 1 and 2 are equivalent

Proof Let us first suppose that there does exist an allocation matrix Aαβ

satisfying the conditions of equation 41 Then for any set B

sumβisinB

(nβ minus 1) lesumβisinB

sumαsupβ

Aαβ lesumαisinπ(B)

sumβsubα

Aαβ lesumαisinπ(B)

1

where the inequalities follow from conditions 3 1 and 2 in equation 41respectively In other words validity of the conditions in theorem 1 impliesthe validity of those in theorem 2

Next let us suppose that the conditions in Theorem 1 fail Above wehave seen that this can happen if and only if the graph contains at least oneconnected component with two connected loops But then condition 42is violated as well when we take for B the set of all nodes within such acomponent

Since validity implies validity and violation implies violation the con-ditions must be equivalent

Graphical models with a single loop have been studied in detail in Weiss(2000) yielding important theoretical results (eg correctness of maximuma posteriori assignments) These results are derived by ldquounwrappingrdquo thesingle loop into an infinite tree This argument also breaks down as soon asthere is more than a single loop It might be interesting to find out whetherthere is a deeper connection between this unwrapping argument and theconvexity of the Bethe free energy

In summary we have given conditions for the Bethe free energy to have aunique extremum satisfying the constraints From the connection betweenthe extrema of the Bethe free energy and fixed points of loopy belief prop-agation it then follows that loopy belief propagation has a unique fixedpoint when these conditions are satisfied These conditions fail as soon asthe structure of the graph contains two connected loops

The conditions for convexity of the Bethe free energy depend on thestructure of the graph the potentialsα(Xα) do not play any role These po-tentials appear only in the energy term that is linear in the pseudomarginals

2392 T Heskes

and thus does not affect the convexity argument Consequently adding aldquofake linkrdquo with potential α(Xα) = 1 can change the validity of the con-ditions whereas it has no effect on the loopy belief propagation updatesEven if we managed to find more interesting (ie milder) conditions forconvexity over the set of constraints3 the impact of fake links would neverdisappear In the following we will therefore dig a little deeper to arrive atmilder conditions that do take into account (the strength of) the potentials

5 The Dual Formulation

51 From Lagrangian to Dual As we have seen fixed points of loopybelief propagation are in a one-to-one correspondence with zero derivativesof the Lagrangian If we manage to find conditions under which these zeroderivatives have a unique solution then for the same conditions loopybelief propagation has a unique fixed point In the following we will workwith a Lagrangian slightly different from equation 34 First we substitutethe constraint Qα(xβ) = Qβ(xβ) to write the Bethe free energy in the ldquomoreconvexrdquo form

F(QαQβ) = minussumα

sumXα

Qα(Xα)ψα(Xα)+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumβ

sumαsupβ

Aαβ

sumxβ

Qα(xβ) log Qβ(xβ) (51)

where the allocation matrix Aαβ can be any matrix that satisfies

sumαsupβ

Aαβ = nβ minus 1 (52)

And second we express the consistency constraints from equation 33 interms of the potential pseudomarginals Qα alone This then yields

L(QαQβ λαβ λα) = minussumα

sumXα

Qα(Xα)ψα(Xα)

+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumα

sumβsubα

Aαβ

sumxβ

Qα(xβ) log Qβ(xβ)

3 We would like to conjecture that this is not possiblemdashthat the conditions in theorem 1are not only sufficient but also necessary to prove convexity of the Bethe free energy overthe set of consistency constraints Note that this would not imply that we need theseconditions to guarantee the uniqueness of fixed points since for that convexity by itselfis sufficient not necessary

Uniqueness of Loopy Belief Propagation Fixed Points 2393

+sumβ

sumαsupβ

sumxβ

λαβ(xβ)

times[

1nβ minus 1

sumαprimesupβ

AαprimeβQαprime(xβ)minusQα(xβ)

]

+sumα

λα

[1minus

sumXα

Qα(Xα)

]

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (53)

Note that the constraint Qβ(xβ) = Qα(xβ) as well as its normalization is nolonger incorporated with Lagrange multipliers but follows when we takethe minimum with respect to Qβ It is easy to check that the fixed-pointequations of loopy belief propagation still follow by setting the derivativesof the Lagrangian equation 53 to zero

Although the Bethe free energy and thus the Lagrangian equation 53may not be convex in QαQβ they are convex in Qα and Qβ separatelyTherefore we can interchange the minimum over the pseudomarginals Qα

and the maximum over the Lagrange multipliers as long as we leave theminimum over Qβ as the final operation4

minQαQβ

maxλαβ λα

L(QαQβ λαβ λα) = minQβ

maxλαβ λα

minQα

L(QαQβ λαβ λα)

Rewriting

sumβ

sumαsupβ

sumxβ

λαβ(xβ)

[1

nβ minus 1

sumαprimesupβ

AαprimeβQαprime(xβ)minusQα(xβ)

]

= minussumα

sumβsubα

sumxβ

λαβ(xβ)Qα(xβ)

with

λαβ(xβ) equiv λαβ(xβ)minus 1nβ minus 1

sumαprimesupβ

Aαprimeβλαprimeβ(xβ)

we can easily solve for the minimum with respect to Qα

Qlowastα(Xα) = α(Xα) exp

[λα minus 1+

sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

] (54)

4 In principle we could also first take the minimum over Qβ and leave the minimumover Qα but this does not seem to lead to any useful results

2394 T Heskes

Plugging this into the Lagrangian we obtain the ldquodualrdquo

G(Qβ λαβ λα) equiv L(QlowastαQβ λαβ λα)

= minussumα

sumXα

α(Xα)

exptimes[λα minus 1+

sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

]

+sumα

λα +sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (55)

Next we find for the maximum with respect to λα

exp[1minus λlowastα

] =sumXα

α(Xα) exp

[sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

]

equiv Zlowastα (56)

where we have to keep in mind that Zlowastα by itself like Qlowastα is a function of theremaining pseudomarginals Qβ and Lagrange multipliers λαβ Substitutingthis solution into the dual we arrive at

G(Qβ λαβ) equiv G(Qβ λαβ λlowastα)

= minussumα

log Zlowastα +sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (57)

Let us pause here for a moment and reflect on what we have done so farThe Lagrangian equation 53 being convex in Qα has a unique minimumin Qα (given all other parameters fixed) which is also the only extremumIt happens to be relatively straightforward to express the value at this mini-mum in terms of the remaining parameters and then also to find the optimal(maximal) λlowastα Plugging these values into the Lagrangian equation 53 wehave not lost anything That is zero derivatives of the Lagrangian are still inone-to-one correspondence with zero derivatives of the dual equation 57and thus with fixed points of loopy belief propagation

52 Recovering the Convexity Conditions (1) To find a minimum of theBethe free energy satisfying the constraints in equation 33 we first have totake the maximum of the dual equation 57 over the remaining Lagrangemultipliersλαβ and then the minimum over the remaining pseudomarginalsQβ The duality theorem a standard result from constrained optimization(see Luenberger 1984) tells us that the dual G is concave in the Lagrangemultipliers The remaining question is then whether the dual is convex in

Uniqueness of Loopy Belief Propagation Fixed Points 2395

Qβ If it is we have a convex-concave minimax problem which is guaranteedto have a unique solution

Link c in Figure 1 follows from the following proposition

Proposition 2 Convexity of the Bethe free energy equation 51 in QαQβimplies convexity of the dual equation 57 in Qβ

Proof First we note that the minimum of a convex function over someof its parameters is convex in its remaining parameters In obvious one-dimensional notation with ylowast(x) equiv argmin

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge 2 f (x (ylowast(x+ δ)+ ylowast(xminus δ))2)ge 2 f (x ylowast(x))

where the first inequality follows from the convexity of f in x yand the sec-ond inequality from ylowast(x) being the unique minimum of f (x y) Thereforethe dual equation 55 is convex in Qβ when the Lagrangian equation 53and thus the Bethe free energy equation 51 is convex in QαQβ Further-more from the duality theorem the dual equation 55 is concave in theLagrange multipliers λαβ λα Next we note that the maximum of a con-vex or concave function over its maximizing parameters is again convexwith ylowast(x) equiv argmax

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge f (x+ δ ylowast(x))+ f (xminus δ ylowast(x))ge 2 f (x ylowast(x))

where the first inequality follows from ylowast(xplusmnδ) being the unique maximumof f (x plusmn δ y) and the second inequality from the convexity of f (x y) in xHence the dual equation 57 must still be convex in Qβ

For now we did not gain or lose anything in comparison with the con-ditions for theorem 1 However the inequalities in the above proof suggesta little space that will lead to milder conditions for the uniqueness of fixedpoints

53 Boundedness of the Bethe Free Energy For completeness and tosupport link f in Figure 1 we will here prove that the Bethe free energy isbounded from below The following theorem can be considered a specialcase of the one stated in Minka (2001) on the Bethe free energy for expectationpropagation a generalization of (loopy) belief propagation

Theorem 3 If all potentials are bounded from above that isα(Xα) le max forall α and Xα the Bethe free energy is bounded from below on the set of constraints

2396 T Heskes

Proof It is sufficient to prove that the function G(Qβ) equiv maxλαβ G(Qβ λαβ)

is bounded from below for a particular choice of Aαβ satisfying equation 52Considering Aαβ = nβminus1

nβ we then have

G(Qβ) ge minussumα

logsumXα

α(Xα) exp

[sumβsubα

nβ minus 1nβ

log Qβ(xβ)

]

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

logsumXα

α(Xα)Qβ(xβ)

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

+sumβ

(nβ minus 1)

[minus log

sumxβ

Qβ(xβ)+sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

where the first inequality follows by substituting the choice λαβ(xβ) = 0 forall α β and xβ in G(Qβ λαβ) the second from the concavity of the function

ynβminus1

nβ and the third from the upper bound on the potentials

6 Toward Better Conditions

61 The Hessian The next step is to compute the Hessianmdashthe secondderivative of the dual with respect to the pseudomarginals Qβ The firstderivative yields

partGpartQβ(xβ)

= minussumαsupβ

Aαβ

Qlowastα(xβ)Qβ(xβ)

+ (nβ minus 1)

which is immediate from the Lagrangian equation 53 To compute thematrix of second derivatives

Hββ prime(xβ xprimeβ prime) equivpart2G

partQβ(xβ)partQβ prime(xprimeβ prime)

Uniqueness of Loopy Belief Propagation Fixed Points 2397

we make use of

partQlowastα(xβ)partQβ prime(xprimeβ prime)

= Aαβ primeQlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)

Qβ prime(xprimeβ prime)

where both β and β prime should be a subset of α and with convention Qlowastα(xβ xβ)= Qlowastα(xβ) and Qlowastα(xβ xprimeβ) = 0 if xβ = xprimeβ Here the first term follows from thedifferentation of equation 54 and the second term from the normalizationas in equation 56 Distinguishing between β = β prime and β = β prime we then have

Hββ(xβ xprimeβ) =sumαsupβ

Aαβ(1minus Aαβ)Qlowastα(xβ)Q2β(xβ)

δxβ xprimeβ

+sumαsupβ

A2αβ

Qlowastα(xβ)Qlowastα(xprimeβ)Qβ(xβ)Qβ(xprimeβ)

Hββ prime(xβ xprimeβ prime) = minussum

αsupββ primeAαβAαβ prime

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)Qβ(xβ)Qβ prime(xprimeβ prime)

for β prime = β

where δxβ xprimeβ = 1 if and only if xβ = xprimeβ Here it should be noted that bothβ and xβ play the role of indices that is xβ should not be mistaken for avariable or parameter The parameters are still the (tables with) Lagrangemultipliers λαβ and pseudomarginals Qβ

The goal is now to find conditions under which this Hessian is positive(semi) definite for any setting of the parameters Qβ λαβ that is conditionsthat guarantee

K equivsumββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) ge 0

for any choice of the ldquovectorrdquo S with elements Sβ(xβ) Straightforward ma-nipulations yieldsum

ββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) (K)

=sumα

sumβsubα

sumxβ

Aαβ(1minus Aαβ)Qlowastα(xβ)R2β(xβ) (K1)

+sumα

sumββ primesubα

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ)Qlowastα(xprimeβ prime)Rβ(xβ)Rβ prime(x

primeβ prime) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime) (K3)

where Rβ(xβ) equiv Sβ(xβ)Qβ(xβ)

2398 T Heskes

62 Recovering the Convexity Conditions (2) Let us first see how weget back the conditions for convexity of the Bethe free energy equation 51Since

K2 =sumα

[sumβsubα

sumxβ

AαβQlowastα(xβ)Rβ(xβ)

]2

ge 0

and5

K3 =sumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)

times

12

[Rβ(xβ)minus Rβ prime(xprimeβ prime)

]2 minus 12

R2β(xβ)minus

12

R2β prime(x

primeβ prime)

gesumα

sumβsubα

sumxβ

Aαβ

(sumβ primesubα

Aαβ prime minus Aαβ

)Qlowastα(xβ)R

2β(xβ) (61)

we have

K = K1 + K2 + K3 gesumα

sumβsubα

sumxβ

Aαβ

(1minus

sumβ primesubα

Aαβ prime

)Qlowastα(xβ)R

2β(xβ)

That is sufficient conditions for K to be nonnegative are

Aαβ ge 0 forallαβsubα andsumβsubα

Aαβ le 1 forallα

precisely the conditions for theorem 1

63 Fake Interactions While discussing the conditions for convexity ofthe Bethe free energy we noticed that adding a ldquofake interactionrdquo such as aconstant potential can change the validity of the conditions We will see thathere this is not the case and these fake interactions drop out as we wouldexpect them to

Suppose that we have a fake interaction α(Xα) = 1 From the solutionequation 54 it follows that the pseudomarginal Qlowastα(Xα) factorizes6

Qlowastα(xβ xprimeβ prime) = Qlowastα(xβ)Qlowastα(xprimeβ prime) forallββ primesubα

5 This step is in fact equivalent to the Gerschgorin theorem for bounding the eigenval-ues of a matrix

6 The exact marginal Pexact(Xα) need not factorize This is really a consequence of thelocality assumptions behind loopy belief propagation and the Bethe free energy

Uniqueness of Loopy Belief Propagation Fixed Points 2399

Consequently the terms involving α in K3 cancel with those in K2 which ismost easily seen when we combine K2 and K3 in a different way

K2 + K3 =sumα

sumβsubα

sumxβ xprime

βprime

A2αβQlowastα(xβ)Q

lowastα(xprimeβ)Rβ(xβ)Rβ(x

primeβ) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ prime

times [Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)]Rβ(xβ)Rβ prime(x

primeβ prime) (K3)

This leaves us with the weaker requirement (from K1) Aαβ(1minusAαβ) ge 0 forall β sub α The best choice is then to take Aαβ = 1 which turns condition 3of equation 41 intosum

αprimesupβαprime =α

Aαprimeβ + 1 ge nβ minus 1

The net effect is equivalent to ignoring the interaction reducing the numberof neighboring potentials nβ by 1 for all β that are part of the fake interactionα

We have seen how we get milder and thus better conditions when thereis effectively no interaction Motivated by this ldquosuccessrdquo we will work to-ward conditions that take into account the strength of the interactions Ourstarting point will be the above decomposition in K2 and K3 where sinceK2 ge 0 we will concentrate on K3

7 The Strength of a Potential

71 Bounding the Correlations The crucial observation which will al-low us to obtain milder and thus better conditions for the uniqueness of afixed point is the following lemma It bounds the term between brackets inK3 such that we can again combine this bound with the (positive) term K1However before we get to that we take some time to introduce and deriveproperties of the ldquostrengthrdquo of a potential

Lemma 2 Two-node correlations of loopy belief marginals obey the bound

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime) le σαQlowastα(xβ xprimeβ prime) forall ββprime subα

βprime =βforallxβ xprime

βprime (71)

with the ldquostrengthrdquo σα a function of the potential ψα(Xα) equiv logα(Xα) only

σα = 1minus exp(minusωα) with

ωα equiv maxXαXα

[ψα(Xα)+ (nα minus 1)ψα(Xα)minus

sumβsubα

ψα(Xαβ xβ)

] (72)

where nα equivsum

βsubα 1

2400 T Heskes

Proof For convenience and without loss of generality we omit α fromour notation and renumber the nodes that are contained in α from 1 to nWe consider the quotient between the loopy belief on the potential subsetdivided by the product of its single-node marginals

Qlowast(X)nprodβ=1

Qlowast(xβ)=(X)

prodβ

microβ(xβ)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)microβ(xβ)

=(X)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)

(73)

where we substituted the properly normalized version of equation 35 aloopy belief pseudomarginal is proportional to the potential times incomingmessages The goal is now to find the maximum of the above expressionover all possible messages and all values of X Especially the maximum overmessagesmicro seems to be difficult to compute but the following intermediatelemma helps us out

Lemma 3 The maximum of the function

V(micro) = (nminus 1) log

[sumX

(X)nprodβ=1

microβ(xβ)

]

minusnsumβ=1

log

sum

(Xβ xlowastβ)prodβ prime =β

microβ prime(xβ prime)

with respect to the messages micro under constraintssum

xβ microβ(xβ) = 1 for all β andmicroβ(xβ) ge 0 for all β and xβ occurs at an extreme point microβ(xβ) = δxβ xβ for somexβ to be found

Proof Let us consider optimizing the messagemicro1(x1)with fixed messagesmicroβ(xβ) for β gt 1 The first and second derivatives are easily found to obey

partVpartmicro1(x1)

= (nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ)

part2Vpartmicro1(x1)partmicro1(xprime1)

= (nminus 1)Q(x1)Q(xprime1)minussumβ =1

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

Uniqueness of Loopy Belief Propagation Fixed Points 2401

where

Q(X) equiv (X)prodβ microβ(xβ)sum

Xprime (Xprime)prodβ microβ(x

primeβ)

Now suppose that V has a regular extremum (maximum or minimum) notat an extreme point that is micro1(x1) gt 0 for two or more values of x1 At suchan extremum the first derivative should obey

(nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ) = λ

with λ a Lagrange multiplier implementing the constraintsum

x1micro1(x1) = 1

Summing over x1 we obtain λ = 0 (in fact V is indifferent to any multi-plicative scaling of micro) For the matrix with second derivatives at such anextremum we then have

part2Vpartmicro1(x1)partmicro1(xprime1)

=sumβ =1

sumβprime =1βprime =β

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

which is positive semidefinite the extremum cannot be a maximum Con-sequently any maximum must be at the boundary of the domain Sincethis holds for any choice of microβ(xβ) β gt 1 it follows by induction that themaximum with respect to all microβ(xβ) must be at an extreme point as well

The function V(micro) is up to a term independent of micro the logarithm ofequation 73 So the intermediate lemma 3 tells us that we can replace themaximization over messages micro by maximization over values X

maxmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

X

(X)[(X)

]nminus1

prodβ

(Xβ xβ)

Next we take the maximum over X as well and define the ldquostrengthrdquo σ tobe used in equation 71 through

11minus σ equiv max

Xmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

XX

(X)[(X)

]nminus1

prodβ

(Xβ xβ) (74)

2402 T Heskes

The inequality 71 then follows by summing out Xββ prime in

Qlowast(X)minusprodβ

Qlowast(xβ) le σQlowast(X)

The form of equation 72 then follows by rewriting equation 74 as

ω equiv minus log(1minus σ) = maxXX

W(X X) with

W(X X) =[ψ(X)+ (nminus 1)ψ(X)minus

sumβ

ψ(Xβ xβ)

]

where we recall that ψ(X) equiv log(X)

72 Some Properties In the following we will refer to both ω and σ asthe strength of the potential There are several properties worth noting

bull The strength of a potential is indifferent to multiplication with anyterm that factorizes over the nodes that is

if (X) = (X)prodβ

microβ(xβ) then ω() = ω() for any choice of micro

This property relates to the arbitrariness in the definition of equa-tion 31 if two potentials overlap then multiplying one potential witha term that only depends on the overlap and dividing the other by thesame term does not change the distribution Luckily it also does notchange the strength of those potentials

bull To compute the strength we can enumerate all possible combinationsHowever we can neglect all combinations X and X that differ in fewerthan two nodes To see this consider

W(x1 x2 x12 x1 x2 x12) = ψ(x1 x2 x12)+ ψ(x1 x2 x12)minus ψ(x1 x2 x12)minus ψ(x1 x2 x12)

= minusW(x1 x2 x12 x1 x2 x12)

If now also x2 = x2 we get W(x1 x1 x1 x1) = minusW(x1 x1 x1 x1) =0 Furthermore if W(x1 x2 x12 x1 x2 x12) le 0 then it must be thatW(x1 x2 x12 x1 x2 x12) ge 0 and vice versa So ω the maximumover all combinations must be nonnegative and we can indeed neglectall combinations that by definition yield zero

bull Thus for finite potentials 0 le ω ltinfin and 0 le σ lt 1

Uniqueness of Loopy Belief Propagation Fixed Points 2403

bull With pairwise potentials the above symmetries can be used to reducethe number of evaluations to |x1||x2|(|x1|minus1)(|x2|minus1)4 combinationsAnd indeed for binary nodes x12 isin 0 1 we immediately obtain

ω = |ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| (75)

Any pairwise binary potential can be written as a Boltzmann factor

(x1 x2) prop exp[wx1x2 + θ1x1 + θ2x2]

In this notation we find the simple and intuitive expression ω = |w|the strength is the absolute value of the ldquoweightrdquo It is indeed inde-pendent of (the size of) the thresholds In the case of minus1 1 codingthe relationship is ω = 4|w|

bull In some models there is the notion of a ldquotemperaturerdquo T that is(X) propexp[ψ(X)T] where ψ(X) is considered constant In obvious notationwe then have ω(T) = ω(1)T and thus σ(T) = 1 minus exp[minusω(1)T] =1minus [1(1minus σ(1))]1T

bull Loopy belief revision (max-product) can be interpreted as a zero-temperature limit of loopy belief propagation (sum product) Morespecifically we get the belief revision updates if we imagine runningloopy belief propagation on potentials that are scaled with tempera-ture T and then take the limit T to zero Consequently when analyzingconditions for uniqueness of loopy belief revision fixed points we cantake σ(0) = 0 if σ(1) = 0 (fake interaction) yet σ(0) = 1 wheneverσ(1) gt 0

8 Conditions for Uniqueness

81 Main Result

Theorem 4 Loopy belief propagation has a unique fixed point if there exists anallocation matrix Aαβ between potentials α and nodes β with properties

1 Aαβ ge 0 forallαβsubα (positivity)

2 (1minus σα)maxβsubα

Aαβ + σαsumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(81)

with the strength σα a function of the potentialα(Xα) as defined in equation 72

Proof For completeness we first summarize our line of reasoning Fixedpoints of loopy belief propagation are in one-to-one correspondence with

2404 T Heskes

extrema of the dual equation 55 This dual has a unique extremum if itis convexconcave Concavity is guaranteed so we focus on conditionsfor convexity that is for positive (semi)definiteness of the correspondingHessian This then boils down to conditions that ensure K = K1+K2+K3 ge 0for any choice of Rβ(xβ)

Substituting the bound equation 71 into the term K3 we obtain

K3 ge minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeσαQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime)

ge minussumα

σαsumβsubα

sumxβ

Aαβ

sumβprimesubαβprime =β

Aαβ primeQlowastα(xβ)R2β(xβ)

where in the last step we applied the same trick as in equation 61 SinceK2 ge 0 and combining K1 and (the above lower bound on) K3 we get

K = K1 + K2 + K3

gesumα

sumβsubα

sumxβ

Aαβ

[1minus Aαβ minus σα

sumβ prime =β

Aαβ prime

]Qlowastα(xβ)R

2β(xβ)

This implies

(1minus σα)Aαβ + σαsumβ primesubα

Aαβ prime le 1 forallαβsubα

which in combination with Aαβ ge 0 and σα le 1 yields condition 2 inequation 81 The equality constraint equation 52 that we started with canbe relaxed to the inequality condition 3 without any consequences

We get back the stricter conditions of theorem 1 if σα = 1 for all potentialsα Furthermore ldquofake interactionsrdquo play no role with σα = 0 condition 2becomes maxβsubα Aαβ le 1 suggesting the choice Aαβ = 1 for all β sub αwhich then effectively reduces the number of neighboring potentials nβ incondition 3

82 Comparison with Other Work To the best of our knowledge theonly conditions for uniqueness of loopy belief propagation fixed points thatdepend on more than just the structure of the graph are those in Tatikondaand Jordan (2002) for pairwise potentials The analysis in Tatikonda andJordan is based on the concept of the computation tree which represents anunwrapping of the original graph with respect to the loopy belief propaga-tion algorithm The same concept is used in Weiss (2000) to show that beliefrevision yields the correct maximum a posteriori assignments in graphs

Uniqueness of Loopy Belief Propagation Fixed Points 2405

with a single loop and Weiss and Freeman (2001) to prove that loopy beliefpropagation in gaussian graphical models yields exact means Although thecurrent theorems based on the concept of computation trees are derived forpairwise potentials it should be possible to extend them to more generalfactor graphs

The setup in Tatikonda and Jordan (2002) is slightly different it is basedon the factorization

Pexact(X) = 1Z

prodα

α(Xα)prodβ

β(xβ)

to be compared with our equation 31 where there are no self-potentialsβ(xβ) With this in mind the statement is then as follows

Theorem 5 (adapted from Tatikonda amp Jordan 2002 in particular proposi-tion 53) Loopy belief propagation on pairwise potentials has a unique fixed pointif

sumαsupβ

(max

ψα(Xα)minusminXα

ψα(Xα)

)lt 2 forallβ (82)

To make the connection between theorem 5 and theorem 4 we will firststrengthen the former and then weaken the latter We will focus on thecase of binary pairwise potentials Since the definition of self-potentials isarbitrary and the condition 82 is valid for any choice we can easily improvethe condition by optimizing this choice This then leads to the followingcorollary

Corollary 3 This corollary concerns an improvement of theorem 5 for pairwisebinary potentials Loopy belief propagation on pairwise binary potentials has aunique fixed point ifsum

αsupβωα lt 4 forallβ (83)

with ωα defined in equation 72

Proof The condition 82 applies to any arbitrary definition of self-poten-tials β(xβ) In fact it is valid for any choice

ψα(Xα) = ψα(Xα)+sumβsubα

φαβ(xβ)

where ψα(Xα) is any choice of potential subsets that fits in our frameworkof no self-potentials (as argued above there is some arbitrariness here as

2406 T Heskes

well) We can then optimize this choice to obtain milder and thus betterconditions Omitting α and renumbering the nodes from 1 to 2 we have

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= minφ1φ2

maxx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

minus minx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

In the case of binary nodes (two-by-two matrices ψ(x1 x2)) it is easy tocheck that the optimal φ1 and φ2 that yield the smallest gap are such that

ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2)

ge ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2) (84)

for some x1 x2 x1 and x2 with x1 = x1 and x2 = x2 Solving for φ1 and φ2we find

φ1(x1)minus φ1(x1) = 12

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]φ2(x2)minus φ2(x2) = 1

2

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]

Substitution back into equation 84 yields

ψ(x1 x2)+ φ1(x1)+ φ2(x2)minus ψ(x1 x2)minus φ1(x1)minus φ2(x2)

= 12

[ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)minus ψ(x1 x2)

]

which has to be nonnegative Of all four possible combinations two of themare valid and yield the same positive gap and the other two are invalid sincethey yield the same negative gap Enumerating these combinations we find

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= 12|ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| = ω

2

from equation 75 Substitution into the condition 82 then yields equa-tion 83

Next we derive the following weaker corollary of theorem 4

Uniqueness of Loopy Belief Propagation Fixed Points 2407

Corollary 4 This is a weaker version of theorem 4 for pairwise potentials Loopybelief propagation on pairwise potentials has a unique fixed point ifsum

αsupβωα le 1 forallβ (85)

with ωα defined in equation 72

Proof Consider the allocation matrix with components Aαβ = 1minus σα forall β sub α With this choice conditions 1 and 2 of equation 81 are fulfilledsince (condition 1) σα le 1 and (condition 2)

(1minus σα)(1minus σα)+ 2σα(1minus σα) = 1minus 2σ 2α le 1

Substitution into condition 3 yieldssumαsupβ

(1minus σα) gesumαsupβ

1minus 1 and thussumαsupβ

σα le 1 (86)

Since ωα = minus log(1minus σα) ge σα condition 85 is weaker than condition 86

Summarizing the conditions in Tatikonda and Jordan (2002) are for bi-nary pairwise potentials and when strengthened as above at most a constant(factor 4) less strict and thus better than the ones derived here The latterare better when the structure is (close to) a tree The best set of conditionsfollows by taking the union of both Note further that the conditions derivedin Tatikonda and Jordan (2002) are unlike theorem 4 specific to pairwisepotentials

83 Illustration For illustration we consider a 3 times 3 Ising grid withtoroidal boundary conditions as in Figure 3a and uniform ferromagneticpotentials proportional to(

α 1minus α1minus α α

)

The trivial solution which is the only minimum of the Bethe free energy forsmall α is the one with all pseudomarginals equal to (05 05) With simplealgebra for example following the line of reasoning that leads to the beliefoptimization algorithm in Welling and Teh (2003) it can be shown that thistrivial solution becomes unstable at the critical αcritical = 23 asymp 067 Forα gt 23 we find two minima one with ldquospins uprdquo and the other one withldquospins downrdquo

In this symmetric problem the strength of each potential is given by

ω = 2 log[

α

1minus α]

and thus σ = 1minus(

1minus αα

)2

2408 T Heskes

Figure 3 Three Ising grids in factor-graph notation circles denote nodes boxesinteractions (a) Toroidal boundary conditions All elements of the allocationmatrix equal to 34 (not shown) (b) Aperiodic boundary conditions and (c) twoloops left The elements of the allocation matrix along the edges follow directlyfrom optimizing condition 3 in theorem 4 and symmetry considerations WithB = 2minus 2A in b and C = 1minusA in c the optimal settings for the single remainingvariable A then boil down to 34 and 1 minus

radic18 respectively See the text for

further explanation

Uniqueness of Loopy Belief Propagation Fixed Points 2409

The minimal (uniform) compensation in condition 3 of theorem 4 amountsto A = 34 for all combinations of potentials and nodes Substitution intocondition 2 then yields

σ le 13

and thus α le 11+radic23

asymp 055

The critical value that follows from corollary 3 is in this case slightly better

ω lt 1 and thus α le 11+ eminus12 asymp 062

Next we consider the same grid with aperiodic boundary conditions asin Figure 3b Numerically we find a critical αcritical asymp 079 The value thatfollows from corollary 3 is dominated by the center node and hence staysthe same a unique loopy belief propagation fixed point for α lt 062 Theo-rem 4 can be exploited to shift resources a little In principle we can solvethe nonlinear programming problem but for this small problem it can stillbe done by hand with the following argumentation Minimal compensationaccording to condition 3 in theorem 4 combined with symmetry considera-tions yields the allocation matrix elements along on the edges in Figure 3bIt is then easy to check that there are only two different appearances ofcondition 2

(2minus 2A)σ + 34le 1 and

12σ + A le 1

The optimal choice for A is the one in which both conditions turn out to beidentical In this way we obtain A = 34 yielding

σ le 12

and thus α le 11+radic12

asymp 058

still slightly worse than the condition from corollary 3An example in which the condition obtained with theorem 4 is better

than the one from corollary 3 is given in Figure 3c Straightforward analysisfollowing the same recipe as for Figure 3b yields A = 1minusradic18 with

σ leradic

12

and thus α le 1

1+radic

1minusradic12asymp 065

better than theα lt 062 from corollary 3 and to be compared with the criticalαcritical asymp 088

9 Discussion

In this article we derived sufficient conditions for loopy belief propagationto have just a single fixed point These conditions remain much too strong tobe anywhere near the necessary conditions and in that sense should be seenas no more than a first step These conditions have the following positivefeatures

2410 T Heskes

bull Generalize the conditions for convexity of the Bethe free energy

bull Incorporate the (local) strength of potentials

bull Scale naturally as a function of the ldquotemperaturerdquo

bull Are invariant to arbitrary definitions of potentials and self-interactions

Although the analysis that led to these conditions may seem quite involvedit basically consists of a relatively straightforward combination of two ob-servations The first observation is that we can exploit the arbitrariness inthe definition of the Bethe free energy when we incorporate the constraintsThis forms the basis of the resource allocation argument And the secondobservation concerns the bound on the correlation of a loopy belief propa-gation marginal that leads to the introduction of the strength of a potential

Besides its theoretical usefulness there are more practical uses Firstalgorithms for guaranteed convergence explicitly minimize the Bethe freeenergy They can be considered ldquobound optimization algorithmsrdquo similarto expectation maximization and iterative proportional fitting in the innerloop they minimize a bound on the Bethe free energy which is then up-dated in the outer loop In practice it appears that the tighter the boundthe faster the convergence (see eg Heskes et al 2003) Instead of a boundthat is convex (Yuille 2002) or convex over the set of constraints (Teh ampWelling 2002 Heskes et al 2003) we might relax the convexity conditionand choose a tighter bound that still has a unique minimum thereby speed-ing up the convergence Second in Wainwright et al (2002) a convexifiedBethe free energy is proposed The arguments for this class of free energiesare twofold they yield a bound on the partition function (instead of justan approximation as the standard Bethe free energy) and have a uniqueminimum Focusing on the second argument the conditions in this articlecan be used to construct Bethe free energies that may not be convex (overthe set of constraints) but do have a unique minimum and being closer tothe standard Bethe free energy may yield better approximations

We can think of the following opportunities to make the sufficient con-ditions derived here stricter and thus closer to necessary conditions

bull The conditions guarantee convexity of the dual G(Qβ λαβ) with re-spect to Qβ But in fact we need only G(Qβ) equiv maxλαβ G(Qβ λαβ) to beconvex which is a weaker requirement The Hessian of G(Qβ) how-ever appears to be more difficult to compute and to analyze in generalbut may lead to stronger results in specific cases (eg only pairwiseinteractions or substituting a particular choice of Aαβ )

bull It may be possible to strengthen the bound equation 71 on loopybelief correlations especially for interactions that involve more thantwo nodes

An important question is how the uniqueness of loopy belief propaga-tion fixed points relates to the convergence of loopy belief propagation

Uniqueness of Loopy Belief Propagation Fixed Points 2411

Intuitively one might expect that if loopy belief propagation has a uniquefixed point it will also converge to it This also seems to be the argumenta-tion in Tatikonda and Jordan (2002) However to the best of our knowledgethere is no proof of such correspondence Furthermore the following set ofsimulations does seem to suggest otherwise

We consider a Boltzmann machine with four binary nodes weights

w = ω

0 1 minus1 minus11 0 1 minus1minus1 1 0 minus1minus1 minus1 minus1 0

zero thresholds and potentials

ij(xi xj) = exp[wij4] if xi = xj and ij(xi xj) = exp[minuswij4] if xi = xj

Running loopy belief propagation possibly damped as in equation 39 weobserve ldquoconvergentrdquo and ldquononconvergentrdquo behavior For relatively smallweights loopy belief propagation converges to the trivial fixed point withPi(xi) = 05 for all nodes i and xi = 0 1 as in the lower left inset inFigure 4 For relatively large weights it ends up in a limit cycle as shown inthe upper right inset The weight strength that forms the transition betweenthis ldquoconvergentrdquo and ldquononconvergentrdquo behavior strongly depends on thestep size7 This by itself makes it hard to defend a one-to-one correspondencebetween convergence of loopy belief propagation (apparently dependingon step size) and uniqueness of fixed points (obviously independent of stepsize)

For weights larger than roughly 58 loopy belief propagation failed toconverge to the trivial fixed point even for very small step sizes Howeverrunning a convergent double-loop algorithm from many different initialconditions and many weight strengths considerably larger than 58 we al-ways ended up in the trivial fixed point and never in another one We foundsimilar behavior for a three-node Boltzmann machine (same weight matrixas above except for the fourth node) for very large weights loopy beliefpropagation ends up in a limit cycle whereas a convergent double-loopalgorithm converges to the trivial fixed point which here by corollary 2is guaranteed to be unique In future work we hope to elaborate on theseissues

7 Note that the conditions for guaranteed uniqueness imply ω = 43 for corollary 3and ω = log(2) asymp 069 for theorem 4 both far below the weight strengths where ldquonon-convergentrdquo behavior sets in

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

2392 T Heskes

and thus does not affect the convexity argument Consequently adding aldquofake linkrdquo with potential α(Xα) = 1 can change the validity of the con-ditions whereas it has no effect on the loopy belief propagation updatesEven if we managed to find more interesting (ie milder) conditions forconvexity over the set of constraints3 the impact of fake links would neverdisappear In the following we will therefore dig a little deeper to arrive atmilder conditions that do take into account (the strength of) the potentials

5 The Dual Formulation

51 From Lagrangian to Dual As we have seen fixed points of loopybelief propagation are in a one-to-one correspondence with zero derivativesof the Lagrangian If we manage to find conditions under which these zeroderivatives have a unique solution then for the same conditions loopybelief propagation has a unique fixed point In the following we will workwith a Lagrangian slightly different from equation 34 First we substitutethe constraint Qα(xβ) = Qβ(xβ) to write the Bethe free energy in the ldquomoreconvexrdquo form

F(QαQβ) = minussumα

sumXα

Qα(Xα)ψα(Xα)+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumβ

sumαsupβ

Aαβ

sumxβ

Qα(xβ) log Qβ(xβ) (51)

where the allocation matrix Aαβ can be any matrix that satisfies

sumαsupβ

Aαβ = nβ minus 1 (52)

And second we express the consistency constraints from equation 33 interms of the potential pseudomarginals Qα alone This then yields

L(QαQβ λαβ λα) = minussumα

sumXα

Qα(Xα)ψα(Xα)

+sumα

sumXα

Qα(Xα) log Qα(Xα)

minussumα

sumβsubα

Aαβ

sumxβ

Qα(xβ) log Qβ(xβ)

3 We would like to conjecture that this is not possiblemdashthat the conditions in theorem 1are not only sufficient but also necessary to prove convexity of the Bethe free energy overthe set of consistency constraints Note that this would not imply that we need theseconditions to guarantee the uniqueness of fixed points since for that convexity by itselfis sufficient not necessary

Uniqueness of Loopy Belief Propagation Fixed Points 2393

+sumβ

sumαsupβ

sumxβ

λαβ(xβ)

times[

1nβ minus 1

sumαprimesupβ

AαprimeβQαprime(xβ)minusQα(xβ)

]

+sumα

λα

[1minus

sumXα

Qα(Xα)

]

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (53)

Note that the constraint Qβ(xβ) = Qα(xβ) as well as its normalization is nolonger incorporated with Lagrange multipliers but follows when we takethe minimum with respect to Qβ It is easy to check that the fixed-pointequations of loopy belief propagation still follow by setting the derivativesof the Lagrangian equation 53 to zero

Although the Bethe free energy and thus the Lagrangian equation 53may not be convex in QαQβ they are convex in Qα and Qβ separatelyTherefore we can interchange the minimum over the pseudomarginals Qα

and the maximum over the Lagrange multipliers as long as we leave theminimum over Qβ as the final operation4

minQαQβ

maxλαβ λα

L(QαQβ λαβ λα) = minQβ

maxλαβ λα

minQα

L(QαQβ λαβ λα)

Rewriting

sumβ

sumαsupβ

sumxβ

λαβ(xβ)

[1

nβ minus 1

sumαprimesupβ

AαprimeβQαprime(xβ)minusQα(xβ)

]

= minussumα

sumβsubα

sumxβ

λαβ(xβ)Qα(xβ)

with

λαβ(xβ) equiv λαβ(xβ)minus 1nβ minus 1

sumαprimesupβ

Aαprimeβλαprimeβ(xβ)

we can easily solve for the minimum with respect to Qα

Qlowastα(Xα) = α(Xα) exp

[λα minus 1+

sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

] (54)

4 In principle we could also first take the minimum over Qβ and leave the minimumover Qα but this does not seem to lead to any useful results

2394 T Heskes

Plugging this into the Lagrangian we obtain the ldquodualrdquo

G(Qβ λαβ λα) equiv L(QlowastαQβ λαβ λα)

= minussumα

sumXα

α(Xα)

exptimes[λα minus 1+

sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

]

+sumα

λα +sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (55)

Next we find for the maximum with respect to λα

exp[1minus λlowastα

] =sumXα

α(Xα) exp

[sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

]

equiv Zlowastα (56)

where we have to keep in mind that Zlowastα by itself like Qlowastα is a function of theremaining pseudomarginals Qβ and Lagrange multipliers λαβ Substitutingthis solution into the dual we arrive at

G(Qβ λαβ) equiv G(Qβ λαβ λlowastα)

= minussumα

log Zlowastα +sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (57)

Let us pause here for a moment and reflect on what we have done so farThe Lagrangian equation 53 being convex in Qα has a unique minimumin Qα (given all other parameters fixed) which is also the only extremumIt happens to be relatively straightforward to express the value at this mini-mum in terms of the remaining parameters and then also to find the optimal(maximal) λlowastα Plugging these values into the Lagrangian equation 53 wehave not lost anything That is zero derivatives of the Lagrangian are still inone-to-one correspondence with zero derivatives of the dual equation 57and thus with fixed points of loopy belief propagation

52 Recovering the Convexity Conditions (1) To find a minimum of theBethe free energy satisfying the constraints in equation 33 we first have totake the maximum of the dual equation 57 over the remaining Lagrangemultipliersλαβ and then the minimum over the remaining pseudomarginalsQβ The duality theorem a standard result from constrained optimization(see Luenberger 1984) tells us that the dual G is concave in the Lagrangemultipliers The remaining question is then whether the dual is convex in

Uniqueness of Loopy Belief Propagation Fixed Points 2395

Qβ If it is we have a convex-concave minimax problem which is guaranteedto have a unique solution

Link c in Figure 1 follows from the following proposition

Proposition 2 Convexity of the Bethe free energy equation 51 in QαQβimplies convexity of the dual equation 57 in Qβ

Proof First we note that the minimum of a convex function over someof its parameters is convex in its remaining parameters In obvious one-dimensional notation with ylowast(x) equiv argmin

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge 2 f (x (ylowast(x+ δ)+ ylowast(xminus δ))2)ge 2 f (x ylowast(x))

where the first inequality follows from the convexity of f in x yand the sec-ond inequality from ylowast(x) being the unique minimum of f (x y) Thereforethe dual equation 55 is convex in Qβ when the Lagrangian equation 53and thus the Bethe free energy equation 51 is convex in QαQβ Further-more from the duality theorem the dual equation 55 is concave in theLagrange multipliers λαβ λα Next we note that the maximum of a con-vex or concave function over its maximizing parameters is again convexwith ylowast(x) equiv argmax

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge f (x+ δ ylowast(x))+ f (xminus δ ylowast(x))ge 2 f (x ylowast(x))

where the first inequality follows from ylowast(xplusmnδ) being the unique maximumof f (x plusmn δ y) and the second inequality from the convexity of f (x y) in xHence the dual equation 57 must still be convex in Qβ

For now we did not gain or lose anything in comparison with the con-ditions for theorem 1 However the inequalities in the above proof suggesta little space that will lead to milder conditions for the uniqueness of fixedpoints

53 Boundedness of the Bethe Free Energy For completeness and tosupport link f in Figure 1 we will here prove that the Bethe free energy isbounded from below The following theorem can be considered a specialcase of the one stated in Minka (2001) on the Bethe free energy for expectationpropagation a generalization of (loopy) belief propagation

Theorem 3 If all potentials are bounded from above that isα(Xα) le max forall α and Xα the Bethe free energy is bounded from below on the set of constraints

2396 T Heskes

Proof It is sufficient to prove that the function G(Qβ) equiv maxλαβ G(Qβ λαβ)

is bounded from below for a particular choice of Aαβ satisfying equation 52Considering Aαβ = nβminus1

nβ we then have

G(Qβ) ge minussumα

logsumXα

α(Xα) exp

[sumβsubα

nβ minus 1nβ

log Qβ(xβ)

]

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

logsumXα

α(Xα)Qβ(xβ)

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

+sumβ

(nβ minus 1)

[minus log

sumxβ

Qβ(xβ)+sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

where the first inequality follows by substituting the choice λαβ(xβ) = 0 forall α β and xβ in G(Qβ λαβ) the second from the concavity of the function

ynβminus1

nβ and the third from the upper bound on the potentials

6 Toward Better Conditions

61 The Hessian The next step is to compute the Hessianmdashthe secondderivative of the dual with respect to the pseudomarginals Qβ The firstderivative yields

partGpartQβ(xβ)

= minussumαsupβ

Aαβ

Qlowastα(xβ)Qβ(xβ)

+ (nβ minus 1)

which is immediate from the Lagrangian equation 53 To compute thematrix of second derivatives

Hββ prime(xβ xprimeβ prime) equivpart2G

partQβ(xβ)partQβ prime(xprimeβ prime)

Uniqueness of Loopy Belief Propagation Fixed Points 2397

we make use of

partQlowastα(xβ)partQβ prime(xprimeβ prime)

= Aαβ primeQlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)

Qβ prime(xprimeβ prime)

where both β and β prime should be a subset of α and with convention Qlowastα(xβ xβ)= Qlowastα(xβ) and Qlowastα(xβ xprimeβ) = 0 if xβ = xprimeβ Here the first term follows from thedifferentation of equation 54 and the second term from the normalizationas in equation 56 Distinguishing between β = β prime and β = β prime we then have

Hββ(xβ xprimeβ) =sumαsupβ

Aαβ(1minus Aαβ)Qlowastα(xβ)Q2β(xβ)

δxβ xprimeβ

+sumαsupβ

A2αβ

Qlowastα(xβ)Qlowastα(xprimeβ)Qβ(xβ)Qβ(xprimeβ)

Hββ prime(xβ xprimeβ prime) = minussum

αsupββ primeAαβAαβ prime

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)Qβ(xβ)Qβ prime(xprimeβ prime)

for β prime = β

where δxβ xprimeβ = 1 if and only if xβ = xprimeβ Here it should be noted that bothβ and xβ play the role of indices that is xβ should not be mistaken for avariable or parameter The parameters are still the (tables with) Lagrangemultipliers λαβ and pseudomarginals Qβ

The goal is now to find conditions under which this Hessian is positive(semi) definite for any setting of the parameters Qβ λαβ that is conditionsthat guarantee

K equivsumββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) ge 0

for any choice of the ldquovectorrdquo S with elements Sβ(xβ) Straightforward ma-nipulations yieldsum

ββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) (K)

=sumα

sumβsubα

sumxβ

Aαβ(1minus Aαβ)Qlowastα(xβ)R2β(xβ) (K1)

+sumα

sumββ primesubα

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ)Qlowastα(xprimeβ prime)Rβ(xβ)Rβ prime(x

primeβ prime) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime) (K3)

where Rβ(xβ) equiv Sβ(xβ)Qβ(xβ)

2398 T Heskes

62 Recovering the Convexity Conditions (2) Let us first see how weget back the conditions for convexity of the Bethe free energy equation 51Since

K2 =sumα

[sumβsubα

sumxβ

AαβQlowastα(xβ)Rβ(xβ)

]2

ge 0

and5

K3 =sumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)

times

12

[Rβ(xβ)minus Rβ prime(xprimeβ prime)

]2 minus 12

R2β(xβ)minus

12

R2β prime(x

primeβ prime)

gesumα

sumβsubα

sumxβ

Aαβ

(sumβ primesubα

Aαβ prime minus Aαβ

)Qlowastα(xβ)R

2β(xβ) (61)

we have

K = K1 + K2 + K3 gesumα

sumβsubα

sumxβ

Aαβ

(1minus

sumβ primesubα

Aαβ prime

)Qlowastα(xβ)R

2β(xβ)

That is sufficient conditions for K to be nonnegative are

Aαβ ge 0 forallαβsubα andsumβsubα

Aαβ le 1 forallα

precisely the conditions for theorem 1

63 Fake Interactions While discussing the conditions for convexity ofthe Bethe free energy we noticed that adding a ldquofake interactionrdquo such as aconstant potential can change the validity of the conditions We will see thathere this is not the case and these fake interactions drop out as we wouldexpect them to

Suppose that we have a fake interaction α(Xα) = 1 From the solutionequation 54 it follows that the pseudomarginal Qlowastα(Xα) factorizes6

Qlowastα(xβ xprimeβ prime) = Qlowastα(xβ)Qlowastα(xprimeβ prime) forallββ primesubα

5 This step is in fact equivalent to the Gerschgorin theorem for bounding the eigenval-ues of a matrix

6 The exact marginal Pexact(Xα) need not factorize This is really a consequence of thelocality assumptions behind loopy belief propagation and the Bethe free energy

Uniqueness of Loopy Belief Propagation Fixed Points 2399

Consequently the terms involving α in K3 cancel with those in K2 which ismost easily seen when we combine K2 and K3 in a different way

K2 + K3 =sumα

sumβsubα

sumxβ xprime

βprime

A2αβQlowastα(xβ)Q

lowastα(xprimeβ)Rβ(xβ)Rβ(x

primeβ) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ prime

times [Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)]Rβ(xβ)Rβ prime(x

primeβ prime) (K3)

This leaves us with the weaker requirement (from K1) Aαβ(1minusAαβ) ge 0 forall β sub α The best choice is then to take Aαβ = 1 which turns condition 3of equation 41 intosum

αprimesupβαprime =α

Aαprimeβ + 1 ge nβ minus 1

The net effect is equivalent to ignoring the interaction reducing the numberof neighboring potentials nβ by 1 for all β that are part of the fake interactionα

We have seen how we get milder and thus better conditions when thereis effectively no interaction Motivated by this ldquosuccessrdquo we will work to-ward conditions that take into account the strength of the interactions Ourstarting point will be the above decomposition in K2 and K3 where sinceK2 ge 0 we will concentrate on K3

7 The Strength of a Potential

71 Bounding the Correlations The crucial observation which will al-low us to obtain milder and thus better conditions for the uniqueness of afixed point is the following lemma It bounds the term between brackets inK3 such that we can again combine this bound with the (positive) term K1However before we get to that we take some time to introduce and deriveproperties of the ldquostrengthrdquo of a potential

Lemma 2 Two-node correlations of loopy belief marginals obey the bound

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime) le σαQlowastα(xβ xprimeβ prime) forall ββprime subα

βprime =βforallxβ xprime

βprime (71)

with the ldquostrengthrdquo σα a function of the potential ψα(Xα) equiv logα(Xα) only

σα = 1minus exp(minusωα) with

ωα equiv maxXαXα

[ψα(Xα)+ (nα minus 1)ψα(Xα)minus

sumβsubα

ψα(Xαβ xβ)

] (72)

where nα equivsum

βsubα 1

2400 T Heskes

Proof For convenience and without loss of generality we omit α fromour notation and renumber the nodes that are contained in α from 1 to nWe consider the quotient between the loopy belief on the potential subsetdivided by the product of its single-node marginals

Qlowast(X)nprodβ=1

Qlowast(xβ)=(X)

prodβ

microβ(xβ)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)microβ(xβ)

=(X)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)

(73)

where we substituted the properly normalized version of equation 35 aloopy belief pseudomarginal is proportional to the potential times incomingmessages The goal is now to find the maximum of the above expressionover all possible messages and all values of X Especially the maximum overmessagesmicro seems to be difficult to compute but the following intermediatelemma helps us out

Lemma 3 The maximum of the function

V(micro) = (nminus 1) log

[sumX

(X)nprodβ=1

microβ(xβ)

]

minusnsumβ=1

log

sum

(Xβ xlowastβ)prodβ prime =β

microβ prime(xβ prime)

with respect to the messages micro under constraintssum

xβ microβ(xβ) = 1 for all β andmicroβ(xβ) ge 0 for all β and xβ occurs at an extreme point microβ(xβ) = δxβ xβ for somexβ to be found

Proof Let us consider optimizing the messagemicro1(x1)with fixed messagesmicroβ(xβ) for β gt 1 The first and second derivatives are easily found to obey

partVpartmicro1(x1)

= (nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ)

part2Vpartmicro1(x1)partmicro1(xprime1)

= (nminus 1)Q(x1)Q(xprime1)minussumβ =1

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

Uniqueness of Loopy Belief Propagation Fixed Points 2401

where

Q(X) equiv (X)prodβ microβ(xβ)sum

Xprime (Xprime)prodβ microβ(x

primeβ)

Now suppose that V has a regular extremum (maximum or minimum) notat an extreme point that is micro1(x1) gt 0 for two or more values of x1 At suchan extremum the first derivative should obey

(nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ) = λ

with λ a Lagrange multiplier implementing the constraintsum

x1micro1(x1) = 1

Summing over x1 we obtain λ = 0 (in fact V is indifferent to any multi-plicative scaling of micro) For the matrix with second derivatives at such anextremum we then have

part2Vpartmicro1(x1)partmicro1(xprime1)

=sumβ =1

sumβprime =1βprime =β

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

which is positive semidefinite the extremum cannot be a maximum Con-sequently any maximum must be at the boundary of the domain Sincethis holds for any choice of microβ(xβ) β gt 1 it follows by induction that themaximum with respect to all microβ(xβ) must be at an extreme point as well

The function V(micro) is up to a term independent of micro the logarithm ofequation 73 So the intermediate lemma 3 tells us that we can replace themaximization over messages micro by maximization over values X

maxmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

X

(X)[(X)

]nminus1

prodβ

(Xβ xβ)

Next we take the maximum over X as well and define the ldquostrengthrdquo σ tobe used in equation 71 through

11minus σ equiv max

Xmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

XX

(X)[(X)

]nminus1

prodβ

(Xβ xβ) (74)

2402 T Heskes

The inequality 71 then follows by summing out Xββ prime in

Qlowast(X)minusprodβ

Qlowast(xβ) le σQlowast(X)

The form of equation 72 then follows by rewriting equation 74 as

ω equiv minus log(1minus σ) = maxXX

W(X X) with

W(X X) =[ψ(X)+ (nminus 1)ψ(X)minus

sumβ

ψ(Xβ xβ)

]

where we recall that ψ(X) equiv log(X)

72 Some Properties In the following we will refer to both ω and σ asthe strength of the potential There are several properties worth noting

bull The strength of a potential is indifferent to multiplication with anyterm that factorizes over the nodes that is

if (X) = (X)prodβ

microβ(xβ) then ω() = ω() for any choice of micro

This property relates to the arbitrariness in the definition of equa-tion 31 if two potentials overlap then multiplying one potential witha term that only depends on the overlap and dividing the other by thesame term does not change the distribution Luckily it also does notchange the strength of those potentials

bull To compute the strength we can enumerate all possible combinationsHowever we can neglect all combinations X and X that differ in fewerthan two nodes To see this consider

W(x1 x2 x12 x1 x2 x12) = ψ(x1 x2 x12)+ ψ(x1 x2 x12)minus ψ(x1 x2 x12)minus ψ(x1 x2 x12)

= minusW(x1 x2 x12 x1 x2 x12)

If now also x2 = x2 we get W(x1 x1 x1 x1) = minusW(x1 x1 x1 x1) =0 Furthermore if W(x1 x2 x12 x1 x2 x12) le 0 then it must be thatW(x1 x2 x12 x1 x2 x12) ge 0 and vice versa So ω the maximumover all combinations must be nonnegative and we can indeed neglectall combinations that by definition yield zero

bull Thus for finite potentials 0 le ω ltinfin and 0 le σ lt 1

Uniqueness of Loopy Belief Propagation Fixed Points 2403

bull With pairwise potentials the above symmetries can be used to reducethe number of evaluations to |x1||x2|(|x1|minus1)(|x2|minus1)4 combinationsAnd indeed for binary nodes x12 isin 0 1 we immediately obtain

ω = |ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| (75)

Any pairwise binary potential can be written as a Boltzmann factor

(x1 x2) prop exp[wx1x2 + θ1x1 + θ2x2]

In this notation we find the simple and intuitive expression ω = |w|the strength is the absolute value of the ldquoweightrdquo It is indeed inde-pendent of (the size of) the thresholds In the case of minus1 1 codingthe relationship is ω = 4|w|

bull In some models there is the notion of a ldquotemperaturerdquo T that is(X) propexp[ψ(X)T] where ψ(X) is considered constant In obvious notationwe then have ω(T) = ω(1)T and thus σ(T) = 1 minus exp[minusω(1)T] =1minus [1(1minus σ(1))]1T

bull Loopy belief revision (max-product) can be interpreted as a zero-temperature limit of loopy belief propagation (sum product) Morespecifically we get the belief revision updates if we imagine runningloopy belief propagation on potentials that are scaled with tempera-ture T and then take the limit T to zero Consequently when analyzingconditions for uniqueness of loopy belief revision fixed points we cantake σ(0) = 0 if σ(1) = 0 (fake interaction) yet σ(0) = 1 wheneverσ(1) gt 0

8 Conditions for Uniqueness

81 Main Result

Theorem 4 Loopy belief propagation has a unique fixed point if there exists anallocation matrix Aαβ between potentials α and nodes β with properties

1 Aαβ ge 0 forallαβsubα (positivity)

2 (1minus σα)maxβsubα

Aαβ + σαsumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(81)

with the strength σα a function of the potentialα(Xα) as defined in equation 72

Proof For completeness we first summarize our line of reasoning Fixedpoints of loopy belief propagation are in one-to-one correspondence with

2404 T Heskes

extrema of the dual equation 55 This dual has a unique extremum if itis convexconcave Concavity is guaranteed so we focus on conditionsfor convexity that is for positive (semi)definiteness of the correspondingHessian This then boils down to conditions that ensure K = K1+K2+K3 ge 0for any choice of Rβ(xβ)

Substituting the bound equation 71 into the term K3 we obtain

K3 ge minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeσαQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime)

ge minussumα

σαsumβsubα

sumxβ

Aαβ

sumβprimesubαβprime =β

Aαβ primeQlowastα(xβ)R2β(xβ)

where in the last step we applied the same trick as in equation 61 SinceK2 ge 0 and combining K1 and (the above lower bound on) K3 we get

K = K1 + K2 + K3

gesumα

sumβsubα

sumxβ

Aαβ

[1minus Aαβ minus σα

sumβ prime =β

Aαβ prime

]Qlowastα(xβ)R

2β(xβ)

This implies

(1minus σα)Aαβ + σαsumβ primesubα

Aαβ prime le 1 forallαβsubα

which in combination with Aαβ ge 0 and σα le 1 yields condition 2 inequation 81 The equality constraint equation 52 that we started with canbe relaxed to the inequality condition 3 without any consequences

We get back the stricter conditions of theorem 1 if σα = 1 for all potentialsα Furthermore ldquofake interactionsrdquo play no role with σα = 0 condition 2becomes maxβsubα Aαβ le 1 suggesting the choice Aαβ = 1 for all β sub αwhich then effectively reduces the number of neighboring potentials nβ incondition 3

82 Comparison with Other Work To the best of our knowledge theonly conditions for uniqueness of loopy belief propagation fixed points thatdepend on more than just the structure of the graph are those in Tatikondaand Jordan (2002) for pairwise potentials The analysis in Tatikonda andJordan is based on the concept of the computation tree which represents anunwrapping of the original graph with respect to the loopy belief propaga-tion algorithm The same concept is used in Weiss (2000) to show that beliefrevision yields the correct maximum a posteriori assignments in graphs

Uniqueness of Loopy Belief Propagation Fixed Points 2405

with a single loop and Weiss and Freeman (2001) to prove that loopy beliefpropagation in gaussian graphical models yields exact means Although thecurrent theorems based on the concept of computation trees are derived forpairwise potentials it should be possible to extend them to more generalfactor graphs

The setup in Tatikonda and Jordan (2002) is slightly different it is basedon the factorization

Pexact(X) = 1Z

prodα

α(Xα)prodβ

β(xβ)

to be compared with our equation 31 where there are no self-potentialsβ(xβ) With this in mind the statement is then as follows

Theorem 5 (adapted from Tatikonda amp Jordan 2002 in particular proposi-tion 53) Loopy belief propagation on pairwise potentials has a unique fixed pointif

sumαsupβ

(max

ψα(Xα)minusminXα

ψα(Xα)

)lt 2 forallβ (82)

To make the connection between theorem 5 and theorem 4 we will firststrengthen the former and then weaken the latter We will focus on thecase of binary pairwise potentials Since the definition of self-potentials isarbitrary and the condition 82 is valid for any choice we can easily improvethe condition by optimizing this choice This then leads to the followingcorollary

Corollary 3 This corollary concerns an improvement of theorem 5 for pairwisebinary potentials Loopy belief propagation on pairwise binary potentials has aunique fixed point ifsum

αsupβωα lt 4 forallβ (83)

with ωα defined in equation 72

Proof The condition 82 applies to any arbitrary definition of self-poten-tials β(xβ) In fact it is valid for any choice

ψα(Xα) = ψα(Xα)+sumβsubα

φαβ(xβ)

where ψα(Xα) is any choice of potential subsets that fits in our frameworkof no self-potentials (as argued above there is some arbitrariness here as

2406 T Heskes

well) We can then optimize this choice to obtain milder and thus betterconditions Omitting α and renumbering the nodes from 1 to 2 we have

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= minφ1φ2

maxx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

minus minx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

In the case of binary nodes (two-by-two matrices ψ(x1 x2)) it is easy tocheck that the optimal φ1 and φ2 that yield the smallest gap are such that

ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2)

ge ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2) (84)

for some x1 x2 x1 and x2 with x1 = x1 and x2 = x2 Solving for φ1 and φ2we find

φ1(x1)minus φ1(x1) = 12

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]φ2(x2)minus φ2(x2) = 1

2

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]

Substitution back into equation 84 yields

ψ(x1 x2)+ φ1(x1)+ φ2(x2)minus ψ(x1 x2)minus φ1(x1)minus φ2(x2)

= 12

[ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)minus ψ(x1 x2)

]

which has to be nonnegative Of all four possible combinations two of themare valid and yield the same positive gap and the other two are invalid sincethey yield the same negative gap Enumerating these combinations we find

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= 12|ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| = ω

2

from equation 75 Substitution into the condition 82 then yields equa-tion 83

Next we derive the following weaker corollary of theorem 4

Uniqueness of Loopy Belief Propagation Fixed Points 2407

Corollary 4 This is a weaker version of theorem 4 for pairwise potentials Loopybelief propagation on pairwise potentials has a unique fixed point ifsum

αsupβωα le 1 forallβ (85)

with ωα defined in equation 72

Proof Consider the allocation matrix with components Aαβ = 1minus σα forall β sub α With this choice conditions 1 and 2 of equation 81 are fulfilledsince (condition 1) σα le 1 and (condition 2)

(1minus σα)(1minus σα)+ 2σα(1minus σα) = 1minus 2σ 2α le 1

Substitution into condition 3 yieldssumαsupβ

(1minus σα) gesumαsupβ

1minus 1 and thussumαsupβ

σα le 1 (86)

Since ωα = minus log(1minus σα) ge σα condition 85 is weaker than condition 86

Summarizing the conditions in Tatikonda and Jordan (2002) are for bi-nary pairwise potentials and when strengthened as above at most a constant(factor 4) less strict and thus better than the ones derived here The latterare better when the structure is (close to) a tree The best set of conditionsfollows by taking the union of both Note further that the conditions derivedin Tatikonda and Jordan (2002) are unlike theorem 4 specific to pairwisepotentials

83 Illustration For illustration we consider a 3 times 3 Ising grid withtoroidal boundary conditions as in Figure 3a and uniform ferromagneticpotentials proportional to(

α 1minus α1minus α α

)

The trivial solution which is the only minimum of the Bethe free energy forsmall α is the one with all pseudomarginals equal to (05 05) With simplealgebra for example following the line of reasoning that leads to the beliefoptimization algorithm in Welling and Teh (2003) it can be shown that thistrivial solution becomes unstable at the critical αcritical = 23 asymp 067 Forα gt 23 we find two minima one with ldquospins uprdquo and the other one withldquospins downrdquo

In this symmetric problem the strength of each potential is given by

ω = 2 log[

α

1minus α]

and thus σ = 1minus(

1minus αα

)2

2408 T Heskes

Figure 3 Three Ising grids in factor-graph notation circles denote nodes boxesinteractions (a) Toroidal boundary conditions All elements of the allocationmatrix equal to 34 (not shown) (b) Aperiodic boundary conditions and (c) twoloops left The elements of the allocation matrix along the edges follow directlyfrom optimizing condition 3 in theorem 4 and symmetry considerations WithB = 2minus 2A in b and C = 1minusA in c the optimal settings for the single remainingvariable A then boil down to 34 and 1 minus

radic18 respectively See the text for

further explanation

Uniqueness of Loopy Belief Propagation Fixed Points 2409

The minimal (uniform) compensation in condition 3 of theorem 4 amountsto A = 34 for all combinations of potentials and nodes Substitution intocondition 2 then yields

σ le 13

and thus α le 11+radic23

asymp 055

The critical value that follows from corollary 3 is in this case slightly better

ω lt 1 and thus α le 11+ eminus12 asymp 062

Next we consider the same grid with aperiodic boundary conditions asin Figure 3b Numerically we find a critical αcritical asymp 079 The value thatfollows from corollary 3 is dominated by the center node and hence staysthe same a unique loopy belief propagation fixed point for α lt 062 Theo-rem 4 can be exploited to shift resources a little In principle we can solvethe nonlinear programming problem but for this small problem it can stillbe done by hand with the following argumentation Minimal compensationaccording to condition 3 in theorem 4 combined with symmetry considera-tions yields the allocation matrix elements along on the edges in Figure 3bIt is then easy to check that there are only two different appearances ofcondition 2

(2minus 2A)σ + 34le 1 and

12σ + A le 1

The optimal choice for A is the one in which both conditions turn out to beidentical In this way we obtain A = 34 yielding

σ le 12

and thus α le 11+radic12

asymp 058

still slightly worse than the condition from corollary 3An example in which the condition obtained with theorem 4 is better

than the one from corollary 3 is given in Figure 3c Straightforward analysisfollowing the same recipe as for Figure 3b yields A = 1minusradic18 with

σ leradic

12

and thus α le 1

1+radic

1minusradic12asymp 065

better than theα lt 062 from corollary 3 and to be compared with the criticalαcritical asymp 088

9 Discussion

In this article we derived sufficient conditions for loopy belief propagationto have just a single fixed point These conditions remain much too strong tobe anywhere near the necessary conditions and in that sense should be seenas no more than a first step These conditions have the following positivefeatures

2410 T Heskes

bull Generalize the conditions for convexity of the Bethe free energy

bull Incorporate the (local) strength of potentials

bull Scale naturally as a function of the ldquotemperaturerdquo

bull Are invariant to arbitrary definitions of potentials and self-interactions

Although the analysis that led to these conditions may seem quite involvedit basically consists of a relatively straightforward combination of two ob-servations The first observation is that we can exploit the arbitrariness inthe definition of the Bethe free energy when we incorporate the constraintsThis forms the basis of the resource allocation argument And the secondobservation concerns the bound on the correlation of a loopy belief propa-gation marginal that leads to the introduction of the strength of a potential

Besides its theoretical usefulness there are more practical uses Firstalgorithms for guaranteed convergence explicitly minimize the Bethe freeenergy They can be considered ldquobound optimization algorithmsrdquo similarto expectation maximization and iterative proportional fitting in the innerloop they minimize a bound on the Bethe free energy which is then up-dated in the outer loop In practice it appears that the tighter the boundthe faster the convergence (see eg Heskes et al 2003) Instead of a boundthat is convex (Yuille 2002) or convex over the set of constraints (Teh ampWelling 2002 Heskes et al 2003) we might relax the convexity conditionand choose a tighter bound that still has a unique minimum thereby speed-ing up the convergence Second in Wainwright et al (2002) a convexifiedBethe free energy is proposed The arguments for this class of free energiesare twofold they yield a bound on the partition function (instead of justan approximation as the standard Bethe free energy) and have a uniqueminimum Focusing on the second argument the conditions in this articlecan be used to construct Bethe free energies that may not be convex (overthe set of constraints) but do have a unique minimum and being closer tothe standard Bethe free energy may yield better approximations

We can think of the following opportunities to make the sufficient con-ditions derived here stricter and thus closer to necessary conditions

bull The conditions guarantee convexity of the dual G(Qβ λαβ) with re-spect to Qβ But in fact we need only G(Qβ) equiv maxλαβ G(Qβ λαβ) to beconvex which is a weaker requirement The Hessian of G(Qβ) how-ever appears to be more difficult to compute and to analyze in generalbut may lead to stronger results in specific cases (eg only pairwiseinteractions or substituting a particular choice of Aαβ )

bull It may be possible to strengthen the bound equation 71 on loopybelief correlations especially for interactions that involve more thantwo nodes

An important question is how the uniqueness of loopy belief propaga-tion fixed points relates to the convergence of loopy belief propagation

Uniqueness of Loopy Belief Propagation Fixed Points 2411

Intuitively one might expect that if loopy belief propagation has a uniquefixed point it will also converge to it This also seems to be the argumenta-tion in Tatikonda and Jordan (2002) However to the best of our knowledgethere is no proof of such correspondence Furthermore the following set ofsimulations does seem to suggest otherwise

We consider a Boltzmann machine with four binary nodes weights

w = ω

0 1 minus1 minus11 0 1 minus1minus1 1 0 minus1minus1 minus1 minus1 0

zero thresholds and potentials

ij(xi xj) = exp[wij4] if xi = xj and ij(xi xj) = exp[minuswij4] if xi = xj

Running loopy belief propagation possibly damped as in equation 39 weobserve ldquoconvergentrdquo and ldquononconvergentrdquo behavior For relatively smallweights loopy belief propagation converges to the trivial fixed point withPi(xi) = 05 for all nodes i and xi = 0 1 as in the lower left inset inFigure 4 For relatively large weights it ends up in a limit cycle as shown inthe upper right inset The weight strength that forms the transition betweenthis ldquoconvergentrdquo and ldquononconvergentrdquo behavior strongly depends on thestep size7 This by itself makes it hard to defend a one-to-one correspondencebetween convergence of loopy belief propagation (apparently dependingon step size) and uniqueness of fixed points (obviously independent of stepsize)

For weights larger than roughly 58 loopy belief propagation failed toconverge to the trivial fixed point even for very small step sizes Howeverrunning a convergent double-loop algorithm from many different initialconditions and many weight strengths considerably larger than 58 we al-ways ended up in the trivial fixed point and never in another one We foundsimilar behavior for a three-node Boltzmann machine (same weight matrixas above except for the fourth node) for very large weights loopy beliefpropagation ends up in a limit cycle whereas a convergent double-loopalgorithm converges to the trivial fixed point which here by corollary 2is guaranteed to be unique In future work we hope to elaborate on theseissues

7 Note that the conditions for guaranteed uniqueness imply ω = 43 for corollary 3and ω = log(2) asymp 069 for theorem 4 both far below the weight strengths where ldquonon-convergentrdquo behavior sets in

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

Uniqueness of Loopy Belief Propagation Fixed Points 2393

+sumβ

sumαsupβ

sumxβ

λαβ(xβ)

times[

1nβ minus 1

sumαprimesupβ

AαprimeβQαprime(xβ)minusQα(xβ)

]

+sumα

λα

[1minus

sumXα

Qα(Xα)

]

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (53)

Note that the constraint Qβ(xβ) = Qα(xβ) as well as its normalization is nolonger incorporated with Lagrange multipliers but follows when we takethe minimum with respect to Qβ It is easy to check that the fixed-pointequations of loopy belief propagation still follow by setting the derivativesof the Lagrangian equation 53 to zero

Although the Bethe free energy and thus the Lagrangian equation 53may not be convex in QαQβ they are convex in Qα and Qβ separatelyTherefore we can interchange the minimum over the pseudomarginals Qα

and the maximum over the Lagrange multipliers as long as we leave theminimum over Qβ as the final operation4

minQαQβ

maxλαβ λα

L(QαQβ λαβ λα) = minQβ

maxλαβ λα

minQα

L(QαQβ λαβ λα)

Rewriting

sumβ

sumαsupβ

sumxβ

λαβ(xβ)

[1

nβ minus 1

sumαprimesupβ

AαprimeβQαprime(xβ)minusQα(xβ)

]

= minussumα

sumβsubα

sumxβ

λαβ(xβ)Qα(xβ)

with

λαβ(xβ) equiv λαβ(xβ)minus 1nβ minus 1

sumαprimesupβ

Aαprimeβλαprimeβ(xβ)

we can easily solve for the minimum with respect to Qα

Qlowastα(Xα) = α(Xα) exp

[λα minus 1+

sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

] (54)

4 In principle we could also first take the minimum over Qβ and leave the minimumover Qα but this does not seem to lead to any useful results

2394 T Heskes

Plugging this into the Lagrangian we obtain the ldquodualrdquo

G(Qβ λαβ λα) equiv L(QlowastαQβ λαβ λα)

= minussumα

sumXα

α(Xα)

exptimes[λα minus 1+

sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

]

+sumα

λα +sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (55)

Next we find for the maximum with respect to λα

exp[1minus λlowastα

] =sumXα

α(Xα) exp

[sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

]

equiv Zlowastα (56)

where we have to keep in mind that Zlowastα by itself like Qlowastα is a function of theremaining pseudomarginals Qβ and Lagrange multipliers λαβ Substitutingthis solution into the dual we arrive at

G(Qβ λαβ) equiv G(Qβ λαβ λlowastα)

= minussumα

log Zlowastα +sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (57)

Let us pause here for a moment and reflect on what we have done so farThe Lagrangian equation 53 being convex in Qα has a unique minimumin Qα (given all other parameters fixed) which is also the only extremumIt happens to be relatively straightforward to express the value at this mini-mum in terms of the remaining parameters and then also to find the optimal(maximal) λlowastα Plugging these values into the Lagrangian equation 53 wehave not lost anything That is zero derivatives of the Lagrangian are still inone-to-one correspondence with zero derivatives of the dual equation 57and thus with fixed points of loopy belief propagation

52 Recovering the Convexity Conditions (1) To find a minimum of theBethe free energy satisfying the constraints in equation 33 we first have totake the maximum of the dual equation 57 over the remaining Lagrangemultipliersλαβ and then the minimum over the remaining pseudomarginalsQβ The duality theorem a standard result from constrained optimization(see Luenberger 1984) tells us that the dual G is concave in the Lagrangemultipliers The remaining question is then whether the dual is convex in

Uniqueness of Loopy Belief Propagation Fixed Points 2395

Qβ If it is we have a convex-concave minimax problem which is guaranteedto have a unique solution

Link c in Figure 1 follows from the following proposition

Proposition 2 Convexity of the Bethe free energy equation 51 in QαQβimplies convexity of the dual equation 57 in Qβ

Proof First we note that the minimum of a convex function over someof its parameters is convex in its remaining parameters In obvious one-dimensional notation with ylowast(x) equiv argmin

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge 2 f (x (ylowast(x+ δ)+ ylowast(xminus δ))2)ge 2 f (x ylowast(x))

where the first inequality follows from the convexity of f in x yand the sec-ond inequality from ylowast(x) being the unique minimum of f (x y) Thereforethe dual equation 55 is convex in Qβ when the Lagrangian equation 53and thus the Bethe free energy equation 51 is convex in QαQβ Further-more from the duality theorem the dual equation 55 is concave in theLagrange multipliers λαβ λα Next we note that the maximum of a con-vex or concave function over its maximizing parameters is again convexwith ylowast(x) equiv argmax

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge f (x+ δ ylowast(x))+ f (xminus δ ylowast(x))ge 2 f (x ylowast(x))

where the first inequality follows from ylowast(xplusmnδ) being the unique maximumof f (x plusmn δ y) and the second inequality from the convexity of f (x y) in xHence the dual equation 57 must still be convex in Qβ

For now we did not gain or lose anything in comparison with the con-ditions for theorem 1 However the inequalities in the above proof suggesta little space that will lead to milder conditions for the uniqueness of fixedpoints

53 Boundedness of the Bethe Free Energy For completeness and tosupport link f in Figure 1 we will here prove that the Bethe free energy isbounded from below The following theorem can be considered a specialcase of the one stated in Minka (2001) on the Bethe free energy for expectationpropagation a generalization of (loopy) belief propagation

Theorem 3 If all potentials are bounded from above that isα(Xα) le max forall α and Xα the Bethe free energy is bounded from below on the set of constraints

2396 T Heskes

Proof It is sufficient to prove that the function G(Qβ) equiv maxλαβ G(Qβ λαβ)

is bounded from below for a particular choice of Aαβ satisfying equation 52Considering Aαβ = nβminus1

nβ we then have

G(Qβ) ge minussumα

logsumXα

α(Xα) exp

[sumβsubα

nβ minus 1nβ

log Qβ(xβ)

]

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

logsumXα

α(Xα)Qβ(xβ)

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

+sumβ

(nβ minus 1)

[minus log

sumxβ

Qβ(xβ)+sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

where the first inequality follows by substituting the choice λαβ(xβ) = 0 forall α β and xβ in G(Qβ λαβ) the second from the concavity of the function

ynβminus1

nβ and the third from the upper bound on the potentials

6 Toward Better Conditions

61 The Hessian The next step is to compute the Hessianmdashthe secondderivative of the dual with respect to the pseudomarginals Qβ The firstderivative yields

partGpartQβ(xβ)

= minussumαsupβ

Aαβ

Qlowastα(xβ)Qβ(xβ)

+ (nβ minus 1)

which is immediate from the Lagrangian equation 53 To compute thematrix of second derivatives

Hββ prime(xβ xprimeβ prime) equivpart2G

partQβ(xβ)partQβ prime(xprimeβ prime)

Uniqueness of Loopy Belief Propagation Fixed Points 2397

we make use of

partQlowastα(xβ)partQβ prime(xprimeβ prime)

= Aαβ primeQlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)

Qβ prime(xprimeβ prime)

where both β and β prime should be a subset of α and with convention Qlowastα(xβ xβ)= Qlowastα(xβ) and Qlowastα(xβ xprimeβ) = 0 if xβ = xprimeβ Here the first term follows from thedifferentation of equation 54 and the second term from the normalizationas in equation 56 Distinguishing between β = β prime and β = β prime we then have

Hββ(xβ xprimeβ) =sumαsupβ

Aαβ(1minus Aαβ)Qlowastα(xβ)Q2β(xβ)

δxβ xprimeβ

+sumαsupβ

A2αβ

Qlowastα(xβ)Qlowastα(xprimeβ)Qβ(xβ)Qβ(xprimeβ)

Hββ prime(xβ xprimeβ prime) = minussum

αsupββ primeAαβAαβ prime

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)Qβ(xβ)Qβ prime(xprimeβ prime)

for β prime = β

where δxβ xprimeβ = 1 if and only if xβ = xprimeβ Here it should be noted that bothβ and xβ play the role of indices that is xβ should not be mistaken for avariable or parameter The parameters are still the (tables with) Lagrangemultipliers λαβ and pseudomarginals Qβ

The goal is now to find conditions under which this Hessian is positive(semi) definite for any setting of the parameters Qβ λαβ that is conditionsthat guarantee

K equivsumββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) ge 0

for any choice of the ldquovectorrdquo S with elements Sβ(xβ) Straightforward ma-nipulations yieldsum

ββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) (K)

=sumα

sumβsubα

sumxβ

Aαβ(1minus Aαβ)Qlowastα(xβ)R2β(xβ) (K1)

+sumα

sumββ primesubα

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ)Qlowastα(xprimeβ prime)Rβ(xβ)Rβ prime(x

primeβ prime) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime) (K3)

where Rβ(xβ) equiv Sβ(xβ)Qβ(xβ)

2398 T Heskes

62 Recovering the Convexity Conditions (2) Let us first see how weget back the conditions for convexity of the Bethe free energy equation 51Since

K2 =sumα

[sumβsubα

sumxβ

AαβQlowastα(xβ)Rβ(xβ)

]2

ge 0

and5

K3 =sumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)

times

12

[Rβ(xβ)minus Rβ prime(xprimeβ prime)

]2 minus 12

R2β(xβ)minus

12

R2β prime(x

primeβ prime)

gesumα

sumβsubα

sumxβ

Aαβ

(sumβ primesubα

Aαβ prime minus Aαβ

)Qlowastα(xβ)R

2β(xβ) (61)

we have

K = K1 + K2 + K3 gesumα

sumβsubα

sumxβ

Aαβ

(1minus

sumβ primesubα

Aαβ prime

)Qlowastα(xβ)R

2β(xβ)

That is sufficient conditions for K to be nonnegative are

Aαβ ge 0 forallαβsubα andsumβsubα

Aαβ le 1 forallα

precisely the conditions for theorem 1

63 Fake Interactions While discussing the conditions for convexity ofthe Bethe free energy we noticed that adding a ldquofake interactionrdquo such as aconstant potential can change the validity of the conditions We will see thathere this is not the case and these fake interactions drop out as we wouldexpect them to

Suppose that we have a fake interaction α(Xα) = 1 From the solutionequation 54 it follows that the pseudomarginal Qlowastα(Xα) factorizes6

Qlowastα(xβ xprimeβ prime) = Qlowastα(xβ)Qlowastα(xprimeβ prime) forallββ primesubα

5 This step is in fact equivalent to the Gerschgorin theorem for bounding the eigenval-ues of a matrix

6 The exact marginal Pexact(Xα) need not factorize This is really a consequence of thelocality assumptions behind loopy belief propagation and the Bethe free energy

Uniqueness of Loopy Belief Propagation Fixed Points 2399

Consequently the terms involving α in K3 cancel with those in K2 which ismost easily seen when we combine K2 and K3 in a different way

K2 + K3 =sumα

sumβsubα

sumxβ xprime

βprime

A2αβQlowastα(xβ)Q

lowastα(xprimeβ)Rβ(xβ)Rβ(x

primeβ) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ prime

times [Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)]Rβ(xβ)Rβ prime(x

primeβ prime) (K3)

This leaves us with the weaker requirement (from K1) Aαβ(1minusAαβ) ge 0 forall β sub α The best choice is then to take Aαβ = 1 which turns condition 3of equation 41 intosum

αprimesupβαprime =α

Aαprimeβ + 1 ge nβ minus 1

The net effect is equivalent to ignoring the interaction reducing the numberof neighboring potentials nβ by 1 for all β that are part of the fake interactionα

We have seen how we get milder and thus better conditions when thereis effectively no interaction Motivated by this ldquosuccessrdquo we will work to-ward conditions that take into account the strength of the interactions Ourstarting point will be the above decomposition in K2 and K3 where sinceK2 ge 0 we will concentrate on K3

7 The Strength of a Potential

71 Bounding the Correlations The crucial observation which will al-low us to obtain milder and thus better conditions for the uniqueness of afixed point is the following lemma It bounds the term between brackets inK3 such that we can again combine this bound with the (positive) term K1However before we get to that we take some time to introduce and deriveproperties of the ldquostrengthrdquo of a potential

Lemma 2 Two-node correlations of loopy belief marginals obey the bound

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime) le σαQlowastα(xβ xprimeβ prime) forall ββprime subα

βprime =βforallxβ xprime

βprime (71)

with the ldquostrengthrdquo σα a function of the potential ψα(Xα) equiv logα(Xα) only

σα = 1minus exp(minusωα) with

ωα equiv maxXαXα

[ψα(Xα)+ (nα minus 1)ψα(Xα)minus

sumβsubα

ψα(Xαβ xβ)

] (72)

where nα equivsum

βsubα 1

2400 T Heskes

Proof For convenience and without loss of generality we omit α fromour notation and renumber the nodes that are contained in α from 1 to nWe consider the quotient between the loopy belief on the potential subsetdivided by the product of its single-node marginals

Qlowast(X)nprodβ=1

Qlowast(xβ)=(X)

prodβ

microβ(xβ)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)microβ(xβ)

=(X)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)

(73)

where we substituted the properly normalized version of equation 35 aloopy belief pseudomarginal is proportional to the potential times incomingmessages The goal is now to find the maximum of the above expressionover all possible messages and all values of X Especially the maximum overmessagesmicro seems to be difficult to compute but the following intermediatelemma helps us out

Lemma 3 The maximum of the function

V(micro) = (nminus 1) log

[sumX

(X)nprodβ=1

microβ(xβ)

]

minusnsumβ=1

log

sum

(Xβ xlowastβ)prodβ prime =β

microβ prime(xβ prime)

with respect to the messages micro under constraintssum

xβ microβ(xβ) = 1 for all β andmicroβ(xβ) ge 0 for all β and xβ occurs at an extreme point microβ(xβ) = δxβ xβ for somexβ to be found

Proof Let us consider optimizing the messagemicro1(x1)with fixed messagesmicroβ(xβ) for β gt 1 The first and second derivatives are easily found to obey

partVpartmicro1(x1)

= (nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ)

part2Vpartmicro1(x1)partmicro1(xprime1)

= (nminus 1)Q(x1)Q(xprime1)minussumβ =1

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

Uniqueness of Loopy Belief Propagation Fixed Points 2401

where

Q(X) equiv (X)prodβ microβ(xβ)sum

Xprime (Xprime)prodβ microβ(x

primeβ)

Now suppose that V has a regular extremum (maximum or minimum) notat an extreme point that is micro1(x1) gt 0 for two or more values of x1 At suchan extremum the first derivative should obey

(nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ) = λ

with λ a Lagrange multiplier implementing the constraintsum

x1micro1(x1) = 1

Summing over x1 we obtain λ = 0 (in fact V is indifferent to any multi-plicative scaling of micro) For the matrix with second derivatives at such anextremum we then have

part2Vpartmicro1(x1)partmicro1(xprime1)

=sumβ =1

sumβprime =1βprime =β

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

which is positive semidefinite the extremum cannot be a maximum Con-sequently any maximum must be at the boundary of the domain Sincethis holds for any choice of microβ(xβ) β gt 1 it follows by induction that themaximum with respect to all microβ(xβ) must be at an extreme point as well

The function V(micro) is up to a term independent of micro the logarithm ofequation 73 So the intermediate lemma 3 tells us that we can replace themaximization over messages micro by maximization over values X

maxmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

X

(X)[(X)

]nminus1

prodβ

(Xβ xβ)

Next we take the maximum over X as well and define the ldquostrengthrdquo σ tobe used in equation 71 through

11minus σ equiv max

Xmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

XX

(X)[(X)

]nminus1

prodβ

(Xβ xβ) (74)

2402 T Heskes

The inequality 71 then follows by summing out Xββ prime in

Qlowast(X)minusprodβ

Qlowast(xβ) le σQlowast(X)

The form of equation 72 then follows by rewriting equation 74 as

ω equiv minus log(1minus σ) = maxXX

W(X X) with

W(X X) =[ψ(X)+ (nminus 1)ψ(X)minus

sumβ

ψ(Xβ xβ)

]

where we recall that ψ(X) equiv log(X)

72 Some Properties In the following we will refer to both ω and σ asthe strength of the potential There are several properties worth noting

bull The strength of a potential is indifferent to multiplication with anyterm that factorizes over the nodes that is

if (X) = (X)prodβ

microβ(xβ) then ω() = ω() for any choice of micro

This property relates to the arbitrariness in the definition of equa-tion 31 if two potentials overlap then multiplying one potential witha term that only depends on the overlap and dividing the other by thesame term does not change the distribution Luckily it also does notchange the strength of those potentials

bull To compute the strength we can enumerate all possible combinationsHowever we can neglect all combinations X and X that differ in fewerthan two nodes To see this consider

W(x1 x2 x12 x1 x2 x12) = ψ(x1 x2 x12)+ ψ(x1 x2 x12)minus ψ(x1 x2 x12)minus ψ(x1 x2 x12)

= minusW(x1 x2 x12 x1 x2 x12)

If now also x2 = x2 we get W(x1 x1 x1 x1) = minusW(x1 x1 x1 x1) =0 Furthermore if W(x1 x2 x12 x1 x2 x12) le 0 then it must be thatW(x1 x2 x12 x1 x2 x12) ge 0 and vice versa So ω the maximumover all combinations must be nonnegative and we can indeed neglectall combinations that by definition yield zero

bull Thus for finite potentials 0 le ω ltinfin and 0 le σ lt 1

Uniqueness of Loopy Belief Propagation Fixed Points 2403

bull With pairwise potentials the above symmetries can be used to reducethe number of evaluations to |x1||x2|(|x1|minus1)(|x2|minus1)4 combinationsAnd indeed for binary nodes x12 isin 0 1 we immediately obtain

ω = |ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| (75)

Any pairwise binary potential can be written as a Boltzmann factor

(x1 x2) prop exp[wx1x2 + θ1x1 + θ2x2]

In this notation we find the simple and intuitive expression ω = |w|the strength is the absolute value of the ldquoweightrdquo It is indeed inde-pendent of (the size of) the thresholds In the case of minus1 1 codingthe relationship is ω = 4|w|

bull In some models there is the notion of a ldquotemperaturerdquo T that is(X) propexp[ψ(X)T] where ψ(X) is considered constant In obvious notationwe then have ω(T) = ω(1)T and thus σ(T) = 1 minus exp[minusω(1)T] =1minus [1(1minus σ(1))]1T

bull Loopy belief revision (max-product) can be interpreted as a zero-temperature limit of loopy belief propagation (sum product) Morespecifically we get the belief revision updates if we imagine runningloopy belief propagation on potentials that are scaled with tempera-ture T and then take the limit T to zero Consequently when analyzingconditions for uniqueness of loopy belief revision fixed points we cantake σ(0) = 0 if σ(1) = 0 (fake interaction) yet σ(0) = 1 wheneverσ(1) gt 0

8 Conditions for Uniqueness

81 Main Result

Theorem 4 Loopy belief propagation has a unique fixed point if there exists anallocation matrix Aαβ between potentials α and nodes β with properties

1 Aαβ ge 0 forallαβsubα (positivity)

2 (1minus σα)maxβsubα

Aαβ + σαsumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(81)

with the strength σα a function of the potentialα(Xα) as defined in equation 72

Proof For completeness we first summarize our line of reasoning Fixedpoints of loopy belief propagation are in one-to-one correspondence with

2404 T Heskes

extrema of the dual equation 55 This dual has a unique extremum if itis convexconcave Concavity is guaranteed so we focus on conditionsfor convexity that is for positive (semi)definiteness of the correspondingHessian This then boils down to conditions that ensure K = K1+K2+K3 ge 0for any choice of Rβ(xβ)

Substituting the bound equation 71 into the term K3 we obtain

K3 ge minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeσαQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime)

ge minussumα

σαsumβsubα

sumxβ

Aαβ

sumβprimesubαβprime =β

Aαβ primeQlowastα(xβ)R2β(xβ)

where in the last step we applied the same trick as in equation 61 SinceK2 ge 0 and combining K1 and (the above lower bound on) K3 we get

K = K1 + K2 + K3

gesumα

sumβsubα

sumxβ

Aαβ

[1minus Aαβ minus σα

sumβ prime =β

Aαβ prime

]Qlowastα(xβ)R

2β(xβ)

This implies

(1minus σα)Aαβ + σαsumβ primesubα

Aαβ prime le 1 forallαβsubα

which in combination with Aαβ ge 0 and σα le 1 yields condition 2 inequation 81 The equality constraint equation 52 that we started with canbe relaxed to the inequality condition 3 without any consequences

We get back the stricter conditions of theorem 1 if σα = 1 for all potentialsα Furthermore ldquofake interactionsrdquo play no role with σα = 0 condition 2becomes maxβsubα Aαβ le 1 suggesting the choice Aαβ = 1 for all β sub αwhich then effectively reduces the number of neighboring potentials nβ incondition 3

82 Comparison with Other Work To the best of our knowledge theonly conditions for uniqueness of loopy belief propagation fixed points thatdepend on more than just the structure of the graph are those in Tatikondaand Jordan (2002) for pairwise potentials The analysis in Tatikonda andJordan is based on the concept of the computation tree which represents anunwrapping of the original graph with respect to the loopy belief propaga-tion algorithm The same concept is used in Weiss (2000) to show that beliefrevision yields the correct maximum a posteriori assignments in graphs

Uniqueness of Loopy Belief Propagation Fixed Points 2405

with a single loop and Weiss and Freeman (2001) to prove that loopy beliefpropagation in gaussian graphical models yields exact means Although thecurrent theorems based on the concept of computation trees are derived forpairwise potentials it should be possible to extend them to more generalfactor graphs

The setup in Tatikonda and Jordan (2002) is slightly different it is basedon the factorization

Pexact(X) = 1Z

prodα

α(Xα)prodβ

β(xβ)

to be compared with our equation 31 where there are no self-potentialsβ(xβ) With this in mind the statement is then as follows

Theorem 5 (adapted from Tatikonda amp Jordan 2002 in particular proposi-tion 53) Loopy belief propagation on pairwise potentials has a unique fixed pointif

sumαsupβ

(max

ψα(Xα)minusminXα

ψα(Xα)

)lt 2 forallβ (82)

To make the connection between theorem 5 and theorem 4 we will firststrengthen the former and then weaken the latter We will focus on thecase of binary pairwise potentials Since the definition of self-potentials isarbitrary and the condition 82 is valid for any choice we can easily improvethe condition by optimizing this choice This then leads to the followingcorollary

Corollary 3 This corollary concerns an improvement of theorem 5 for pairwisebinary potentials Loopy belief propagation on pairwise binary potentials has aunique fixed point ifsum

αsupβωα lt 4 forallβ (83)

with ωα defined in equation 72

Proof The condition 82 applies to any arbitrary definition of self-poten-tials β(xβ) In fact it is valid for any choice

ψα(Xα) = ψα(Xα)+sumβsubα

φαβ(xβ)

where ψα(Xα) is any choice of potential subsets that fits in our frameworkof no self-potentials (as argued above there is some arbitrariness here as

2406 T Heskes

well) We can then optimize this choice to obtain milder and thus betterconditions Omitting α and renumbering the nodes from 1 to 2 we have

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= minφ1φ2

maxx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

minus minx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

In the case of binary nodes (two-by-two matrices ψ(x1 x2)) it is easy tocheck that the optimal φ1 and φ2 that yield the smallest gap are such that

ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2)

ge ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2) (84)

for some x1 x2 x1 and x2 with x1 = x1 and x2 = x2 Solving for φ1 and φ2we find

φ1(x1)minus φ1(x1) = 12

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]φ2(x2)minus φ2(x2) = 1

2

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]

Substitution back into equation 84 yields

ψ(x1 x2)+ φ1(x1)+ φ2(x2)minus ψ(x1 x2)minus φ1(x1)minus φ2(x2)

= 12

[ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)minus ψ(x1 x2)

]

which has to be nonnegative Of all four possible combinations two of themare valid and yield the same positive gap and the other two are invalid sincethey yield the same negative gap Enumerating these combinations we find

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= 12|ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| = ω

2

from equation 75 Substitution into the condition 82 then yields equa-tion 83

Next we derive the following weaker corollary of theorem 4

Uniqueness of Loopy Belief Propagation Fixed Points 2407

Corollary 4 This is a weaker version of theorem 4 for pairwise potentials Loopybelief propagation on pairwise potentials has a unique fixed point ifsum

αsupβωα le 1 forallβ (85)

with ωα defined in equation 72

Proof Consider the allocation matrix with components Aαβ = 1minus σα forall β sub α With this choice conditions 1 and 2 of equation 81 are fulfilledsince (condition 1) σα le 1 and (condition 2)

(1minus σα)(1minus σα)+ 2σα(1minus σα) = 1minus 2σ 2α le 1

Substitution into condition 3 yieldssumαsupβ

(1minus σα) gesumαsupβ

1minus 1 and thussumαsupβ

σα le 1 (86)

Since ωα = minus log(1minus σα) ge σα condition 85 is weaker than condition 86

Summarizing the conditions in Tatikonda and Jordan (2002) are for bi-nary pairwise potentials and when strengthened as above at most a constant(factor 4) less strict and thus better than the ones derived here The latterare better when the structure is (close to) a tree The best set of conditionsfollows by taking the union of both Note further that the conditions derivedin Tatikonda and Jordan (2002) are unlike theorem 4 specific to pairwisepotentials

83 Illustration For illustration we consider a 3 times 3 Ising grid withtoroidal boundary conditions as in Figure 3a and uniform ferromagneticpotentials proportional to(

α 1minus α1minus α α

)

The trivial solution which is the only minimum of the Bethe free energy forsmall α is the one with all pseudomarginals equal to (05 05) With simplealgebra for example following the line of reasoning that leads to the beliefoptimization algorithm in Welling and Teh (2003) it can be shown that thistrivial solution becomes unstable at the critical αcritical = 23 asymp 067 Forα gt 23 we find two minima one with ldquospins uprdquo and the other one withldquospins downrdquo

In this symmetric problem the strength of each potential is given by

ω = 2 log[

α

1minus α]

and thus σ = 1minus(

1minus αα

)2

2408 T Heskes

Figure 3 Three Ising grids in factor-graph notation circles denote nodes boxesinteractions (a) Toroidal boundary conditions All elements of the allocationmatrix equal to 34 (not shown) (b) Aperiodic boundary conditions and (c) twoloops left The elements of the allocation matrix along the edges follow directlyfrom optimizing condition 3 in theorem 4 and symmetry considerations WithB = 2minus 2A in b and C = 1minusA in c the optimal settings for the single remainingvariable A then boil down to 34 and 1 minus

radic18 respectively See the text for

further explanation

Uniqueness of Loopy Belief Propagation Fixed Points 2409

The minimal (uniform) compensation in condition 3 of theorem 4 amountsto A = 34 for all combinations of potentials and nodes Substitution intocondition 2 then yields

σ le 13

and thus α le 11+radic23

asymp 055

The critical value that follows from corollary 3 is in this case slightly better

ω lt 1 and thus α le 11+ eminus12 asymp 062

Next we consider the same grid with aperiodic boundary conditions asin Figure 3b Numerically we find a critical αcritical asymp 079 The value thatfollows from corollary 3 is dominated by the center node and hence staysthe same a unique loopy belief propagation fixed point for α lt 062 Theo-rem 4 can be exploited to shift resources a little In principle we can solvethe nonlinear programming problem but for this small problem it can stillbe done by hand with the following argumentation Minimal compensationaccording to condition 3 in theorem 4 combined with symmetry considera-tions yields the allocation matrix elements along on the edges in Figure 3bIt is then easy to check that there are only two different appearances ofcondition 2

(2minus 2A)σ + 34le 1 and

12σ + A le 1

The optimal choice for A is the one in which both conditions turn out to beidentical In this way we obtain A = 34 yielding

σ le 12

and thus α le 11+radic12

asymp 058

still slightly worse than the condition from corollary 3An example in which the condition obtained with theorem 4 is better

than the one from corollary 3 is given in Figure 3c Straightforward analysisfollowing the same recipe as for Figure 3b yields A = 1minusradic18 with

σ leradic

12

and thus α le 1

1+radic

1minusradic12asymp 065

better than theα lt 062 from corollary 3 and to be compared with the criticalαcritical asymp 088

9 Discussion

In this article we derived sufficient conditions for loopy belief propagationto have just a single fixed point These conditions remain much too strong tobe anywhere near the necessary conditions and in that sense should be seenas no more than a first step These conditions have the following positivefeatures

2410 T Heskes

bull Generalize the conditions for convexity of the Bethe free energy

bull Incorporate the (local) strength of potentials

bull Scale naturally as a function of the ldquotemperaturerdquo

bull Are invariant to arbitrary definitions of potentials and self-interactions

Although the analysis that led to these conditions may seem quite involvedit basically consists of a relatively straightforward combination of two ob-servations The first observation is that we can exploit the arbitrariness inthe definition of the Bethe free energy when we incorporate the constraintsThis forms the basis of the resource allocation argument And the secondobservation concerns the bound on the correlation of a loopy belief propa-gation marginal that leads to the introduction of the strength of a potential

Besides its theoretical usefulness there are more practical uses Firstalgorithms for guaranteed convergence explicitly minimize the Bethe freeenergy They can be considered ldquobound optimization algorithmsrdquo similarto expectation maximization and iterative proportional fitting in the innerloop they minimize a bound on the Bethe free energy which is then up-dated in the outer loop In practice it appears that the tighter the boundthe faster the convergence (see eg Heskes et al 2003) Instead of a boundthat is convex (Yuille 2002) or convex over the set of constraints (Teh ampWelling 2002 Heskes et al 2003) we might relax the convexity conditionand choose a tighter bound that still has a unique minimum thereby speed-ing up the convergence Second in Wainwright et al (2002) a convexifiedBethe free energy is proposed The arguments for this class of free energiesare twofold they yield a bound on the partition function (instead of justan approximation as the standard Bethe free energy) and have a uniqueminimum Focusing on the second argument the conditions in this articlecan be used to construct Bethe free energies that may not be convex (overthe set of constraints) but do have a unique minimum and being closer tothe standard Bethe free energy may yield better approximations

We can think of the following opportunities to make the sufficient con-ditions derived here stricter and thus closer to necessary conditions

bull The conditions guarantee convexity of the dual G(Qβ λαβ) with re-spect to Qβ But in fact we need only G(Qβ) equiv maxλαβ G(Qβ λαβ) to beconvex which is a weaker requirement The Hessian of G(Qβ) how-ever appears to be more difficult to compute and to analyze in generalbut may lead to stronger results in specific cases (eg only pairwiseinteractions or substituting a particular choice of Aαβ )

bull It may be possible to strengthen the bound equation 71 on loopybelief correlations especially for interactions that involve more thantwo nodes

An important question is how the uniqueness of loopy belief propaga-tion fixed points relates to the convergence of loopy belief propagation

Uniqueness of Loopy Belief Propagation Fixed Points 2411

Intuitively one might expect that if loopy belief propagation has a uniquefixed point it will also converge to it This also seems to be the argumenta-tion in Tatikonda and Jordan (2002) However to the best of our knowledgethere is no proof of such correspondence Furthermore the following set ofsimulations does seem to suggest otherwise

We consider a Boltzmann machine with four binary nodes weights

w = ω

0 1 minus1 minus11 0 1 minus1minus1 1 0 minus1minus1 minus1 minus1 0

zero thresholds and potentials

ij(xi xj) = exp[wij4] if xi = xj and ij(xi xj) = exp[minuswij4] if xi = xj

Running loopy belief propagation possibly damped as in equation 39 weobserve ldquoconvergentrdquo and ldquononconvergentrdquo behavior For relatively smallweights loopy belief propagation converges to the trivial fixed point withPi(xi) = 05 for all nodes i and xi = 0 1 as in the lower left inset inFigure 4 For relatively large weights it ends up in a limit cycle as shown inthe upper right inset The weight strength that forms the transition betweenthis ldquoconvergentrdquo and ldquononconvergentrdquo behavior strongly depends on thestep size7 This by itself makes it hard to defend a one-to-one correspondencebetween convergence of loopy belief propagation (apparently dependingon step size) and uniqueness of fixed points (obviously independent of stepsize)

For weights larger than roughly 58 loopy belief propagation failed toconverge to the trivial fixed point even for very small step sizes Howeverrunning a convergent double-loop algorithm from many different initialconditions and many weight strengths considerably larger than 58 we al-ways ended up in the trivial fixed point and never in another one We foundsimilar behavior for a three-node Boltzmann machine (same weight matrixas above except for the fourth node) for very large weights loopy beliefpropagation ends up in a limit cycle whereas a convergent double-loopalgorithm converges to the trivial fixed point which here by corollary 2is guaranteed to be unique In future work we hope to elaborate on theseissues

7 Note that the conditions for guaranteed uniqueness imply ω = 43 for corollary 3and ω = log(2) asymp 069 for theorem 4 both far below the weight strengths where ldquonon-convergentrdquo behavior sets in

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

2394 T Heskes

Plugging this into the Lagrangian we obtain the ldquodualrdquo

G(Qβ λαβ λα) equiv L(QlowastαQβ λαβ λα)

= minussumα

sumXα

α(Xα)

exptimes[λα minus 1+

sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

]

+sumα

λα +sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (55)

Next we find for the maximum with respect to λα

exp[1minus λlowastα

] =sumXα

α(Xα) exp

[sumβsubα

Aαβ log Qβ(xβ)+ λαβ(xβ)

]

equiv Zlowastα (56)

where we have to keep in mind that Zlowastα by itself like Qlowastα is a function of theremaining pseudomarginals Qβ and Lagrange multipliers λαβ Substitutingthis solution into the dual we arrive at

G(Qβ λαβ) equiv G(Qβ λαβ λlowastα)

= minussumα

log Zlowastα +sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

] (57)

Let us pause here for a moment and reflect on what we have done so farThe Lagrangian equation 53 being convex in Qα has a unique minimumin Qα (given all other parameters fixed) which is also the only extremumIt happens to be relatively straightforward to express the value at this mini-mum in terms of the remaining parameters and then also to find the optimal(maximal) λlowastα Plugging these values into the Lagrangian equation 53 wehave not lost anything That is zero derivatives of the Lagrangian are still inone-to-one correspondence with zero derivatives of the dual equation 57and thus with fixed points of loopy belief propagation

52 Recovering the Convexity Conditions (1) To find a minimum of theBethe free energy satisfying the constraints in equation 33 we first have totake the maximum of the dual equation 57 over the remaining Lagrangemultipliersλαβ and then the minimum over the remaining pseudomarginalsQβ The duality theorem a standard result from constrained optimization(see Luenberger 1984) tells us that the dual G is concave in the Lagrangemultipliers The remaining question is then whether the dual is convex in

Uniqueness of Loopy Belief Propagation Fixed Points 2395

Qβ If it is we have a convex-concave minimax problem which is guaranteedto have a unique solution

Link c in Figure 1 follows from the following proposition

Proposition 2 Convexity of the Bethe free energy equation 51 in QαQβimplies convexity of the dual equation 57 in Qβ

Proof First we note that the minimum of a convex function over someof its parameters is convex in its remaining parameters In obvious one-dimensional notation with ylowast(x) equiv argmin

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge 2 f (x (ylowast(x+ δ)+ ylowast(xminus δ))2)ge 2 f (x ylowast(x))

where the first inequality follows from the convexity of f in x yand the sec-ond inequality from ylowast(x) being the unique minimum of f (x y) Thereforethe dual equation 55 is convex in Qβ when the Lagrangian equation 53and thus the Bethe free energy equation 51 is convex in QαQβ Further-more from the duality theorem the dual equation 55 is concave in theLagrange multipliers λαβ λα Next we note that the maximum of a con-vex or concave function over its maximizing parameters is again convexwith ylowast(x) equiv argmax

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge f (x+ δ ylowast(x))+ f (xminus δ ylowast(x))ge 2 f (x ylowast(x))

where the first inequality follows from ylowast(xplusmnδ) being the unique maximumof f (x plusmn δ y) and the second inequality from the convexity of f (x y) in xHence the dual equation 57 must still be convex in Qβ

For now we did not gain or lose anything in comparison with the con-ditions for theorem 1 However the inequalities in the above proof suggesta little space that will lead to milder conditions for the uniqueness of fixedpoints

53 Boundedness of the Bethe Free Energy For completeness and tosupport link f in Figure 1 we will here prove that the Bethe free energy isbounded from below The following theorem can be considered a specialcase of the one stated in Minka (2001) on the Bethe free energy for expectationpropagation a generalization of (loopy) belief propagation

Theorem 3 If all potentials are bounded from above that isα(Xα) le max forall α and Xα the Bethe free energy is bounded from below on the set of constraints

2396 T Heskes

Proof It is sufficient to prove that the function G(Qβ) equiv maxλαβ G(Qβ λαβ)

is bounded from below for a particular choice of Aαβ satisfying equation 52Considering Aαβ = nβminus1

nβ we then have

G(Qβ) ge minussumα

logsumXα

α(Xα) exp

[sumβsubα

nβ minus 1nβ

log Qβ(xβ)

]

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

logsumXα

α(Xα)Qβ(xβ)

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

+sumβ

(nβ minus 1)

[minus log

sumxβ

Qβ(xβ)+sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

where the first inequality follows by substituting the choice λαβ(xβ) = 0 forall α β and xβ in G(Qβ λαβ) the second from the concavity of the function

ynβminus1

nβ and the third from the upper bound on the potentials

6 Toward Better Conditions

61 The Hessian The next step is to compute the Hessianmdashthe secondderivative of the dual with respect to the pseudomarginals Qβ The firstderivative yields

partGpartQβ(xβ)

= minussumαsupβ

Aαβ

Qlowastα(xβ)Qβ(xβ)

+ (nβ minus 1)

which is immediate from the Lagrangian equation 53 To compute thematrix of second derivatives

Hββ prime(xβ xprimeβ prime) equivpart2G

partQβ(xβ)partQβ prime(xprimeβ prime)

Uniqueness of Loopy Belief Propagation Fixed Points 2397

we make use of

partQlowastα(xβ)partQβ prime(xprimeβ prime)

= Aαβ primeQlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)

Qβ prime(xprimeβ prime)

where both β and β prime should be a subset of α and with convention Qlowastα(xβ xβ)= Qlowastα(xβ) and Qlowastα(xβ xprimeβ) = 0 if xβ = xprimeβ Here the first term follows from thedifferentation of equation 54 and the second term from the normalizationas in equation 56 Distinguishing between β = β prime and β = β prime we then have

Hββ(xβ xprimeβ) =sumαsupβ

Aαβ(1minus Aαβ)Qlowastα(xβ)Q2β(xβ)

δxβ xprimeβ

+sumαsupβ

A2αβ

Qlowastα(xβ)Qlowastα(xprimeβ)Qβ(xβ)Qβ(xprimeβ)

Hββ prime(xβ xprimeβ prime) = minussum

αsupββ primeAαβAαβ prime

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)Qβ(xβ)Qβ prime(xprimeβ prime)

for β prime = β

where δxβ xprimeβ = 1 if and only if xβ = xprimeβ Here it should be noted that bothβ and xβ play the role of indices that is xβ should not be mistaken for avariable or parameter The parameters are still the (tables with) Lagrangemultipliers λαβ and pseudomarginals Qβ

The goal is now to find conditions under which this Hessian is positive(semi) definite for any setting of the parameters Qβ λαβ that is conditionsthat guarantee

K equivsumββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) ge 0

for any choice of the ldquovectorrdquo S with elements Sβ(xβ) Straightforward ma-nipulations yieldsum

ββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) (K)

=sumα

sumβsubα

sumxβ

Aαβ(1minus Aαβ)Qlowastα(xβ)R2β(xβ) (K1)

+sumα

sumββ primesubα

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ)Qlowastα(xprimeβ prime)Rβ(xβ)Rβ prime(x

primeβ prime) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime) (K3)

where Rβ(xβ) equiv Sβ(xβ)Qβ(xβ)

2398 T Heskes

62 Recovering the Convexity Conditions (2) Let us first see how weget back the conditions for convexity of the Bethe free energy equation 51Since

K2 =sumα

[sumβsubα

sumxβ

AαβQlowastα(xβ)Rβ(xβ)

]2

ge 0

and5

K3 =sumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)

times

12

[Rβ(xβ)minus Rβ prime(xprimeβ prime)

]2 minus 12

R2β(xβ)minus

12

R2β prime(x

primeβ prime)

gesumα

sumβsubα

sumxβ

Aαβ

(sumβ primesubα

Aαβ prime minus Aαβ

)Qlowastα(xβ)R

2β(xβ) (61)

we have

K = K1 + K2 + K3 gesumα

sumβsubα

sumxβ

Aαβ

(1minus

sumβ primesubα

Aαβ prime

)Qlowastα(xβ)R

2β(xβ)

That is sufficient conditions for K to be nonnegative are

Aαβ ge 0 forallαβsubα andsumβsubα

Aαβ le 1 forallα

precisely the conditions for theorem 1

63 Fake Interactions While discussing the conditions for convexity ofthe Bethe free energy we noticed that adding a ldquofake interactionrdquo such as aconstant potential can change the validity of the conditions We will see thathere this is not the case and these fake interactions drop out as we wouldexpect them to

Suppose that we have a fake interaction α(Xα) = 1 From the solutionequation 54 it follows that the pseudomarginal Qlowastα(Xα) factorizes6

Qlowastα(xβ xprimeβ prime) = Qlowastα(xβ)Qlowastα(xprimeβ prime) forallββ primesubα

5 This step is in fact equivalent to the Gerschgorin theorem for bounding the eigenval-ues of a matrix

6 The exact marginal Pexact(Xα) need not factorize This is really a consequence of thelocality assumptions behind loopy belief propagation and the Bethe free energy

Uniqueness of Loopy Belief Propagation Fixed Points 2399

Consequently the terms involving α in K3 cancel with those in K2 which ismost easily seen when we combine K2 and K3 in a different way

K2 + K3 =sumα

sumβsubα

sumxβ xprime

βprime

A2αβQlowastα(xβ)Q

lowastα(xprimeβ)Rβ(xβ)Rβ(x

primeβ) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ prime

times [Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)]Rβ(xβ)Rβ prime(x

primeβ prime) (K3)

This leaves us with the weaker requirement (from K1) Aαβ(1minusAαβ) ge 0 forall β sub α The best choice is then to take Aαβ = 1 which turns condition 3of equation 41 intosum

αprimesupβαprime =α

Aαprimeβ + 1 ge nβ minus 1

The net effect is equivalent to ignoring the interaction reducing the numberof neighboring potentials nβ by 1 for all β that are part of the fake interactionα

We have seen how we get milder and thus better conditions when thereis effectively no interaction Motivated by this ldquosuccessrdquo we will work to-ward conditions that take into account the strength of the interactions Ourstarting point will be the above decomposition in K2 and K3 where sinceK2 ge 0 we will concentrate on K3

7 The Strength of a Potential

71 Bounding the Correlations The crucial observation which will al-low us to obtain milder and thus better conditions for the uniqueness of afixed point is the following lemma It bounds the term between brackets inK3 such that we can again combine this bound with the (positive) term K1However before we get to that we take some time to introduce and deriveproperties of the ldquostrengthrdquo of a potential

Lemma 2 Two-node correlations of loopy belief marginals obey the bound

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime) le σαQlowastα(xβ xprimeβ prime) forall ββprime subα

βprime =βforallxβ xprime

βprime (71)

with the ldquostrengthrdquo σα a function of the potential ψα(Xα) equiv logα(Xα) only

σα = 1minus exp(minusωα) with

ωα equiv maxXαXα

[ψα(Xα)+ (nα minus 1)ψα(Xα)minus

sumβsubα

ψα(Xαβ xβ)

] (72)

where nα equivsum

βsubα 1

2400 T Heskes

Proof For convenience and without loss of generality we omit α fromour notation and renumber the nodes that are contained in α from 1 to nWe consider the quotient between the loopy belief on the potential subsetdivided by the product of its single-node marginals

Qlowast(X)nprodβ=1

Qlowast(xβ)=(X)

prodβ

microβ(xβ)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)microβ(xβ)

=(X)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)

(73)

where we substituted the properly normalized version of equation 35 aloopy belief pseudomarginal is proportional to the potential times incomingmessages The goal is now to find the maximum of the above expressionover all possible messages and all values of X Especially the maximum overmessagesmicro seems to be difficult to compute but the following intermediatelemma helps us out

Lemma 3 The maximum of the function

V(micro) = (nminus 1) log

[sumX

(X)nprodβ=1

microβ(xβ)

]

minusnsumβ=1

log

sum

(Xβ xlowastβ)prodβ prime =β

microβ prime(xβ prime)

with respect to the messages micro under constraintssum

xβ microβ(xβ) = 1 for all β andmicroβ(xβ) ge 0 for all β and xβ occurs at an extreme point microβ(xβ) = δxβ xβ for somexβ to be found

Proof Let us consider optimizing the messagemicro1(x1)with fixed messagesmicroβ(xβ) for β gt 1 The first and second derivatives are easily found to obey

partVpartmicro1(x1)

= (nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ)

part2Vpartmicro1(x1)partmicro1(xprime1)

= (nminus 1)Q(x1)Q(xprime1)minussumβ =1

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

Uniqueness of Loopy Belief Propagation Fixed Points 2401

where

Q(X) equiv (X)prodβ microβ(xβ)sum

Xprime (Xprime)prodβ microβ(x

primeβ)

Now suppose that V has a regular extremum (maximum or minimum) notat an extreme point that is micro1(x1) gt 0 for two or more values of x1 At suchan extremum the first derivative should obey

(nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ) = λ

with λ a Lagrange multiplier implementing the constraintsum

x1micro1(x1) = 1

Summing over x1 we obtain λ = 0 (in fact V is indifferent to any multi-plicative scaling of micro) For the matrix with second derivatives at such anextremum we then have

part2Vpartmicro1(x1)partmicro1(xprime1)

=sumβ =1

sumβprime =1βprime =β

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

which is positive semidefinite the extremum cannot be a maximum Con-sequently any maximum must be at the boundary of the domain Sincethis holds for any choice of microβ(xβ) β gt 1 it follows by induction that themaximum with respect to all microβ(xβ) must be at an extreme point as well

The function V(micro) is up to a term independent of micro the logarithm ofequation 73 So the intermediate lemma 3 tells us that we can replace themaximization over messages micro by maximization over values X

maxmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

X

(X)[(X)

]nminus1

prodβ

(Xβ xβ)

Next we take the maximum over X as well and define the ldquostrengthrdquo σ tobe used in equation 71 through

11minus σ equiv max

Xmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

XX

(X)[(X)

]nminus1

prodβ

(Xβ xβ) (74)

2402 T Heskes

The inequality 71 then follows by summing out Xββ prime in

Qlowast(X)minusprodβ

Qlowast(xβ) le σQlowast(X)

The form of equation 72 then follows by rewriting equation 74 as

ω equiv minus log(1minus σ) = maxXX

W(X X) with

W(X X) =[ψ(X)+ (nminus 1)ψ(X)minus

sumβ

ψ(Xβ xβ)

]

where we recall that ψ(X) equiv log(X)

72 Some Properties In the following we will refer to both ω and σ asthe strength of the potential There are several properties worth noting

bull The strength of a potential is indifferent to multiplication with anyterm that factorizes over the nodes that is

if (X) = (X)prodβ

microβ(xβ) then ω() = ω() for any choice of micro

This property relates to the arbitrariness in the definition of equa-tion 31 if two potentials overlap then multiplying one potential witha term that only depends on the overlap and dividing the other by thesame term does not change the distribution Luckily it also does notchange the strength of those potentials

bull To compute the strength we can enumerate all possible combinationsHowever we can neglect all combinations X and X that differ in fewerthan two nodes To see this consider

W(x1 x2 x12 x1 x2 x12) = ψ(x1 x2 x12)+ ψ(x1 x2 x12)minus ψ(x1 x2 x12)minus ψ(x1 x2 x12)

= minusW(x1 x2 x12 x1 x2 x12)

If now also x2 = x2 we get W(x1 x1 x1 x1) = minusW(x1 x1 x1 x1) =0 Furthermore if W(x1 x2 x12 x1 x2 x12) le 0 then it must be thatW(x1 x2 x12 x1 x2 x12) ge 0 and vice versa So ω the maximumover all combinations must be nonnegative and we can indeed neglectall combinations that by definition yield zero

bull Thus for finite potentials 0 le ω ltinfin and 0 le σ lt 1

Uniqueness of Loopy Belief Propagation Fixed Points 2403

bull With pairwise potentials the above symmetries can be used to reducethe number of evaluations to |x1||x2|(|x1|minus1)(|x2|minus1)4 combinationsAnd indeed for binary nodes x12 isin 0 1 we immediately obtain

ω = |ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| (75)

Any pairwise binary potential can be written as a Boltzmann factor

(x1 x2) prop exp[wx1x2 + θ1x1 + θ2x2]

In this notation we find the simple and intuitive expression ω = |w|the strength is the absolute value of the ldquoweightrdquo It is indeed inde-pendent of (the size of) the thresholds In the case of minus1 1 codingthe relationship is ω = 4|w|

bull In some models there is the notion of a ldquotemperaturerdquo T that is(X) propexp[ψ(X)T] where ψ(X) is considered constant In obvious notationwe then have ω(T) = ω(1)T and thus σ(T) = 1 minus exp[minusω(1)T] =1minus [1(1minus σ(1))]1T

bull Loopy belief revision (max-product) can be interpreted as a zero-temperature limit of loopy belief propagation (sum product) Morespecifically we get the belief revision updates if we imagine runningloopy belief propagation on potentials that are scaled with tempera-ture T and then take the limit T to zero Consequently when analyzingconditions for uniqueness of loopy belief revision fixed points we cantake σ(0) = 0 if σ(1) = 0 (fake interaction) yet σ(0) = 1 wheneverσ(1) gt 0

8 Conditions for Uniqueness

81 Main Result

Theorem 4 Loopy belief propagation has a unique fixed point if there exists anallocation matrix Aαβ between potentials α and nodes β with properties

1 Aαβ ge 0 forallαβsubα (positivity)

2 (1minus σα)maxβsubα

Aαβ + σαsumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(81)

with the strength σα a function of the potentialα(Xα) as defined in equation 72

Proof For completeness we first summarize our line of reasoning Fixedpoints of loopy belief propagation are in one-to-one correspondence with

2404 T Heskes

extrema of the dual equation 55 This dual has a unique extremum if itis convexconcave Concavity is guaranteed so we focus on conditionsfor convexity that is for positive (semi)definiteness of the correspondingHessian This then boils down to conditions that ensure K = K1+K2+K3 ge 0for any choice of Rβ(xβ)

Substituting the bound equation 71 into the term K3 we obtain

K3 ge minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeσαQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime)

ge minussumα

σαsumβsubα

sumxβ

Aαβ

sumβprimesubαβprime =β

Aαβ primeQlowastα(xβ)R2β(xβ)

where in the last step we applied the same trick as in equation 61 SinceK2 ge 0 and combining K1 and (the above lower bound on) K3 we get

K = K1 + K2 + K3

gesumα

sumβsubα

sumxβ

Aαβ

[1minus Aαβ minus σα

sumβ prime =β

Aαβ prime

]Qlowastα(xβ)R

2β(xβ)

This implies

(1minus σα)Aαβ + σαsumβ primesubα

Aαβ prime le 1 forallαβsubα

which in combination with Aαβ ge 0 and σα le 1 yields condition 2 inequation 81 The equality constraint equation 52 that we started with canbe relaxed to the inequality condition 3 without any consequences

We get back the stricter conditions of theorem 1 if σα = 1 for all potentialsα Furthermore ldquofake interactionsrdquo play no role with σα = 0 condition 2becomes maxβsubα Aαβ le 1 suggesting the choice Aαβ = 1 for all β sub αwhich then effectively reduces the number of neighboring potentials nβ incondition 3

82 Comparison with Other Work To the best of our knowledge theonly conditions for uniqueness of loopy belief propagation fixed points thatdepend on more than just the structure of the graph are those in Tatikondaand Jordan (2002) for pairwise potentials The analysis in Tatikonda andJordan is based on the concept of the computation tree which represents anunwrapping of the original graph with respect to the loopy belief propaga-tion algorithm The same concept is used in Weiss (2000) to show that beliefrevision yields the correct maximum a posteriori assignments in graphs

Uniqueness of Loopy Belief Propagation Fixed Points 2405

with a single loop and Weiss and Freeman (2001) to prove that loopy beliefpropagation in gaussian graphical models yields exact means Although thecurrent theorems based on the concept of computation trees are derived forpairwise potentials it should be possible to extend them to more generalfactor graphs

The setup in Tatikonda and Jordan (2002) is slightly different it is basedon the factorization

Pexact(X) = 1Z

prodα

α(Xα)prodβ

β(xβ)

to be compared with our equation 31 where there are no self-potentialsβ(xβ) With this in mind the statement is then as follows

Theorem 5 (adapted from Tatikonda amp Jordan 2002 in particular proposi-tion 53) Loopy belief propagation on pairwise potentials has a unique fixed pointif

sumαsupβ

(max

ψα(Xα)minusminXα

ψα(Xα)

)lt 2 forallβ (82)

To make the connection between theorem 5 and theorem 4 we will firststrengthen the former and then weaken the latter We will focus on thecase of binary pairwise potentials Since the definition of self-potentials isarbitrary and the condition 82 is valid for any choice we can easily improvethe condition by optimizing this choice This then leads to the followingcorollary

Corollary 3 This corollary concerns an improvement of theorem 5 for pairwisebinary potentials Loopy belief propagation on pairwise binary potentials has aunique fixed point ifsum

αsupβωα lt 4 forallβ (83)

with ωα defined in equation 72

Proof The condition 82 applies to any arbitrary definition of self-poten-tials β(xβ) In fact it is valid for any choice

ψα(Xα) = ψα(Xα)+sumβsubα

φαβ(xβ)

where ψα(Xα) is any choice of potential subsets that fits in our frameworkof no self-potentials (as argued above there is some arbitrariness here as

2406 T Heskes

well) We can then optimize this choice to obtain milder and thus betterconditions Omitting α and renumbering the nodes from 1 to 2 we have

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= minφ1φ2

maxx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

minus minx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

In the case of binary nodes (two-by-two matrices ψ(x1 x2)) it is easy tocheck that the optimal φ1 and φ2 that yield the smallest gap are such that

ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2)

ge ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2) (84)

for some x1 x2 x1 and x2 with x1 = x1 and x2 = x2 Solving for φ1 and φ2we find

φ1(x1)minus φ1(x1) = 12

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]φ2(x2)minus φ2(x2) = 1

2

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]

Substitution back into equation 84 yields

ψ(x1 x2)+ φ1(x1)+ φ2(x2)minus ψ(x1 x2)minus φ1(x1)minus φ2(x2)

= 12

[ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)minus ψ(x1 x2)

]

which has to be nonnegative Of all four possible combinations two of themare valid and yield the same positive gap and the other two are invalid sincethey yield the same negative gap Enumerating these combinations we find

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= 12|ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| = ω

2

from equation 75 Substitution into the condition 82 then yields equa-tion 83

Next we derive the following weaker corollary of theorem 4

Uniqueness of Loopy Belief Propagation Fixed Points 2407

Corollary 4 This is a weaker version of theorem 4 for pairwise potentials Loopybelief propagation on pairwise potentials has a unique fixed point ifsum

αsupβωα le 1 forallβ (85)

with ωα defined in equation 72

Proof Consider the allocation matrix with components Aαβ = 1minus σα forall β sub α With this choice conditions 1 and 2 of equation 81 are fulfilledsince (condition 1) σα le 1 and (condition 2)

(1minus σα)(1minus σα)+ 2σα(1minus σα) = 1minus 2σ 2α le 1

Substitution into condition 3 yieldssumαsupβ

(1minus σα) gesumαsupβ

1minus 1 and thussumαsupβ

σα le 1 (86)

Since ωα = minus log(1minus σα) ge σα condition 85 is weaker than condition 86

Summarizing the conditions in Tatikonda and Jordan (2002) are for bi-nary pairwise potentials and when strengthened as above at most a constant(factor 4) less strict and thus better than the ones derived here The latterare better when the structure is (close to) a tree The best set of conditionsfollows by taking the union of both Note further that the conditions derivedin Tatikonda and Jordan (2002) are unlike theorem 4 specific to pairwisepotentials

83 Illustration For illustration we consider a 3 times 3 Ising grid withtoroidal boundary conditions as in Figure 3a and uniform ferromagneticpotentials proportional to(

α 1minus α1minus α α

)

The trivial solution which is the only minimum of the Bethe free energy forsmall α is the one with all pseudomarginals equal to (05 05) With simplealgebra for example following the line of reasoning that leads to the beliefoptimization algorithm in Welling and Teh (2003) it can be shown that thistrivial solution becomes unstable at the critical αcritical = 23 asymp 067 Forα gt 23 we find two minima one with ldquospins uprdquo and the other one withldquospins downrdquo

In this symmetric problem the strength of each potential is given by

ω = 2 log[

α

1minus α]

and thus σ = 1minus(

1minus αα

)2

2408 T Heskes

Figure 3 Three Ising grids in factor-graph notation circles denote nodes boxesinteractions (a) Toroidal boundary conditions All elements of the allocationmatrix equal to 34 (not shown) (b) Aperiodic boundary conditions and (c) twoloops left The elements of the allocation matrix along the edges follow directlyfrom optimizing condition 3 in theorem 4 and symmetry considerations WithB = 2minus 2A in b and C = 1minusA in c the optimal settings for the single remainingvariable A then boil down to 34 and 1 minus

radic18 respectively See the text for

further explanation

Uniqueness of Loopy Belief Propagation Fixed Points 2409

The minimal (uniform) compensation in condition 3 of theorem 4 amountsto A = 34 for all combinations of potentials and nodes Substitution intocondition 2 then yields

σ le 13

and thus α le 11+radic23

asymp 055

The critical value that follows from corollary 3 is in this case slightly better

ω lt 1 and thus α le 11+ eminus12 asymp 062

Next we consider the same grid with aperiodic boundary conditions asin Figure 3b Numerically we find a critical αcritical asymp 079 The value thatfollows from corollary 3 is dominated by the center node and hence staysthe same a unique loopy belief propagation fixed point for α lt 062 Theo-rem 4 can be exploited to shift resources a little In principle we can solvethe nonlinear programming problem but for this small problem it can stillbe done by hand with the following argumentation Minimal compensationaccording to condition 3 in theorem 4 combined with symmetry considera-tions yields the allocation matrix elements along on the edges in Figure 3bIt is then easy to check that there are only two different appearances ofcondition 2

(2minus 2A)σ + 34le 1 and

12σ + A le 1

The optimal choice for A is the one in which both conditions turn out to beidentical In this way we obtain A = 34 yielding

σ le 12

and thus α le 11+radic12

asymp 058

still slightly worse than the condition from corollary 3An example in which the condition obtained with theorem 4 is better

than the one from corollary 3 is given in Figure 3c Straightforward analysisfollowing the same recipe as for Figure 3b yields A = 1minusradic18 with

σ leradic

12

and thus α le 1

1+radic

1minusradic12asymp 065

better than theα lt 062 from corollary 3 and to be compared with the criticalαcritical asymp 088

9 Discussion

In this article we derived sufficient conditions for loopy belief propagationto have just a single fixed point These conditions remain much too strong tobe anywhere near the necessary conditions and in that sense should be seenas no more than a first step These conditions have the following positivefeatures

2410 T Heskes

bull Generalize the conditions for convexity of the Bethe free energy

bull Incorporate the (local) strength of potentials

bull Scale naturally as a function of the ldquotemperaturerdquo

bull Are invariant to arbitrary definitions of potentials and self-interactions

Although the analysis that led to these conditions may seem quite involvedit basically consists of a relatively straightforward combination of two ob-servations The first observation is that we can exploit the arbitrariness inthe definition of the Bethe free energy when we incorporate the constraintsThis forms the basis of the resource allocation argument And the secondobservation concerns the bound on the correlation of a loopy belief propa-gation marginal that leads to the introduction of the strength of a potential

Besides its theoretical usefulness there are more practical uses Firstalgorithms for guaranteed convergence explicitly minimize the Bethe freeenergy They can be considered ldquobound optimization algorithmsrdquo similarto expectation maximization and iterative proportional fitting in the innerloop they minimize a bound on the Bethe free energy which is then up-dated in the outer loop In practice it appears that the tighter the boundthe faster the convergence (see eg Heskes et al 2003) Instead of a boundthat is convex (Yuille 2002) or convex over the set of constraints (Teh ampWelling 2002 Heskes et al 2003) we might relax the convexity conditionand choose a tighter bound that still has a unique minimum thereby speed-ing up the convergence Second in Wainwright et al (2002) a convexifiedBethe free energy is proposed The arguments for this class of free energiesare twofold they yield a bound on the partition function (instead of justan approximation as the standard Bethe free energy) and have a uniqueminimum Focusing on the second argument the conditions in this articlecan be used to construct Bethe free energies that may not be convex (overthe set of constraints) but do have a unique minimum and being closer tothe standard Bethe free energy may yield better approximations

We can think of the following opportunities to make the sufficient con-ditions derived here stricter and thus closer to necessary conditions

bull The conditions guarantee convexity of the dual G(Qβ λαβ) with re-spect to Qβ But in fact we need only G(Qβ) equiv maxλαβ G(Qβ λαβ) to beconvex which is a weaker requirement The Hessian of G(Qβ) how-ever appears to be more difficult to compute and to analyze in generalbut may lead to stronger results in specific cases (eg only pairwiseinteractions or substituting a particular choice of Aαβ )

bull It may be possible to strengthen the bound equation 71 on loopybelief correlations especially for interactions that involve more thantwo nodes

An important question is how the uniqueness of loopy belief propaga-tion fixed points relates to the convergence of loopy belief propagation

Uniqueness of Loopy Belief Propagation Fixed Points 2411

Intuitively one might expect that if loopy belief propagation has a uniquefixed point it will also converge to it This also seems to be the argumenta-tion in Tatikonda and Jordan (2002) However to the best of our knowledgethere is no proof of such correspondence Furthermore the following set ofsimulations does seem to suggest otherwise

We consider a Boltzmann machine with four binary nodes weights

w = ω

0 1 minus1 minus11 0 1 minus1minus1 1 0 minus1minus1 minus1 minus1 0

zero thresholds and potentials

ij(xi xj) = exp[wij4] if xi = xj and ij(xi xj) = exp[minuswij4] if xi = xj

Running loopy belief propagation possibly damped as in equation 39 weobserve ldquoconvergentrdquo and ldquononconvergentrdquo behavior For relatively smallweights loopy belief propagation converges to the trivial fixed point withPi(xi) = 05 for all nodes i and xi = 0 1 as in the lower left inset inFigure 4 For relatively large weights it ends up in a limit cycle as shown inthe upper right inset The weight strength that forms the transition betweenthis ldquoconvergentrdquo and ldquononconvergentrdquo behavior strongly depends on thestep size7 This by itself makes it hard to defend a one-to-one correspondencebetween convergence of loopy belief propagation (apparently dependingon step size) and uniqueness of fixed points (obviously independent of stepsize)

For weights larger than roughly 58 loopy belief propagation failed toconverge to the trivial fixed point even for very small step sizes Howeverrunning a convergent double-loop algorithm from many different initialconditions and many weight strengths considerably larger than 58 we al-ways ended up in the trivial fixed point and never in another one We foundsimilar behavior for a three-node Boltzmann machine (same weight matrixas above except for the fourth node) for very large weights loopy beliefpropagation ends up in a limit cycle whereas a convergent double-loopalgorithm converges to the trivial fixed point which here by corollary 2is guaranteed to be unique In future work we hope to elaborate on theseissues

7 Note that the conditions for guaranteed uniqueness imply ω = 43 for corollary 3and ω = log(2) asymp 069 for theorem 4 both far below the weight strengths where ldquonon-convergentrdquo behavior sets in

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

Uniqueness of Loopy Belief Propagation Fixed Points 2395

Qβ If it is we have a convex-concave minimax problem which is guaranteedto have a unique solution

Link c in Figure 1 follows from the following proposition

Proposition 2 Convexity of the Bethe free energy equation 51 in QαQβimplies convexity of the dual equation 57 in Qβ

Proof First we note that the minimum of a convex function over someof its parameters is convex in its remaining parameters In obvious one-dimensional notation with ylowast(x) equiv argmin

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge 2 f (x (ylowast(x+ δ)+ ylowast(xminus δ))2)ge 2 f (x ylowast(x))

where the first inequality follows from the convexity of f in x yand the sec-ond inequality from ylowast(x) being the unique minimum of f (x y) Thereforethe dual equation 55 is convex in Qβ when the Lagrangian equation 53and thus the Bethe free energy equation 51 is convex in QαQβ Further-more from the duality theorem the dual equation 55 is concave in theLagrange multipliers λαβ λα Next we note that the maximum of a con-vex or concave function over its maximizing parameters is again convexwith ylowast(x) equiv argmax

yf (x y)

f (x+ δ ylowast(x+ δ))+ f (xminus δ ylowast(xminus δ)) ge f (x+ δ ylowast(x))+ f (xminus δ ylowast(x))ge 2 f (x ylowast(x))

where the first inequality follows from ylowast(xplusmnδ) being the unique maximumof f (x plusmn δ y) and the second inequality from the convexity of f (x y) in xHence the dual equation 57 must still be convex in Qβ

For now we did not gain or lose anything in comparison with the con-ditions for theorem 1 However the inequalities in the above proof suggesta little space that will lead to milder conditions for the uniqueness of fixedpoints

53 Boundedness of the Bethe Free Energy For completeness and tosupport link f in Figure 1 we will here prove that the Bethe free energy isbounded from below The following theorem can be considered a specialcase of the one stated in Minka (2001) on the Bethe free energy for expectationpropagation a generalization of (loopy) belief propagation

Theorem 3 If all potentials are bounded from above that isα(Xα) le max forall α and Xα the Bethe free energy is bounded from below on the set of constraints

2396 T Heskes

Proof It is sufficient to prove that the function G(Qβ) equiv maxλαβ G(Qβ λαβ)

is bounded from below for a particular choice of Aαβ satisfying equation 52Considering Aαβ = nβminus1

nβ we then have

G(Qβ) ge minussumα

logsumXα

α(Xα) exp

[sumβsubα

nβ minus 1nβ

log Qβ(xβ)

]

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

logsumXα

α(Xα)Qβ(xβ)

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

+sumβ

(nβ minus 1)

[minus log

sumxβ

Qβ(xβ)+sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

where the first inequality follows by substituting the choice λαβ(xβ) = 0 forall α β and xβ in G(Qβ λαβ) the second from the concavity of the function

ynβminus1

nβ and the third from the upper bound on the potentials

6 Toward Better Conditions

61 The Hessian The next step is to compute the Hessianmdashthe secondderivative of the dual with respect to the pseudomarginals Qβ The firstderivative yields

partGpartQβ(xβ)

= minussumαsupβ

Aαβ

Qlowastα(xβ)Qβ(xβ)

+ (nβ minus 1)

which is immediate from the Lagrangian equation 53 To compute thematrix of second derivatives

Hββ prime(xβ xprimeβ prime) equivpart2G

partQβ(xβ)partQβ prime(xprimeβ prime)

Uniqueness of Loopy Belief Propagation Fixed Points 2397

we make use of

partQlowastα(xβ)partQβ prime(xprimeβ prime)

= Aαβ primeQlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)

Qβ prime(xprimeβ prime)

where both β and β prime should be a subset of α and with convention Qlowastα(xβ xβ)= Qlowastα(xβ) and Qlowastα(xβ xprimeβ) = 0 if xβ = xprimeβ Here the first term follows from thedifferentation of equation 54 and the second term from the normalizationas in equation 56 Distinguishing between β = β prime and β = β prime we then have

Hββ(xβ xprimeβ) =sumαsupβ

Aαβ(1minus Aαβ)Qlowastα(xβ)Q2β(xβ)

δxβ xprimeβ

+sumαsupβ

A2αβ

Qlowastα(xβ)Qlowastα(xprimeβ)Qβ(xβ)Qβ(xprimeβ)

Hββ prime(xβ xprimeβ prime) = minussum

αsupββ primeAαβAαβ prime

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)Qβ(xβ)Qβ prime(xprimeβ prime)

for β prime = β

where δxβ xprimeβ = 1 if and only if xβ = xprimeβ Here it should be noted that bothβ and xβ play the role of indices that is xβ should not be mistaken for avariable or parameter The parameters are still the (tables with) Lagrangemultipliers λαβ and pseudomarginals Qβ

The goal is now to find conditions under which this Hessian is positive(semi) definite for any setting of the parameters Qβ λαβ that is conditionsthat guarantee

K equivsumββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) ge 0

for any choice of the ldquovectorrdquo S with elements Sβ(xβ) Straightforward ma-nipulations yieldsum

ββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) (K)

=sumα

sumβsubα

sumxβ

Aαβ(1minus Aαβ)Qlowastα(xβ)R2β(xβ) (K1)

+sumα

sumββ primesubα

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ)Qlowastα(xprimeβ prime)Rβ(xβ)Rβ prime(x

primeβ prime) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime) (K3)

where Rβ(xβ) equiv Sβ(xβ)Qβ(xβ)

2398 T Heskes

62 Recovering the Convexity Conditions (2) Let us first see how weget back the conditions for convexity of the Bethe free energy equation 51Since

K2 =sumα

[sumβsubα

sumxβ

AαβQlowastα(xβ)Rβ(xβ)

]2

ge 0

and5

K3 =sumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)

times

12

[Rβ(xβ)minus Rβ prime(xprimeβ prime)

]2 minus 12

R2β(xβ)minus

12

R2β prime(x

primeβ prime)

gesumα

sumβsubα

sumxβ

Aαβ

(sumβ primesubα

Aαβ prime minus Aαβ

)Qlowastα(xβ)R

2β(xβ) (61)

we have

K = K1 + K2 + K3 gesumα

sumβsubα

sumxβ

Aαβ

(1minus

sumβ primesubα

Aαβ prime

)Qlowastα(xβ)R

2β(xβ)

That is sufficient conditions for K to be nonnegative are

Aαβ ge 0 forallαβsubα andsumβsubα

Aαβ le 1 forallα

precisely the conditions for theorem 1

63 Fake Interactions While discussing the conditions for convexity ofthe Bethe free energy we noticed that adding a ldquofake interactionrdquo such as aconstant potential can change the validity of the conditions We will see thathere this is not the case and these fake interactions drop out as we wouldexpect them to

Suppose that we have a fake interaction α(Xα) = 1 From the solutionequation 54 it follows that the pseudomarginal Qlowastα(Xα) factorizes6

Qlowastα(xβ xprimeβ prime) = Qlowastα(xβ)Qlowastα(xprimeβ prime) forallββ primesubα

5 This step is in fact equivalent to the Gerschgorin theorem for bounding the eigenval-ues of a matrix

6 The exact marginal Pexact(Xα) need not factorize This is really a consequence of thelocality assumptions behind loopy belief propagation and the Bethe free energy

Uniqueness of Loopy Belief Propagation Fixed Points 2399

Consequently the terms involving α in K3 cancel with those in K2 which ismost easily seen when we combine K2 and K3 in a different way

K2 + K3 =sumα

sumβsubα

sumxβ xprime

βprime

A2αβQlowastα(xβ)Q

lowastα(xprimeβ)Rβ(xβ)Rβ(x

primeβ) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ prime

times [Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)]Rβ(xβ)Rβ prime(x

primeβ prime) (K3)

This leaves us with the weaker requirement (from K1) Aαβ(1minusAαβ) ge 0 forall β sub α The best choice is then to take Aαβ = 1 which turns condition 3of equation 41 intosum

αprimesupβαprime =α

Aαprimeβ + 1 ge nβ minus 1

The net effect is equivalent to ignoring the interaction reducing the numberof neighboring potentials nβ by 1 for all β that are part of the fake interactionα

We have seen how we get milder and thus better conditions when thereis effectively no interaction Motivated by this ldquosuccessrdquo we will work to-ward conditions that take into account the strength of the interactions Ourstarting point will be the above decomposition in K2 and K3 where sinceK2 ge 0 we will concentrate on K3

7 The Strength of a Potential

71 Bounding the Correlations The crucial observation which will al-low us to obtain milder and thus better conditions for the uniqueness of afixed point is the following lemma It bounds the term between brackets inK3 such that we can again combine this bound with the (positive) term K1However before we get to that we take some time to introduce and deriveproperties of the ldquostrengthrdquo of a potential

Lemma 2 Two-node correlations of loopy belief marginals obey the bound

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime) le σαQlowastα(xβ xprimeβ prime) forall ββprime subα

βprime =βforallxβ xprime

βprime (71)

with the ldquostrengthrdquo σα a function of the potential ψα(Xα) equiv logα(Xα) only

σα = 1minus exp(minusωα) with

ωα equiv maxXαXα

[ψα(Xα)+ (nα minus 1)ψα(Xα)minus

sumβsubα

ψα(Xαβ xβ)

] (72)

where nα equivsum

βsubα 1

2400 T Heskes

Proof For convenience and without loss of generality we omit α fromour notation and renumber the nodes that are contained in α from 1 to nWe consider the quotient between the loopy belief on the potential subsetdivided by the product of its single-node marginals

Qlowast(X)nprodβ=1

Qlowast(xβ)=(X)

prodβ

microβ(xβ)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)microβ(xβ)

=(X)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)

(73)

where we substituted the properly normalized version of equation 35 aloopy belief pseudomarginal is proportional to the potential times incomingmessages The goal is now to find the maximum of the above expressionover all possible messages and all values of X Especially the maximum overmessagesmicro seems to be difficult to compute but the following intermediatelemma helps us out

Lemma 3 The maximum of the function

V(micro) = (nminus 1) log

[sumX

(X)nprodβ=1

microβ(xβ)

]

minusnsumβ=1

log

sum

(Xβ xlowastβ)prodβ prime =β

microβ prime(xβ prime)

with respect to the messages micro under constraintssum

xβ microβ(xβ) = 1 for all β andmicroβ(xβ) ge 0 for all β and xβ occurs at an extreme point microβ(xβ) = δxβ xβ for somexβ to be found

Proof Let us consider optimizing the messagemicro1(x1)with fixed messagesmicroβ(xβ) for β gt 1 The first and second derivatives are easily found to obey

partVpartmicro1(x1)

= (nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ)

part2Vpartmicro1(x1)partmicro1(xprime1)

= (nminus 1)Q(x1)Q(xprime1)minussumβ =1

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

Uniqueness of Loopy Belief Propagation Fixed Points 2401

where

Q(X) equiv (X)prodβ microβ(xβ)sum

Xprime (Xprime)prodβ microβ(x

primeβ)

Now suppose that V has a regular extremum (maximum or minimum) notat an extreme point that is micro1(x1) gt 0 for two or more values of x1 At suchan extremum the first derivative should obey

(nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ) = λ

with λ a Lagrange multiplier implementing the constraintsum

x1micro1(x1) = 1

Summing over x1 we obtain λ = 0 (in fact V is indifferent to any multi-plicative scaling of micro) For the matrix with second derivatives at such anextremum we then have

part2Vpartmicro1(x1)partmicro1(xprime1)

=sumβ =1

sumβprime =1βprime =β

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

which is positive semidefinite the extremum cannot be a maximum Con-sequently any maximum must be at the boundary of the domain Sincethis holds for any choice of microβ(xβ) β gt 1 it follows by induction that themaximum with respect to all microβ(xβ) must be at an extreme point as well

The function V(micro) is up to a term independent of micro the logarithm ofequation 73 So the intermediate lemma 3 tells us that we can replace themaximization over messages micro by maximization over values X

maxmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

X

(X)[(X)

]nminus1

prodβ

(Xβ xβ)

Next we take the maximum over X as well and define the ldquostrengthrdquo σ tobe used in equation 71 through

11minus σ equiv max

Xmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

XX

(X)[(X)

]nminus1

prodβ

(Xβ xβ) (74)

2402 T Heskes

The inequality 71 then follows by summing out Xββ prime in

Qlowast(X)minusprodβ

Qlowast(xβ) le σQlowast(X)

The form of equation 72 then follows by rewriting equation 74 as

ω equiv minus log(1minus σ) = maxXX

W(X X) with

W(X X) =[ψ(X)+ (nminus 1)ψ(X)minus

sumβ

ψ(Xβ xβ)

]

where we recall that ψ(X) equiv log(X)

72 Some Properties In the following we will refer to both ω and σ asthe strength of the potential There are several properties worth noting

bull The strength of a potential is indifferent to multiplication with anyterm that factorizes over the nodes that is

if (X) = (X)prodβ

microβ(xβ) then ω() = ω() for any choice of micro

This property relates to the arbitrariness in the definition of equa-tion 31 if two potentials overlap then multiplying one potential witha term that only depends on the overlap and dividing the other by thesame term does not change the distribution Luckily it also does notchange the strength of those potentials

bull To compute the strength we can enumerate all possible combinationsHowever we can neglect all combinations X and X that differ in fewerthan two nodes To see this consider

W(x1 x2 x12 x1 x2 x12) = ψ(x1 x2 x12)+ ψ(x1 x2 x12)minus ψ(x1 x2 x12)minus ψ(x1 x2 x12)

= minusW(x1 x2 x12 x1 x2 x12)

If now also x2 = x2 we get W(x1 x1 x1 x1) = minusW(x1 x1 x1 x1) =0 Furthermore if W(x1 x2 x12 x1 x2 x12) le 0 then it must be thatW(x1 x2 x12 x1 x2 x12) ge 0 and vice versa So ω the maximumover all combinations must be nonnegative and we can indeed neglectall combinations that by definition yield zero

bull Thus for finite potentials 0 le ω ltinfin and 0 le σ lt 1

Uniqueness of Loopy Belief Propagation Fixed Points 2403

bull With pairwise potentials the above symmetries can be used to reducethe number of evaluations to |x1||x2|(|x1|minus1)(|x2|minus1)4 combinationsAnd indeed for binary nodes x12 isin 0 1 we immediately obtain

ω = |ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| (75)

Any pairwise binary potential can be written as a Boltzmann factor

(x1 x2) prop exp[wx1x2 + θ1x1 + θ2x2]

In this notation we find the simple and intuitive expression ω = |w|the strength is the absolute value of the ldquoweightrdquo It is indeed inde-pendent of (the size of) the thresholds In the case of minus1 1 codingthe relationship is ω = 4|w|

bull In some models there is the notion of a ldquotemperaturerdquo T that is(X) propexp[ψ(X)T] where ψ(X) is considered constant In obvious notationwe then have ω(T) = ω(1)T and thus σ(T) = 1 minus exp[minusω(1)T] =1minus [1(1minus σ(1))]1T

bull Loopy belief revision (max-product) can be interpreted as a zero-temperature limit of loopy belief propagation (sum product) Morespecifically we get the belief revision updates if we imagine runningloopy belief propagation on potentials that are scaled with tempera-ture T and then take the limit T to zero Consequently when analyzingconditions for uniqueness of loopy belief revision fixed points we cantake σ(0) = 0 if σ(1) = 0 (fake interaction) yet σ(0) = 1 wheneverσ(1) gt 0

8 Conditions for Uniqueness

81 Main Result

Theorem 4 Loopy belief propagation has a unique fixed point if there exists anallocation matrix Aαβ between potentials α and nodes β with properties

1 Aαβ ge 0 forallαβsubα (positivity)

2 (1minus σα)maxβsubα

Aαβ + σαsumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(81)

with the strength σα a function of the potentialα(Xα) as defined in equation 72

Proof For completeness we first summarize our line of reasoning Fixedpoints of loopy belief propagation are in one-to-one correspondence with

2404 T Heskes

extrema of the dual equation 55 This dual has a unique extremum if itis convexconcave Concavity is guaranteed so we focus on conditionsfor convexity that is for positive (semi)definiteness of the correspondingHessian This then boils down to conditions that ensure K = K1+K2+K3 ge 0for any choice of Rβ(xβ)

Substituting the bound equation 71 into the term K3 we obtain

K3 ge minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeσαQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime)

ge minussumα

σαsumβsubα

sumxβ

Aαβ

sumβprimesubαβprime =β

Aαβ primeQlowastα(xβ)R2β(xβ)

where in the last step we applied the same trick as in equation 61 SinceK2 ge 0 and combining K1 and (the above lower bound on) K3 we get

K = K1 + K2 + K3

gesumα

sumβsubα

sumxβ

Aαβ

[1minus Aαβ minus σα

sumβ prime =β

Aαβ prime

]Qlowastα(xβ)R

2β(xβ)

This implies

(1minus σα)Aαβ + σαsumβ primesubα

Aαβ prime le 1 forallαβsubα

which in combination with Aαβ ge 0 and σα le 1 yields condition 2 inequation 81 The equality constraint equation 52 that we started with canbe relaxed to the inequality condition 3 without any consequences

We get back the stricter conditions of theorem 1 if σα = 1 for all potentialsα Furthermore ldquofake interactionsrdquo play no role with σα = 0 condition 2becomes maxβsubα Aαβ le 1 suggesting the choice Aαβ = 1 for all β sub αwhich then effectively reduces the number of neighboring potentials nβ incondition 3

82 Comparison with Other Work To the best of our knowledge theonly conditions for uniqueness of loopy belief propagation fixed points thatdepend on more than just the structure of the graph are those in Tatikondaand Jordan (2002) for pairwise potentials The analysis in Tatikonda andJordan is based on the concept of the computation tree which represents anunwrapping of the original graph with respect to the loopy belief propaga-tion algorithm The same concept is used in Weiss (2000) to show that beliefrevision yields the correct maximum a posteriori assignments in graphs

Uniqueness of Loopy Belief Propagation Fixed Points 2405

with a single loop and Weiss and Freeman (2001) to prove that loopy beliefpropagation in gaussian graphical models yields exact means Although thecurrent theorems based on the concept of computation trees are derived forpairwise potentials it should be possible to extend them to more generalfactor graphs

The setup in Tatikonda and Jordan (2002) is slightly different it is basedon the factorization

Pexact(X) = 1Z

prodα

α(Xα)prodβ

β(xβ)

to be compared with our equation 31 where there are no self-potentialsβ(xβ) With this in mind the statement is then as follows

Theorem 5 (adapted from Tatikonda amp Jordan 2002 in particular proposi-tion 53) Loopy belief propagation on pairwise potentials has a unique fixed pointif

sumαsupβ

(max

ψα(Xα)minusminXα

ψα(Xα)

)lt 2 forallβ (82)

To make the connection between theorem 5 and theorem 4 we will firststrengthen the former and then weaken the latter We will focus on thecase of binary pairwise potentials Since the definition of self-potentials isarbitrary and the condition 82 is valid for any choice we can easily improvethe condition by optimizing this choice This then leads to the followingcorollary

Corollary 3 This corollary concerns an improvement of theorem 5 for pairwisebinary potentials Loopy belief propagation on pairwise binary potentials has aunique fixed point ifsum

αsupβωα lt 4 forallβ (83)

with ωα defined in equation 72

Proof The condition 82 applies to any arbitrary definition of self-poten-tials β(xβ) In fact it is valid for any choice

ψα(Xα) = ψα(Xα)+sumβsubα

φαβ(xβ)

where ψα(Xα) is any choice of potential subsets that fits in our frameworkof no self-potentials (as argued above there is some arbitrariness here as

2406 T Heskes

well) We can then optimize this choice to obtain milder and thus betterconditions Omitting α and renumbering the nodes from 1 to 2 we have

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= minφ1φ2

maxx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

minus minx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

In the case of binary nodes (two-by-two matrices ψ(x1 x2)) it is easy tocheck that the optimal φ1 and φ2 that yield the smallest gap are such that

ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2)

ge ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2) (84)

for some x1 x2 x1 and x2 with x1 = x1 and x2 = x2 Solving for φ1 and φ2we find

φ1(x1)minus φ1(x1) = 12

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]φ2(x2)minus φ2(x2) = 1

2

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]

Substitution back into equation 84 yields

ψ(x1 x2)+ φ1(x1)+ φ2(x2)minus ψ(x1 x2)minus φ1(x1)minus φ2(x2)

= 12

[ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)minus ψ(x1 x2)

]

which has to be nonnegative Of all four possible combinations two of themare valid and yield the same positive gap and the other two are invalid sincethey yield the same negative gap Enumerating these combinations we find

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= 12|ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| = ω

2

from equation 75 Substitution into the condition 82 then yields equa-tion 83

Next we derive the following weaker corollary of theorem 4

Uniqueness of Loopy Belief Propagation Fixed Points 2407

Corollary 4 This is a weaker version of theorem 4 for pairwise potentials Loopybelief propagation on pairwise potentials has a unique fixed point ifsum

αsupβωα le 1 forallβ (85)

with ωα defined in equation 72

Proof Consider the allocation matrix with components Aαβ = 1minus σα forall β sub α With this choice conditions 1 and 2 of equation 81 are fulfilledsince (condition 1) σα le 1 and (condition 2)

(1minus σα)(1minus σα)+ 2σα(1minus σα) = 1minus 2σ 2α le 1

Substitution into condition 3 yieldssumαsupβ

(1minus σα) gesumαsupβ

1minus 1 and thussumαsupβ

σα le 1 (86)

Since ωα = minus log(1minus σα) ge σα condition 85 is weaker than condition 86

Summarizing the conditions in Tatikonda and Jordan (2002) are for bi-nary pairwise potentials and when strengthened as above at most a constant(factor 4) less strict and thus better than the ones derived here The latterare better when the structure is (close to) a tree The best set of conditionsfollows by taking the union of both Note further that the conditions derivedin Tatikonda and Jordan (2002) are unlike theorem 4 specific to pairwisepotentials

83 Illustration For illustration we consider a 3 times 3 Ising grid withtoroidal boundary conditions as in Figure 3a and uniform ferromagneticpotentials proportional to(

α 1minus α1minus α α

)

The trivial solution which is the only minimum of the Bethe free energy forsmall α is the one with all pseudomarginals equal to (05 05) With simplealgebra for example following the line of reasoning that leads to the beliefoptimization algorithm in Welling and Teh (2003) it can be shown that thistrivial solution becomes unstable at the critical αcritical = 23 asymp 067 Forα gt 23 we find two minima one with ldquospins uprdquo and the other one withldquospins downrdquo

In this symmetric problem the strength of each potential is given by

ω = 2 log[

α

1minus α]

and thus σ = 1minus(

1minus αα

)2

2408 T Heskes

Figure 3 Three Ising grids in factor-graph notation circles denote nodes boxesinteractions (a) Toroidal boundary conditions All elements of the allocationmatrix equal to 34 (not shown) (b) Aperiodic boundary conditions and (c) twoloops left The elements of the allocation matrix along the edges follow directlyfrom optimizing condition 3 in theorem 4 and symmetry considerations WithB = 2minus 2A in b and C = 1minusA in c the optimal settings for the single remainingvariable A then boil down to 34 and 1 minus

radic18 respectively See the text for

further explanation

Uniqueness of Loopy Belief Propagation Fixed Points 2409

The minimal (uniform) compensation in condition 3 of theorem 4 amountsto A = 34 for all combinations of potentials and nodes Substitution intocondition 2 then yields

σ le 13

and thus α le 11+radic23

asymp 055

The critical value that follows from corollary 3 is in this case slightly better

ω lt 1 and thus α le 11+ eminus12 asymp 062

Next we consider the same grid with aperiodic boundary conditions asin Figure 3b Numerically we find a critical αcritical asymp 079 The value thatfollows from corollary 3 is dominated by the center node and hence staysthe same a unique loopy belief propagation fixed point for α lt 062 Theo-rem 4 can be exploited to shift resources a little In principle we can solvethe nonlinear programming problem but for this small problem it can stillbe done by hand with the following argumentation Minimal compensationaccording to condition 3 in theorem 4 combined with symmetry considera-tions yields the allocation matrix elements along on the edges in Figure 3bIt is then easy to check that there are only two different appearances ofcondition 2

(2minus 2A)σ + 34le 1 and

12σ + A le 1

The optimal choice for A is the one in which both conditions turn out to beidentical In this way we obtain A = 34 yielding

σ le 12

and thus α le 11+radic12

asymp 058

still slightly worse than the condition from corollary 3An example in which the condition obtained with theorem 4 is better

than the one from corollary 3 is given in Figure 3c Straightforward analysisfollowing the same recipe as for Figure 3b yields A = 1minusradic18 with

σ leradic

12

and thus α le 1

1+radic

1minusradic12asymp 065

better than theα lt 062 from corollary 3 and to be compared with the criticalαcritical asymp 088

9 Discussion

In this article we derived sufficient conditions for loopy belief propagationto have just a single fixed point These conditions remain much too strong tobe anywhere near the necessary conditions and in that sense should be seenas no more than a first step These conditions have the following positivefeatures

2410 T Heskes

bull Generalize the conditions for convexity of the Bethe free energy

bull Incorporate the (local) strength of potentials

bull Scale naturally as a function of the ldquotemperaturerdquo

bull Are invariant to arbitrary definitions of potentials and self-interactions

Although the analysis that led to these conditions may seem quite involvedit basically consists of a relatively straightforward combination of two ob-servations The first observation is that we can exploit the arbitrariness inthe definition of the Bethe free energy when we incorporate the constraintsThis forms the basis of the resource allocation argument And the secondobservation concerns the bound on the correlation of a loopy belief propa-gation marginal that leads to the introduction of the strength of a potential

Besides its theoretical usefulness there are more practical uses Firstalgorithms for guaranteed convergence explicitly minimize the Bethe freeenergy They can be considered ldquobound optimization algorithmsrdquo similarto expectation maximization and iterative proportional fitting in the innerloop they minimize a bound on the Bethe free energy which is then up-dated in the outer loop In practice it appears that the tighter the boundthe faster the convergence (see eg Heskes et al 2003) Instead of a boundthat is convex (Yuille 2002) or convex over the set of constraints (Teh ampWelling 2002 Heskes et al 2003) we might relax the convexity conditionand choose a tighter bound that still has a unique minimum thereby speed-ing up the convergence Second in Wainwright et al (2002) a convexifiedBethe free energy is proposed The arguments for this class of free energiesare twofold they yield a bound on the partition function (instead of justan approximation as the standard Bethe free energy) and have a uniqueminimum Focusing on the second argument the conditions in this articlecan be used to construct Bethe free energies that may not be convex (overthe set of constraints) but do have a unique minimum and being closer tothe standard Bethe free energy may yield better approximations

We can think of the following opportunities to make the sufficient con-ditions derived here stricter and thus closer to necessary conditions

bull The conditions guarantee convexity of the dual G(Qβ λαβ) with re-spect to Qβ But in fact we need only G(Qβ) equiv maxλαβ G(Qβ λαβ) to beconvex which is a weaker requirement The Hessian of G(Qβ) how-ever appears to be more difficult to compute and to analyze in generalbut may lead to stronger results in specific cases (eg only pairwiseinteractions or substituting a particular choice of Aαβ )

bull It may be possible to strengthen the bound equation 71 on loopybelief correlations especially for interactions that involve more thantwo nodes

An important question is how the uniqueness of loopy belief propaga-tion fixed points relates to the convergence of loopy belief propagation

Uniqueness of Loopy Belief Propagation Fixed Points 2411

Intuitively one might expect that if loopy belief propagation has a uniquefixed point it will also converge to it This also seems to be the argumenta-tion in Tatikonda and Jordan (2002) However to the best of our knowledgethere is no proof of such correspondence Furthermore the following set ofsimulations does seem to suggest otherwise

We consider a Boltzmann machine with four binary nodes weights

w = ω

0 1 minus1 minus11 0 1 minus1minus1 1 0 minus1minus1 minus1 minus1 0

zero thresholds and potentials

ij(xi xj) = exp[wij4] if xi = xj and ij(xi xj) = exp[minuswij4] if xi = xj

Running loopy belief propagation possibly damped as in equation 39 weobserve ldquoconvergentrdquo and ldquononconvergentrdquo behavior For relatively smallweights loopy belief propagation converges to the trivial fixed point withPi(xi) = 05 for all nodes i and xi = 0 1 as in the lower left inset inFigure 4 For relatively large weights it ends up in a limit cycle as shown inthe upper right inset The weight strength that forms the transition betweenthis ldquoconvergentrdquo and ldquononconvergentrdquo behavior strongly depends on thestep size7 This by itself makes it hard to defend a one-to-one correspondencebetween convergence of loopy belief propagation (apparently dependingon step size) and uniqueness of fixed points (obviously independent of stepsize)

For weights larger than roughly 58 loopy belief propagation failed toconverge to the trivial fixed point even for very small step sizes Howeverrunning a convergent double-loop algorithm from many different initialconditions and many weight strengths considerably larger than 58 we al-ways ended up in the trivial fixed point and never in another one We foundsimilar behavior for a three-node Boltzmann machine (same weight matrixas above except for the fourth node) for very large weights loopy beliefpropagation ends up in a limit cycle whereas a convergent double-loopalgorithm converges to the trivial fixed point which here by corollary 2is guaranteed to be unique In future work we hope to elaborate on theseissues

7 Note that the conditions for guaranteed uniqueness imply ω = 43 for corollary 3and ω = log(2) asymp 069 for theorem 4 both far below the weight strengths where ldquonon-convergentrdquo behavior sets in

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

2396 T Heskes

Proof It is sufficient to prove that the function G(Qβ) equiv maxλαβ G(Qβ λαβ)

is bounded from below for a particular choice of Aαβ satisfying equation 52Considering Aαβ = nβminus1

nβ we then have

G(Qβ) ge minussumα

logsumXα

α(Xα) exp

[sumβsubα

nβ minus 1nβ

log Qβ(xβ)

]

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

logsumXα

α(Xα)Qβ(xβ)

+sumβ

(nβ minus 1)

[sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

+sumβ

(nβ minus 1)

[minus log

sumxβ

Qβ(xβ)+sumxβ

Qβ(xβ)minus 1

]

ge minussumα

sumβsubα

nβ minus 1nβ

log

sum

Xαβ

max

where the first inequality follows by substituting the choice λαβ(xβ) = 0 forall α β and xβ in G(Qβ λαβ) the second from the concavity of the function

ynβminus1

nβ and the third from the upper bound on the potentials

6 Toward Better Conditions

61 The Hessian The next step is to compute the Hessianmdashthe secondderivative of the dual with respect to the pseudomarginals Qβ The firstderivative yields

partGpartQβ(xβ)

= minussumαsupβ

Aαβ

Qlowastα(xβ)Qβ(xβ)

+ (nβ minus 1)

which is immediate from the Lagrangian equation 53 To compute thematrix of second derivatives

Hββ prime(xβ xprimeβ prime) equivpart2G

partQβ(xβ)partQβ prime(xprimeβ prime)

Uniqueness of Loopy Belief Propagation Fixed Points 2397

we make use of

partQlowastα(xβ)partQβ prime(xprimeβ prime)

= Aαβ primeQlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)

Qβ prime(xprimeβ prime)

where both β and β prime should be a subset of α and with convention Qlowastα(xβ xβ)= Qlowastα(xβ) and Qlowastα(xβ xprimeβ) = 0 if xβ = xprimeβ Here the first term follows from thedifferentation of equation 54 and the second term from the normalizationas in equation 56 Distinguishing between β = β prime and β = β prime we then have

Hββ(xβ xprimeβ) =sumαsupβ

Aαβ(1minus Aαβ)Qlowastα(xβ)Q2β(xβ)

δxβ xprimeβ

+sumαsupβ

A2αβ

Qlowastα(xβ)Qlowastα(xprimeβ)Qβ(xβ)Qβ(xprimeβ)

Hββ prime(xβ xprimeβ prime) = minussum

αsupββ primeAαβAαβ prime

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)Qβ(xβ)Qβ prime(xprimeβ prime)

for β prime = β

where δxβ xprimeβ = 1 if and only if xβ = xprimeβ Here it should be noted that bothβ and xβ play the role of indices that is xβ should not be mistaken for avariable or parameter The parameters are still the (tables with) Lagrangemultipliers λαβ and pseudomarginals Qβ

The goal is now to find conditions under which this Hessian is positive(semi) definite for any setting of the parameters Qβ λαβ that is conditionsthat guarantee

K equivsumββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) ge 0

for any choice of the ldquovectorrdquo S with elements Sβ(xβ) Straightforward ma-nipulations yieldsum

ββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) (K)

=sumα

sumβsubα

sumxβ

Aαβ(1minus Aαβ)Qlowastα(xβ)R2β(xβ) (K1)

+sumα

sumββ primesubα

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ)Qlowastα(xprimeβ prime)Rβ(xβ)Rβ prime(x

primeβ prime) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime) (K3)

where Rβ(xβ) equiv Sβ(xβ)Qβ(xβ)

2398 T Heskes

62 Recovering the Convexity Conditions (2) Let us first see how weget back the conditions for convexity of the Bethe free energy equation 51Since

K2 =sumα

[sumβsubα

sumxβ

AαβQlowastα(xβ)Rβ(xβ)

]2

ge 0

and5

K3 =sumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)

times

12

[Rβ(xβ)minus Rβ prime(xprimeβ prime)

]2 minus 12

R2β(xβ)minus

12

R2β prime(x

primeβ prime)

gesumα

sumβsubα

sumxβ

Aαβ

(sumβ primesubα

Aαβ prime minus Aαβ

)Qlowastα(xβ)R

2β(xβ) (61)

we have

K = K1 + K2 + K3 gesumα

sumβsubα

sumxβ

Aαβ

(1minus

sumβ primesubα

Aαβ prime

)Qlowastα(xβ)R

2β(xβ)

That is sufficient conditions for K to be nonnegative are

Aαβ ge 0 forallαβsubα andsumβsubα

Aαβ le 1 forallα

precisely the conditions for theorem 1

63 Fake Interactions While discussing the conditions for convexity ofthe Bethe free energy we noticed that adding a ldquofake interactionrdquo such as aconstant potential can change the validity of the conditions We will see thathere this is not the case and these fake interactions drop out as we wouldexpect them to

Suppose that we have a fake interaction α(Xα) = 1 From the solutionequation 54 it follows that the pseudomarginal Qlowastα(Xα) factorizes6

Qlowastα(xβ xprimeβ prime) = Qlowastα(xβ)Qlowastα(xprimeβ prime) forallββ primesubα

5 This step is in fact equivalent to the Gerschgorin theorem for bounding the eigenval-ues of a matrix

6 The exact marginal Pexact(Xα) need not factorize This is really a consequence of thelocality assumptions behind loopy belief propagation and the Bethe free energy

Uniqueness of Loopy Belief Propagation Fixed Points 2399

Consequently the terms involving α in K3 cancel with those in K2 which ismost easily seen when we combine K2 and K3 in a different way

K2 + K3 =sumα

sumβsubα

sumxβ xprime

βprime

A2αβQlowastα(xβ)Q

lowastα(xprimeβ)Rβ(xβ)Rβ(x

primeβ) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ prime

times [Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)]Rβ(xβ)Rβ prime(x

primeβ prime) (K3)

This leaves us with the weaker requirement (from K1) Aαβ(1minusAαβ) ge 0 forall β sub α The best choice is then to take Aαβ = 1 which turns condition 3of equation 41 intosum

αprimesupβαprime =α

Aαprimeβ + 1 ge nβ minus 1

The net effect is equivalent to ignoring the interaction reducing the numberof neighboring potentials nβ by 1 for all β that are part of the fake interactionα

We have seen how we get milder and thus better conditions when thereis effectively no interaction Motivated by this ldquosuccessrdquo we will work to-ward conditions that take into account the strength of the interactions Ourstarting point will be the above decomposition in K2 and K3 where sinceK2 ge 0 we will concentrate on K3

7 The Strength of a Potential

71 Bounding the Correlations The crucial observation which will al-low us to obtain milder and thus better conditions for the uniqueness of afixed point is the following lemma It bounds the term between brackets inK3 such that we can again combine this bound with the (positive) term K1However before we get to that we take some time to introduce and deriveproperties of the ldquostrengthrdquo of a potential

Lemma 2 Two-node correlations of loopy belief marginals obey the bound

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime) le σαQlowastα(xβ xprimeβ prime) forall ββprime subα

βprime =βforallxβ xprime

βprime (71)

with the ldquostrengthrdquo σα a function of the potential ψα(Xα) equiv logα(Xα) only

σα = 1minus exp(minusωα) with

ωα equiv maxXαXα

[ψα(Xα)+ (nα minus 1)ψα(Xα)minus

sumβsubα

ψα(Xαβ xβ)

] (72)

where nα equivsum

βsubα 1

2400 T Heskes

Proof For convenience and without loss of generality we omit α fromour notation and renumber the nodes that are contained in α from 1 to nWe consider the quotient between the loopy belief on the potential subsetdivided by the product of its single-node marginals

Qlowast(X)nprodβ=1

Qlowast(xβ)=(X)

prodβ

microβ(xβ)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)microβ(xβ)

=(X)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)

(73)

where we substituted the properly normalized version of equation 35 aloopy belief pseudomarginal is proportional to the potential times incomingmessages The goal is now to find the maximum of the above expressionover all possible messages and all values of X Especially the maximum overmessagesmicro seems to be difficult to compute but the following intermediatelemma helps us out

Lemma 3 The maximum of the function

V(micro) = (nminus 1) log

[sumX

(X)nprodβ=1

microβ(xβ)

]

minusnsumβ=1

log

sum

(Xβ xlowastβ)prodβ prime =β

microβ prime(xβ prime)

with respect to the messages micro under constraintssum

xβ microβ(xβ) = 1 for all β andmicroβ(xβ) ge 0 for all β and xβ occurs at an extreme point microβ(xβ) = δxβ xβ for somexβ to be found

Proof Let us consider optimizing the messagemicro1(x1)with fixed messagesmicroβ(xβ) for β gt 1 The first and second derivatives are easily found to obey

partVpartmicro1(x1)

= (nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ)

part2Vpartmicro1(x1)partmicro1(xprime1)

= (nminus 1)Q(x1)Q(xprime1)minussumβ =1

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

Uniqueness of Loopy Belief Propagation Fixed Points 2401

where

Q(X) equiv (X)prodβ microβ(xβ)sum

Xprime (Xprime)prodβ microβ(x

primeβ)

Now suppose that V has a regular extremum (maximum or minimum) notat an extreme point that is micro1(x1) gt 0 for two or more values of x1 At suchan extremum the first derivative should obey

(nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ) = λ

with λ a Lagrange multiplier implementing the constraintsum

x1micro1(x1) = 1

Summing over x1 we obtain λ = 0 (in fact V is indifferent to any multi-plicative scaling of micro) For the matrix with second derivatives at such anextremum we then have

part2Vpartmicro1(x1)partmicro1(xprime1)

=sumβ =1

sumβprime =1βprime =β

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

which is positive semidefinite the extremum cannot be a maximum Con-sequently any maximum must be at the boundary of the domain Sincethis holds for any choice of microβ(xβ) β gt 1 it follows by induction that themaximum with respect to all microβ(xβ) must be at an extreme point as well

The function V(micro) is up to a term independent of micro the logarithm ofequation 73 So the intermediate lemma 3 tells us that we can replace themaximization over messages micro by maximization over values X

maxmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

X

(X)[(X)

]nminus1

prodβ

(Xβ xβ)

Next we take the maximum over X as well and define the ldquostrengthrdquo σ tobe used in equation 71 through

11minus σ equiv max

Xmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

XX

(X)[(X)

]nminus1

prodβ

(Xβ xβ) (74)

2402 T Heskes

The inequality 71 then follows by summing out Xββ prime in

Qlowast(X)minusprodβ

Qlowast(xβ) le σQlowast(X)

The form of equation 72 then follows by rewriting equation 74 as

ω equiv minus log(1minus σ) = maxXX

W(X X) with

W(X X) =[ψ(X)+ (nminus 1)ψ(X)minus

sumβ

ψ(Xβ xβ)

]

where we recall that ψ(X) equiv log(X)

72 Some Properties In the following we will refer to both ω and σ asthe strength of the potential There are several properties worth noting

bull The strength of a potential is indifferent to multiplication with anyterm that factorizes over the nodes that is

if (X) = (X)prodβ

microβ(xβ) then ω() = ω() for any choice of micro

This property relates to the arbitrariness in the definition of equa-tion 31 if two potentials overlap then multiplying one potential witha term that only depends on the overlap and dividing the other by thesame term does not change the distribution Luckily it also does notchange the strength of those potentials

bull To compute the strength we can enumerate all possible combinationsHowever we can neglect all combinations X and X that differ in fewerthan two nodes To see this consider

W(x1 x2 x12 x1 x2 x12) = ψ(x1 x2 x12)+ ψ(x1 x2 x12)minus ψ(x1 x2 x12)minus ψ(x1 x2 x12)

= minusW(x1 x2 x12 x1 x2 x12)

If now also x2 = x2 we get W(x1 x1 x1 x1) = minusW(x1 x1 x1 x1) =0 Furthermore if W(x1 x2 x12 x1 x2 x12) le 0 then it must be thatW(x1 x2 x12 x1 x2 x12) ge 0 and vice versa So ω the maximumover all combinations must be nonnegative and we can indeed neglectall combinations that by definition yield zero

bull Thus for finite potentials 0 le ω ltinfin and 0 le σ lt 1

Uniqueness of Loopy Belief Propagation Fixed Points 2403

bull With pairwise potentials the above symmetries can be used to reducethe number of evaluations to |x1||x2|(|x1|minus1)(|x2|minus1)4 combinationsAnd indeed for binary nodes x12 isin 0 1 we immediately obtain

ω = |ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| (75)

Any pairwise binary potential can be written as a Boltzmann factor

(x1 x2) prop exp[wx1x2 + θ1x1 + θ2x2]

In this notation we find the simple and intuitive expression ω = |w|the strength is the absolute value of the ldquoweightrdquo It is indeed inde-pendent of (the size of) the thresholds In the case of minus1 1 codingthe relationship is ω = 4|w|

bull In some models there is the notion of a ldquotemperaturerdquo T that is(X) propexp[ψ(X)T] where ψ(X) is considered constant In obvious notationwe then have ω(T) = ω(1)T and thus σ(T) = 1 minus exp[minusω(1)T] =1minus [1(1minus σ(1))]1T

bull Loopy belief revision (max-product) can be interpreted as a zero-temperature limit of loopy belief propagation (sum product) Morespecifically we get the belief revision updates if we imagine runningloopy belief propagation on potentials that are scaled with tempera-ture T and then take the limit T to zero Consequently when analyzingconditions for uniqueness of loopy belief revision fixed points we cantake σ(0) = 0 if σ(1) = 0 (fake interaction) yet σ(0) = 1 wheneverσ(1) gt 0

8 Conditions for Uniqueness

81 Main Result

Theorem 4 Loopy belief propagation has a unique fixed point if there exists anallocation matrix Aαβ between potentials α and nodes β with properties

1 Aαβ ge 0 forallαβsubα (positivity)

2 (1minus σα)maxβsubα

Aαβ + σαsumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(81)

with the strength σα a function of the potentialα(Xα) as defined in equation 72

Proof For completeness we first summarize our line of reasoning Fixedpoints of loopy belief propagation are in one-to-one correspondence with

2404 T Heskes

extrema of the dual equation 55 This dual has a unique extremum if itis convexconcave Concavity is guaranteed so we focus on conditionsfor convexity that is for positive (semi)definiteness of the correspondingHessian This then boils down to conditions that ensure K = K1+K2+K3 ge 0for any choice of Rβ(xβ)

Substituting the bound equation 71 into the term K3 we obtain

K3 ge minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeσαQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime)

ge minussumα

σαsumβsubα

sumxβ

Aαβ

sumβprimesubαβprime =β

Aαβ primeQlowastα(xβ)R2β(xβ)

where in the last step we applied the same trick as in equation 61 SinceK2 ge 0 and combining K1 and (the above lower bound on) K3 we get

K = K1 + K2 + K3

gesumα

sumβsubα

sumxβ

Aαβ

[1minus Aαβ minus σα

sumβ prime =β

Aαβ prime

]Qlowastα(xβ)R

2β(xβ)

This implies

(1minus σα)Aαβ + σαsumβ primesubα

Aαβ prime le 1 forallαβsubα

which in combination with Aαβ ge 0 and σα le 1 yields condition 2 inequation 81 The equality constraint equation 52 that we started with canbe relaxed to the inequality condition 3 without any consequences

We get back the stricter conditions of theorem 1 if σα = 1 for all potentialsα Furthermore ldquofake interactionsrdquo play no role with σα = 0 condition 2becomes maxβsubα Aαβ le 1 suggesting the choice Aαβ = 1 for all β sub αwhich then effectively reduces the number of neighboring potentials nβ incondition 3

82 Comparison with Other Work To the best of our knowledge theonly conditions for uniqueness of loopy belief propagation fixed points thatdepend on more than just the structure of the graph are those in Tatikondaand Jordan (2002) for pairwise potentials The analysis in Tatikonda andJordan is based on the concept of the computation tree which represents anunwrapping of the original graph with respect to the loopy belief propaga-tion algorithm The same concept is used in Weiss (2000) to show that beliefrevision yields the correct maximum a posteriori assignments in graphs

Uniqueness of Loopy Belief Propagation Fixed Points 2405

with a single loop and Weiss and Freeman (2001) to prove that loopy beliefpropagation in gaussian graphical models yields exact means Although thecurrent theorems based on the concept of computation trees are derived forpairwise potentials it should be possible to extend them to more generalfactor graphs

The setup in Tatikonda and Jordan (2002) is slightly different it is basedon the factorization

Pexact(X) = 1Z

prodα

α(Xα)prodβ

β(xβ)

to be compared with our equation 31 where there are no self-potentialsβ(xβ) With this in mind the statement is then as follows

Theorem 5 (adapted from Tatikonda amp Jordan 2002 in particular proposi-tion 53) Loopy belief propagation on pairwise potentials has a unique fixed pointif

sumαsupβ

(max

ψα(Xα)minusminXα

ψα(Xα)

)lt 2 forallβ (82)

To make the connection between theorem 5 and theorem 4 we will firststrengthen the former and then weaken the latter We will focus on thecase of binary pairwise potentials Since the definition of self-potentials isarbitrary and the condition 82 is valid for any choice we can easily improvethe condition by optimizing this choice This then leads to the followingcorollary

Corollary 3 This corollary concerns an improvement of theorem 5 for pairwisebinary potentials Loopy belief propagation on pairwise binary potentials has aunique fixed point ifsum

αsupβωα lt 4 forallβ (83)

with ωα defined in equation 72

Proof The condition 82 applies to any arbitrary definition of self-poten-tials β(xβ) In fact it is valid for any choice

ψα(Xα) = ψα(Xα)+sumβsubα

φαβ(xβ)

where ψα(Xα) is any choice of potential subsets that fits in our frameworkof no self-potentials (as argued above there is some arbitrariness here as

2406 T Heskes

well) We can then optimize this choice to obtain milder and thus betterconditions Omitting α and renumbering the nodes from 1 to 2 we have

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= minφ1φ2

maxx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

minus minx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

In the case of binary nodes (two-by-two matrices ψ(x1 x2)) it is easy tocheck that the optimal φ1 and φ2 that yield the smallest gap are such that

ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2)

ge ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2) (84)

for some x1 x2 x1 and x2 with x1 = x1 and x2 = x2 Solving for φ1 and φ2we find

φ1(x1)minus φ1(x1) = 12

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]φ2(x2)minus φ2(x2) = 1

2

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]

Substitution back into equation 84 yields

ψ(x1 x2)+ φ1(x1)+ φ2(x2)minus ψ(x1 x2)minus φ1(x1)minus φ2(x2)

= 12

[ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)minus ψ(x1 x2)

]

which has to be nonnegative Of all four possible combinations two of themare valid and yield the same positive gap and the other two are invalid sincethey yield the same negative gap Enumerating these combinations we find

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= 12|ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| = ω

2

from equation 75 Substitution into the condition 82 then yields equa-tion 83

Next we derive the following weaker corollary of theorem 4

Uniqueness of Loopy Belief Propagation Fixed Points 2407

Corollary 4 This is a weaker version of theorem 4 for pairwise potentials Loopybelief propagation on pairwise potentials has a unique fixed point ifsum

αsupβωα le 1 forallβ (85)

with ωα defined in equation 72

Proof Consider the allocation matrix with components Aαβ = 1minus σα forall β sub α With this choice conditions 1 and 2 of equation 81 are fulfilledsince (condition 1) σα le 1 and (condition 2)

(1minus σα)(1minus σα)+ 2σα(1minus σα) = 1minus 2σ 2α le 1

Substitution into condition 3 yieldssumαsupβ

(1minus σα) gesumαsupβ

1minus 1 and thussumαsupβ

σα le 1 (86)

Since ωα = minus log(1minus σα) ge σα condition 85 is weaker than condition 86

Summarizing the conditions in Tatikonda and Jordan (2002) are for bi-nary pairwise potentials and when strengthened as above at most a constant(factor 4) less strict and thus better than the ones derived here The latterare better when the structure is (close to) a tree The best set of conditionsfollows by taking the union of both Note further that the conditions derivedin Tatikonda and Jordan (2002) are unlike theorem 4 specific to pairwisepotentials

83 Illustration For illustration we consider a 3 times 3 Ising grid withtoroidal boundary conditions as in Figure 3a and uniform ferromagneticpotentials proportional to(

α 1minus α1minus α α

)

The trivial solution which is the only minimum of the Bethe free energy forsmall α is the one with all pseudomarginals equal to (05 05) With simplealgebra for example following the line of reasoning that leads to the beliefoptimization algorithm in Welling and Teh (2003) it can be shown that thistrivial solution becomes unstable at the critical αcritical = 23 asymp 067 Forα gt 23 we find two minima one with ldquospins uprdquo and the other one withldquospins downrdquo

In this symmetric problem the strength of each potential is given by

ω = 2 log[

α

1minus α]

and thus σ = 1minus(

1minus αα

)2

2408 T Heskes

Figure 3 Three Ising grids in factor-graph notation circles denote nodes boxesinteractions (a) Toroidal boundary conditions All elements of the allocationmatrix equal to 34 (not shown) (b) Aperiodic boundary conditions and (c) twoloops left The elements of the allocation matrix along the edges follow directlyfrom optimizing condition 3 in theorem 4 and symmetry considerations WithB = 2minus 2A in b and C = 1minusA in c the optimal settings for the single remainingvariable A then boil down to 34 and 1 minus

radic18 respectively See the text for

further explanation

Uniqueness of Loopy Belief Propagation Fixed Points 2409

The minimal (uniform) compensation in condition 3 of theorem 4 amountsto A = 34 for all combinations of potentials and nodes Substitution intocondition 2 then yields

σ le 13

and thus α le 11+radic23

asymp 055

The critical value that follows from corollary 3 is in this case slightly better

ω lt 1 and thus α le 11+ eminus12 asymp 062

Next we consider the same grid with aperiodic boundary conditions asin Figure 3b Numerically we find a critical αcritical asymp 079 The value thatfollows from corollary 3 is dominated by the center node and hence staysthe same a unique loopy belief propagation fixed point for α lt 062 Theo-rem 4 can be exploited to shift resources a little In principle we can solvethe nonlinear programming problem but for this small problem it can stillbe done by hand with the following argumentation Minimal compensationaccording to condition 3 in theorem 4 combined with symmetry considera-tions yields the allocation matrix elements along on the edges in Figure 3bIt is then easy to check that there are only two different appearances ofcondition 2

(2minus 2A)σ + 34le 1 and

12σ + A le 1

The optimal choice for A is the one in which both conditions turn out to beidentical In this way we obtain A = 34 yielding

σ le 12

and thus α le 11+radic12

asymp 058

still slightly worse than the condition from corollary 3An example in which the condition obtained with theorem 4 is better

than the one from corollary 3 is given in Figure 3c Straightforward analysisfollowing the same recipe as for Figure 3b yields A = 1minusradic18 with

σ leradic

12

and thus α le 1

1+radic

1minusradic12asymp 065

better than theα lt 062 from corollary 3 and to be compared with the criticalαcritical asymp 088

9 Discussion

In this article we derived sufficient conditions for loopy belief propagationto have just a single fixed point These conditions remain much too strong tobe anywhere near the necessary conditions and in that sense should be seenas no more than a first step These conditions have the following positivefeatures

2410 T Heskes

bull Generalize the conditions for convexity of the Bethe free energy

bull Incorporate the (local) strength of potentials

bull Scale naturally as a function of the ldquotemperaturerdquo

bull Are invariant to arbitrary definitions of potentials and self-interactions

Although the analysis that led to these conditions may seem quite involvedit basically consists of a relatively straightforward combination of two ob-servations The first observation is that we can exploit the arbitrariness inthe definition of the Bethe free energy when we incorporate the constraintsThis forms the basis of the resource allocation argument And the secondobservation concerns the bound on the correlation of a loopy belief propa-gation marginal that leads to the introduction of the strength of a potential

Besides its theoretical usefulness there are more practical uses Firstalgorithms for guaranteed convergence explicitly minimize the Bethe freeenergy They can be considered ldquobound optimization algorithmsrdquo similarto expectation maximization and iterative proportional fitting in the innerloop they minimize a bound on the Bethe free energy which is then up-dated in the outer loop In practice it appears that the tighter the boundthe faster the convergence (see eg Heskes et al 2003) Instead of a boundthat is convex (Yuille 2002) or convex over the set of constraints (Teh ampWelling 2002 Heskes et al 2003) we might relax the convexity conditionand choose a tighter bound that still has a unique minimum thereby speed-ing up the convergence Second in Wainwright et al (2002) a convexifiedBethe free energy is proposed The arguments for this class of free energiesare twofold they yield a bound on the partition function (instead of justan approximation as the standard Bethe free energy) and have a uniqueminimum Focusing on the second argument the conditions in this articlecan be used to construct Bethe free energies that may not be convex (overthe set of constraints) but do have a unique minimum and being closer tothe standard Bethe free energy may yield better approximations

We can think of the following opportunities to make the sufficient con-ditions derived here stricter and thus closer to necessary conditions

bull The conditions guarantee convexity of the dual G(Qβ λαβ) with re-spect to Qβ But in fact we need only G(Qβ) equiv maxλαβ G(Qβ λαβ) to beconvex which is a weaker requirement The Hessian of G(Qβ) how-ever appears to be more difficult to compute and to analyze in generalbut may lead to stronger results in specific cases (eg only pairwiseinteractions or substituting a particular choice of Aαβ )

bull It may be possible to strengthen the bound equation 71 on loopybelief correlations especially for interactions that involve more thantwo nodes

An important question is how the uniqueness of loopy belief propaga-tion fixed points relates to the convergence of loopy belief propagation

Uniqueness of Loopy Belief Propagation Fixed Points 2411

Intuitively one might expect that if loopy belief propagation has a uniquefixed point it will also converge to it This also seems to be the argumenta-tion in Tatikonda and Jordan (2002) However to the best of our knowledgethere is no proof of such correspondence Furthermore the following set ofsimulations does seem to suggest otherwise

We consider a Boltzmann machine with four binary nodes weights

w = ω

0 1 minus1 minus11 0 1 minus1minus1 1 0 minus1minus1 minus1 minus1 0

zero thresholds and potentials

ij(xi xj) = exp[wij4] if xi = xj and ij(xi xj) = exp[minuswij4] if xi = xj

Running loopy belief propagation possibly damped as in equation 39 weobserve ldquoconvergentrdquo and ldquononconvergentrdquo behavior For relatively smallweights loopy belief propagation converges to the trivial fixed point withPi(xi) = 05 for all nodes i and xi = 0 1 as in the lower left inset inFigure 4 For relatively large weights it ends up in a limit cycle as shown inthe upper right inset The weight strength that forms the transition betweenthis ldquoconvergentrdquo and ldquononconvergentrdquo behavior strongly depends on thestep size7 This by itself makes it hard to defend a one-to-one correspondencebetween convergence of loopy belief propagation (apparently dependingon step size) and uniqueness of fixed points (obviously independent of stepsize)

For weights larger than roughly 58 loopy belief propagation failed toconverge to the trivial fixed point even for very small step sizes Howeverrunning a convergent double-loop algorithm from many different initialconditions and many weight strengths considerably larger than 58 we al-ways ended up in the trivial fixed point and never in another one We foundsimilar behavior for a three-node Boltzmann machine (same weight matrixas above except for the fourth node) for very large weights loopy beliefpropagation ends up in a limit cycle whereas a convergent double-loopalgorithm converges to the trivial fixed point which here by corollary 2is guaranteed to be unique In future work we hope to elaborate on theseissues

7 Note that the conditions for guaranteed uniqueness imply ω = 43 for corollary 3and ω = log(2) asymp 069 for theorem 4 both far below the weight strengths where ldquonon-convergentrdquo behavior sets in

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

Uniqueness of Loopy Belief Propagation Fixed Points 2397

we make use of

partQlowastα(xβ)partQβ prime(xprimeβ prime)

= Aαβ primeQlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)

Qβ prime(xprimeβ prime)

where both β and β prime should be a subset of α and with convention Qlowastα(xβ xβ)= Qlowastα(xβ) and Qlowastα(xβ xprimeβ) = 0 if xβ = xprimeβ Here the first term follows from thedifferentation of equation 54 and the second term from the normalizationas in equation 56 Distinguishing between β = β prime and β = β prime we then have

Hββ(xβ xprimeβ) =sumαsupβ

Aαβ(1minus Aαβ)Qlowastα(xβ)Q2β(xβ)

δxβ xprimeβ

+sumαsupβ

A2αβ

Qlowastα(xβ)Qlowastα(xprimeβ)Qβ(xβ)Qβ(xprimeβ)

Hββ prime(xβ xprimeβ prime) = minussum

αsupββ primeAαβAαβ prime

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)Qβ(xβ)Qβ prime(xprimeβ prime)

for β prime = β

where δxβ xprimeβ = 1 if and only if xβ = xprimeβ Here it should be noted that bothβ and xβ play the role of indices that is xβ should not be mistaken for avariable or parameter The parameters are still the (tables with) Lagrangemultipliers λαβ and pseudomarginals Qβ

The goal is now to find conditions under which this Hessian is positive(semi) definite for any setting of the parameters Qβ λαβ that is conditionsthat guarantee

K equivsumββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) ge 0

for any choice of the ldquovectorrdquo S with elements Sβ(xβ) Straightforward ma-nipulations yieldsum

ββ prime

sumxβ xβprime

Sβ(xβ)Hββ prime(xβ xβ prime)Sβ prime(xβ prime) (K)

=sumα

sumβsubα

sumxβ

Aαβ(1minus Aαβ)Qlowastα(xβ)R2β(xβ) (K1)

+sumα

sumββ primesubα

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ)Qlowastα(xprimeβ prime)Rβ(xβ)Rβ prime(x

primeβ prime) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime) (K3)

where Rβ(xβ) equiv Sβ(xβ)Qβ(xβ)

2398 T Heskes

62 Recovering the Convexity Conditions (2) Let us first see how weget back the conditions for convexity of the Bethe free energy equation 51Since

K2 =sumα

[sumβsubα

sumxβ

AαβQlowastα(xβ)Rβ(xβ)

]2

ge 0

and5

K3 =sumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)

times

12

[Rβ(xβ)minus Rβ prime(xprimeβ prime)

]2 minus 12

R2β(xβ)minus

12

R2β prime(x

primeβ prime)

gesumα

sumβsubα

sumxβ

Aαβ

(sumβ primesubα

Aαβ prime minus Aαβ

)Qlowastα(xβ)R

2β(xβ) (61)

we have

K = K1 + K2 + K3 gesumα

sumβsubα

sumxβ

Aαβ

(1minus

sumβ primesubα

Aαβ prime

)Qlowastα(xβ)R

2β(xβ)

That is sufficient conditions for K to be nonnegative are

Aαβ ge 0 forallαβsubα andsumβsubα

Aαβ le 1 forallα

precisely the conditions for theorem 1

63 Fake Interactions While discussing the conditions for convexity ofthe Bethe free energy we noticed that adding a ldquofake interactionrdquo such as aconstant potential can change the validity of the conditions We will see thathere this is not the case and these fake interactions drop out as we wouldexpect them to

Suppose that we have a fake interaction α(Xα) = 1 From the solutionequation 54 it follows that the pseudomarginal Qlowastα(Xα) factorizes6

Qlowastα(xβ xprimeβ prime) = Qlowastα(xβ)Qlowastα(xprimeβ prime) forallββ primesubα

5 This step is in fact equivalent to the Gerschgorin theorem for bounding the eigenval-ues of a matrix

6 The exact marginal Pexact(Xα) need not factorize This is really a consequence of thelocality assumptions behind loopy belief propagation and the Bethe free energy

Uniqueness of Loopy Belief Propagation Fixed Points 2399

Consequently the terms involving α in K3 cancel with those in K2 which ismost easily seen when we combine K2 and K3 in a different way

K2 + K3 =sumα

sumβsubα

sumxβ xprime

βprime

A2αβQlowastα(xβ)Q

lowastα(xprimeβ)Rβ(xβ)Rβ(x

primeβ) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ prime

times [Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)]Rβ(xβ)Rβ prime(x

primeβ prime) (K3)

This leaves us with the weaker requirement (from K1) Aαβ(1minusAαβ) ge 0 forall β sub α The best choice is then to take Aαβ = 1 which turns condition 3of equation 41 intosum

αprimesupβαprime =α

Aαprimeβ + 1 ge nβ minus 1

The net effect is equivalent to ignoring the interaction reducing the numberof neighboring potentials nβ by 1 for all β that are part of the fake interactionα

We have seen how we get milder and thus better conditions when thereis effectively no interaction Motivated by this ldquosuccessrdquo we will work to-ward conditions that take into account the strength of the interactions Ourstarting point will be the above decomposition in K2 and K3 where sinceK2 ge 0 we will concentrate on K3

7 The Strength of a Potential

71 Bounding the Correlations The crucial observation which will al-low us to obtain milder and thus better conditions for the uniqueness of afixed point is the following lemma It bounds the term between brackets inK3 such that we can again combine this bound with the (positive) term K1However before we get to that we take some time to introduce and deriveproperties of the ldquostrengthrdquo of a potential

Lemma 2 Two-node correlations of loopy belief marginals obey the bound

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime) le σαQlowastα(xβ xprimeβ prime) forall ββprime subα

βprime =βforallxβ xprime

βprime (71)

with the ldquostrengthrdquo σα a function of the potential ψα(Xα) equiv logα(Xα) only

σα = 1minus exp(minusωα) with

ωα equiv maxXαXα

[ψα(Xα)+ (nα minus 1)ψα(Xα)minus

sumβsubα

ψα(Xαβ xβ)

] (72)

where nα equivsum

βsubα 1

2400 T Heskes

Proof For convenience and without loss of generality we omit α fromour notation and renumber the nodes that are contained in α from 1 to nWe consider the quotient between the loopy belief on the potential subsetdivided by the product of its single-node marginals

Qlowast(X)nprodβ=1

Qlowast(xβ)=(X)

prodβ

microβ(xβ)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)microβ(xβ)

=(X)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)

(73)

where we substituted the properly normalized version of equation 35 aloopy belief pseudomarginal is proportional to the potential times incomingmessages The goal is now to find the maximum of the above expressionover all possible messages and all values of X Especially the maximum overmessagesmicro seems to be difficult to compute but the following intermediatelemma helps us out

Lemma 3 The maximum of the function

V(micro) = (nminus 1) log

[sumX

(X)nprodβ=1

microβ(xβ)

]

minusnsumβ=1

log

sum

(Xβ xlowastβ)prodβ prime =β

microβ prime(xβ prime)

with respect to the messages micro under constraintssum

xβ microβ(xβ) = 1 for all β andmicroβ(xβ) ge 0 for all β and xβ occurs at an extreme point microβ(xβ) = δxβ xβ for somexβ to be found

Proof Let us consider optimizing the messagemicro1(x1)with fixed messagesmicroβ(xβ) for β gt 1 The first and second derivatives are easily found to obey

partVpartmicro1(x1)

= (nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ)

part2Vpartmicro1(x1)partmicro1(xprime1)

= (nminus 1)Q(x1)Q(xprime1)minussumβ =1

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

Uniqueness of Loopy Belief Propagation Fixed Points 2401

where

Q(X) equiv (X)prodβ microβ(xβ)sum

Xprime (Xprime)prodβ microβ(x

primeβ)

Now suppose that V has a regular extremum (maximum or minimum) notat an extreme point that is micro1(x1) gt 0 for two or more values of x1 At suchan extremum the first derivative should obey

(nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ) = λ

with λ a Lagrange multiplier implementing the constraintsum

x1micro1(x1) = 1

Summing over x1 we obtain λ = 0 (in fact V is indifferent to any multi-plicative scaling of micro) For the matrix with second derivatives at such anextremum we then have

part2Vpartmicro1(x1)partmicro1(xprime1)

=sumβ =1

sumβprime =1βprime =β

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

which is positive semidefinite the extremum cannot be a maximum Con-sequently any maximum must be at the boundary of the domain Sincethis holds for any choice of microβ(xβ) β gt 1 it follows by induction that themaximum with respect to all microβ(xβ) must be at an extreme point as well

The function V(micro) is up to a term independent of micro the logarithm ofequation 73 So the intermediate lemma 3 tells us that we can replace themaximization over messages micro by maximization over values X

maxmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

X

(X)[(X)

]nminus1

prodβ

(Xβ xβ)

Next we take the maximum over X as well and define the ldquostrengthrdquo σ tobe used in equation 71 through

11minus σ equiv max

Xmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

XX

(X)[(X)

]nminus1

prodβ

(Xβ xβ) (74)

2402 T Heskes

The inequality 71 then follows by summing out Xββ prime in

Qlowast(X)minusprodβ

Qlowast(xβ) le σQlowast(X)

The form of equation 72 then follows by rewriting equation 74 as

ω equiv minus log(1minus σ) = maxXX

W(X X) with

W(X X) =[ψ(X)+ (nminus 1)ψ(X)minus

sumβ

ψ(Xβ xβ)

]

where we recall that ψ(X) equiv log(X)

72 Some Properties In the following we will refer to both ω and σ asthe strength of the potential There are several properties worth noting

bull The strength of a potential is indifferent to multiplication with anyterm that factorizes over the nodes that is

if (X) = (X)prodβ

microβ(xβ) then ω() = ω() for any choice of micro

This property relates to the arbitrariness in the definition of equa-tion 31 if two potentials overlap then multiplying one potential witha term that only depends on the overlap and dividing the other by thesame term does not change the distribution Luckily it also does notchange the strength of those potentials

bull To compute the strength we can enumerate all possible combinationsHowever we can neglect all combinations X and X that differ in fewerthan two nodes To see this consider

W(x1 x2 x12 x1 x2 x12) = ψ(x1 x2 x12)+ ψ(x1 x2 x12)minus ψ(x1 x2 x12)minus ψ(x1 x2 x12)

= minusW(x1 x2 x12 x1 x2 x12)

If now also x2 = x2 we get W(x1 x1 x1 x1) = minusW(x1 x1 x1 x1) =0 Furthermore if W(x1 x2 x12 x1 x2 x12) le 0 then it must be thatW(x1 x2 x12 x1 x2 x12) ge 0 and vice versa So ω the maximumover all combinations must be nonnegative and we can indeed neglectall combinations that by definition yield zero

bull Thus for finite potentials 0 le ω ltinfin and 0 le σ lt 1

Uniqueness of Loopy Belief Propagation Fixed Points 2403

bull With pairwise potentials the above symmetries can be used to reducethe number of evaluations to |x1||x2|(|x1|minus1)(|x2|minus1)4 combinationsAnd indeed for binary nodes x12 isin 0 1 we immediately obtain

ω = |ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| (75)

Any pairwise binary potential can be written as a Boltzmann factor

(x1 x2) prop exp[wx1x2 + θ1x1 + θ2x2]

In this notation we find the simple and intuitive expression ω = |w|the strength is the absolute value of the ldquoweightrdquo It is indeed inde-pendent of (the size of) the thresholds In the case of minus1 1 codingthe relationship is ω = 4|w|

bull In some models there is the notion of a ldquotemperaturerdquo T that is(X) propexp[ψ(X)T] where ψ(X) is considered constant In obvious notationwe then have ω(T) = ω(1)T and thus σ(T) = 1 minus exp[minusω(1)T] =1minus [1(1minus σ(1))]1T

bull Loopy belief revision (max-product) can be interpreted as a zero-temperature limit of loopy belief propagation (sum product) Morespecifically we get the belief revision updates if we imagine runningloopy belief propagation on potentials that are scaled with tempera-ture T and then take the limit T to zero Consequently when analyzingconditions for uniqueness of loopy belief revision fixed points we cantake σ(0) = 0 if σ(1) = 0 (fake interaction) yet σ(0) = 1 wheneverσ(1) gt 0

8 Conditions for Uniqueness

81 Main Result

Theorem 4 Loopy belief propagation has a unique fixed point if there exists anallocation matrix Aαβ between potentials α and nodes β with properties

1 Aαβ ge 0 forallαβsubα (positivity)

2 (1minus σα)maxβsubα

Aαβ + σαsumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(81)

with the strength σα a function of the potentialα(Xα) as defined in equation 72

Proof For completeness we first summarize our line of reasoning Fixedpoints of loopy belief propagation are in one-to-one correspondence with

2404 T Heskes

extrema of the dual equation 55 This dual has a unique extremum if itis convexconcave Concavity is guaranteed so we focus on conditionsfor convexity that is for positive (semi)definiteness of the correspondingHessian This then boils down to conditions that ensure K = K1+K2+K3 ge 0for any choice of Rβ(xβ)

Substituting the bound equation 71 into the term K3 we obtain

K3 ge minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeσαQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime)

ge minussumα

σαsumβsubα

sumxβ

Aαβ

sumβprimesubαβprime =β

Aαβ primeQlowastα(xβ)R2β(xβ)

where in the last step we applied the same trick as in equation 61 SinceK2 ge 0 and combining K1 and (the above lower bound on) K3 we get

K = K1 + K2 + K3

gesumα

sumβsubα

sumxβ

Aαβ

[1minus Aαβ minus σα

sumβ prime =β

Aαβ prime

]Qlowastα(xβ)R

2β(xβ)

This implies

(1minus σα)Aαβ + σαsumβ primesubα

Aαβ prime le 1 forallαβsubα

which in combination with Aαβ ge 0 and σα le 1 yields condition 2 inequation 81 The equality constraint equation 52 that we started with canbe relaxed to the inequality condition 3 without any consequences

We get back the stricter conditions of theorem 1 if σα = 1 for all potentialsα Furthermore ldquofake interactionsrdquo play no role with σα = 0 condition 2becomes maxβsubα Aαβ le 1 suggesting the choice Aαβ = 1 for all β sub αwhich then effectively reduces the number of neighboring potentials nβ incondition 3

82 Comparison with Other Work To the best of our knowledge theonly conditions for uniqueness of loopy belief propagation fixed points thatdepend on more than just the structure of the graph are those in Tatikondaand Jordan (2002) for pairwise potentials The analysis in Tatikonda andJordan is based on the concept of the computation tree which represents anunwrapping of the original graph with respect to the loopy belief propaga-tion algorithm The same concept is used in Weiss (2000) to show that beliefrevision yields the correct maximum a posteriori assignments in graphs

Uniqueness of Loopy Belief Propagation Fixed Points 2405

with a single loop and Weiss and Freeman (2001) to prove that loopy beliefpropagation in gaussian graphical models yields exact means Although thecurrent theorems based on the concept of computation trees are derived forpairwise potentials it should be possible to extend them to more generalfactor graphs

The setup in Tatikonda and Jordan (2002) is slightly different it is basedon the factorization

Pexact(X) = 1Z

prodα

α(Xα)prodβ

β(xβ)

to be compared with our equation 31 where there are no self-potentialsβ(xβ) With this in mind the statement is then as follows

Theorem 5 (adapted from Tatikonda amp Jordan 2002 in particular proposi-tion 53) Loopy belief propagation on pairwise potentials has a unique fixed pointif

sumαsupβ

(max

ψα(Xα)minusminXα

ψα(Xα)

)lt 2 forallβ (82)

To make the connection between theorem 5 and theorem 4 we will firststrengthen the former and then weaken the latter We will focus on thecase of binary pairwise potentials Since the definition of self-potentials isarbitrary and the condition 82 is valid for any choice we can easily improvethe condition by optimizing this choice This then leads to the followingcorollary

Corollary 3 This corollary concerns an improvement of theorem 5 for pairwisebinary potentials Loopy belief propagation on pairwise binary potentials has aunique fixed point ifsum

αsupβωα lt 4 forallβ (83)

with ωα defined in equation 72

Proof The condition 82 applies to any arbitrary definition of self-poten-tials β(xβ) In fact it is valid for any choice

ψα(Xα) = ψα(Xα)+sumβsubα

φαβ(xβ)

where ψα(Xα) is any choice of potential subsets that fits in our frameworkof no self-potentials (as argued above there is some arbitrariness here as

2406 T Heskes

well) We can then optimize this choice to obtain milder and thus betterconditions Omitting α and renumbering the nodes from 1 to 2 we have

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= minφ1φ2

maxx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

minus minx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

In the case of binary nodes (two-by-two matrices ψ(x1 x2)) it is easy tocheck that the optimal φ1 and φ2 that yield the smallest gap are such that

ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2)

ge ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2) (84)

for some x1 x2 x1 and x2 with x1 = x1 and x2 = x2 Solving for φ1 and φ2we find

φ1(x1)minus φ1(x1) = 12

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]φ2(x2)minus φ2(x2) = 1

2

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]

Substitution back into equation 84 yields

ψ(x1 x2)+ φ1(x1)+ φ2(x2)minus ψ(x1 x2)minus φ1(x1)minus φ2(x2)

= 12

[ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)minus ψ(x1 x2)

]

which has to be nonnegative Of all four possible combinations two of themare valid and yield the same positive gap and the other two are invalid sincethey yield the same negative gap Enumerating these combinations we find

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= 12|ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| = ω

2

from equation 75 Substitution into the condition 82 then yields equa-tion 83

Next we derive the following weaker corollary of theorem 4

Uniqueness of Loopy Belief Propagation Fixed Points 2407

Corollary 4 This is a weaker version of theorem 4 for pairwise potentials Loopybelief propagation on pairwise potentials has a unique fixed point ifsum

αsupβωα le 1 forallβ (85)

with ωα defined in equation 72

Proof Consider the allocation matrix with components Aαβ = 1minus σα forall β sub α With this choice conditions 1 and 2 of equation 81 are fulfilledsince (condition 1) σα le 1 and (condition 2)

(1minus σα)(1minus σα)+ 2σα(1minus σα) = 1minus 2σ 2α le 1

Substitution into condition 3 yieldssumαsupβ

(1minus σα) gesumαsupβ

1minus 1 and thussumαsupβ

σα le 1 (86)

Since ωα = minus log(1minus σα) ge σα condition 85 is weaker than condition 86

Summarizing the conditions in Tatikonda and Jordan (2002) are for bi-nary pairwise potentials and when strengthened as above at most a constant(factor 4) less strict and thus better than the ones derived here The latterare better when the structure is (close to) a tree The best set of conditionsfollows by taking the union of both Note further that the conditions derivedin Tatikonda and Jordan (2002) are unlike theorem 4 specific to pairwisepotentials

83 Illustration For illustration we consider a 3 times 3 Ising grid withtoroidal boundary conditions as in Figure 3a and uniform ferromagneticpotentials proportional to(

α 1minus α1minus α α

)

The trivial solution which is the only minimum of the Bethe free energy forsmall α is the one with all pseudomarginals equal to (05 05) With simplealgebra for example following the line of reasoning that leads to the beliefoptimization algorithm in Welling and Teh (2003) it can be shown that thistrivial solution becomes unstable at the critical αcritical = 23 asymp 067 Forα gt 23 we find two minima one with ldquospins uprdquo and the other one withldquospins downrdquo

In this symmetric problem the strength of each potential is given by

ω = 2 log[

α

1minus α]

and thus σ = 1minus(

1minus αα

)2

2408 T Heskes

Figure 3 Three Ising grids in factor-graph notation circles denote nodes boxesinteractions (a) Toroidal boundary conditions All elements of the allocationmatrix equal to 34 (not shown) (b) Aperiodic boundary conditions and (c) twoloops left The elements of the allocation matrix along the edges follow directlyfrom optimizing condition 3 in theorem 4 and symmetry considerations WithB = 2minus 2A in b and C = 1minusA in c the optimal settings for the single remainingvariable A then boil down to 34 and 1 minus

radic18 respectively See the text for

further explanation

Uniqueness of Loopy Belief Propagation Fixed Points 2409

The minimal (uniform) compensation in condition 3 of theorem 4 amountsto A = 34 for all combinations of potentials and nodes Substitution intocondition 2 then yields

σ le 13

and thus α le 11+radic23

asymp 055

The critical value that follows from corollary 3 is in this case slightly better

ω lt 1 and thus α le 11+ eminus12 asymp 062

Next we consider the same grid with aperiodic boundary conditions asin Figure 3b Numerically we find a critical αcritical asymp 079 The value thatfollows from corollary 3 is dominated by the center node and hence staysthe same a unique loopy belief propagation fixed point for α lt 062 Theo-rem 4 can be exploited to shift resources a little In principle we can solvethe nonlinear programming problem but for this small problem it can stillbe done by hand with the following argumentation Minimal compensationaccording to condition 3 in theorem 4 combined with symmetry considera-tions yields the allocation matrix elements along on the edges in Figure 3bIt is then easy to check that there are only two different appearances ofcondition 2

(2minus 2A)σ + 34le 1 and

12σ + A le 1

The optimal choice for A is the one in which both conditions turn out to beidentical In this way we obtain A = 34 yielding

σ le 12

and thus α le 11+radic12

asymp 058

still slightly worse than the condition from corollary 3An example in which the condition obtained with theorem 4 is better

than the one from corollary 3 is given in Figure 3c Straightforward analysisfollowing the same recipe as for Figure 3b yields A = 1minusradic18 with

σ leradic

12

and thus α le 1

1+radic

1minusradic12asymp 065

better than theα lt 062 from corollary 3 and to be compared with the criticalαcritical asymp 088

9 Discussion

In this article we derived sufficient conditions for loopy belief propagationto have just a single fixed point These conditions remain much too strong tobe anywhere near the necessary conditions and in that sense should be seenas no more than a first step These conditions have the following positivefeatures

2410 T Heskes

bull Generalize the conditions for convexity of the Bethe free energy

bull Incorporate the (local) strength of potentials

bull Scale naturally as a function of the ldquotemperaturerdquo

bull Are invariant to arbitrary definitions of potentials and self-interactions

Although the analysis that led to these conditions may seem quite involvedit basically consists of a relatively straightforward combination of two ob-servations The first observation is that we can exploit the arbitrariness inthe definition of the Bethe free energy when we incorporate the constraintsThis forms the basis of the resource allocation argument And the secondobservation concerns the bound on the correlation of a loopy belief propa-gation marginal that leads to the introduction of the strength of a potential

Besides its theoretical usefulness there are more practical uses Firstalgorithms for guaranteed convergence explicitly minimize the Bethe freeenergy They can be considered ldquobound optimization algorithmsrdquo similarto expectation maximization and iterative proportional fitting in the innerloop they minimize a bound on the Bethe free energy which is then up-dated in the outer loop In practice it appears that the tighter the boundthe faster the convergence (see eg Heskes et al 2003) Instead of a boundthat is convex (Yuille 2002) or convex over the set of constraints (Teh ampWelling 2002 Heskes et al 2003) we might relax the convexity conditionand choose a tighter bound that still has a unique minimum thereby speed-ing up the convergence Second in Wainwright et al (2002) a convexifiedBethe free energy is proposed The arguments for this class of free energiesare twofold they yield a bound on the partition function (instead of justan approximation as the standard Bethe free energy) and have a uniqueminimum Focusing on the second argument the conditions in this articlecan be used to construct Bethe free energies that may not be convex (overthe set of constraints) but do have a unique minimum and being closer tothe standard Bethe free energy may yield better approximations

We can think of the following opportunities to make the sufficient con-ditions derived here stricter and thus closer to necessary conditions

bull The conditions guarantee convexity of the dual G(Qβ λαβ) with re-spect to Qβ But in fact we need only G(Qβ) equiv maxλαβ G(Qβ λαβ) to beconvex which is a weaker requirement The Hessian of G(Qβ) how-ever appears to be more difficult to compute and to analyze in generalbut may lead to stronger results in specific cases (eg only pairwiseinteractions or substituting a particular choice of Aαβ )

bull It may be possible to strengthen the bound equation 71 on loopybelief correlations especially for interactions that involve more thantwo nodes

An important question is how the uniqueness of loopy belief propaga-tion fixed points relates to the convergence of loopy belief propagation

Uniqueness of Loopy Belief Propagation Fixed Points 2411

Intuitively one might expect that if loopy belief propagation has a uniquefixed point it will also converge to it This also seems to be the argumenta-tion in Tatikonda and Jordan (2002) However to the best of our knowledgethere is no proof of such correspondence Furthermore the following set ofsimulations does seem to suggest otherwise

We consider a Boltzmann machine with four binary nodes weights

w = ω

0 1 minus1 minus11 0 1 minus1minus1 1 0 minus1minus1 minus1 minus1 0

zero thresholds and potentials

ij(xi xj) = exp[wij4] if xi = xj and ij(xi xj) = exp[minuswij4] if xi = xj

Running loopy belief propagation possibly damped as in equation 39 weobserve ldquoconvergentrdquo and ldquononconvergentrdquo behavior For relatively smallweights loopy belief propagation converges to the trivial fixed point withPi(xi) = 05 for all nodes i and xi = 0 1 as in the lower left inset inFigure 4 For relatively large weights it ends up in a limit cycle as shown inthe upper right inset The weight strength that forms the transition betweenthis ldquoconvergentrdquo and ldquononconvergentrdquo behavior strongly depends on thestep size7 This by itself makes it hard to defend a one-to-one correspondencebetween convergence of loopy belief propagation (apparently dependingon step size) and uniqueness of fixed points (obviously independent of stepsize)

For weights larger than roughly 58 loopy belief propagation failed toconverge to the trivial fixed point even for very small step sizes Howeverrunning a convergent double-loop algorithm from many different initialconditions and many weight strengths considerably larger than 58 we al-ways ended up in the trivial fixed point and never in another one We foundsimilar behavior for a three-node Boltzmann machine (same weight matrixas above except for the fourth node) for very large weights loopy beliefpropagation ends up in a limit cycle whereas a convergent double-loopalgorithm converges to the trivial fixed point which here by corollary 2is guaranteed to be unique In future work we hope to elaborate on theseissues

7 Note that the conditions for guaranteed uniqueness imply ω = 43 for corollary 3and ω = log(2) asymp 069 for theorem 4 both far below the weight strengths where ldquonon-convergentrdquo behavior sets in

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

2398 T Heskes

62 Recovering the Convexity Conditions (2) Let us first see how weget back the conditions for convexity of the Bethe free energy equation 51Since

K2 =sumα

[sumβsubα

sumxβ

AαβQlowastα(xβ)Rβ(xβ)

]2

ge 0

and5

K3 =sumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeQlowastα(xβ xprimeβ prime)

times

12

[Rβ(xβ)minus Rβ prime(xprimeβ prime)

]2 minus 12

R2β(xβ)minus

12

R2β prime(x

primeβ prime)

gesumα

sumβsubα

sumxβ

Aαβ

(sumβ primesubα

Aαβ prime minus Aαβ

)Qlowastα(xβ)R

2β(xβ) (61)

we have

K = K1 + K2 + K3 gesumα

sumβsubα

sumxβ

Aαβ

(1minus

sumβ primesubα

Aαβ prime

)Qlowastα(xβ)R

2β(xβ)

That is sufficient conditions for K to be nonnegative are

Aαβ ge 0 forallαβsubα andsumβsubα

Aαβ le 1 forallα

precisely the conditions for theorem 1

63 Fake Interactions While discussing the conditions for convexity ofthe Bethe free energy we noticed that adding a ldquofake interactionrdquo such as aconstant potential can change the validity of the conditions We will see thathere this is not the case and these fake interactions drop out as we wouldexpect them to

Suppose that we have a fake interaction α(Xα) = 1 From the solutionequation 54 it follows that the pseudomarginal Qlowastα(Xα) factorizes6

Qlowastα(xβ xprimeβ prime) = Qlowastα(xβ)Qlowastα(xprimeβ prime) forallββ primesubα

5 This step is in fact equivalent to the Gerschgorin theorem for bounding the eigenval-ues of a matrix

6 The exact marginal Pexact(Xα) need not factorize This is really a consequence of thelocality assumptions behind loopy belief propagation and the Bethe free energy

Uniqueness of Loopy Belief Propagation Fixed Points 2399

Consequently the terms involving α in K3 cancel with those in K2 which ismost easily seen when we combine K2 and K3 in a different way

K2 + K3 =sumα

sumβsubα

sumxβ xprime

βprime

A2αβQlowastα(xβ)Q

lowastα(xprimeβ)Rβ(xβ)Rβ(x

primeβ) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ prime

times [Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)]Rβ(xβ)Rβ prime(x

primeβ prime) (K3)

This leaves us with the weaker requirement (from K1) Aαβ(1minusAαβ) ge 0 forall β sub α The best choice is then to take Aαβ = 1 which turns condition 3of equation 41 intosum

αprimesupβαprime =α

Aαprimeβ + 1 ge nβ minus 1

The net effect is equivalent to ignoring the interaction reducing the numberof neighboring potentials nβ by 1 for all β that are part of the fake interactionα

We have seen how we get milder and thus better conditions when thereis effectively no interaction Motivated by this ldquosuccessrdquo we will work to-ward conditions that take into account the strength of the interactions Ourstarting point will be the above decomposition in K2 and K3 where sinceK2 ge 0 we will concentrate on K3

7 The Strength of a Potential

71 Bounding the Correlations The crucial observation which will al-low us to obtain milder and thus better conditions for the uniqueness of afixed point is the following lemma It bounds the term between brackets inK3 such that we can again combine this bound with the (positive) term K1However before we get to that we take some time to introduce and deriveproperties of the ldquostrengthrdquo of a potential

Lemma 2 Two-node correlations of loopy belief marginals obey the bound

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime) le σαQlowastα(xβ xprimeβ prime) forall ββprime subα

βprime =βforallxβ xprime

βprime (71)

with the ldquostrengthrdquo σα a function of the potential ψα(Xα) equiv logα(Xα) only

σα = 1minus exp(minusωα) with

ωα equiv maxXαXα

[ψα(Xα)+ (nα minus 1)ψα(Xα)minus

sumβsubα

ψα(Xαβ xβ)

] (72)

where nα equivsum

βsubα 1

2400 T Heskes

Proof For convenience and without loss of generality we omit α fromour notation and renumber the nodes that are contained in α from 1 to nWe consider the quotient between the loopy belief on the potential subsetdivided by the product of its single-node marginals

Qlowast(X)nprodβ=1

Qlowast(xβ)=(X)

prodβ

microβ(xβ)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)microβ(xβ)

=(X)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)

(73)

where we substituted the properly normalized version of equation 35 aloopy belief pseudomarginal is proportional to the potential times incomingmessages The goal is now to find the maximum of the above expressionover all possible messages and all values of X Especially the maximum overmessagesmicro seems to be difficult to compute but the following intermediatelemma helps us out

Lemma 3 The maximum of the function

V(micro) = (nminus 1) log

[sumX

(X)nprodβ=1

microβ(xβ)

]

minusnsumβ=1

log

sum

(Xβ xlowastβ)prodβ prime =β

microβ prime(xβ prime)

with respect to the messages micro under constraintssum

xβ microβ(xβ) = 1 for all β andmicroβ(xβ) ge 0 for all β and xβ occurs at an extreme point microβ(xβ) = δxβ xβ for somexβ to be found

Proof Let us consider optimizing the messagemicro1(x1)with fixed messagesmicroβ(xβ) for β gt 1 The first and second derivatives are easily found to obey

partVpartmicro1(x1)

= (nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ)

part2Vpartmicro1(x1)partmicro1(xprime1)

= (nminus 1)Q(x1)Q(xprime1)minussumβ =1

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

Uniqueness of Loopy Belief Propagation Fixed Points 2401

where

Q(X) equiv (X)prodβ microβ(xβ)sum

Xprime (Xprime)prodβ microβ(x

primeβ)

Now suppose that V has a regular extremum (maximum or minimum) notat an extreme point that is micro1(x1) gt 0 for two or more values of x1 At suchan extremum the first derivative should obey

(nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ) = λ

with λ a Lagrange multiplier implementing the constraintsum

x1micro1(x1) = 1

Summing over x1 we obtain λ = 0 (in fact V is indifferent to any multi-plicative scaling of micro) For the matrix with second derivatives at such anextremum we then have

part2Vpartmicro1(x1)partmicro1(xprime1)

=sumβ =1

sumβprime =1βprime =β

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

which is positive semidefinite the extremum cannot be a maximum Con-sequently any maximum must be at the boundary of the domain Sincethis holds for any choice of microβ(xβ) β gt 1 it follows by induction that themaximum with respect to all microβ(xβ) must be at an extreme point as well

The function V(micro) is up to a term independent of micro the logarithm ofequation 73 So the intermediate lemma 3 tells us that we can replace themaximization over messages micro by maximization over values X

maxmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

X

(X)[(X)

]nminus1

prodβ

(Xβ xβ)

Next we take the maximum over X as well and define the ldquostrengthrdquo σ tobe used in equation 71 through

11minus σ equiv max

Xmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

XX

(X)[(X)

]nminus1

prodβ

(Xβ xβ) (74)

2402 T Heskes

The inequality 71 then follows by summing out Xββ prime in

Qlowast(X)minusprodβ

Qlowast(xβ) le σQlowast(X)

The form of equation 72 then follows by rewriting equation 74 as

ω equiv minus log(1minus σ) = maxXX

W(X X) with

W(X X) =[ψ(X)+ (nminus 1)ψ(X)minus

sumβ

ψ(Xβ xβ)

]

where we recall that ψ(X) equiv log(X)

72 Some Properties In the following we will refer to both ω and σ asthe strength of the potential There are several properties worth noting

bull The strength of a potential is indifferent to multiplication with anyterm that factorizes over the nodes that is

if (X) = (X)prodβ

microβ(xβ) then ω() = ω() for any choice of micro

This property relates to the arbitrariness in the definition of equa-tion 31 if two potentials overlap then multiplying one potential witha term that only depends on the overlap and dividing the other by thesame term does not change the distribution Luckily it also does notchange the strength of those potentials

bull To compute the strength we can enumerate all possible combinationsHowever we can neglect all combinations X and X that differ in fewerthan two nodes To see this consider

W(x1 x2 x12 x1 x2 x12) = ψ(x1 x2 x12)+ ψ(x1 x2 x12)minus ψ(x1 x2 x12)minus ψ(x1 x2 x12)

= minusW(x1 x2 x12 x1 x2 x12)

If now also x2 = x2 we get W(x1 x1 x1 x1) = minusW(x1 x1 x1 x1) =0 Furthermore if W(x1 x2 x12 x1 x2 x12) le 0 then it must be thatW(x1 x2 x12 x1 x2 x12) ge 0 and vice versa So ω the maximumover all combinations must be nonnegative and we can indeed neglectall combinations that by definition yield zero

bull Thus for finite potentials 0 le ω ltinfin and 0 le σ lt 1

Uniqueness of Loopy Belief Propagation Fixed Points 2403

bull With pairwise potentials the above symmetries can be used to reducethe number of evaluations to |x1||x2|(|x1|minus1)(|x2|minus1)4 combinationsAnd indeed for binary nodes x12 isin 0 1 we immediately obtain

ω = |ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| (75)

Any pairwise binary potential can be written as a Boltzmann factor

(x1 x2) prop exp[wx1x2 + θ1x1 + θ2x2]

In this notation we find the simple and intuitive expression ω = |w|the strength is the absolute value of the ldquoweightrdquo It is indeed inde-pendent of (the size of) the thresholds In the case of minus1 1 codingthe relationship is ω = 4|w|

bull In some models there is the notion of a ldquotemperaturerdquo T that is(X) propexp[ψ(X)T] where ψ(X) is considered constant In obvious notationwe then have ω(T) = ω(1)T and thus σ(T) = 1 minus exp[minusω(1)T] =1minus [1(1minus σ(1))]1T

bull Loopy belief revision (max-product) can be interpreted as a zero-temperature limit of loopy belief propagation (sum product) Morespecifically we get the belief revision updates if we imagine runningloopy belief propagation on potentials that are scaled with tempera-ture T and then take the limit T to zero Consequently when analyzingconditions for uniqueness of loopy belief revision fixed points we cantake σ(0) = 0 if σ(1) = 0 (fake interaction) yet σ(0) = 1 wheneverσ(1) gt 0

8 Conditions for Uniqueness

81 Main Result

Theorem 4 Loopy belief propagation has a unique fixed point if there exists anallocation matrix Aαβ between potentials α and nodes β with properties

1 Aαβ ge 0 forallαβsubα (positivity)

2 (1minus σα)maxβsubα

Aαβ + σαsumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(81)

with the strength σα a function of the potentialα(Xα) as defined in equation 72

Proof For completeness we first summarize our line of reasoning Fixedpoints of loopy belief propagation are in one-to-one correspondence with

2404 T Heskes

extrema of the dual equation 55 This dual has a unique extremum if itis convexconcave Concavity is guaranteed so we focus on conditionsfor convexity that is for positive (semi)definiteness of the correspondingHessian This then boils down to conditions that ensure K = K1+K2+K3 ge 0for any choice of Rβ(xβ)

Substituting the bound equation 71 into the term K3 we obtain

K3 ge minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeσαQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime)

ge minussumα

σαsumβsubα

sumxβ

Aαβ

sumβprimesubαβprime =β

Aαβ primeQlowastα(xβ)R2β(xβ)

where in the last step we applied the same trick as in equation 61 SinceK2 ge 0 and combining K1 and (the above lower bound on) K3 we get

K = K1 + K2 + K3

gesumα

sumβsubα

sumxβ

Aαβ

[1minus Aαβ minus σα

sumβ prime =β

Aαβ prime

]Qlowastα(xβ)R

2β(xβ)

This implies

(1minus σα)Aαβ + σαsumβ primesubα

Aαβ prime le 1 forallαβsubα

which in combination with Aαβ ge 0 and σα le 1 yields condition 2 inequation 81 The equality constraint equation 52 that we started with canbe relaxed to the inequality condition 3 without any consequences

We get back the stricter conditions of theorem 1 if σα = 1 for all potentialsα Furthermore ldquofake interactionsrdquo play no role with σα = 0 condition 2becomes maxβsubα Aαβ le 1 suggesting the choice Aαβ = 1 for all β sub αwhich then effectively reduces the number of neighboring potentials nβ incondition 3

82 Comparison with Other Work To the best of our knowledge theonly conditions for uniqueness of loopy belief propagation fixed points thatdepend on more than just the structure of the graph are those in Tatikondaand Jordan (2002) for pairwise potentials The analysis in Tatikonda andJordan is based on the concept of the computation tree which represents anunwrapping of the original graph with respect to the loopy belief propaga-tion algorithm The same concept is used in Weiss (2000) to show that beliefrevision yields the correct maximum a posteriori assignments in graphs

Uniqueness of Loopy Belief Propagation Fixed Points 2405

with a single loop and Weiss and Freeman (2001) to prove that loopy beliefpropagation in gaussian graphical models yields exact means Although thecurrent theorems based on the concept of computation trees are derived forpairwise potentials it should be possible to extend them to more generalfactor graphs

The setup in Tatikonda and Jordan (2002) is slightly different it is basedon the factorization

Pexact(X) = 1Z

prodα

α(Xα)prodβ

β(xβ)

to be compared with our equation 31 where there are no self-potentialsβ(xβ) With this in mind the statement is then as follows

Theorem 5 (adapted from Tatikonda amp Jordan 2002 in particular proposi-tion 53) Loopy belief propagation on pairwise potentials has a unique fixed pointif

sumαsupβ

(max

ψα(Xα)minusminXα

ψα(Xα)

)lt 2 forallβ (82)

To make the connection between theorem 5 and theorem 4 we will firststrengthen the former and then weaken the latter We will focus on thecase of binary pairwise potentials Since the definition of self-potentials isarbitrary and the condition 82 is valid for any choice we can easily improvethe condition by optimizing this choice This then leads to the followingcorollary

Corollary 3 This corollary concerns an improvement of theorem 5 for pairwisebinary potentials Loopy belief propagation on pairwise binary potentials has aunique fixed point ifsum

αsupβωα lt 4 forallβ (83)

with ωα defined in equation 72

Proof The condition 82 applies to any arbitrary definition of self-poten-tials β(xβ) In fact it is valid for any choice

ψα(Xα) = ψα(Xα)+sumβsubα

φαβ(xβ)

where ψα(Xα) is any choice of potential subsets that fits in our frameworkof no self-potentials (as argued above there is some arbitrariness here as

2406 T Heskes

well) We can then optimize this choice to obtain milder and thus betterconditions Omitting α and renumbering the nodes from 1 to 2 we have

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= minφ1φ2

maxx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

minus minx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

In the case of binary nodes (two-by-two matrices ψ(x1 x2)) it is easy tocheck that the optimal φ1 and φ2 that yield the smallest gap are such that

ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2)

ge ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2) (84)

for some x1 x2 x1 and x2 with x1 = x1 and x2 = x2 Solving for φ1 and φ2we find

φ1(x1)minus φ1(x1) = 12

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]φ2(x2)minus φ2(x2) = 1

2

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]

Substitution back into equation 84 yields

ψ(x1 x2)+ φ1(x1)+ φ2(x2)minus ψ(x1 x2)minus φ1(x1)minus φ2(x2)

= 12

[ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)minus ψ(x1 x2)

]

which has to be nonnegative Of all four possible combinations two of themare valid and yield the same positive gap and the other two are invalid sincethey yield the same negative gap Enumerating these combinations we find

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= 12|ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| = ω

2

from equation 75 Substitution into the condition 82 then yields equa-tion 83

Next we derive the following weaker corollary of theorem 4

Uniqueness of Loopy Belief Propagation Fixed Points 2407

Corollary 4 This is a weaker version of theorem 4 for pairwise potentials Loopybelief propagation on pairwise potentials has a unique fixed point ifsum

αsupβωα le 1 forallβ (85)

with ωα defined in equation 72

Proof Consider the allocation matrix with components Aαβ = 1minus σα forall β sub α With this choice conditions 1 and 2 of equation 81 are fulfilledsince (condition 1) σα le 1 and (condition 2)

(1minus σα)(1minus σα)+ 2σα(1minus σα) = 1minus 2σ 2α le 1

Substitution into condition 3 yieldssumαsupβ

(1minus σα) gesumαsupβ

1minus 1 and thussumαsupβ

σα le 1 (86)

Since ωα = minus log(1minus σα) ge σα condition 85 is weaker than condition 86

Summarizing the conditions in Tatikonda and Jordan (2002) are for bi-nary pairwise potentials and when strengthened as above at most a constant(factor 4) less strict and thus better than the ones derived here The latterare better when the structure is (close to) a tree The best set of conditionsfollows by taking the union of both Note further that the conditions derivedin Tatikonda and Jordan (2002) are unlike theorem 4 specific to pairwisepotentials

83 Illustration For illustration we consider a 3 times 3 Ising grid withtoroidal boundary conditions as in Figure 3a and uniform ferromagneticpotentials proportional to(

α 1minus α1minus α α

)

The trivial solution which is the only minimum of the Bethe free energy forsmall α is the one with all pseudomarginals equal to (05 05) With simplealgebra for example following the line of reasoning that leads to the beliefoptimization algorithm in Welling and Teh (2003) it can be shown that thistrivial solution becomes unstable at the critical αcritical = 23 asymp 067 Forα gt 23 we find two minima one with ldquospins uprdquo and the other one withldquospins downrdquo

In this symmetric problem the strength of each potential is given by

ω = 2 log[

α

1minus α]

and thus σ = 1minus(

1minus αα

)2

2408 T Heskes

Figure 3 Three Ising grids in factor-graph notation circles denote nodes boxesinteractions (a) Toroidal boundary conditions All elements of the allocationmatrix equal to 34 (not shown) (b) Aperiodic boundary conditions and (c) twoloops left The elements of the allocation matrix along the edges follow directlyfrom optimizing condition 3 in theorem 4 and symmetry considerations WithB = 2minus 2A in b and C = 1minusA in c the optimal settings for the single remainingvariable A then boil down to 34 and 1 minus

radic18 respectively See the text for

further explanation

Uniqueness of Loopy Belief Propagation Fixed Points 2409

The minimal (uniform) compensation in condition 3 of theorem 4 amountsto A = 34 for all combinations of potentials and nodes Substitution intocondition 2 then yields

σ le 13

and thus α le 11+radic23

asymp 055

The critical value that follows from corollary 3 is in this case slightly better

ω lt 1 and thus α le 11+ eminus12 asymp 062

Next we consider the same grid with aperiodic boundary conditions asin Figure 3b Numerically we find a critical αcritical asymp 079 The value thatfollows from corollary 3 is dominated by the center node and hence staysthe same a unique loopy belief propagation fixed point for α lt 062 Theo-rem 4 can be exploited to shift resources a little In principle we can solvethe nonlinear programming problem but for this small problem it can stillbe done by hand with the following argumentation Minimal compensationaccording to condition 3 in theorem 4 combined with symmetry considera-tions yields the allocation matrix elements along on the edges in Figure 3bIt is then easy to check that there are only two different appearances ofcondition 2

(2minus 2A)σ + 34le 1 and

12σ + A le 1

The optimal choice for A is the one in which both conditions turn out to beidentical In this way we obtain A = 34 yielding

σ le 12

and thus α le 11+radic12

asymp 058

still slightly worse than the condition from corollary 3An example in which the condition obtained with theorem 4 is better

than the one from corollary 3 is given in Figure 3c Straightforward analysisfollowing the same recipe as for Figure 3b yields A = 1minusradic18 with

σ leradic

12

and thus α le 1

1+radic

1minusradic12asymp 065

better than theα lt 062 from corollary 3 and to be compared with the criticalαcritical asymp 088

9 Discussion

In this article we derived sufficient conditions for loopy belief propagationto have just a single fixed point These conditions remain much too strong tobe anywhere near the necessary conditions and in that sense should be seenas no more than a first step These conditions have the following positivefeatures

2410 T Heskes

bull Generalize the conditions for convexity of the Bethe free energy

bull Incorporate the (local) strength of potentials

bull Scale naturally as a function of the ldquotemperaturerdquo

bull Are invariant to arbitrary definitions of potentials and self-interactions

Although the analysis that led to these conditions may seem quite involvedit basically consists of a relatively straightforward combination of two ob-servations The first observation is that we can exploit the arbitrariness inthe definition of the Bethe free energy when we incorporate the constraintsThis forms the basis of the resource allocation argument And the secondobservation concerns the bound on the correlation of a loopy belief propa-gation marginal that leads to the introduction of the strength of a potential

Besides its theoretical usefulness there are more practical uses Firstalgorithms for guaranteed convergence explicitly minimize the Bethe freeenergy They can be considered ldquobound optimization algorithmsrdquo similarto expectation maximization and iterative proportional fitting in the innerloop they minimize a bound on the Bethe free energy which is then up-dated in the outer loop In practice it appears that the tighter the boundthe faster the convergence (see eg Heskes et al 2003) Instead of a boundthat is convex (Yuille 2002) or convex over the set of constraints (Teh ampWelling 2002 Heskes et al 2003) we might relax the convexity conditionand choose a tighter bound that still has a unique minimum thereby speed-ing up the convergence Second in Wainwright et al (2002) a convexifiedBethe free energy is proposed The arguments for this class of free energiesare twofold they yield a bound on the partition function (instead of justan approximation as the standard Bethe free energy) and have a uniqueminimum Focusing on the second argument the conditions in this articlecan be used to construct Bethe free energies that may not be convex (overthe set of constraints) but do have a unique minimum and being closer tothe standard Bethe free energy may yield better approximations

We can think of the following opportunities to make the sufficient con-ditions derived here stricter and thus closer to necessary conditions

bull The conditions guarantee convexity of the dual G(Qβ λαβ) with re-spect to Qβ But in fact we need only G(Qβ) equiv maxλαβ G(Qβ λαβ) to beconvex which is a weaker requirement The Hessian of G(Qβ) how-ever appears to be more difficult to compute and to analyze in generalbut may lead to stronger results in specific cases (eg only pairwiseinteractions or substituting a particular choice of Aαβ )

bull It may be possible to strengthen the bound equation 71 on loopybelief correlations especially for interactions that involve more thantwo nodes

An important question is how the uniqueness of loopy belief propaga-tion fixed points relates to the convergence of loopy belief propagation

Uniqueness of Loopy Belief Propagation Fixed Points 2411

Intuitively one might expect that if loopy belief propagation has a uniquefixed point it will also converge to it This also seems to be the argumenta-tion in Tatikonda and Jordan (2002) However to the best of our knowledgethere is no proof of such correspondence Furthermore the following set ofsimulations does seem to suggest otherwise

We consider a Boltzmann machine with four binary nodes weights

w = ω

0 1 minus1 minus11 0 1 minus1minus1 1 0 minus1minus1 minus1 minus1 0

zero thresholds and potentials

ij(xi xj) = exp[wij4] if xi = xj and ij(xi xj) = exp[minuswij4] if xi = xj

Running loopy belief propagation possibly damped as in equation 39 weobserve ldquoconvergentrdquo and ldquononconvergentrdquo behavior For relatively smallweights loopy belief propagation converges to the trivial fixed point withPi(xi) = 05 for all nodes i and xi = 0 1 as in the lower left inset inFigure 4 For relatively large weights it ends up in a limit cycle as shown inthe upper right inset The weight strength that forms the transition betweenthis ldquoconvergentrdquo and ldquononconvergentrdquo behavior strongly depends on thestep size7 This by itself makes it hard to defend a one-to-one correspondencebetween convergence of loopy belief propagation (apparently dependingon step size) and uniqueness of fixed points (obviously independent of stepsize)

For weights larger than roughly 58 loopy belief propagation failed toconverge to the trivial fixed point even for very small step sizes Howeverrunning a convergent double-loop algorithm from many different initialconditions and many weight strengths considerably larger than 58 we al-ways ended up in the trivial fixed point and never in another one We foundsimilar behavior for a three-node Boltzmann machine (same weight matrixas above except for the fourth node) for very large weights loopy beliefpropagation ends up in a limit cycle whereas a convergent double-loopalgorithm converges to the trivial fixed point which here by corollary 2is guaranteed to be unique In future work we hope to elaborate on theseissues

7 Note that the conditions for guaranteed uniqueness imply ω = 43 for corollary 3and ω = log(2) asymp 069 for theorem 4 both far below the weight strengths where ldquonon-convergentrdquo behavior sets in

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

Uniqueness of Loopy Belief Propagation Fixed Points 2399

Consequently the terms involving α in K3 cancel with those in K2 which ismost easily seen when we combine K2 and K3 in a different way

K2 + K3 =sumα

sumβsubα

sumxβ xprime

βprime

A2αβQlowastα(xβ)Q

lowastα(xprimeβ)Rβ(xβ)Rβ(x

primeβ) (K2)

minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ prime

times [Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime)]Rβ(xβ)Rβ prime(x

primeβ prime) (K3)

This leaves us with the weaker requirement (from K1) Aαβ(1minusAαβ) ge 0 forall β sub α The best choice is then to take Aαβ = 1 which turns condition 3of equation 41 intosum

αprimesupβαprime =α

Aαprimeβ + 1 ge nβ minus 1

The net effect is equivalent to ignoring the interaction reducing the numberof neighboring potentials nβ by 1 for all β that are part of the fake interactionα

We have seen how we get milder and thus better conditions when thereis effectively no interaction Motivated by this ldquosuccessrdquo we will work to-ward conditions that take into account the strength of the interactions Ourstarting point will be the above decomposition in K2 and K3 where sinceK2 ge 0 we will concentrate on K3

7 The Strength of a Potential

71 Bounding the Correlations The crucial observation which will al-low us to obtain milder and thus better conditions for the uniqueness of afixed point is the following lemma It bounds the term between brackets inK3 such that we can again combine this bound with the (positive) term K1However before we get to that we take some time to introduce and deriveproperties of the ldquostrengthrdquo of a potential

Lemma 2 Two-node correlations of loopy belief marginals obey the bound

Qlowastα(xβ xprimeβ prime)minusQlowastα(xβ)Qlowastα(xprimeβ prime) le σαQlowastα(xβ xprimeβ prime) forall ββprime subα

βprime =βforallxβ xprime

βprime (71)

with the ldquostrengthrdquo σα a function of the potential ψα(Xα) equiv logα(Xα) only

σα = 1minus exp(minusωα) with

ωα equiv maxXαXα

[ψα(Xα)+ (nα minus 1)ψα(Xα)minus

sumβsubα

ψα(Xαβ xβ)

] (72)

where nα equivsum

βsubα 1

2400 T Heskes

Proof For convenience and without loss of generality we omit α fromour notation and renumber the nodes that are contained in α from 1 to nWe consider the quotient between the loopy belief on the potential subsetdivided by the product of its single-node marginals

Qlowast(X)nprodβ=1

Qlowast(xβ)=(X)

prodβ

microβ(xβ)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)microβ(xβ)

=(X)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)

(73)

where we substituted the properly normalized version of equation 35 aloopy belief pseudomarginal is proportional to the potential times incomingmessages The goal is now to find the maximum of the above expressionover all possible messages and all values of X Especially the maximum overmessagesmicro seems to be difficult to compute but the following intermediatelemma helps us out

Lemma 3 The maximum of the function

V(micro) = (nminus 1) log

[sumX

(X)nprodβ=1

microβ(xβ)

]

minusnsumβ=1

log

sum

(Xβ xlowastβ)prodβ prime =β

microβ prime(xβ prime)

with respect to the messages micro under constraintssum

xβ microβ(xβ) = 1 for all β andmicroβ(xβ) ge 0 for all β and xβ occurs at an extreme point microβ(xβ) = δxβ xβ for somexβ to be found

Proof Let us consider optimizing the messagemicro1(x1)with fixed messagesmicroβ(xβ) for β gt 1 The first and second derivatives are easily found to obey

partVpartmicro1(x1)

= (nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ)

part2Vpartmicro1(x1)partmicro1(xprime1)

= (nminus 1)Q(x1)Q(xprime1)minussumβ =1

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

Uniqueness of Loopy Belief Propagation Fixed Points 2401

where

Q(X) equiv (X)prodβ microβ(xβ)sum

Xprime (Xprime)prodβ microβ(x

primeβ)

Now suppose that V has a regular extremum (maximum or minimum) notat an extreme point that is micro1(x1) gt 0 for two or more values of x1 At suchan extremum the first derivative should obey

(nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ) = λ

with λ a Lagrange multiplier implementing the constraintsum

x1micro1(x1) = 1

Summing over x1 we obtain λ = 0 (in fact V is indifferent to any multi-plicative scaling of micro) For the matrix with second derivatives at such anextremum we then have

part2Vpartmicro1(x1)partmicro1(xprime1)

=sumβ =1

sumβprime =1βprime =β

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

which is positive semidefinite the extremum cannot be a maximum Con-sequently any maximum must be at the boundary of the domain Sincethis holds for any choice of microβ(xβ) β gt 1 it follows by induction that themaximum with respect to all microβ(xβ) must be at an extreme point as well

The function V(micro) is up to a term independent of micro the logarithm ofequation 73 So the intermediate lemma 3 tells us that we can replace themaximization over messages micro by maximization over values X

maxmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

X

(X)[(X)

]nminus1

prodβ

(Xβ xβ)

Next we take the maximum over X as well and define the ldquostrengthrdquo σ tobe used in equation 71 through

11minus σ equiv max

Xmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

XX

(X)[(X)

]nminus1

prodβ

(Xβ xβ) (74)

2402 T Heskes

The inequality 71 then follows by summing out Xββ prime in

Qlowast(X)minusprodβ

Qlowast(xβ) le σQlowast(X)

The form of equation 72 then follows by rewriting equation 74 as

ω equiv minus log(1minus σ) = maxXX

W(X X) with

W(X X) =[ψ(X)+ (nminus 1)ψ(X)minus

sumβ

ψ(Xβ xβ)

]

where we recall that ψ(X) equiv log(X)

72 Some Properties In the following we will refer to both ω and σ asthe strength of the potential There are several properties worth noting

bull The strength of a potential is indifferent to multiplication with anyterm that factorizes over the nodes that is

if (X) = (X)prodβ

microβ(xβ) then ω() = ω() for any choice of micro

This property relates to the arbitrariness in the definition of equa-tion 31 if two potentials overlap then multiplying one potential witha term that only depends on the overlap and dividing the other by thesame term does not change the distribution Luckily it also does notchange the strength of those potentials

bull To compute the strength we can enumerate all possible combinationsHowever we can neglect all combinations X and X that differ in fewerthan two nodes To see this consider

W(x1 x2 x12 x1 x2 x12) = ψ(x1 x2 x12)+ ψ(x1 x2 x12)minus ψ(x1 x2 x12)minus ψ(x1 x2 x12)

= minusW(x1 x2 x12 x1 x2 x12)

If now also x2 = x2 we get W(x1 x1 x1 x1) = minusW(x1 x1 x1 x1) =0 Furthermore if W(x1 x2 x12 x1 x2 x12) le 0 then it must be thatW(x1 x2 x12 x1 x2 x12) ge 0 and vice versa So ω the maximumover all combinations must be nonnegative and we can indeed neglectall combinations that by definition yield zero

bull Thus for finite potentials 0 le ω ltinfin and 0 le σ lt 1

Uniqueness of Loopy Belief Propagation Fixed Points 2403

bull With pairwise potentials the above symmetries can be used to reducethe number of evaluations to |x1||x2|(|x1|minus1)(|x2|minus1)4 combinationsAnd indeed for binary nodes x12 isin 0 1 we immediately obtain

ω = |ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| (75)

Any pairwise binary potential can be written as a Boltzmann factor

(x1 x2) prop exp[wx1x2 + θ1x1 + θ2x2]

In this notation we find the simple and intuitive expression ω = |w|the strength is the absolute value of the ldquoweightrdquo It is indeed inde-pendent of (the size of) the thresholds In the case of minus1 1 codingthe relationship is ω = 4|w|

bull In some models there is the notion of a ldquotemperaturerdquo T that is(X) propexp[ψ(X)T] where ψ(X) is considered constant In obvious notationwe then have ω(T) = ω(1)T and thus σ(T) = 1 minus exp[minusω(1)T] =1minus [1(1minus σ(1))]1T

bull Loopy belief revision (max-product) can be interpreted as a zero-temperature limit of loopy belief propagation (sum product) Morespecifically we get the belief revision updates if we imagine runningloopy belief propagation on potentials that are scaled with tempera-ture T and then take the limit T to zero Consequently when analyzingconditions for uniqueness of loopy belief revision fixed points we cantake σ(0) = 0 if σ(1) = 0 (fake interaction) yet σ(0) = 1 wheneverσ(1) gt 0

8 Conditions for Uniqueness

81 Main Result

Theorem 4 Loopy belief propagation has a unique fixed point if there exists anallocation matrix Aαβ between potentials α and nodes β with properties

1 Aαβ ge 0 forallαβsubα (positivity)

2 (1minus σα)maxβsubα

Aαβ + σαsumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(81)

with the strength σα a function of the potentialα(Xα) as defined in equation 72

Proof For completeness we first summarize our line of reasoning Fixedpoints of loopy belief propagation are in one-to-one correspondence with

2404 T Heskes

extrema of the dual equation 55 This dual has a unique extremum if itis convexconcave Concavity is guaranteed so we focus on conditionsfor convexity that is for positive (semi)definiteness of the correspondingHessian This then boils down to conditions that ensure K = K1+K2+K3 ge 0for any choice of Rβ(xβ)

Substituting the bound equation 71 into the term K3 we obtain

K3 ge minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeσαQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime)

ge minussumα

σαsumβsubα

sumxβ

Aαβ

sumβprimesubαβprime =β

Aαβ primeQlowastα(xβ)R2β(xβ)

where in the last step we applied the same trick as in equation 61 SinceK2 ge 0 and combining K1 and (the above lower bound on) K3 we get

K = K1 + K2 + K3

gesumα

sumβsubα

sumxβ

Aαβ

[1minus Aαβ minus σα

sumβ prime =β

Aαβ prime

]Qlowastα(xβ)R

2β(xβ)

This implies

(1minus σα)Aαβ + σαsumβ primesubα

Aαβ prime le 1 forallαβsubα

which in combination with Aαβ ge 0 and σα le 1 yields condition 2 inequation 81 The equality constraint equation 52 that we started with canbe relaxed to the inequality condition 3 without any consequences

We get back the stricter conditions of theorem 1 if σα = 1 for all potentialsα Furthermore ldquofake interactionsrdquo play no role with σα = 0 condition 2becomes maxβsubα Aαβ le 1 suggesting the choice Aαβ = 1 for all β sub αwhich then effectively reduces the number of neighboring potentials nβ incondition 3

82 Comparison with Other Work To the best of our knowledge theonly conditions for uniqueness of loopy belief propagation fixed points thatdepend on more than just the structure of the graph are those in Tatikondaand Jordan (2002) for pairwise potentials The analysis in Tatikonda andJordan is based on the concept of the computation tree which represents anunwrapping of the original graph with respect to the loopy belief propaga-tion algorithm The same concept is used in Weiss (2000) to show that beliefrevision yields the correct maximum a posteriori assignments in graphs

Uniqueness of Loopy Belief Propagation Fixed Points 2405

with a single loop and Weiss and Freeman (2001) to prove that loopy beliefpropagation in gaussian graphical models yields exact means Although thecurrent theorems based on the concept of computation trees are derived forpairwise potentials it should be possible to extend them to more generalfactor graphs

The setup in Tatikonda and Jordan (2002) is slightly different it is basedon the factorization

Pexact(X) = 1Z

prodα

α(Xα)prodβ

β(xβ)

to be compared with our equation 31 where there are no self-potentialsβ(xβ) With this in mind the statement is then as follows

Theorem 5 (adapted from Tatikonda amp Jordan 2002 in particular proposi-tion 53) Loopy belief propagation on pairwise potentials has a unique fixed pointif

sumαsupβ

(max

ψα(Xα)minusminXα

ψα(Xα)

)lt 2 forallβ (82)

To make the connection between theorem 5 and theorem 4 we will firststrengthen the former and then weaken the latter We will focus on thecase of binary pairwise potentials Since the definition of self-potentials isarbitrary and the condition 82 is valid for any choice we can easily improvethe condition by optimizing this choice This then leads to the followingcorollary

Corollary 3 This corollary concerns an improvement of theorem 5 for pairwisebinary potentials Loopy belief propagation on pairwise binary potentials has aunique fixed point ifsum

αsupβωα lt 4 forallβ (83)

with ωα defined in equation 72

Proof The condition 82 applies to any arbitrary definition of self-poten-tials β(xβ) In fact it is valid for any choice

ψα(Xα) = ψα(Xα)+sumβsubα

φαβ(xβ)

where ψα(Xα) is any choice of potential subsets that fits in our frameworkof no self-potentials (as argued above there is some arbitrariness here as

2406 T Heskes

well) We can then optimize this choice to obtain milder and thus betterconditions Omitting α and renumbering the nodes from 1 to 2 we have

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= minφ1φ2

maxx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

minus minx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

In the case of binary nodes (two-by-two matrices ψ(x1 x2)) it is easy tocheck that the optimal φ1 and φ2 that yield the smallest gap are such that

ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2)

ge ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2) (84)

for some x1 x2 x1 and x2 with x1 = x1 and x2 = x2 Solving for φ1 and φ2we find

φ1(x1)minus φ1(x1) = 12

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]φ2(x2)minus φ2(x2) = 1

2

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]

Substitution back into equation 84 yields

ψ(x1 x2)+ φ1(x1)+ φ2(x2)minus ψ(x1 x2)minus φ1(x1)minus φ2(x2)

= 12

[ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)minus ψ(x1 x2)

]

which has to be nonnegative Of all four possible combinations two of themare valid and yield the same positive gap and the other two are invalid sincethey yield the same negative gap Enumerating these combinations we find

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= 12|ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| = ω

2

from equation 75 Substitution into the condition 82 then yields equa-tion 83

Next we derive the following weaker corollary of theorem 4

Uniqueness of Loopy Belief Propagation Fixed Points 2407

Corollary 4 This is a weaker version of theorem 4 for pairwise potentials Loopybelief propagation on pairwise potentials has a unique fixed point ifsum

αsupβωα le 1 forallβ (85)

with ωα defined in equation 72

Proof Consider the allocation matrix with components Aαβ = 1minus σα forall β sub α With this choice conditions 1 and 2 of equation 81 are fulfilledsince (condition 1) σα le 1 and (condition 2)

(1minus σα)(1minus σα)+ 2σα(1minus σα) = 1minus 2σ 2α le 1

Substitution into condition 3 yieldssumαsupβ

(1minus σα) gesumαsupβ

1minus 1 and thussumαsupβ

σα le 1 (86)

Since ωα = minus log(1minus σα) ge σα condition 85 is weaker than condition 86

Summarizing the conditions in Tatikonda and Jordan (2002) are for bi-nary pairwise potentials and when strengthened as above at most a constant(factor 4) less strict and thus better than the ones derived here The latterare better when the structure is (close to) a tree The best set of conditionsfollows by taking the union of both Note further that the conditions derivedin Tatikonda and Jordan (2002) are unlike theorem 4 specific to pairwisepotentials

83 Illustration For illustration we consider a 3 times 3 Ising grid withtoroidal boundary conditions as in Figure 3a and uniform ferromagneticpotentials proportional to(

α 1minus α1minus α α

)

The trivial solution which is the only minimum of the Bethe free energy forsmall α is the one with all pseudomarginals equal to (05 05) With simplealgebra for example following the line of reasoning that leads to the beliefoptimization algorithm in Welling and Teh (2003) it can be shown that thistrivial solution becomes unstable at the critical αcritical = 23 asymp 067 Forα gt 23 we find two minima one with ldquospins uprdquo and the other one withldquospins downrdquo

In this symmetric problem the strength of each potential is given by

ω = 2 log[

α

1minus α]

and thus σ = 1minus(

1minus αα

)2

2408 T Heskes

Figure 3 Three Ising grids in factor-graph notation circles denote nodes boxesinteractions (a) Toroidal boundary conditions All elements of the allocationmatrix equal to 34 (not shown) (b) Aperiodic boundary conditions and (c) twoloops left The elements of the allocation matrix along the edges follow directlyfrom optimizing condition 3 in theorem 4 and symmetry considerations WithB = 2minus 2A in b and C = 1minusA in c the optimal settings for the single remainingvariable A then boil down to 34 and 1 minus

radic18 respectively See the text for

further explanation

Uniqueness of Loopy Belief Propagation Fixed Points 2409

The minimal (uniform) compensation in condition 3 of theorem 4 amountsto A = 34 for all combinations of potentials and nodes Substitution intocondition 2 then yields

σ le 13

and thus α le 11+radic23

asymp 055

The critical value that follows from corollary 3 is in this case slightly better

ω lt 1 and thus α le 11+ eminus12 asymp 062

Next we consider the same grid with aperiodic boundary conditions asin Figure 3b Numerically we find a critical αcritical asymp 079 The value thatfollows from corollary 3 is dominated by the center node and hence staysthe same a unique loopy belief propagation fixed point for α lt 062 Theo-rem 4 can be exploited to shift resources a little In principle we can solvethe nonlinear programming problem but for this small problem it can stillbe done by hand with the following argumentation Minimal compensationaccording to condition 3 in theorem 4 combined with symmetry considera-tions yields the allocation matrix elements along on the edges in Figure 3bIt is then easy to check that there are only two different appearances ofcondition 2

(2minus 2A)σ + 34le 1 and

12σ + A le 1

The optimal choice for A is the one in which both conditions turn out to beidentical In this way we obtain A = 34 yielding

σ le 12

and thus α le 11+radic12

asymp 058

still slightly worse than the condition from corollary 3An example in which the condition obtained with theorem 4 is better

than the one from corollary 3 is given in Figure 3c Straightforward analysisfollowing the same recipe as for Figure 3b yields A = 1minusradic18 with

σ leradic

12

and thus α le 1

1+radic

1minusradic12asymp 065

better than theα lt 062 from corollary 3 and to be compared with the criticalαcritical asymp 088

9 Discussion

In this article we derived sufficient conditions for loopy belief propagationto have just a single fixed point These conditions remain much too strong tobe anywhere near the necessary conditions and in that sense should be seenas no more than a first step These conditions have the following positivefeatures

2410 T Heskes

bull Generalize the conditions for convexity of the Bethe free energy

bull Incorporate the (local) strength of potentials

bull Scale naturally as a function of the ldquotemperaturerdquo

bull Are invariant to arbitrary definitions of potentials and self-interactions

Although the analysis that led to these conditions may seem quite involvedit basically consists of a relatively straightforward combination of two ob-servations The first observation is that we can exploit the arbitrariness inthe definition of the Bethe free energy when we incorporate the constraintsThis forms the basis of the resource allocation argument And the secondobservation concerns the bound on the correlation of a loopy belief propa-gation marginal that leads to the introduction of the strength of a potential

Besides its theoretical usefulness there are more practical uses Firstalgorithms for guaranteed convergence explicitly minimize the Bethe freeenergy They can be considered ldquobound optimization algorithmsrdquo similarto expectation maximization and iterative proportional fitting in the innerloop they minimize a bound on the Bethe free energy which is then up-dated in the outer loop In practice it appears that the tighter the boundthe faster the convergence (see eg Heskes et al 2003) Instead of a boundthat is convex (Yuille 2002) or convex over the set of constraints (Teh ampWelling 2002 Heskes et al 2003) we might relax the convexity conditionand choose a tighter bound that still has a unique minimum thereby speed-ing up the convergence Second in Wainwright et al (2002) a convexifiedBethe free energy is proposed The arguments for this class of free energiesare twofold they yield a bound on the partition function (instead of justan approximation as the standard Bethe free energy) and have a uniqueminimum Focusing on the second argument the conditions in this articlecan be used to construct Bethe free energies that may not be convex (overthe set of constraints) but do have a unique minimum and being closer tothe standard Bethe free energy may yield better approximations

We can think of the following opportunities to make the sufficient con-ditions derived here stricter and thus closer to necessary conditions

bull The conditions guarantee convexity of the dual G(Qβ λαβ) with re-spect to Qβ But in fact we need only G(Qβ) equiv maxλαβ G(Qβ λαβ) to beconvex which is a weaker requirement The Hessian of G(Qβ) how-ever appears to be more difficult to compute and to analyze in generalbut may lead to stronger results in specific cases (eg only pairwiseinteractions or substituting a particular choice of Aαβ )

bull It may be possible to strengthen the bound equation 71 on loopybelief correlations especially for interactions that involve more thantwo nodes

An important question is how the uniqueness of loopy belief propaga-tion fixed points relates to the convergence of loopy belief propagation

Uniqueness of Loopy Belief Propagation Fixed Points 2411

Intuitively one might expect that if loopy belief propagation has a uniquefixed point it will also converge to it This also seems to be the argumenta-tion in Tatikonda and Jordan (2002) However to the best of our knowledgethere is no proof of such correspondence Furthermore the following set ofsimulations does seem to suggest otherwise

We consider a Boltzmann machine with four binary nodes weights

w = ω

0 1 minus1 minus11 0 1 minus1minus1 1 0 minus1minus1 minus1 minus1 0

zero thresholds and potentials

ij(xi xj) = exp[wij4] if xi = xj and ij(xi xj) = exp[minuswij4] if xi = xj

Running loopy belief propagation possibly damped as in equation 39 weobserve ldquoconvergentrdquo and ldquononconvergentrdquo behavior For relatively smallweights loopy belief propagation converges to the trivial fixed point withPi(xi) = 05 for all nodes i and xi = 0 1 as in the lower left inset inFigure 4 For relatively large weights it ends up in a limit cycle as shown inthe upper right inset The weight strength that forms the transition betweenthis ldquoconvergentrdquo and ldquononconvergentrdquo behavior strongly depends on thestep size7 This by itself makes it hard to defend a one-to-one correspondencebetween convergence of loopy belief propagation (apparently dependingon step size) and uniqueness of fixed points (obviously independent of stepsize)

For weights larger than roughly 58 loopy belief propagation failed toconverge to the trivial fixed point even for very small step sizes Howeverrunning a convergent double-loop algorithm from many different initialconditions and many weight strengths considerably larger than 58 we al-ways ended up in the trivial fixed point and never in another one We foundsimilar behavior for a three-node Boltzmann machine (same weight matrixas above except for the fourth node) for very large weights loopy beliefpropagation ends up in a limit cycle whereas a convergent double-loopalgorithm converges to the trivial fixed point which here by corollary 2is guaranteed to be unique In future work we hope to elaborate on theseissues

7 Note that the conditions for guaranteed uniqueness imply ω = 43 for corollary 3and ω = log(2) asymp 069 for theorem 4 both far below the weight strengths where ldquonon-convergentrdquo behavior sets in

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

2400 T Heskes

Proof For convenience and without loss of generality we omit α fromour notation and renumber the nodes that are contained in α from 1 to nWe consider the quotient between the loopy belief on the potential subsetdivided by the product of its single-node marginals

Qlowast(X)nprodβ=1

Qlowast(xβ)=(X)

prodβ

microβ(xβ)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)microβ(xβ)

=(X)

[sumXprime(Xprime)

prodβ

microβ(xprimeβ)

]nminus1

prodβ

sum

Xprimeβ

(Xprimeβ xβ)prodβ prime =β

microβ prime(xprimeβ prime)

(73)

where we substituted the properly normalized version of equation 35 aloopy belief pseudomarginal is proportional to the potential times incomingmessages The goal is now to find the maximum of the above expressionover all possible messages and all values of X Especially the maximum overmessagesmicro seems to be difficult to compute but the following intermediatelemma helps us out

Lemma 3 The maximum of the function

V(micro) = (nminus 1) log

[sumX

(X)nprodβ=1

microβ(xβ)

]

minusnsumβ=1

log

sum

(Xβ xlowastβ)prodβ prime =β

microβ prime(xβ prime)

with respect to the messages micro under constraintssum

xβ microβ(xβ) = 1 for all β andmicroβ(xβ) ge 0 for all β and xβ occurs at an extreme point microβ(xβ) = δxβ xβ for somexβ to be found

Proof Let us consider optimizing the messagemicro1(x1)with fixed messagesmicroβ(xβ) for β gt 1 The first and second derivatives are easily found to obey

partVpartmicro1(x1)

= (nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ)

part2Vpartmicro1(x1)partmicro1(xprime1)

= (nminus 1)Q(x1)Q(xprime1)minussumβ =1

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

Uniqueness of Loopy Belief Propagation Fixed Points 2401

where

Q(X) equiv (X)prodβ microβ(xβ)sum

Xprime (Xprime)prodβ microβ(x

primeβ)

Now suppose that V has a regular extremum (maximum or minimum) notat an extreme point that is micro1(x1) gt 0 for two or more values of x1 At suchan extremum the first derivative should obey

(nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ) = λ

with λ a Lagrange multiplier implementing the constraintsum

x1micro1(x1) = 1

Summing over x1 we obtain λ = 0 (in fact V is indifferent to any multi-plicative scaling of micro) For the matrix with second derivatives at such anextremum we then have

part2Vpartmicro1(x1)partmicro1(xprime1)

=sumβ =1

sumβprime =1βprime =β

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

which is positive semidefinite the extremum cannot be a maximum Con-sequently any maximum must be at the boundary of the domain Sincethis holds for any choice of microβ(xβ) β gt 1 it follows by induction that themaximum with respect to all microβ(xβ) must be at an extreme point as well

The function V(micro) is up to a term independent of micro the logarithm ofequation 73 So the intermediate lemma 3 tells us that we can replace themaximization over messages micro by maximization over values X

maxmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

X

(X)[(X)

]nminus1

prodβ

(Xβ xβ)

Next we take the maximum over X as well and define the ldquostrengthrdquo σ tobe used in equation 71 through

11minus σ equiv max

Xmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

XX

(X)[(X)

]nminus1

prodβ

(Xβ xβ) (74)

2402 T Heskes

The inequality 71 then follows by summing out Xββ prime in

Qlowast(X)minusprodβ

Qlowast(xβ) le σQlowast(X)

The form of equation 72 then follows by rewriting equation 74 as

ω equiv minus log(1minus σ) = maxXX

W(X X) with

W(X X) =[ψ(X)+ (nminus 1)ψ(X)minus

sumβ

ψ(Xβ xβ)

]

where we recall that ψ(X) equiv log(X)

72 Some Properties In the following we will refer to both ω and σ asthe strength of the potential There are several properties worth noting

bull The strength of a potential is indifferent to multiplication with anyterm that factorizes over the nodes that is

if (X) = (X)prodβ

microβ(xβ) then ω() = ω() for any choice of micro

This property relates to the arbitrariness in the definition of equa-tion 31 if two potentials overlap then multiplying one potential witha term that only depends on the overlap and dividing the other by thesame term does not change the distribution Luckily it also does notchange the strength of those potentials

bull To compute the strength we can enumerate all possible combinationsHowever we can neglect all combinations X and X that differ in fewerthan two nodes To see this consider

W(x1 x2 x12 x1 x2 x12) = ψ(x1 x2 x12)+ ψ(x1 x2 x12)minus ψ(x1 x2 x12)minus ψ(x1 x2 x12)

= minusW(x1 x2 x12 x1 x2 x12)

If now also x2 = x2 we get W(x1 x1 x1 x1) = minusW(x1 x1 x1 x1) =0 Furthermore if W(x1 x2 x12 x1 x2 x12) le 0 then it must be thatW(x1 x2 x12 x1 x2 x12) ge 0 and vice versa So ω the maximumover all combinations must be nonnegative and we can indeed neglectall combinations that by definition yield zero

bull Thus for finite potentials 0 le ω ltinfin and 0 le σ lt 1

Uniqueness of Loopy Belief Propagation Fixed Points 2403

bull With pairwise potentials the above symmetries can be used to reducethe number of evaluations to |x1||x2|(|x1|minus1)(|x2|minus1)4 combinationsAnd indeed for binary nodes x12 isin 0 1 we immediately obtain

ω = |ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| (75)

Any pairwise binary potential can be written as a Boltzmann factor

(x1 x2) prop exp[wx1x2 + θ1x1 + θ2x2]

In this notation we find the simple and intuitive expression ω = |w|the strength is the absolute value of the ldquoweightrdquo It is indeed inde-pendent of (the size of) the thresholds In the case of minus1 1 codingthe relationship is ω = 4|w|

bull In some models there is the notion of a ldquotemperaturerdquo T that is(X) propexp[ψ(X)T] where ψ(X) is considered constant In obvious notationwe then have ω(T) = ω(1)T and thus σ(T) = 1 minus exp[minusω(1)T] =1minus [1(1minus σ(1))]1T

bull Loopy belief revision (max-product) can be interpreted as a zero-temperature limit of loopy belief propagation (sum product) Morespecifically we get the belief revision updates if we imagine runningloopy belief propagation on potentials that are scaled with tempera-ture T and then take the limit T to zero Consequently when analyzingconditions for uniqueness of loopy belief revision fixed points we cantake σ(0) = 0 if σ(1) = 0 (fake interaction) yet σ(0) = 1 wheneverσ(1) gt 0

8 Conditions for Uniqueness

81 Main Result

Theorem 4 Loopy belief propagation has a unique fixed point if there exists anallocation matrix Aαβ between potentials α and nodes β with properties

1 Aαβ ge 0 forallαβsubα (positivity)

2 (1minus σα)maxβsubα

Aαβ + σαsumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(81)

with the strength σα a function of the potentialα(Xα) as defined in equation 72

Proof For completeness we first summarize our line of reasoning Fixedpoints of loopy belief propagation are in one-to-one correspondence with

2404 T Heskes

extrema of the dual equation 55 This dual has a unique extremum if itis convexconcave Concavity is guaranteed so we focus on conditionsfor convexity that is for positive (semi)definiteness of the correspondingHessian This then boils down to conditions that ensure K = K1+K2+K3 ge 0for any choice of Rβ(xβ)

Substituting the bound equation 71 into the term K3 we obtain

K3 ge minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeσαQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime)

ge minussumα

σαsumβsubα

sumxβ

Aαβ

sumβprimesubαβprime =β

Aαβ primeQlowastα(xβ)R2β(xβ)

where in the last step we applied the same trick as in equation 61 SinceK2 ge 0 and combining K1 and (the above lower bound on) K3 we get

K = K1 + K2 + K3

gesumα

sumβsubα

sumxβ

Aαβ

[1minus Aαβ minus σα

sumβ prime =β

Aαβ prime

]Qlowastα(xβ)R

2β(xβ)

This implies

(1minus σα)Aαβ + σαsumβ primesubα

Aαβ prime le 1 forallαβsubα

which in combination with Aαβ ge 0 and σα le 1 yields condition 2 inequation 81 The equality constraint equation 52 that we started with canbe relaxed to the inequality condition 3 without any consequences

We get back the stricter conditions of theorem 1 if σα = 1 for all potentialsα Furthermore ldquofake interactionsrdquo play no role with σα = 0 condition 2becomes maxβsubα Aαβ le 1 suggesting the choice Aαβ = 1 for all β sub αwhich then effectively reduces the number of neighboring potentials nβ incondition 3

82 Comparison with Other Work To the best of our knowledge theonly conditions for uniqueness of loopy belief propagation fixed points thatdepend on more than just the structure of the graph are those in Tatikondaand Jordan (2002) for pairwise potentials The analysis in Tatikonda andJordan is based on the concept of the computation tree which represents anunwrapping of the original graph with respect to the loopy belief propaga-tion algorithm The same concept is used in Weiss (2000) to show that beliefrevision yields the correct maximum a posteriori assignments in graphs

Uniqueness of Loopy Belief Propagation Fixed Points 2405

with a single loop and Weiss and Freeman (2001) to prove that loopy beliefpropagation in gaussian graphical models yields exact means Although thecurrent theorems based on the concept of computation trees are derived forpairwise potentials it should be possible to extend them to more generalfactor graphs

The setup in Tatikonda and Jordan (2002) is slightly different it is basedon the factorization

Pexact(X) = 1Z

prodα

α(Xα)prodβ

β(xβ)

to be compared with our equation 31 where there are no self-potentialsβ(xβ) With this in mind the statement is then as follows

Theorem 5 (adapted from Tatikonda amp Jordan 2002 in particular proposi-tion 53) Loopy belief propagation on pairwise potentials has a unique fixed pointif

sumαsupβ

(max

ψα(Xα)minusminXα

ψα(Xα)

)lt 2 forallβ (82)

To make the connection between theorem 5 and theorem 4 we will firststrengthen the former and then weaken the latter We will focus on thecase of binary pairwise potentials Since the definition of self-potentials isarbitrary and the condition 82 is valid for any choice we can easily improvethe condition by optimizing this choice This then leads to the followingcorollary

Corollary 3 This corollary concerns an improvement of theorem 5 for pairwisebinary potentials Loopy belief propagation on pairwise binary potentials has aunique fixed point ifsum

αsupβωα lt 4 forallβ (83)

with ωα defined in equation 72

Proof The condition 82 applies to any arbitrary definition of self-poten-tials β(xβ) In fact it is valid for any choice

ψα(Xα) = ψα(Xα)+sumβsubα

φαβ(xβ)

where ψα(Xα) is any choice of potential subsets that fits in our frameworkof no self-potentials (as argued above there is some arbitrariness here as

2406 T Heskes

well) We can then optimize this choice to obtain milder and thus betterconditions Omitting α and renumbering the nodes from 1 to 2 we have

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= minφ1φ2

maxx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

minus minx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

In the case of binary nodes (two-by-two matrices ψ(x1 x2)) it is easy tocheck that the optimal φ1 and φ2 that yield the smallest gap are such that

ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2)

ge ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2) (84)

for some x1 x2 x1 and x2 with x1 = x1 and x2 = x2 Solving for φ1 and φ2we find

φ1(x1)minus φ1(x1) = 12

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]φ2(x2)minus φ2(x2) = 1

2

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]

Substitution back into equation 84 yields

ψ(x1 x2)+ φ1(x1)+ φ2(x2)minus ψ(x1 x2)minus φ1(x1)minus φ2(x2)

= 12

[ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)minus ψ(x1 x2)

]

which has to be nonnegative Of all four possible combinations two of themare valid and yield the same positive gap and the other two are invalid sincethey yield the same negative gap Enumerating these combinations we find

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= 12|ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| = ω

2

from equation 75 Substitution into the condition 82 then yields equa-tion 83

Next we derive the following weaker corollary of theorem 4

Uniqueness of Loopy Belief Propagation Fixed Points 2407

Corollary 4 This is a weaker version of theorem 4 for pairwise potentials Loopybelief propagation on pairwise potentials has a unique fixed point ifsum

αsupβωα le 1 forallβ (85)

with ωα defined in equation 72

Proof Consider the allocation matrix with components Aαβ = 1minus σα forall β sub α With this choice conditions 1 and 2 of equation 81 are fulfilledsince (condition 1) σα le 1 and (condition 2)

(1minus σα)(1minus σα)+ 2σα(1minus σα) = 1minus 2σ 2α le 1

Substitution into condition 3 yieldssumαsupβ

(1minus σα) gesumαsupβ

1minus 1 and thussumαsupβ

σα le 1 (86)

Since ωα = minus log(1minus σα) ge σα condition 85 is weaker than condition 86

Summarizing the conditions in Tatikonda and Jordan (2002) are for bi-nary pairwise potentials and when strengthened as above at most a constant(factor 4) less strict and thus better than the ones derived here The latterare better when the structure is (close to) a tree The best set of conditionsfollows by taking the union of both Note further that the conditions derivedin Tatikonda and Jordan (2002) are unlike theorem 4 specific to pairwisepotentials

83 Illustration For illustration we consider a 3 times 3 Ising grid withtoroidal boundary conditions as in Figure 3a and uniform ferromagneticpotentials proportional to(

α 1minus α1minus α α

)

The trivial solution which is the only minimum of the Bethe free energy forsmall α is the one with all pseudomarginals equal to (05 05) With simplealgebra for example following the line of reasoning that leads to the beliefoptimization algorithm in Welling and Teh (2003) it can be shown that thistrivial solution becomes unstable at the critical αcritical = 23 asymp 067 Forα gt 23 we find two minima one with ldquospins uprdquo and the other one withldquospins downrdquo

In this symmetric problem the strength of each potential is given by

ω = 2 log[

α

1minus α]

and thus σ = 1minus(

1minus αα

)2

2408 T Heskes

Figure 3 Three Ising grids in factor-graph notation circles denote nodes boxesinteractions (a) Toroidal boundary conditions All elements of the allocationmatrix equal to 34 (not shown) (b) Aperiodic boundary conditions and (c) twoloops left The elements of the allocation matrix along the edges follow directlyfrom optimizing condition 3 in theorem 4 and symmetry considerations WithB = 2minus 2A in b and C = 1minusA in c the optimal settings for the single remainingvariable A then boil down to 34 and 1 minus

radic18 respectively See the text for

further explanation

Uniqueness of Loopy Belief Propagation Fixed Points 2409

The minimal (uniform) compensation in condition 3 of theorem 4 amountsto A = 34 for all combinations of potentials and nodes Substitution intocondition 2 then yields

σ le 13

and thus α le 11+radic23

asymp 055

The critical value that follows from corollary 3 is in this case slightly better

ω lt 1 and thus α le 11+ eminus12 asymp 062

Next we consider the same grid with aperiodic boundary conditions asin Figure 3b Numerically we find a critical αcritical asymp 079 The value thatfollows from corollary 3 is dominated by the center node and hence staysthe same a unique loopy belief propagation fixed point for α lt 062 Theo-rem 4 can be exploited to shift resources a little In principle we can solvethe nonlinear programming problem but for this small problem it can stillbe done by hand with the following argumentation Minimal compensationaccording to condition 3 in theorem 4 combined with symmetry considera-tions yields the allocation matrix elements along on the edges in Figure 3bIt is then easy to check that there are only two different appearances ofcondition 2

(2minus 2A)σ + 34le 1 and

12σ + A le 1

The optimal choice for A is the one in which both conditions turn out to beidentical In this way we obtain A = 34 yielding

σ le 12

and thus α le 11+radic12

asymp 058

still slightly worse than the condition from corollary 3An example in which the condition obtained with theorem 4 is better

than the one from corollary 3 is given in Figure 3c Straightforward analysisfollowing the same recipe as for Figure 3b yields A = 1minusradic18 with

σ leradic

12

and thus α le 1

1+radic

1minusradic12asymp 065

better than theα lt 062 from corollary 3 and to be compared with the criticalαcritical asymp 088

9 Discussion

In this article we derived sufficient conditions for loopy belief propagationto have just a single fixed point These conditions remain much too strong tobe anywhere near the necessary conditions and in that sense should be seenas no more than a first step These conditions have the following positivefeatures

2410 T Heskes

bull Generalize the conditions for convexity of the Bethe free energy

bull Incorporate the (local) strength of potentials

bull Scale naturally as a function of the ldquotemperaturerdquo

bull Are invariant to arbitrary definitions of potentials and self-interactions

Although the analysis that led to these conditions may seem quite involvedit basically consists of a relatively straightforward combination of two ob-servations The first observation is that we can exploit the arbitrariness inthe definition of the Bethe free energy when we incorporate the constraintsThis forms the basis of the resource allocation argument And the secondobservation concerns the bound on the correlation of a loopy belief propa-gation marginal that leads to the introduction of the strength of a potential

Besides its theoretical usefulness there are more practical uses Firstalgorithms for guaranteed convergence explicitly minimize the Bethe freeenergy They can be considered ldquobound optimization algorithmsrdquo similarto expectation maximization and iterative proportional fitting in the innerloop they minimize a bound on the Bethe free energy which is then up-dated in the outer loop In practice it appears that the tighter the boundthe faster the convergence (see eg Heskes et al 2003) Instead of a boundthat is convex (Yuille 2002) or convex over the set of constraints (Teh ampWelling 2002 Heskes et al 2003) we might relax the convexity conditionand choose a tighter bound that still has a unique minimum thereby speed-ing up the convergence Second in Wainwright et al (2002) a convexifiedBethe free energy is proposed The arguments for this class of free energiesare twofold they yield a bound on the partition function (instead of justan approximation as the standard Bethe free energy) and have a uniqueminimum Focusing on the second argument the conditions in this articlecan be used to construct Bethe free energies that may not be convex (overthe set of constraints) but do have a unique minimum and being closer tothe standard Bethe free energy may yield better approximations

We can think of the following opportunities to make the sufficient con-ditions derived here stricter and thus closer to necessary conditions

bull The conditions guarantee convexity of the dual G(Qβ λαβ) with re-spect to Qβ But in fact we need only G(Qβ) equiv maxλαβ G(Qβ λαβ) to beconvex which is a weaker requirement The Hessian of G(Qβ) how-ever appears to be more difficult to compute and to analyze in generalbut may lead to stronger results in specific cases (eg only pairwiseinteractions or substituting a particular choice of Aαβ )

bull It may be possible to strengthen the bound equation 71 on loopybelief correlations especially for interactions that involve more thantwo nodes

An important question is how the uniqueness of loopy belief propaga-tion fixed points relates to the convergence of loopy belief propagation

Uniqueness of Loopy Belief Propagation Fixed Points 2411

Intuitively one might expect that if loopy belief propagation has a uniquefixed point it will also converge to it This also seems to be the argumenta-tion in Tatikonda and Jordan (2002) However to the best of our knowledgethere is no proof of such correspondence Furthermore the following set ofsimulations does seem to suggest otherwise

We consider a Boltzmann machine with four binary nodes weights

w = ω

0 1 minus1 minus11 0 1 minus1minus1 1 0 minus1minus1 minus1 minus1 0

zero thresholds and potentials

ij(xi xj) = exp[wij4] if xi = xj and ij(xi xj) = exp[minuswij4] if xi = xj

Running loopy belief propagation possibly damped as in equation 39 weobserve ldquoconvergentrdquo and ldquononconvergentrdquo behavior For relatively smallweights loopy belief propagation converges to the trivial fixed point withPi(xi) = 05 for all nodes i and xi = 0 1 as in the lower left inset inFigure 4 For relatively large weights it ends up in a limit cycle as shown inthe upper right inset The weight strength that forms the transition betweenthis ldquoconvergentrdquo and ldquononconvergentrdquo behavior strongly depends on thestep size7 This by itself makes it hard to defend a one-to-one correspondencebetween convergence of loopy belief propagation (apparently dependingon step size) and uniqueness of fixed points (obviously independent of stepsize)

For weights larger than roughly 58 loopy belief propagation failed toconverge to the trivial fixed point even for very small step sizes Howeverrunning a convergent double-loop algorithm from many different initialconditions and many weight strengths considerably larger than 58 we al-ways ended up in the trivial fixed point and never in another one We foundsimilar behavior for a three-node Boltzmann machine (same weight matrixas above except for the fourth node) for very large weights loopy beliefpropagation ends up in a limit cycle whereas a convergent double-loopalgorithm converges to the trivial fixed point which here by corollary 2is guaranteed to be unique In future work we hope to elaborate on theseissues

7 Note that the conditions for guaranteed uniqueness imply ω = 43 for corollary 3and ω = log(2) asymp 069 for theorem 4 both far below the weight strengths where ldquonon-convergentrdquo behavior sets in

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

Uniqueness of Loopy Belief Propagation Fixed Points 2401

where

Q(X) equiv (X)prodβ microβ(xβ)sum

Xprime (Xprime)prodβ microβ(x

primeβ)

Now suppose that V has a regular extremum (maximum or minimum) notat an extreme point that is micro1(x1) gt 0 for two or more values of x1 At suchan extremum the first derivative should obey

(nminus 1)Q(x1)minussumβ =1

Q(x1|xlowastβ) = λ

with λ a Lagrange multiplier implementing the constraintsum

x1micro1(x1) = 1

Summing over x1 we obtain λ = 0 (in fact V is indifferent to any multi-plicative scaling of micro) For the matrix with second derivatives at such anextremum we then have

part2Vpartmicro1(x1)partmicro1(xprime1)

=sumβ =1

sumβprime =1βprime =β

Q(x1|xlowastβ)Q(xprime1|xlowastβ)

which is positive semidefinite the extremum cannot be a maximum Con-sequently any maximum must be at the boundary of the domain Sincethis holds for any choice of microβ(xβ) β gt 1 it follows by induction that themaximum with respect to all microβ(xβ) must be at an extreme point as well

The function V(micro) is up to a term independent of micro the logarithm ofequation 73 So the intermediate lemma 3 tells us that we can replace themaximization over messages micro by maximization over values X

maxmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

X

(X)[(X)

]nminus1

prodβ

(Xβ xβ)

Next we take the maximum over X as well and define the ldquostrengthrdquo σ tobe used in equation 71 through

11minus σ equiv max

Xmicro

Qlowast(X)prodβ

Qlowast(xβ)= max

XX

(X)[(X)

]nminus1

prodβ

(Xβ xβ) (74)

2402 T Heskes

The inequality 71 then follows by summing out Xββ prime in

Qlowast(X)minusprodβ

Qlowast(xβ) le σQlowast(X)

The form of equation 72 then follows by rewriting equation 74 as

ω equiv minus log(1minus σ) = maxXX

W(X X) with

W(X X) =[ψ(X)+ (nminus 1)ψ(X)minus

sumβ

ψ(Xβ xβ)

]

where we recall that ψ(X) equiv log(X)

72 Some Properties In the following we will refer to both ω and σ asthe strength of the potential There are several properties worth noting

bull The strength of a potential is indifferent to multiplication with anyterm that factorizes over the nodes that is

if (X) = (X)prodβ

microβ(xβ) then ω() = ω() for any choice of micro

This property relates to the arbitrariness in the definition of equa-tion 31 if two potentials overlap then multiplying one potential witha term that only depends on the overlap and dividing the other by thesame term does not change the distribution Luckily it also does notchange the strength of those potentials

bull To compute the strength we can enumerate all possible combinationsHowever we can neglect all combinations X and X that differ in fewerthan two nodes To see this consider

W(x1 x2 x12 x1 x2 x12) = ψ(x1 x2 x12)+ ψ(x1 x2 x12)minus ψ(x1 x2 x12)minus ψ(x1 x2 x12)

= minusW(x1 x2 x12 x1 x2 x12)

If now also x2 = x2 we get W(x1 x1 x1 x1) = minusW(x1 x1 x1 x1) =0 Furthermore if W(x1 x2 x12 x1 x2 x12) le 0 then it must be thatW(x1 x2 x12 x1 x2 x12) ge 0 and vice versa So ω the maximumover all combinations must be nonnegative and we can indeed neglectall combinations that by definition yield zero

bull Thus for finite potentials 0 le ω ltinfin and 0 le σ lt 1

Uniqueness of Loopy Belief Propagation Fixed Points 2403

bull With pairwise potentials the above symmetries can be used to reducethe number of evaluations to |x1||x2|(|x1|minus1)(|x2|minus1)4 combinationsAnd indeed for binary nodes x12 isin 0 1 we immediately obtain

ω = |ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| (75)

Any pairwise binary potential can be written as a Boltzmann factor

(x1 x2) prop exp[wx1x2 + θ1x1 + θ2x2]

In this notation we find the simple and intuitive expression ω = |w|the strength is the absolute value of the ldquoweightrdquo It is indeed inde-pendent of (the size of) the thresholds In the case of minus1 1 codingthe relationship is ω = 4|w|

bull In some models there is the notion of a ldquotemperaturerdquo T that is(X) propexp[ψ(X)T] where ψ(X) is considered constant In obvious notationwe then have ω(T) = ω(1)T and thus σ(T) = 1 minus exp[minusω(1)T] =1minus [1(1minus σ(1))]1T

bull Loopy belief revision (max-product) can be interpreted as a zero-temperature limit of loopy belief propagation (sum product) Morespecifically we get the belief revision updates if we imagine runningloopy belief propagation on potentials that are scaled with tempera-ture T and then take the limit T to zero Consequently when analyzingconditions for uniqueness of loopy belief revision fixed points we cantake σ(0) = 0 if σ(1) = 0 (fake interaction) yet σ(0) = 1 wheneverσ(1) gt 0

8 Conditions for Uniqueness

81 Main Result

Theorem 4 Loopy belief propagation has a unique fixed point if there exists anallocation matrix Aαβ between potentials α and nodes β with properties

1 Aαβ ge 0 forallαβsubα (positivity)

2 (1minus σα)maxβsubα

Aαβ + σαsumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(81)

with the strength σα a function of the potentialα(Xα) as defined in equation 72

Proof For completeness we first summarize our line of reasoning Fixedpoints of loopy belief propagation are in one-to-one correspondence with

2404 T Heskes

extrema of the dual equation 55 This dual has a unique extremum if itis convexconcave Concavity is guaranteed so we focus on conditionsfor convexity that is for positive (semi)definiteness of the correspondingHessian This then boils down to conditions that ensure K = K1+K2+K3 ge 0for any choice of Rβ(xβ)

Substituting the bound equation 71 into the term K3 we obtain

K3 ge minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeσαQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime)

ge minussumα

σαsumβsubα

sumxβ

Aαβ

sumβprimesubαβprime =β

Aαβ primeQlowastα(xβ)R2β(xβ)

where in the last step we applied the same trick as in equation 61 SinceK2 ge 0 and combining K1 and (the above lower bound on) K3 we get

K = K1 + K2 + K3

gesumα

sumβsubα

sumxβ

Aαβ

[1minus Aαβ minus σα

sumβ prime =β

Aαβ prime

]Qlowastα(xβ)R

2β(xβ)

This implies

(1minus σα)Aαβ + σαsumβ primesubα

Aαβ prime le 1 forallαβsubα

which in combination with Aαβ ge 0 and σα le 1 yields condition 2 inequation 81 The equality constraint equation 52 that we started with canbe relaxed to the inequality condition 3 without any consequences

We get back the stricter conditions of theorem 1 if σα = 1 for all potentialsα Furthermore ldquofake interactionsrdquo play no role with σα = 0 condition 2becomes maxβsubα Aαβ le 1 suggesting the choice Aαβ = 1 for all β sub αwhich then effectively reduces the number of neighboring potentials nβ incondition 3

82 Comparison with Other Work To the best of our knowledge theonly conditions for uniqueness of loopy belief propagation fixed points thatdepend on more than just the structure of the graph are those in Tatikondaand Jordan (2002) for pairwise potentials The analysis in Tatikonda andJordan is based on the concept of the computation tree which represents anunwrapping of the original graph with respect to the loopy belief propaga-tion algorithm The same concept is used in Weiss (2000) to show that beliefrevision yields the correct maximum a posteriori assignments in graphs

Uniqueness of Loopy Belief Propagation Fixed Points 2405

with a single loop and Weiss and Freeman (2001) to prove that loopy beliefpropagation in gaussian graphical models yields exact means Although thecurrent theorems based on the concept of computation trees are derived forpairwise potentials it should be possible to extend them to more generalfactor graphs

The setup in Tatikonda and Jordan (2002) is slightly different it is basedon the factorization

Pexact(X) = 1Z

prodα

α(Xα)prodβ

β(xβ)

to be compared with our equation 31 where there are no self-potentialsβ(xβ) With this in mind the statement is then as follows

Theorem 5 (adapted from Tatikonda amp Jordan 2002 in particular proposi-tion 53) Loopy belief propagation on pairwise potentials has a unique fixed pointif

sumαsupβ

(max

ψα(Xα)minusminXα

ψα(Xα)

)lt 2 forallβ (82)

To make the connection between theorem 5 and theorem 4 we will firststrengthen the former and then weaken the latter We will focus on thecase of binary pairwise potentials Since the definition of self-potentials isarbitrary and the condition 82 is valid for any choice we can easily improvethe condition by optimizing this choice This then leads to the followingcorollary

Corollary 3 This corollary concerns an improvement of theorem 5 for pairwisebinary potentials Loopy belief propagation on pairwise binary potentials has aunique fixed point ifsum

αsupβωα lt 4 forallβ (83)

with ωα defined in equation 72

Proof The condition 82 applies to any arbitrary definition of self-poten-tials β(xβ) In fact it is valid for any choice

ψα(Xα) = ψα(Xα)+sumβsubα

φαβ(xβ)

where ψα(Xα) is any choice of potential subsets that fits in our frameworkof no self-potentials (as argued above there is some arbitrariness here as

2406 T Heskes

well) We can then optimize this choice to obtain milder and thus betterconditions Omitting α and renumbering the nodes from 1 to 2 we have

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= minφ1φ2

maxx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

minus minx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

In the case of binary nodes (two-by-two matrices ψ(x1 x2)) it is easy tocheck that the optimal φ1 and φ2 that yield the smallest gap are such that

ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2)

ge ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2) (84)

for some x1 x2 x1 and x2 with x1 = x1 and x2 = x2 Solving for φ1 and φ2we find

φ1(x1)minus φ1(x1) = 12

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]φ2(x2)minus φ2(x2) = 1

2

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]

Substitution back into equation 84 yields

ψ(x1 x2)+ φ1(x1)+ φ2(x2)minus ψ(x1 x2)minus φ1(x1)minus φ2(x2)

= 12

[ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)minus ψ(x1 x2)

]

which has to be nonnegative Of all four possible combinations two of themare valid and yield the same positive gap and the other two are invalid sincethey yield the same negative gap Enumerating these combinations we find

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= 12|ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| = ω

2

from equation 75 Substitution into the condition 82 then yields equa-tion 83

Next we derive the following weaker corollary of theorem 4

Uniqueness of Loopy Belief Propagation Fixed Points 2407

Corollary 4 This is a weaker version of theorem 4 for pairwise potentials Loopybelief propagation on pairwise potentials has a unique fixed point ifsum

αsupβωα le 1 forallβ (85)

with ωα defined in equation 72

Proof Consider the allocation matrix with components Aαβ = 1minus σα forall β sub α With this choice conditions 1 and 2 of equation 81 are fulfilledsince (condition 1) σα le 1 and (condition 2)

(1minus σα)(1minus σα)+ 2σα(1minus σα) = 1minus 2σ 2α le 1

Substitution into condition 3 yieldssumαsupβ

(1minus σα) gesumαsupβ

1minus 1 and thussumαsupβ

σα le 1 (86)

Since ωα = minus log(1minus σα) ge σα condition 85 is weaker than condition 86

Summarizing the conditions in Tatikonda and Jordan (2002) are for bi-nary pairwise potentials and when strengthened as above at most a constant(factor 4) less strict and thus better than the ones derived here The latterare better when the structure is (close to) a tree The best set of conditionsfollows by taking the union of both Note further that the conditions derivedin Tatikonda and Jordan (2002) are unlike theorem 4 specific to pairwisepotentials

83 Illustration For illustration we consider a 3 times 3 Ising grid withtoroidal boundary conditions as in Figure 3a and uniform ferromagneticpotentials proportional to(

α 1minus α1minus α α

)

The trivial solution which is the only minimum of the Bethe free energy forsmall α is the one with all pseudomarginals equal to (05 05) With simplealgebra for example following the line of reasoning that leads to the beliefoptimization algorithm in Welling and Teh (2003) it can be shown that thistrivial solution becomes unstable at the critical αcritical = 23 asymp 067 Forα gt 23 we find two minima one with ldquospins uprdquo and the other one withldquospins downrdquo

In this symmetric problem the strength of each potential is given by

ω = 2 log[

α

1minus α]

and thus σ = 1minus(

1minus αα

)2

2408 T Heskes

Figure 3 Three Ising grids in factor-graph notation circles denote nodes boxesinteractions (a) Toroidal boundary conditions All elements of the allocationmatrix equal to 34 (not shown) (b) Aperiodic boundary conditions and (c) twoloops left The elements of the allocation matrix along the edges follow directlyfrom optimizing condition 3 in theorem 4 and symmetry considerations WithB = 2minus 2A in b and C = 1minusA in c the optimal settings for the single remainingvariable A then boil down to 34 and 1 minus

radic18 respectively See the text for

further explanation

Uniqueness of Loopy Belief Propagation Fixed Points 2409

The minimal (uniform) compensation in condition 3 of theorem 4 amountsto A = 34 for all combinations of potentials and nodes Substitution intocondition 2 then yields

σ le 13

and thus α le 11+radic23

asymp 055

The critical value that follows from corollary 3 is in this case slightly better

ω lt 1 and thus α le 11+ eminus12 asymp 062

Next we consider the same grid with aperiodic boundary conditions asin Figure 3b Numerically we find a critical αcritical asymp 079 The value thatfollows from corollary 3 is dominated by the center node and hence staysthe same a unique loopy belief propagation fixed point for α lt 062 Theo-rem 4 can be exploited to shift resources a little In principle we can solvethe nonlinear programming problem but for this small problem it can stillbe done by hand with the following argumentation Minimal compensationaccording to condition 3 in theorem 4 combined with symmetry considera-tions yields the allocation matrix elements along on the edges in Figure 3bIt is then easy to check that there are only two different appearances ofcondition 2

(2minus 2A)σ + 34le 1 and

12σ + A le 1

The optimal choice for A is the one in which both conditions turn out to beidentical In this way we obtain A = 34 yielding

σ le 12

and thus α le 11+radic12

asymp 058

still slightly worse than the condition from corollary 3An example in which the condition obtained with theorem 4 is better

than the one from corollary 3 is given in Figure 3c Straightforward analysisfollowing the same recipe as for Figure 3b yields A = 1minusradic18 with

σ leradic

12

and thus α le 1

1+radic

1minusradic12asymp 065

better than theα lt 062 from corollary 3 and to be compared with the criticalαcritical asymp 088

9 Discussion

In this article we derived sufficient conditions for loopy belief propagationto have just a single fixed point These conditions remain much too strong tobe anywhere near the necessary conditions and in that sense should be seenas no more than a first step These conditions have the following positivefeatures

2410 T Heskes

bull Generalize the conditions for convexity of the Bethe free energy

bull Incorporate the (local) strength of potentials

bull Scale naturally as a function of the ldquotemperaturerdquo

bull Are invariant to arbitrary definitions of potentials and self-interactions

Although the analysis that led to these conditions may seem quite involvedit basically consists of a relatively straightforward combination of two ob-servations The first observation is that we can exploit the arbitrariness inthe definition of the Bethe free energy when we incorporate the constraintsThis forms the basis of the resource allocation argument And the secondobservation concerns the bound on the correlation of a loopy belief propa-gation marginal that leads to the introduction of the strength of a potential

Besides its theoretical usefulness there are more practical uses Firstalgorithms for guaranteed convergence explicitly minimize the Bethe freeenergy They can be considered ldquobound optimization algorithmsrdquo similarto expectation maximization and iterative proportional fitting in the innerloop they minimize a bound on the Bethe free energy which is then up-dated in the outer loop In practice it appears that the tighter the boundthe faster the convergence (see eg Heskes et al 2003) Instead of a boundthat is convex (Yuille 2002) or convex over the set of constraints (Teh ampWelling 2002 Heskes et al 2003) we might relax the convexity conditionand choose a tighter bound that still has a unique minimum thereby speed-ing up the convergence Second in Wainwright et al (2002) a convexifiedBethe free energy is proposed The arguments for this class of free energiesare twofold they yield a bound on the partition function (instead of justan approximation as the standard Bethe free energy) and have a uniqueminimum Focusing on the second argument the conditions in this articlecan be used to construct Bethe free energies that may not be convex (overthe set of constraints) but do have a unique minimum and being closer tothe standard Bethe free energy may yield better approximations

We can think of the following opportunities to make the sufficient con-ditions derived here stricter and thus closer to necessary conditions

bull The conditions guarantee convexity of the dual G(Qβ λαβ) with re-spect to Qβ But in fact we need only G(Qβ) equiv maxλαβ G(Qβ λαβ) to beconvex which is a weaker requirement The Hessian of G(Qβ) how-ever appears to be more difficult to compute and to analyze in generalbut may lead to stronger results in specific cases (eg only pairwiseinteractions or substituting a particular choice of Aαβ )

bull It may be possible to strengthen the bound equation 71 on loopybelief correlations especially for interactions that involve more thantwo nodes

An important question is how the uniqueness of loopy belief propaga-tion fixed points relates to the convergence of loopy belief propagation

Uniqueness of Loopy Belief Propagation Fixed Points 2411

Intuitively one might expect that if loopy belief propagation has a uniquefixed point it will also converge to it This also seems to be the argumenta-tion in Tatikonda and Jordan (2002) However to the best of our knowledgethere is no proof of such correspondence Furthermore the following set ofsimulations does seem to suggest otherwise

We consider a Boltzmann machine with four binary nodes weights

w = ω

0 1 minus1 minus11 0 1 minus1minus1 1 0 minus1minus1 minus1 minus1 0

zero thresholds and potentials

ij(xi xj) = exp[wij4] if xi = xj and ij(xi xj) = exp[minuswij4] if xi = xj

Running loopy belief propagation possibly damped as in equation 39 weobserve ldquoconvergentrdquo and ldquononconvergentrdquo behavior For relatively smallweights loopy belief propagation converges to the trivial fixed point withPi(xi) = 05 for all nodes i and xi = 0 1 as in the lower left inset inFigure 4 For relatively large weights it ends up in a limit cycle as shown inthe upper right inset The weight strength that forms the transition betweenthis ldquoconvergentrdquo and ldquononconvergentrdquo behavior strongly depends on thestep size7 This by itself makes it hard to defend a one-to-one correspondencebetween convergence of loopy belief propagation (apparently dependingon step size) and uniqueness of fixed points (obviously independent of stepsize)

For weights larger than roughly 58 loopy belief propagation failed toconverge to the trivial fixed point even for very small step sizes Howeverrunning a convergent double-loop algorithm from many different initialconditions and many weight strengths considerably larger than 58 we al-ways ended up in the trivial fixed point and never in another one We foundsimilar behavior for a three-node Boltzmann machine (same weight matrixas above except for the fourth node) for very large weights loopy beliefpropagation ends up in a limit cycle whereas a convergent double-loopalgorithm converges to the trivial fixed point which here by corollary 2is guaranteed to be unique In future work we hope to elaborate on theseissues

7 Note that the conditions for guaranteed uniqueness imply ω = 43 for corollary 3and ω = log(2) asymp 069 for theorem 4 both far below the weight strengths where ldquonon-convergentrdquo behavior sets in

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

2402 T Heskes

The inequality 71 then follows by summing out Xββ prime in

Qlowast(X)minusprodβ

Qlowast(xβ) le σQlowast(X)

The form of equation 72 then follows by rewriting equation 74 as

ω equiv minus log(1minus σ) = maxXX

W(X X) with

W(X X) =[ψ(X)+ (nminus 1)ψ(X)minus

sumβ

ψ(Xβ xβ)

]

where we recall that ψ(X) equiv log(X)

72 Some Properties In the following we will refer to both ω and σ asthe strength of the potential There are several properties worth noting

bull The strength of a potential is indifferent to multiplication with anyterm that factorizes over the nodes that is

if (X) = (X)prodβ

microβ(xβ) then ω() = ω() for any choice of micro

This property relates to the arbitrariness in the definition of equa-tion 31 if two potentials overlap then multiplying one potential witha term that only depends on the overlap and dividing the other by thesame term does not change the distribution Luckily it also does notchange the strength of those potentials

bull To compute the strength we can enumerate all possible combinationsHowever we can neglect all combinations X and X that differ in fewerthan two nodes To see this consider

W(x1 x2 x12 x1 x2 x12) = ψ(x1 x2 x12)+ ψ(x1 x2 x12)minus ψ(x1 x2 x12)minus ψ(x1 x2 x12)

= minusW(x1 x2 x12 x1 x2 x12)

If now also x2 = x2 we get W(x1 x1 x1 x1) = minusW(x1 x1 x1 x1) =0 Furthermore if W(x1 x2 x12 x1 x2 x12) le 0 then it must be thatW(x1 x2 x12 x1 x2 x12) ge 0 and vice versa So ω the maximumover all combinations must be nonnegative and we can indeed neglectall combinations that by definition yield zero

bull Thus for finite potentials 0 le ω ltinfin and 0 le σ lt 1

Uniqueness of Loopy Belief Propagation Fixed Points 2403

bull With pairwise potentials the above symmetries can be used to reducethe number of evaluations to |x1||x2|(|x1|minus1)(|x2|minus1)4 combinationsAnd indeed for binary nodes x12 isin 0 1 we immediately obtain

ω = |ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| (75)

Any pairwise binary potential can be written as a Boltzmann factor

(x1 x2) prop exp[wx1x2 + θ1x1 + θ2x2]

In this notation we find the simple and intuitive expression ω = |w|the strength is the absolute value of the ldquoweightrdquo It is indeed inde-pendent of (the size of) the thresholds In the case of minus1 1 codingthe relationship is ω = 4|w|

bull In some models there is the notion of a ldquotemperaturerdquo T that is(X) propexp[ψ(X)T] where ψ(X) is considered constant In obvious notationwe then have ω(T) = ω(1)T and thus σ(T) = 1 minus exp[minusω(1)T] =1minus [1(1minus σ(1))]1T

bull Loopy belief revision (max-product) can be interpreted as a zero-temperature limit of loopy belief propagation (sum product) Morespecifically we get the belief revision updates if we imagine runningloopy belief propagation on potentials that are scaled with tempera-ture T and then take the limit T to zero Consequently when analyzingconditions for uniqueness of loopy belief revision fixed points we cantake σ(0) = 0 if σ(1) = 0 (fake interaction) yet σ(0) = 1 wheneverσ(1) gt 0

8 Conditions for Uniqueness

81 Main Result

Theorem 4 Loopy belief propagation has a unique fixed point if there exists anallocation matrix Aαβ between potentials α and nodes β with properties

1 Aαβ ge 0 forallαβsubα (positivity)

2 (1minus σα)maxβsubα

Aαβ + σαsumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(81)

with the strength σα a function of the potentialα(Xα) as defined in equation 72

Proof For completeness we first summarize our line of reasoning Fixedpoints of loopy belief propagation are in one-to-one correspondence with

2404 T Heskes

extrema of the dual equation 55 This dual has a unique extremum if itis convexconcave Concavity is guaranteed so we focus on conditionsfor convexity that is for positive (semi)definiteness of the correspondingHessian This then boils down to conditions that ensure K = K1+K2+K3 ge 0for any choice of Rβ(xβ)

Substituting the bound equation 71 into the term K3 we obtain

K3 ge minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeσαQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime)

ge minussumα

σαsumβsubα

sumxβ

Aαβ

sumβprimesubαβprime =β

Aαβ primeQlowastα(xβ)R2β(xβ)

where in the last step we applied the same trick as in equation 61 SinceK2 ge 0 and combining K1 and (the above lower bound on) K3 we get

K = K1 + K2 + K3

gesumα

sumβsubα

sumxβ

Aαβ

[1minus Aαβ minus σα

sumβ prime =β

Aαβ prime

]Qlowastα(xβ)R

2β(xβ)

This implies

(1minus σα)Aαβ + σαsumβ primesubα

Aαβ prime le 1 forallαβsubα

which in combination with Aαβ ge 0 and σα le 1 yields condition 2 inequation 81 The equality constraint equation 52 that we started with canbe relaxed to the inequality condition 3 without any consequences

We get back the stricter conditions of theorem 1 if σα = 1 for all potentialsα Furthermore ldquofake interactionsrdquo play no role with σα = 0 condition 2becomes maxβsubα Aαβ le 1 suggesting the choice Aαβ = 1 for all β sub αwhich then effectively reduces the number of neighboring potentials nβ incondition 3

82 Comparison with Other Work To the best of our knowledge theonly conditions for uniqueness of loopy belief propagation fixed points thatdepend on more than just the structure of the graph are those in Tatikondaand Jordan (2002) for pairwise potentials The analysis in Tatikonda andJordan is based on the concept of the computation tree which represents anunwrapping of the original graph with respect to the loopy belief propaga-tion algorithm The same concept is used in Weiss (2000) to show that beliefrevision yields the correct maximum a posteriori assignments in graphs

Uniqueness of Loopy Belief Propagation Fixed Points 2405

with a single loop and Weiss and Freeman (2001) to prove that loopy beliefpropagation in gaussian graphical models yields exact means Although thecurrent theorems based on the concept of computation trees are derived forpairwise potentials it should be possible to extend them to more generalfactor graphs

The setup in Tatikonda and Jordan (2002) is slightly different it is basedon the factorization

Pexact(X) = 1Z

prodα

α(Xα)prodβ

β(xβ)

to be compared with our equation 31 where there are no self-potentialsβ(xβ) With this in mind the statement is then as follows

Theorem 5 (adapted from Tatikonda amp Jordan 2002 in particular proposi-tion 53) Loopy belief propagation on pairwise potentials has a unique fixed pointif

sumαsupβ

(max

ψα(Xα)minusminXα

ψα(Xα)

)lt 2 forallβ (82)

To make the connection between theorem 5 and theorem 4 we will firststrengthen the former and then weaken the latter We will focus on thecase of binary pairwise potentials Since the definition of self-potentials isarbitrary and the condition 82 is valid for any choice we can easily improvethe condition by optimizing this choice This then leads to the followingcorollary

Corollary 3 This corollary concerns an improvement of theorem 5 for pairwisebinary potentials Loopy belief propagation on pairwise binary potentials has aunique fixed point ifsum

αsupβωα lt 4 forallβ (83)

with ωα defined in equation 72

Proof The condition 82 applies to any arbitrary definition of self-poten-tials β(xβ) In fact it is valid for any choice

ψα(Xα) = ψα(Xα)+sumβsubα

φαβ(xβ)

where ψα(Xα) is any choice of potential subsets that fits in our frameworkof no self-potentials (as argued above there is some arbitrariness here as

2406 T Heskes

well) We can then optimize this choice to obtain milder and thus betterconditions Omitting α and renumbering the nodes from 1 to 2 we have

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= minφ1φ2

maxx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

minus minx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

In the case of binary nodes (two-by-two matrices ψ(x1 x2)) it is easy tocheck that the optimal φ1 and φ2 that yield the smallest gap are such that

ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2)

ge ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2) (84)

for some x1 x2 x1 and x2 with x1 = x1 and x2 = x2 Solving for φ1 and φ2we find

φ1(x1)minus φ1(x1) = 12

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]φ2(x2)minus φ2(x2) = 1

2

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]

Substitution back into equation 84 yields

ψ(x1 x2)+ φ1(x1)+ φ2(x2)minus ψ(x1 x2)minus φ1(x1)minus φ2(x2)

= 12

[ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)minus ψ(x1 x2)

]

which has to be nonnegative Of all four possible combinations two of themare valid and yield the same positive gap and the other two are invalid sincethey yield the same negative gap Enumerating these combinations we find

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= 12|ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| = ω

2

from equation 75 Substitution into the condition 82 then yields equa-tion 83

Next we derive the following weaker corollary of theorem 4

Uniqueness of Loopy Belief Propagation Fixed Points 2407

Corollary 4 This is a weaker version of theorem 4 for pairwise potentials Loopybelief propagation on pairwise potentials has a unique fixed point ifsum

αsupβωα le 1 forallβ (85)

with ωα defined in equation 72

Proof Consider the allocation matrix with components Aαβ = 1minus σα forall β sub α With this choice conditions 1 and 2 of equation 81 are fulfilledsince (condition 1) σα le 1 and (condition 2)

(1minus σα)(1minus σα)+ 2σα(1minus σα) = 1minus 2σ 2α le 1

Substitution into condition 3 yieldssumαsupβ

(1minus σα) gesumαsupβ

1minus 1 and thussumαsupβ

σα le 1 (86)

Since ωα = minus log(1minus σα) ge σα condition 85 is weaker than condition 86

Summarizing the conditions in Tatikonda and Jordan (2002) are for bi-nary pairwise potentials and when strengthened as above at most a constant(factor 4) less strict and thus better than the ones derived here The latterare better when the structure is (close to) a tree The best set of conditionsfollows by taking the union of both Note further that the conditions derivedin Tatikonda and Jordan (2002) are unlike theorem 4 specific to pairwisepotentials

83 Illustration For illustration we consider a 3 times 3 Ising grid withtoroidal boundary conditions as in Figure 3a and uniform ferromagneticpotentials proportional to(

α 1minus α1minus α α

)

The trivial solution which is the only minimum of the Bethe free energy forsmall α is the one with all pseudomarginals equal to (05 05) With simplealgebra for example following the line of reasoning that leads to the beliefoptimization algorithm in Welling and Teh (2003) it can be shown that thistrivial solution becomes unstable at the critical αcritical = 23 asymp 067 Forα gt 23 we find two minima one with ldquospins uprdquo and the other one withldquospins downrdquo

In this symmetric problem the strength of each potential is given by

ω = 2 log[

α

1minus α]

and thus σ = 1minus(

1minus αα

)2

2408 T Heskes

Figure 3 Three Ising grids in factor-graph notation circles denote nodes boxesinteractions (a) Toroidal boundary conditions All elements of the allocationmatrix equal to 34 (not shown) (b) Aperiodic boundary conditions and (c) twoloops left The elements of the allocation matrix along the edges follow directlyfrom optimizing condition 3 in theorem 4 and symmetry considerations WithB = 2minus 2A in b and C = 1minusA in c the optimal settings for the single remainingvariable A then boil down to 34 and 1 minus

radic18 respectively See the text for

further explanation

Uniqueness of Loopy Belief Propagation Fixed Points 2409

The minimal (uniform) compensation in condition 3 of theorem 4 amountsto A = 34 for all combinations of potentials and nodes Substitution intocondition 2 then yields

σ le 13

and thus α le 11+radic23

asymp 055

The critical value that follows from corollary 3 is in this case slightly better

ω lt 1 and thus α le 11+ eminus12 asymp 062

Next we consider the same grid with aperiodic boundary conditions asin Figure 3b Numerically we find a critical αcritical asymp 079 The value thatfollows from corollary 3 is dominated by the center node and hence staysthe same a unique loopy belief propagation fixed point for α lt 062 Theo-rem 4 can be exploited to shift resources a little In principle we can solvethe nonlinear programming problem but for this small problem it can stillbe done by hand with the following argumentation Minimal compensationaccording to condition 3 in theorem 4 combined with symmetry considera-tions yields the allocation matrix elements along on the edges in Figure 3bIt is then easy to check that there are only two different appearances ofcondition 2

(2minus 2A)σ + 34le 1 and

12σ + A le 1

The optimal choice for A is the one in which both conditions turn out to beidentical In this way we obtain A = 34 yielding

σ le 12

and thus α le 11+radic12

asymp 058

still slightly worse than the condition from corollary 3An example in which the condition obtained with theorem 4 is better

than the one from corollary 3 is given in Figure 3c Straightforward analysisfollowing the same recipe as for Figure 3b yields A = 1minusradic18 with

σ leradic

12

and thus α le 1

1+radic

1minusradic12asymp 065

better than theα lt 062 from corollary 3 and to be compared with the criticalαcritical asymp 088

9 Discussion

In this article we derived sufficient conditions for loopy belief propagationto have just a single fixed point These conditions remain much too strong tobe anywhere near the necessary conditions and in that sense should be seenas no more than a first step These conditions have the following positivefeatures

2410 T Heskes

bull Generalize the conditions for convexity of the Bethe free energy

bull Incorporate the (local) strength of potentials

bull Scale naturally as a function of the ldquotemperaturerdquo

bull Are invariant to arbitrary definitions of potentials and self-interactions

Although the analysis that led to these conditions may seem quite involvedit basically consists of a relatively straightforward combination of two ob-servations The first observation is that we can exploit the arbitrariness inthe definition of the Bethe free energy when we incorporate the constraintsThis forms the basis of the resource allocation argument And the secondobservation concerns the bound on the correlation of a loopy belief propa-gation marginal that leads to the introduction of the strength of a potential

Besides its theoretical usefulness there are more practical uses Firstalgorithms for guaranteed convergence explicitly minimize the Bethe freeenergy They can be considered ldquobound optimization algorithmsrdquo similarto expectation maximization and iterative proportional fitting in the innerloop they minimize a bound on the Bethe free energy which is then up-dated in the outer loop In practice it appears that the tighter the boundthe faster the convergence (see eg Heskes et al 2003) Instead of a boundthat is convex (Yuille 2002) or convex over the set of constraints (Teh ampWelling 2002 Heskes et al 2003) we might relax the convexity conditionand choose a tighter bound that still has a unique minimum thereby speed-ing up the convergence Second in Wainwright et al (2002) a convexifiedBethe free energy is proposed The arguments for this class of free energiesare twofold they yield a bound on the partition function (instead of justan approximation as the standard Bethe free energy) and have a uniqueminimum Focusing on the second argument the conditions in this articlecan be used to construct Bethe free energies that may not be convex (overthe set of constraints) but do have a unique minimum and being closer tothe standard Bethe free energy may yield better approximations

We can think of the following opportunities to make the sufficient con-ditions derived here stricter and thus closer to necessary conditions

bull The conditions guarantee convexity of the dual G(Qβ λαβ) with re-spect to Qβ But in fact we need only G(Qβ) equiv maxλαβ G(Qβ λαβ) to beconvex which is a weaker requirement The Hessian of G(Qβ) how-ever appears to be more difficult to compute and to analyze in generalbut may lead to stronger results in specific cases (eg only pairwiseinteractions or substituting a particular choice of Aαβ )

bull It may be possible to strengthen the bound equation 71 on loopybelief correlations especially for interactions that involve more thantwo nodes

An important question is how the uniqueness of loopy belief propaga-tion fixed points relates to the convergence of loopy belief propagation

Uniqueness of Loopy Belief Propagation Fixed Points 2411

Intuitively one might expect that if loopy belief propagation has a uniquefixed point it will also converge to it This also seems to be the argumenta-tion in Tatikonda and Jordan (2002) However to the best of our knowledgethere is no proof of such correspondence Furthermore the following set ofsimulations does seem to suggest otherwise

We consider a Boltzmann machine with four binary nodes weights

w = ω

0 1 minus1 minus11 0 1 minus1minus1 1 0 minus1minus1 minus1 minus1 0

zero thresholds and potentials

ij(xi xj) = exp[wij4] if xi = xj and ij(xi xj) = exp[minuswij4] if xi = xj

Running loopy belief propagation possibly damped as in equation 39 weobserve ldquoconvergentrdquo and ldquononconvergentrdquo behavior For relatively smallweights loopy belief propagation converges to the trivial fixed point withPi(xi) = 05 for all nodes i and xi = 0 1 as in the lower left inset inFigure 4 For relatively large weights it ends up in a limit cycle as shown inthe upper right inset The weight strength that forms the transition betweenthis ldquoconvergentrdquo and ldquononconvergentrdquo behavior strongly depends on thestep size7 This by itself makes it hard to defend a one-to-one correspondencebetween convergence of loopy belief propagation (apparently dependingon step size) and uniqueness of fixed points (obviously independent of stepsize)

For weights larger than roughly 58 loopy belief propagation failed toconverge to the trivial fixed point even for very small step sizes Howeverrunning a convergent double-loop algorithm from many different initialconditions and many weight strengths considerably larger than 58 we al-ways ended up in the trivial fixed point and never in another one We foundsimilar behavior for a three-node Boltzmann machine (same weight matrixas above except for the fourth node) for very large weights loopy beliefpropagation ends up in a limit cycle whereas a convergent double-loopalgorithm converges to the trivial fixed point which here by corollary 2is guaranteed to be unique In future work we hope to elaborate on theseissues

7 Note that the conditions for guaranteed uniqueness imply ω = 43 for corollary 3and ω = log(2) asymp 069 for theorem 4 both far below the weight strengths where ldquonon-convergentrdquo behavior sets in

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

Uniqueness of Loopy Belief Propagation Fixed Points 2403

bull With pairwise potentials the above symmetries can be used to reducethe number of evaluations to |x1||x2|(|x1|minus1)(|x2|minus1)4 combinationsAnd indeed for binary nodes x12 isin 0 1 we immediately obtain

ω = |ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| (75)

Any pairwise binary potential can be written as a Boltzmann factor

(x1 x2) prop exp[wx1x2 + θ1x1 + θ2x2]

In this notation we find the simple and intuitive expression ω = |w|the strength is the absolute value of the ldquoweightrdquo It is indeed inde-pendent of (the size of) the thresholds In the case of minus1 1 codingthe relationship is ω = 4|w|

bull In some models there is the notion of a ldquotemperaturerdquo T that is(X) propexp[ψ(X)T] where ψ(X) is considered constant In obvious notationwe then have ω(T) = ω(1)T and thus σ(T) = 1 minus exp[minusω(1)T] =1minus [1(1minus σ(1))]1T

bull Loopy belief revision (max-product) can be interpreted as a zero-temperature limit of loopy belief propagation (sum product) Morespecifically we get the belief revision updates if we imagine runningloopy belief propagation on potentials that are scaled with tempera-ture T and then take the limit T to zero Consequently when analyzingconditions for uniqueness of loopy belief revision fixed points we cantake σ(0) = 0 if σ(1) = 0 (fake interaction) yet σ(0) = 1 wheneverσ(1) gt 0

8 Conditions for Uniqueness

81 Main Result

Theorem 4 Loopy belief propagation has a unique fixed point if there exists anallocation matrix Aαβ between potentials α and nodes β with properties

1 Aαβ ge 0 forallαβsubα (positivity)

2 (1minus σα)maxβsubα

Aαβ + σαsumβsubα

Aαβ le 1 forallα (sufficient amount of resources)

3sumαsupβ

Aαβ ge nβ minus 1 forallβ (sufficient compensation)

(81)

with the strength σα a function of the potentialα(Xα) as defined in equation 72

Proof For completeness we first summarize our line of reasoning Fixedpoints of loopy belief propagation are in one-to-one correspondence with

2404 T Heskes

extrema of the dual equation 55 This dual has a unique extremum if itis convexconcave Concavity is guaranteed so we focus on conditionsfor convexity that is for positive (semi)definiteness of the correspondingHessian This then boils down to conditions that ensure K = K1+K2+K3 ge 0for any choice of Rβ(xβ)

Substituting the bound equation 71 into the term K3 we obtain

K3 ge minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeσαQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime)

ge minussumα

σαsumβsubα

sumxβ

Aαβ

sumβprimesubαβprime =β

Aαβ primeQlowastα(xβ)R2β(xβ)

where in the last step we applied the same trick as in equation 61 SinceK2 ge 0 and combining K1 and (the above lower bound on) K3 we get

K = K1 + K2 + K3

gesumα

sumβsubα

sumxβ

Aαβ

[1minus Aαβ minus σα

sumβ prime =β

Aαβ prime

]Qlowastα(xβ)R

2β(xβ)

This implies

(1minus σα)Aαβ + σαsumβ primesubα

Aαβ prime le 1 forallαβsubα

which in combination with Aαβ ge 0 and σα le 1 yields condition 2 inequation 81 The equality constraint equation 52 that we started with canbe relaxed to the inequality condition 3 without any consequences

We get back the stricter conditions of theorem 1 if σα = 1 for all potentialsα Furthermore ldquofake interactionsrdquo play no role with σα = 0 condition 2becomes maxβsubα Aαβ le 1 suggesting the choice Aαβ = 1 for all β sub αwhich then effectively reduces the number of neighboring potentials nβ incondition 3

82 Comparison with Other Work To the best of our knowledge theonly conditions for uniqueness of loopy belief propagation fixed points thatdepend on more than just the structure of the graph are those in Tatikondaand Jordan (2002) for pairwise potentials The analysis in Tatikonda andJordan is based on the concept of the computation tree which represents anunwrapping of the original graph with respect to the loopy belief propaga-tion algorithm The same concept is used in Weiss (2000) to show that beliefrevision yields the correct maximum a posteriori assignments in graphs

Uniqueness of Loopy Belief Propagation Fixed Points 2405

with a single loop and Weiss and Freeman (2001) to prove that loopy beliefpropagation in gaussian graphical models yields exact means Although thecurrent theorems based on the concept of computation trees are derived forpairwise potentials it should be possible to extend them to more generalfactor graphs

The setup in Tatikonda and Jordan (2002) is slightly different it is basedon the factorization

Pexact(X) = 1Z

prodα

α(Xα)prodβ

β(xβ)

to be compared with our equation 31 where there are no self-potentialsβ(xβ) With this in mind the statement is then as follows

Theorem 5 (adapted from Tatikonda amp Jordan 2002 in particular proposi-tion 53) Loopy belief propagation on pairwise potentials has a unique fixed pointif

sumαsupβ

(max

ψα(Xα)minusminXα

ψα(Xα)

)lt 2 forallβ (82)

To make the connection between theorem 5 and theorem 4 we will firststrengthen the former and then weaken the latter We will focus on thecase of binary pairwise potentials Since the definition of self-potentials isarbitrary and the condition 82 is valid for any choice we can easily improvethe condition by optimizing this choice This then leads to the followingcorollary

Corollary 3 This corollary concerns an improvement of theorem 5 for pairwisebinary potentials Loopy belief propagation on pairwise binary potentials has aunique fixed point ifsum

αsupβωα lt 4 forallβ (83)

with ωα defined in equation 72

Proof The condition 82 applies to any arbitrary definition of self-poten-tials β(xβ) In fact it is valid for any choice

ψα(Xα) = ψα(Xα)+sumβsubα

φαβ(xβ)

where ψα(Xα) is any choice of potential subsets that fits in our frameworkof no self-potentials (as argued above there is some arbitrariness here as

2406 T Heskes

well) We can then optimize this choice to obtain milder and thus betterconditions Omitting α and renumbering the nodes from 1 to 2 we have

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= minφ1φ2

maxx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

minus minx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

In the case of binary nodes (two-by-two matrices ψ(x1 x2)) it is easy tocheck that the optimal φ1 and φ2 that yield the smallest gap are such that

ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2)

ge ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2) (84)

for some x1 x2 x1 and x2 with x1 = x1 and x2 = x2 Solving for φ1 and φ2we find

φ1(x1)minus φ1(x1) = 12

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]φ2(x2)minus φ2(x2) = 1

2

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]

Substitution back into equation 84 yields

ψ(x1 x2)+ φ1(x1)+ φ2(x2)minus ψ(x1 x2)minus φ1(x1)minus φ2(x2)

= 12

[ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)minus ψ(x1 x2)

]

which has to be nonnegative Of all four possible combinations two of themare valid and yield the same positive gap and the other two are invalid sincethey yield the same negative gap Enumerating these combinations we find

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= 12|ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| = ω

2

from equation 75 Substitution into the condition 82 then yields equa-tion 83

Next we derive the following weaker corollary of theorem 4

Uniqueness of Loopy Belief Propagation Fixed Points 2407

Corollary 4 This is a weaker version of theorem 4 for pairwise potentials Loopybelief propagation on pairwise potentials has a unique fixed point ifsum

αsupβωα le 1 forallβ (85)

with ωα defined in equation 72

Proof Consider the allocation matrix with components Aαβ = 1minus σα forall β sub α With this choice conditions 1 and 2 of equation 81 are fulfilledsince (condition 1) σα le 1 and (condition 2)

(1minus σα)(1minus σα)+ 2σα(1minus σα) = 1minus 2σ 2α le 1

Substitution into condition 3 yieldssumαsupβ

(1minus σα) gesumαsupβ

1minus 1 and thussumαsupβ

σα le 1 (86)

Since ωα = minus log(1minus σα) ge σα condition 85 is weaker than condition 86

Summarizing the conditions in Tatikonda and Jordan (2002) are for bi-nary pairwise potentials and when strengthened as above at most a constant(factor 4) less strict and thus better than the ones derived here The latterare better when the structure is (close to) a tree The best set of conditionsfollows by taking the union of both Note further that the conditions derivedin Tatikonda and Jordan (2002) are unlike theorem 4 specific to pairwisepotentials

83 Illustration For illustration we consider a 3 times 3 Ising grid withtoroidal boundary conditions as in Figure 3a and uniform ferromagneticpotentials proportional to(

α 1minus α1minus α α

)

The trivial solution which is the only minimum of the Bethe free energy forsmall α is the one with all pseudomarginals equal to (05 05) With simplealgebra for example following the line of reasoning that leads to the beliefoptimization algorithm in Welling and Teh (2003) it can be shown that thistrivial solution becomes unstable at the critical αcritical = 23 asymp 067 Forα gt 23 we find two minima one with ldquospins uprdquo and the other one withldquospins downrdquo

In this symmetric problem the strength of each potential is given by

ω = 2 log[

α

1minus α]

and thus σ = 1minus(

1minus αα

)2

2408 T Heskes

Figure 3 Three Ising grids in factor-graph notation circles denote nodes boxesinteractions (a) Toroidal boundary conditions All elements of the allocationmatrix equal to 34 (not shown) (b) Aperiodic boundary conditions and (c) twoloops left The elements of the allocation matrix along the edges follow directlyfrom optimizing condition 3 in theorem 4 and symmetry considerations WithB = 2minus 2A in b and C = 1minusA in c the optimal settings for the single remainingvariable A then boil down to 34 and 1 minus

radic18 respectively See the text for

further explanation

Uniqueness of Loopy Belief Propagation Fixed Points 2409

The minimal (uniform) compensation in condition 3 of theorem 4 amountsto A = 34 for all combinations of potentials and nodes Substitution intocondition 2 then yields

σ le 13

and thus α le 11+radic23

asymp 055

The critical value that follows from corollary 3 is in this case slightly better

ω lt 1 and thus α le 11+ eminus12 asymp 062

Next we consider the same grid with aperiodic boundary conditions asin Figure 3b Numerically we find a critical αcritical asymp 079 The value thatfollows from corollary 3 is dominated by the center node and hence staysthe same a unique loopy belief propagation fixed point for α lt 062 Theo-rem 4 can be exploited to shift resources a little In principle we can solvethe nonlinear programming problem but for this small problem it can stillbe done by hand with the following argumentation Minimal compensationaccording to condition 3 in theorem 4 combined with symmetry considera-tions yields the allocation matrix elements along on the edges in Figure 3bIt is then easy to check that there are only two different appearances ofcondition 2

(2minus 2A)σ + 34le 1 and

12σ + A le 1

The optimal choice for A is the one in which both conditions turn out to beidentical In this way we obtain A = 34 yielding

σ le 12

and thus α le 11+radic12

asymp 058

still slightly worse than the condition from corollary 3An example in which the condition obtained with theorem 4 is better

than the one from corollary 3 is given in Figure 3c Straightforward analysisfollowing the same recipe as for Figure 3b yields A = 1minusradic18 with

σ leradic

12

and thus α le 1

1+radic

1minusradic12asymp 065

better than theα lt 062 from corollary 3 and to be compared with the criticalαcritical asymp 088

9 Discussion

In this article we derived sufficient conditions for loopy belief propagationto have just a single fixed point These conditions remain much too strong tobe anywhere near the necessary conditions and in that sense should be seenas no more than a first step These conditions have the following positivefeatures

2410 T Heskes

bull Generalize the conditions for convexity of the Bethe free energy

bull Incorporate the (local) strength of potentials

bull Scale naturally as a function of the ldquotemperaturerdquo

bull Are invariant to arbitrary definitions of potentials and self-interactions

Although the analysis that led to these conditions may seem quite involvedit basically consists of a relatively straightforward combination of two ob-servations The first observation is that we can exploit the arbitrariness inthe definition of the Bethe free energy when we incorporate the constraintsThis forms the basis of the resource allocation argument And the secondobservation concerns the bound on the correlation of a loopy belief propa-gation marginal that leads to the introduction of the strength of a potential

Besides its theoretical usefulness there are more practical uses Firstalgorithms for guaranteed convergence explicitly minimize the Bethe freeenergy They can be considered ldquobound optimization algorithmsrdquo similarto expectation maximization and iterative proportional fitting in the innerloop they minimize a bound on the Bethe free energy which is then up-dated in the outer loop In practice it appears that the tighter the boundthe faster the convergence (see eg Heskes et al 2003) Instead of a boundthat is convex (Yuille 2002) or convex over the set of constraints (Teh ampWelling 2002 Heskes et al 2003) we might relax the convexity conditionand choose a tighter bound that still has a unique minimum thereby speed-ing up the convergence Second in Wainwright et al (2002) a convexifiedBethe free energy is proposed The arguments for this class of free energiesare twofold they yield a bound on the partition function (instead of justan approximation as the standard Bethe free energy) and have a uniqueminimum Focusing on the second argument the conditions in this articlecan be used to construct Bethe free energies that may not be convex (overthe set of constraints) but do have a unique minimum and being closer tothe standard Bethe free energy may yield better approximations

We can think of the following opportunities to make the sufficient con-ditions derived here stricter and thus closer to necessary conditions

bull The conditions guarantee convexity of the dual G(Qβ λαβ) with re-spect to Qβ But in fact we need only G(Qβ) equiv maxλαβ G(Qβ λαβ) to beconvex which is a weaker requirement The Hessian of G(Qβ) how-ever appears to be more difficult to compute and to analyze in generalbut may lead to stronger results in specific cases (eg only pairwiseinteractions or substituting a particular choice of Aαβ )

bull It may be possible to strengthen the bound equation 71 on loopybelief correlations especially for interactions that involve more thantwo nodes

An important question is how the uniqueness of loopy belief propaga-tion fixed points relates to the convergence of loopy belief propagation

Uniqueness of Loopy Belief Propagation Fixed Points 2411

Intuitively one might expect that if loopy belief propagation has a uniquefixed point it will also converge to it This also seems to be the argumenta-tion in Tatikonda and Jordan (2002) However to the best of our knowledgethere is no proof of such correspondence Furthermore the following set ofsimulations does seem to suggest otherwise

We consider a Boltzmann machine with four binary nodes weights

w = ω

0 1 minus1 minus11 0 1 minus1minus1 1 0 minus1minus1 minus1 minus1 0

zero thresholds and potentials

ij(xi xj) = exp[wij4] if xi = xj and ij(xi xj) = exp[minuswij4] if xi = xj

Running loopy belief propagation possibly damped as in equation 39 weobserve ldquoconvergentrdquo and ldquononconvergentrdquo behavior For relatively smallweights loopy belief propagation converges to the trivial fixed point withPi(xi) = 05 for all nodes i and xi = 0 1 as in the lower left inset inFigure 4 For relatively large weights it ends up in a limit cycle as shown inthe upper right inset The weight strength that forms the transition betweenthis ldquoconvergentrdquo and ldquononconvergentrdquo behavior strongly depends on thestep size7 This by itself makes it hard to defend a one-to-one correspondencebetween convergence of loopy belief propagation (apparently dependingon step size) and uniqueness of fixed points (obviously independent of stepsize)

For weights larger than roughly 58 loopy belief propagation failed toconverge to the trivial fixed point even for very small step sizes Howeverrunning a convergent double-loop algorithm from many different initialconditions and many weight strengths considerably larger than 58 we al-ways ended up in the trivial fixed point and never in another one We foundsimilar behavior for a three-node Boltzmann machine (same weight matrixas above except for the fourth node) for very large weights loopy beliefpropagation ends up in a limit cycle whereas a convergent double-loopalgorithm converges to the trivial fixed point which here by corollary 2is guaranteed to be unique In future work we hope to elaborate on theseissues

7 Note that the conditions for guaranteed uniqueness imply ω = 43 for corollary 3and ω = log(2) asymp 069 for theorem 4 both far below the weight strengths where ldquonon-convergentrdquo behavior sets in

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

2404 T Heskes

extrema of the dual equation 55 This dual has a unique extremum if itis convexconcave Concavity is guaranteed so we focus on conditionsfor convexity that is for positive (semi)definiteness of the correspondingHessian This then boils down to conditions that ensure K = K1+K2+K3 ge 0for any choice of Rβ(xβ)

Substituting the bound equation 71 into the term K3 we obtain

K3 ge minussumα

sumββprime subαβprime =β

sumxβ xprime

βprime

AαβAαβ primeσαQlowastα(xβ xprimeβ prime)Rβ(xβ)Rβ prime(xprimeβ prime)

ge minussumα

σαsumβsubα

sumxβ

Aαβ

sumβprimesubαβprime =β

Aαβ primeQlowastα(xβ)R2β(xβ)

where in the last step we applied the same trick as in equation 61 SinceK2 ge 0 and combining K1 and (the above lower bound on) K3 we get

K = K1 + K2 + K3

gesumα

sumβsubα

sumxβ

Aαβ

[1minus Aαβ minus σα

sumβ prime =β

Aαβ prime

]Qlowastα(xβ)R

2β(xβ)

This implies

(1minus σα)Aαβ + σαsumβ primesubα

Aαβ prime le 1 forallαβsubα

which in combination with Aαβ ge 0 and σα le 1 yields condition 2 inequation 81 The equality constraint equation 52 that we started with canbe relaxed to the inequality condition 3 without any consequences

We get back the stricter conditions of theorem 1 if σα = 1 for all potentialsα Furthermore ldquofake interactionsrdquo play no role with σα = 0 condition 2becomes maxβsubα Aαβ le 1 suggesting the choice Aαβ = 1 for all β sub αwhich then effectively reduces the number of neighboring potentials nβ incondition 3

82 Comparison with Other Work To the best of our knowledge theonly conditions for uniqueness of loopy belief propagation fixed points thatdepend on more than just the structure of the graph are those in Tatikondaand Jordan (2002) for pairwise potentials The analysis in Tatikonda andJordan is based on the concept of the computation tree which represents anunwrapping of the original graph with respect to the loopy belief propaga-tion algorithm The same concept is used in Weiss (2000) to show that beliefrevision yields the correct maximum a posteriori assignments in graphs

Uniqueness of Loopy Belief Propagation Fixed Points 2405

with a single loop and Weiss and Freeman (2001) to prove that loopy beliefpropagation in gaussian graphical models yields exact means Although thecurrent theorems based on the concept of computation trees are derived forpairwise potentials it should be possible to extend them to more generalfactor graphs

The setup in Tatikonda and Jordan (2002) is slightly different it is basedon the factorization

Pexact(X) = 1Z

prodα

α(Xα)prodβ

β(xβ)

to be compared with our equation 31 where there are no self-potentialsβ(xβ) With this in mind the statement is then as follows

Theorem 5 (adapted from Tatikonda amp Jordan 2002 in particular proposi-tion 53) Loopy belief propagation on pairwise potentials has a unique fixed pointif

sumαsupβ

(max

ψα(Xα)minusminXα

ψα(Xα)

)lt 2 forallβ (82)

To make the connection between theorem 5 and theorem 4 we will firststrengthen the former and then weaken the latter We will focus on thecase of binary pairwise potentials Since the definition of self-potentials isarbitrary and the condition 82 is valid for any choice we can easily improvethe condition by optimizing this choice This then leads to the followingcorollary

Corollary 3 This corollary concerns an improvement of theorem 5 for pairwisebinary potentials Loopy belief propagation on pairwise binary potentials has aunique fixed point ifsum

αsupβωα lt 4 forallβ (83)

with ωα defined in equation 72

Proof The condition 82 applies to any arbitrary definition of self-poten-tials β(xβ) In fact it is valid for any choice

ψα(Xα) = ψα(Xα)+sumβsubα

φαβ(xβ)

where ψα(Xα) is any choice of potential subsets that fits in our frameworkof no self-potentials (as argued above there is some arbitrariness here as

2406 T Heskes

well) We can then optimize this choice to obtain milder and thus betterconditions Omitting α and renumbering the nodes from 1 to 2 we have

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= minφ1φ2

maxx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

minus minx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

In the case of binary nodes (two-by-two matrices ψ(x1 x2)) it is easy tocheck that the optimal φ1 and φ2 that yield the smallest gap are such that

ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2)

ge ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2) (84)

for some x1 x2 x1 and x2 with x1 = x1 and x2 = x2 Solving for φ1 and φ2we find

φ1(x1)minus φ1(x1) = 12

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]φ2(x2)minus φ2(x2) = 1

2

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]

Substitution back into equation 84 yields

ψ(x1 x2)+ φ1(x1)+ φ2(x2)minus ψ(x1 x2)minus φ1(x1)minus φ2(x2)

= 12

[ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)minus ψ(x1 x2)

]

which has to be nonnegative Of all four possible combinations two of themare valid and yield the same positive gap and the other two are invalid sincethey yield the same negative gap Enumerating these combinations we find

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= 12|ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| = ω

2

from equation 75 Substitution into the condition 82 then yields equa-tion 83

Next we derive the following weaker corollary of theorem 4

Uniqueness of Loopy Belief Propagation Fixed Points 2407

Corollary 4 This is a weaker version of theorem 4 for pairwise potentials Loopybelief propagation on pairwise potentials has a unique fixed point ifsum

αsupβωα le 1 forallβ (85)

with ωα defined in equation 72

Proof Consider the allocation matrix with components Aαβ = 1minus σα forall β sub α With this choice conditions 1 and 2 of equation 81 are fulfilledsince (condition 1) σα le 1 and (condition 2)

(1minus σα)(1minus σα)+ 2σα(1minus σα) = 1minus 2σ 2α le 1

Substitution into condition 3 yieldssumαsupβ

(1minus σα) gesumαsupβ

1minus 1 and thussumαsupβ

σα le 1 (86)

Since ωα = minus log(1minus σα) ge σα condition 85 is weaker than condition 86

Summarizing the conditions in Tatikonda and Jordan (2002) are for bi-nary pairwise potentials and when strengthened as above at most a constant(factor 4) less strict and thus better than the ones derived here The latterare better when the structure is (close to) a tree The best set of conditionsfollows by taking the union of both Note further that the conditions derivedin Tatikonda and Jordan (2002) are unlike theorem 4 specific to pairwisepotentials

83 Illustration For illustration we consider a 3 times 3 Ising grid withtoroidal boundary conditions as in Figure 3a and uniform ferromagneticpotentials proportional to(

α 1minus α1minus α α

)

The trivial solution which is the only minimum of the Bethe free energy forsmall α is the one with all pseudomarginals equal to (05 05) With simplealgebra for example following the line of reasoning that leads to the beliefoptimization algorithm in Welling and Teh (2003) it can be shown that thistrivial solution becomes unstable at the critical αcritical = 23 asymp 067 Forα gt 23 we find two minima one with ldquospins uprdquo and the other one withldquospins downrdquo

In this symmetric problem the strength of each potential is given by

ω = 2 log[

α

1minus α]

and thus σ = 1minus(

1minus αα

)2

2408 T Heskes

Figure 3 Three Ising grids in factor-graph notation circles denote nodes boxesinteractions (a) Toroidal boundary conditions All elements of the allocationmatrix equal to 34 (not shown) (b) Aperiodic boundary conditions and (c) twoloops left The elements of the allocation matrix along the edges follow directlyfrom optimizing condition 3 in theorem 4 and symmetry considerations WithB = 2minus 2A in b and C = 1minusA in c the optimal settings for the single remainingvariable A then boil down to 34 and 1 minus

radic18 respectively See the text for

further explanation

Uniqueness of Loopy Belief Propagation Fixed Points 2409

The minimal (uniform) compensation in condition 3 of theorem 4 amountsto A = 34 for all combinations of potentials and nodes Substitution intocondition 2 then yields

σ le 13

and thus α le 11+radic23

asymp 055

The critical value that follows from corollary 3 is in this case slightly better

ω lt 1 and thus α le 11+ eminus12 asymp 062

Next we consider the same grid with aperiodic boundary conditions asin Figure 3b Numerically we find a critical αcritical asymp 079 The value thatfollows from corollary 3 is dominated by the center node and hence staysthe same a unique loopy belief propagation fixed point for α lt 062 Theo-rem 4 can be exploited to shift resources a little In principle we can solvethe nonlinear programming problem but for this small problem it can stillbe done by hand with the following argumentation Minimal compensationaccording to condition 3 in theorem 4 combined with symmetry considera-tions yields the allocation matrix elements along on the edges in Figure 3bIt is then easy to check that there are only two different appearances ofcondition 2

(2minus 2A)σ + 34le 1 and

12σ + A le 1

The optimal choice for A is the one in which both conditions turn out to beidentical In this way we obtain A = 34 yielding

σ le 12

and thus α le 11+radic12

asymp 058

still slightly worse than the condition from corollary 3An example in which the condition obtained with theorem 4 is better

than the one from corollary 3 is given in Figure 3c Straightforward analysisfollowing the same recipe as for Figure 3b yields A = 1minusradic18 with

σ leradic

12

and thus α le 1

1+radic

1minusradic12asymp 065

better than theα lt 062 from corollary 3 and to be compared with the criticalαcritical asymp 088

9 Discussion

In this article we derived sufficient conditions for loopy belief propagationto have just a single fixed point These conditions remain much too strong tobe anywhere near the necessary conditions and in that sense should be seenas no more than a first step These conditions have the following positivefeatures

2410 T Heskes

bull Generalize the conditions for convexity of the Bethe free energy

bull Incorporate the (local) strength of potentials

bull Scale naturally as a function of the ldquotemperaturerdquo

bull Are invariant to arbitrary definitions of potentials and self-interactions

Although the analysis that led to these conditions may seem quite involvedit basically consists of a relatively straightforward combination of two ob-servations The first observation is that we can exploit the arbitrariness inthe definition of the Bethe free energy when we incorporate the constraintsThis forms the basis of the resource allocation argument And the secondobservation concerns the bound on the correlation of a loopy belief propa-gation marginal that leads to the introduction of the strength of a potential

Besides its theoretical usefulness there are more practical uses Firstalgorithms for guaranteed convergence explicitly minimize the Bethe freeenergy They can be considered ldquobound optimization algorithmsrdquo similarto expectation maximization and iterative proportional fitting in the innerloop they minimize a bound on the Bethe free energy which is then up-dated in the outer loop In practice it appears that the tighter the boundthe faster the convergence (see eg Heskes et al 2003) Instead of a boundthat is convex (Yuille 2002) or convex over the set of constraints (Teh ampWelling 2002 Heskes et al 2003) we might relax the convexity conditionand choose a tighter bound that still has a unique minimum thereby speed-ing up the convergence Second in Wainwright et al (2002) a convexifiedBethe free energy is proposed The arguments for this class of free energiesare twofold they yield a bound on the partition function (instead of justan approximation as the standard Bethe free energy) and have a uniqueminimum Focusing on the second argument the conditions in this articlecan be used to construct Bethe free energies that may not be convex (overthe set of constraints) but do have a unique minimum and being closer tothe standard Bethe free energy may yield better approximations

We can think of the following opportunities to make the sufficient con-ditions derived here stricter and thus closer to necessary conditions

bull The conditions guarantee convexity of the dual G(Qβ λαβ) with re-spect to Qβ But in fact we need only G(Qβ) equiv maxλαβ G(Qβ λαβ) to beconvex which is a weaker requirement The Hessian of G(Qβ) how-ever appears to be more difficult to compute and to analyze in generalbut may lead to stronger results in specific cases (eg only pairwiseinteractions or substituting a particular choice of Aαβ )

bull It may be possible to strengthen the bound equation 71 on loopybelief correlations especially for interactions that involve more thantwo nodes

An important question is how the uniqueness of loopy belief propaga-tion fixed points relates to the convergence of loopy belief propagation

Uniqueness of Loopy Belief Propagation Fixed Points 2411

Intuitively one might expect that if loopy belief propagation has a uniquefixed point it will also converge to it This also seems to be the argumenta-tion in Tatikonda and Jordan (2002) However to the best of our knowledgethere is no proof of such correspondence Furthermore the following set ofsimulations does seem to suggest otherwise

We consider a Boltzmann machine with four binary nodes weights

w = ω

0 1 minus1 minus11 0 1 minus1minus1 1 0 minus1minus1 minus1 minus1 0

zero thresholds and potentials

ij(xi xj) = exp[wij4] if xi = xj and ij(xi xj) = exp[minuswij4] if xi = xj

Running loopy belief propagation possibly damped as in equation 39 weobserve ldquoconvergentrdquo and ldquononconvergentrdquo behavior For relatively smallweights loopy belief propagation converges to the trivial fixed point withPi(xi) = 05 for all nodes i and xi = 0 1 as in the lower left inset inFigure 4 For relatively large weights it ends up in a limit cycle as shown inthe upper right inset The weight strength that forms the transition betweenthis ldquoconvergentrdquo and ldquononconvergentrdquo behavior strongly depends on thestep size7 This by itself makes it hard to defend a one-to-one correspondencebetween convergence of loopy belief propagation (apparently dependingon step size) and uniqueness of fixed points (obviously independent of stepsize)

For weights larger than roughly 58 loopy belief propagation failed toconverge to the trivial fixed point even for very small step sizes Howeverrunning a convergent double-loop algorithm from many different initialconditions and many weight strengths considerably larger than 58 we al-ways ended up in the trivial fixed point and never in another one We foundsimilar behavior for a three-node Boltzmann machine (same weight matrixas above except for the fourth node) for very large weights loopy beliefpropagation ends up in a limit cycle whereas a convergent double-loopalgorithm converges to the trivial fixed point which here by corollary 2is guaranteed to be unique In future work we hope to elaborate on theseissues

7 Note that the conditions for guaranteed uniqueness imply ω = 43 for corollary 3and ω = log(2) asymp 069 for theorem 4 both far below the weight strengths where ldquonon-convergentrdquo behavior sets in

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

Uniqueness of Loopy Belief Propagation Fixed Points 2405

with a single loop and Weiss and Freeman (2001) to prove that loopy beliefpropagation in gaussian graphical models yields exact means Although thecurrent theorems based on the concept of computation trees are derived forpairwise potentials it should be possible to extend them to more generalfactor graphs

The setup in Tatikonda and Jordan (2002) is slightly different it is basedon the factorization

Pexact(X) = 1Z

prodα

α(Xα)prodβ

β(xβ)

to be compared with our equation 31 where there are no self-potentialsβ(xβ) With this in mind the statement is then as follows

Theorem 5 (adapted from Tatikonda amp Jordan 2002 in particular proposi-tion 53) Loopy belief propagation on pairwise potentials has a unique fixed pointif

sumαsupβ

(max

ψα(Xα)minusminXα

ψα(Xα)

)lt 2 forallβ (82)

To make the connection between theorem 5 and theorem 4 we will firststrengthen the former and then weaken the latter We will focus on thecase of binary pairwise potentials Since the definition of self-potentials isarbitrary and the condition 82 is valid for any choice we can easily improvethe condition by optimizing this choice This then leads to the followingcorollary

Corollary 3 This corollary concerns an improvement of theorem 5 for pairwisebinary potentials Loopy belief propagation on pairwise binary potentials has aunique fixed point ifsum

αsupβωα lt 4 forallβ (83)

with ωα defined in equation 72

Proof The condition 82 applies to any arbitrary definition of self-poten-tials β(xβ) In fact it is valid for any choice

ψα(Xα) = ψα(Xα)+sumβsubα

φαβ(xβ)

where ψα(Xα) is any choice of potential subsets that fits in our frameworkof no self-potentials (as argued above there is some arbitrariness here as

2406 T Heskes

well) We can then optimize this choice to obtain milder and thus betterconditions Omitting α and renumbering the nodes from 1 to 2 we have

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= minφ1φ2

maxx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

minus minx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

In the case of binary nodes (two-by-two matrices ψ(x1 x2)) it is easy tocheck that the optimal φ1 and φ2 that yield the smallest gap are such that

ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2)

ge ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2) (84)

for some x1 x2 x1 and x2 with x1 = x1 and x2 = x2 Solving for φ1 and φ2we find

φ1(x1)minus φ1(x1) = 12

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]φ2(x2)minus φ2(x2) = 1

2

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]

Substitution back into equation 84 yields

ψ(x1 x2)+ φ1(x1)+ φ2(x2)minus ψ(x1 x2)minus φ1(x1)minus φ2(x2)

= 12

[ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)minus ψ(x1 x2)

]

which has to be nonnegative Of all four possible combinations two of themare valid and yield the same positive gap and the other two are invalid sincethey yield the same negative gap Enumerating these combinations we find

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= 12|ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| = ω

2

from equation 75 Substitution into the condition 82 then yields equa-tion 83

Next we derive the following weaker corollary of theorem 4

Uniqueness of Loopy Belief Propagation Fixed Points 2407

Corollary 4 This is a weaker version of theorem 4 for pairwise potentials Loopybelief propagation on pairwise potentials has a unique fixed point ifsum

αsupβωα le 1 forallβ (85)

with ωα defined in equation 72

Proof Consider the allocation matrix with components Aαβ = 1minus σα forall β sub α With this choice conditions 1 and 2 of equation 81 are fulfilledsince (condition 1) σα le 1 and (condition 2)

(1minus σα)(1minus σα)+ 2σα(1minus σα) = 1minus 2σ 2α le 1

Substitution into condition 3 yieldssumαsupβ

(1minus σα) gesumαsupβ

1minus 1 and thussumαsupβ

σα le 1 (86)

Since ωα = minus log(1minus σα) ge σα condition 85 is weaker than condition 86

Summarizing the conditions in Tatikonda and Jordan (2002) are for bi-nary pairwise potentials and when strengthened as above at most a constant(factor 4) less strict and thus better than the ones derived here The latterare better when the structure is (close to) a tree The best set of conditionsfollows by taking the union of both Note further that the conditions derivedin Tatikonda and Jordan (2002) are unlike theorem 4 specific to pairwisepotentials

83 Illustration For illustration we consider a 3 times 3 Ising grid withtoroidal boundary conditions as in Figure 3a and uniform ferromagneticpotentials proportional to(

α 1minus α1minus α α

)

The trivial solution which is the only minimum of the Bethe free energy forsmall α is the one with all pseudomarginals equal to (05 05) With simplealgebra for example following the line of reasoning that leads to the beliefoptimization algorithm in Welling and Teh (2003) it can be shown that thistrivial solution becomes unstable at the critical αcritical = 23 asymp 067 Forα gt 23 we find two minima one with ldquospins uprdquo and the other one withldquospins downrdquo

In this symmetric problem the strength of each potential is given by

ω = 2 log[

α

1minus α]

and thus σ = 1minus(

1minus αα

)2

2408 T Heskes

Figure 3 Three Ising grids in factor-graph notation circles denote nodes boxesinteractions (a) Toroidal boundary conditions All elements of the allocationmatrix equal to 34 (not shown) (b) Aperiodic boundary conditions and (c) twoloops left The elements of the allocation matrix along the edges follow directlyfrom optimizing condition 3 in theorem 4 and symmetry considerations WithB = 2minus 2A in b and C = 1minusA in c the optimal settings for the single remainingvariable A then boil down to 34 and 1 minus

radic18 respectively See the text for

further explanation

Uniqueness of Loopy Belief Propagation Fixed Points 2409

The minimal (uniform) compensation in condition 3 of theorem 4 amountsto A = 34 for all combinations of potentials and nodes Substitution intocondition 2 then yields

σ le 13

and thus α le 11+radic23

asymp 055

The critical value that follows from corollary 3 is in this case slightly better

ω lt 1 and thus α le 11+ eminus12 asymp 062

Next we consider the same grid with aperiodic boundary conditions asin Figure 3b Numerically we find a critical αcritical asymp 079 The value thatfollows from corollary 3 is dominated by the center node and hence staysthe same a unique loopy belief propagation fixed point for α lt 062 Theo-rem 4 can be exploited to shift resources a little In principle we can solvethe nonlinear programming problem but for this small problem it can stillbe done by hand with the following argumentation Minimal compensationaccording to condition 3 in theorem 4 combined with symmetry considera-tions yields the allocation matrix elements along on the edges in Figure 3bIt is then easy to check that there are only two different appearances ofcondition 2

(2minus 2A)σ + 34le 1 and

12σ + A le 1

The optimal choice for A is the one in which both conditions turn out to beidentical In this way we obtain A = 34 yielding

σ le 12

and thus α le 11+radic12

asymp 058

still slightly worse than the condition from corollary 3An example in which the condition obtained with theorem 4 is better

than the one from corollary 3 is given in Figure 3c Straightforward analysisfollowing the same recipe as for Figure 3b yields A = 1minusradic18 with

σ leradic

12

and thus α le 1

1+radic

1minusradic12asymp 065

better than theα lt 062 from corollary 3 and to be compared with the criticalαcritical asymp 088

9 Discussion

In this article we derived sufficient conditions for loopy belief propagationto have just a single fixed point These conditions remain much too strong tobe anywhere near the necessary conditions and in that sense should be seenas no more than a first step These conditions have the following positivefeatures

2410 T Heskes

bull Generalize the conditions for convexity of the Bethe free energy

bull Incorporate the (local) strength of potentials

bull Scale naturally as a function of the ldquotemperaturerdquo

bull Are invariant to arbitrary definitions of potentials and self-interactions

Although the analysis that led to these conditions may seem quite involvedit basically consists of a relatively straightforward combination of two ob-servations The first observation is that we can exploit the arbitrariness inthe definition of the Bethe free energy when we incorporate the constraintsThis forms the basis of the resource allocation argument And the secondobservation concerns the bound on the correlation of a loopy belief propa-gation marginal that leads to the introduction of the strength of a potential

Besides its theoretical usefulness there are more practical uses Firstalgorithms for guaranteed convergence explicitly minimize the Bethe freeenergy They can be considered ldquobound optimization algorithmsrdquo similarto expectation maximization and iterative proportional fitting in the innerloop they minimize a bound on the Bethe free energy which is then up-dated in the outer loop In practice it appears that the tighter the boundthe faster the convergence (see eg Heskes et al 2003) Instead of a boundthat is convex (Yuille 2002) or convex over the set of constraints (Teh ampWelling 2002 Heskes et al 2003) we might relax the convexity conditionand choose a tighter bound that still has a unique minimum thereby speed-ing up the convergence Second in Wainwright et al (2002) a convexifiedBethe free energy is proposed The arguments for this class of free energiesare twofold they yield a bound on the partition function (instead of justan approximation as the standard Bethe free energy) and have a uniqueminimum Focusing on the second argument the conditions in this articlecan be used to construct Bethe free energies that may not be convex (overthe set of constraints) but do have a unique minimum and being closer tothe standard Bethe free energy may yield better approximations

We can think of the following opportunities to make the sufficient con-ditions derived here stricter and thus closer to necessary conditions

bull The conditions guarantee convexity of the dual G(Qβ λαβ) with re-spect to Qβ But in fact we need only G(Qβ) equiv maxλαβ G(Qβ λαβ) to beconvex which is a weaker requirement The Hessian of G(Qβ) how-ever appears to be more difficult to compute and to analyze in generalbut may lead to stronger results in specific cases (eg only pairwiseinteractions or substituting a particular choice of Aαβ )

bull It may be possible to strengthen the bound equation 71 on loopybelief correlations especially for interactions that involve more thantwo nodes

An important question is how the uniqueness of loopy belief propaga-tion fixed points relates to the convergence of loopy belief propagation

Uniqueness of Loopy Belief Propagation Fixed Points 2411

Intuitively one might expect that if loopy belief propagation has a uniquefixed point it will also converge to it This also seems to be the argumenta-tion in Tatikonda and Jordan (2002) However to the best of our knowledgethere is no proof of such correspondence Furthermore the following set ofsimulations does seem to suggest otherwise

We consider a Boltzmann machine with four binary nodes weights

w = ω

0 1 minus1 minus11 0 1 minus1minus1 1 0 minus1minus1 minus1 minus1 0

zero thresholds and potentials

ij(xi xj) = exp[wij4] if xi = xj and ij(xi xj) = exp[minuswij4] if xi = xj

Running loopy belief propagation possibly damped as in equation 39 weobserve ldquoconvergentrdquo and ldquononconvergentrdquo behavior For relatively smallweights loopy belief propagation converges to the trivial fixed point withPi(xi) = 05 for all nodes i and xi = 0 1 as in the lower left inset inFigure 4 For relatively large weights it ends up in a limit cycle as shown inthe upper right inset The weight strength that forms the transition betweenthis ldquoconvergentrdquo and ldquononconvergentrdquo behavior strongly depends on thestep size7 This by itself makes it hard to defend a one-to-one correspondencebetween convergence of loopy belief propagation (apparently dependingon step size) and uniqueness of fixed points (obviously independent of stepsize)

For weights larger than roughly 58 loopy belief propagation failed toconverge to the trivial fixed point even for very small step sizes Howeverrunning a convergent double-loop algorithm from many different initialconditions and many weight strengths considerably larger than 58 we al-ways ended up in the trivial fixed point and never in another one We foundsimilar behavior for a three-node Boltzmann machine (same weight matrixas above except for the fourth node) for very large weights loopy beliefpropagation ends up in a limit cycle whereas a convergent double-loopalgorithm converges to the trivial fixed point which here by corollary 2is guaranteed to be unique In future work we hope to elaborate on theseissues

7 Note that the conditions for guaranteed uniqueness imply ω = 43 for corollary 3and ω = log(2) asymp 069 for theorem 4 both far below the weight strengths where ldquonon-convergentrdquo behavior sets in

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

2406 T Heskes

well) We can then optimize this choice to obtain milder and thus betterconditions Omitting α and renumbering the nodes from 1 to 2 we have

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= minφ1φ2

maxx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

minus minx1x2

[ψ(x1 x2)+ φ1(x1)+ φ2(x2)]

In the case of binary nodes (two-by-two matrices ψ(x1 x2)) it is easy tocheck that the optimal φ1 and φ2 that yield the smallest gap are such that

ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2)

ge ψ(x1 x2)+ φ1(x1)+ φ2(x2) = ψ(x1 x2)+ φ1(x1)+ φ2(x2) (84)

for some x1 x2 x1 and x2 with x1 = x1 and x2 = x2 Solving for φ1 and φ2we find

φ1(x1)minus φ1(x1) = 12

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]φ2(x2)minus φ2(x2) = 1

2

[ψ(x1 x2)minus ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)

]

Substitution back into equation 84 yields

ψ(x1 x2)+ φ1(x1)+ φ2(x2)minus ψ(x1 x2)minus φ1(x1)minus φ2(x2)

= 12

[ψ(x1 x2)+ ψ(x1 x2)minus ψ(x1 x2)minus ψ(x1 x2)

]

which has to be nonnegative Of all four possible combinations two of themare valid and yield the same positive gap and the other two are invalid sincethey yield the same negative gap Enumerating these combinations we find

minφ1φ2

maxx1x2

ψ(x1 x2)minusminx1x2

ψ(x1 x2)

= 12|ψ(0 0)+ ψ(1 1)minus ψ(0 1)minus ψ(1 0)| = ω

2

from equation 75 Substitution into the condition 82 then yields equa-tion 83

Next we derive the following weaker corollary of theorem 4

Uniqueness of Loopy Belief Propagation Fixed Points 2407

Corollary 4 This is a weaker version of theorem 4 for pairwise potentials Loopybelief propagation on pairwise potentials has a unique fixed point ifsum

αsupβωα le 1 forallβ (85)

with ωα defined in equation 72

Proof Consider the allocation matrix with components Aαβ = 1minus σα forall β sub α With this choice conditions 1 and 2 of equation 81 are fulfilledsince (condition 1) σα le 1 and (condition 2)

(1minus σα)(1minus σα)+ 2σα(1minus σα) = 1minus 2σ 2α le 1

Substitution into condition 3 yieldssumαsupβ

(1minus σα) gesumαsupβ

1minus 1 and thussumαsupβ

σα le 1 (86)

Since ωα = minus log(1minus σα) ge σα condition 85 is weaker than condition 86

Summarizing the conditions in Tatikonda and Jordan (2002) are for bi-nary pairwise potentials and when strengthened as above at most a constant(factor 4) less strict and thus better than the ones derived here The latterare better when the structure is (close to) a tree The best set of conditionsfollows by taking the union of both Note further that the conditions derivedin Tatikonda and Jordan (2002) are unlike theorem 4 specific to pairwisepotentials

83 Illustration For illustration we consider a 3 times 3 Ising grid withtoroidal boundary conditions as in Figure 3a and uniform ferromagneticpotentials proportional to(

α 1minus α1minus α α

)

The trivial solution which is the only minimum of the Bethe free energy forsmall α is the one with all pseudomarginals equal to (05 05) With simplealgebra for example following the line of reasoning that leads to the beliefoptimization algorithm in Welling and Teh (2003) it can be shown that thistrivial solution becomes unstable at the critical αcritical = 23 asymp 067 Forα gt 23 we find two minima one with ldquospins uprdquo and the other one withldquospins downrdquo

In this symmetric problem the strength of each potential is given by

ω = 2 log[

α

1minus α]

and thus σ = 1minus(

1minus αα

)2

2408 T Heskes

Figure 3 Three Ising grids in factor-graph notation circles denote nodes boxesinteractions (a) Toroidal boundary conditions All elements of the allocationmatrix equal to 34 (not shown) (b) Aperiodic boundary conditions and (c) twoloops left The elements of the allocation matrix along the edges follow directlyfrom optimizing condition 3 in theorem 4 and symmetry considerations WithB = 2minus 2A in b and C = 1minusA in c the optimal settings for the single remainingvariable A then boil down to 34 and 1 minus

radic18 respectively See the text for

further explanation

Uniqueness of Loopy Belief Propagation Fixed Points 2409

The minimal (uniform) compensation in condition 3 of theorem 4 amountsto A = 34 for all combinations of potentials and nodes Substitution intocondition 2 then yields

σ le 13

and thus α le 11+radic23

asymp 055

The critical value that follows from corollary 3 is in this case slightly better

ω lt 1 and thus α le 11+ eminus12 asymp 062

Next we consider the same grid with aperiodic boundary conditions asin Figure 3b Numerically we find a critical αcritical asymp 079 The value thatfollows from corollary 3 is dominated by the center node and hence staysthe same a unique loopy belief propagation fixed point for α lt 062 Theo-rem 4 can be exploited to shift resources a little In principle we can solvethe nonlinear programming problem but for this small problem it can stillbe done by hand with the following argumentation Minimal compensationaccording to condition 3 in theorem 4 combined with symmetry considera-tions yields the allocation matrix elements along on the edges in Figure 3bIt is then easy to check that there are only two different appearances ofcondition 2

(2minus 2A)σ + 34le 1 and

12σ + A le 1

The optimal choice for A is the one in which both conditions turn out to beidentical In this way we obtain A = 34 yielding

σ le 12

and thus α le 11+radic12

asymp 058

still slightly worse than the condition from corollary 3An example in which the condition obtained with theorem 4 is better

than the one from corollary 3 is given in Figure 3c Straightforward analysisfollowing the same recipe as for Figure 3b yields A = 1minusradic18 with

σ leradic

12

and thus α le 1

1+radic

1minusradic12asymp 065

better than theα lt 062 from corollary 3 and to be compared with the criticalαcritical asymp 088

9 Discussion

In this article we derived sufficient conditions for loopy belief propagationto have just a single fixed point These conditions remain much too strong tobe anywhere near the necessary conditions and in that sense should be seenas no more than a first step These conditions have the following positivefeatures

2410 T Heskes

bull Generalize the conditions for convexity of the Bethe free energy

bull Incorporate the (local) strength of potentials

bull Scale naturally as a function of the ldquotemperaturerdquo

bull Are invariant to arbitrary definitions of potentials and self-interactions

Although the analysis that led to these conditions may seem quite involvedit basically consists of a relatively straightforward combination of two ob-servations The first observation is that we can exploit the arbitrariness inthe definition of the Bethe free energy when we incorporate the constraintsThis forms the basis of the resource allocation argument And the secondobservation concerns the bound on the correlation of a loopy belief propa-gation marginal that leads to the introduction of the strength of a potential

Besides its theoretical usefulness there are more practical uses Firstalgorithms for guaranteed convergence explicitly minimize the Bethe freeenergy They can be considered ldquobound optimization algorithmsrdquo similarto expectation maximization and iterative proportional fitting in the innerloop they minimize a bound on the Bethe free energy which is then up-dated in the outer loop In practice it appears that the tighter the boundthe faster the convergence (see eg Heskes et al 2003) Instead of a boundthat is convex (Yuille 2002) or convex over the set of constraints (Teh ampWelling 2002 Heskes et al 2003) we might relax the convexity conditionand choose a tighter bound that still has a unique minimum thereby speed-ing up the convergence Second in Wainwright et al (2002) a convexifiedBethe free energy is proposed The arguments for this class of free energiesare twofold they yield a bound on the partition function (instead of justan approximation as the standard Bethe free energy) and have a uniqueminimum Focusing on the second argument the conditions in this articlecan be used to construct Bethe free energies that may not be convex (overthe set of constraints) but do have a unique minimum and being closer tothe standard Bethe free energy may yield better approximations

We can think of the following opportunities to make the sufficient con-ditions derived here stricter and thus closer to necessary conditions

bull The conditions guarantee convexity of the dual G(Qβ λαβ) with re-spect to Qβ But in fact we need only G(Qβ) equiv maxλαβ G(Qβ λαβ) to beconvex which is a weaker requirement The Hessian of G(Qβ) how-ever appears to be more difficult to compute and to analyze in generalbut may lead to stronger results in specific cases (eg only pairwiseinteractions or substituting a particular choice of Aαβ )

bull It may be possible to strengthen the bound equation 71 on loopybelief correlations especially for interactions that involve more thantwo nodes

An important question is how the uniqueness of loopy belief propaga-tion fixed points relates to the convergence of loopy belief propagation

Uniqueness of Loopy Belief Propagation Fixed Points 2411

Intuitively one might expect that if loopy belief propagation has a uniquefixed point it will also converge to it This also seems to be the argumenta-tion in Tatikonda and Jordan (2002) However to the best of our knowledgethere is no proof of such correspondence Furthermore the following set ofsimulations does seem to suggest otherwise

We consider a Boltzmann machine with four binary nodes weights

w = ω

0 1 minus1 minus11 0 1 minus1minus1 1 0 minus1minus1 minus1 minus1 0

zero thresholds and potentials

ij(xi xj) = exp[wij4] if xi = xj and ij(xi xj) = exp[minuswij4] if xi = xj

Running loopy belief propagation possibly damped as in equation 39 weobserve ldquoconvergentrdquo and ldquononconvergentrdquo behavior For relatively smallweights loopy belief propagation converges to the trivial fixed point withPi(xi) = 05 for all nodes i and xi = 0 1 as in the lower left inset inFigure 4 For relatively large weights it ends up in a limit cycle as shown inthe upper right inset The weight strength that forms the transition betweenthis ldquoconvergentrdquo and ldquononconvergentrdquo behavior strongly depends on thestep size7 This by itself makes it hard to defend a one-to-one correspondencebetween convergence of loopy belief propagation (apparently dependingon step size) and uniqueness of fixed points (obviously independent of stepsize)

For weights larger than roughly 58 loopy belief propagation failed toconverge to the trivial fixed point even for very small step sizes Howeverrunning a convergent double-loop algorithm from many different initialconditions and many weight strengths considerably larger than 58 we al-ways ended up in the trivial fixed point and never in another one We foundsimilar behavior for a three-node Boltzmann machine (same weight matrixas above except for the fourth node) for very large weights loopy beliefpropagation ends up in a limit cycle whereas a convergent double-loopalgorithm converges to the trivial fixed point which here by corollary 2is guaranteed to be unique In future work we hope to elaborate on theseissues

7 Note that the conditions for guaranteed uniqueness imply ω = 43 for corollary 3and ω = log(2) asymp 069 for theorem 4 both far below the weight strengths where ldquonon-convergentrdquo behavior sets in

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

Uniqueness of Loopy Belief Propagation Fixed Points 2407

Corollary 4 This is a weaker version of theorem 4 for pairwise potentials Loopybelief propagation on pairwise potentials has a unique fixed point ifsum

αsupβωα le 1 forallβ (85)

with ωα defined in equation 72

Proof Consider the allocation matrix with components Aαβ = 1minus σα forall β sub α With this choice conditions 1 and 2 of equation 81 are fulfilledsince (condition 1) σα le 1 and (condition 2)

(1minus σα)(1minus σα)+ 2σα(1minus σα) = 1minus 2σ 2α le 1

Substitution into condition 3 yieldssumαsupβ

(1minus σα) gesumαsupβ

1minus 1 and thussumαsupβ

σα le 1 (86)

Since ωα = minus log(1minus σα) ge σα condition 85 is weaker than condition 86

Summarizing the conditions in Tatikonda and Jordan (2002) are for bi-nary pairwise potentials and when strengthened as above at most a constant(factor 4) less strict and thus better than the ones derived here The latterare better when the structure is (close to) a tree The best set of conditionsfollows by taking the union of both Note further that the conditions derivedin Tatikonda and Jordan (2002) are unlike theorem 4 specific to pairwisepotentials

83 Illustration For illustration we consider a 3 times 3 Ising grid withtoroidal boundary conditions as in Figure 3a and uniform ferromagneticpotentials proportional to(

α 1minus α1minus α α

)

The trivial solution which is the only minimum of the Bethe free energy forsmall α is the one with all pseudomarginals equal to (05 05) With simplealgebra for example following the line of reasoning that leads to the beliefoptimization algorithm in Welling and Teh (2003) it can be shown that thistrivial solution becomes unstable at the critical αcritical = 23 asymp 067 Forα gt 23 we find two minima one with ldquospins uprdquo and the other one withldquospins downrdquo

In this symmetric problem the strength of each potential is given by

ω = 2 log[

α

1minus α]

and thus σ = 1minus(

1minus αα

)2

2408 T Heskes

Figure 3 Three Ising grids in factor-graph notation circles denote nodes boxesinteractions (a) Toroidal boundary conditions All elements of the allocationmatrix equal to 34 (not shown) (b) Aperiodic boundary conditions and (c) twoloops left The elements of the allocation matrix along the edges follow directlyfrom optimizing condition 3 in theorem 4 and symmetry considerations WithB = 2minus 2A in b and C = 1minusA in c the optimal settings for the single remainingvariable A then boil down to 34 and 1 minus

radic18 respectively See the text for

further explanation

Uniqueness of Loopy Belief Propagation Fixed Points 2409

The minimal (uniform) compensation in condition 3 of theorem 4 amountsto A = 34 for all combinations of potentials and nodes Substitution intocondition 2 then yields

σ le 13

and thus α le 11+radic23

asymp 055

The critical value that follows from corollary 3 is in this case slightly better

ω lt 1 and thus α le 11+ eminus12 asymp 062

Next we consider the same grid with aperiodic boundary conditions asin Figure 3b Numerically we find a critical αcritical asymp 079 The value thatfollows from corollary 3 is dominated by the center node and hence staysthe same a unique loopy belief propagation fixed point for α lt 062 Theo-rem 4 can be exploited to shift resources a little In principle we can solvethe nonlinear programming problem but for this small problem it can stillbe done by hand with the following argumentation Minimal compensationaccording to condition 3 in theorem 4 combined with symmetry considera-tions yields the allocation matrix elements along on the edges in Figure 3bIt is then easy to check that there are only two different appearances ofcondition 2

(2minus 2A)σ + 34le 1 and

12σ + A le 1

The optimal choice for A is the one in which both conditions turn out to beidentical In this way we obtain A = 34 yielding

σ le 12

and thus α le 11+radic12

asymp 058

still slightly worse than the condition from corollary 3An example in which the condition obtained with theorem 4 is better

than the one from corollary 3 is given in Figure 3c Straightforward analysisfollowing the same recipe as for Figure 3b yields A = 1minusradic18 with

σ leradic

12

and thus α le 1

1+radic

1minusradic12asymp 065

better than theα lt 062 from corollary 3 and to be compared with the criticalαcritical asymp 088

9 Discussion

In this article we derived sufficient conditions for loopy belief propagationto have just a single fixed point These conditions remain much too strong tobe anywhere near the necessary conditions and in that sense should be seenas no more than a first step These conditions have the following positivefeatures

2410 T Heskes

bull Generalize the conditions for convexity of the Bethe free energy

bull Incorporate the (local) strength of potentials

bull Scale naturally as a function of the ldquotemperaturerdquo

bull Are invariant to arbitrary definitions of potentials and self-interactions

Although the analysis that led to these conditions may seem quite involvedit basically consists of a relatively straightforward combination of two ob-servations The first observation is that we can exploit the arbitrariness inthe definition of the Bethe free energy when we incorporate the constraintsThis forms the basis of the resource allocation argument And the secondobservation concerns the bound on the correlation of a loopy belief propa-gation marginal that leads to the introduction of the strength of a potential

Besides its theoretical usefulness there are more practical uses Firstalgorithms for guaranteed convergence explicitly minimize the Bethe freeenergy They can be considered ldquobound optimization algorithmsrdquo similarto expectation maximization and iterative proportional fitting in the innerloop they minimize a bound on the Bethe free energy which is then up-dated in the outer loop In practice it appears that the tighter the boundthe faster the convergence (see eg Heskes et al 2003) Instead of a boundthat is convex (Yuille 2002) or convex over the set of constraints (Teh ampWelling 2002 Heskes et al 2003) we might relax the convexity conditionand choose a tighter bound that still has a unique minimum thereby speed-ing up the convergence Second in Wainwright et al (2002) a convexifiedBethe free energy is proposed The arguments for this class of free energiesare twofold they yield a bound on the partition function (instead of justan approximation as the standard Bethe free energy) and have a uniqueminimum Focusing on the second argument the conditions in this articlecan be used to construct Bethe free energies that may not be convex (overthe set of constraints) but do have a unique minimum and being closer tothe standard Bethe free energy may yield better approximations

We can think of the following opportunities to make the sufficient con-ditions derived here stricter and thus closer to necessary conditions

bull The conditions guarantee convexity of the dual G(Qβ λαβ) with re-spect to Qβ But in fact we need only G(Qβ) equiv maxλαβ G(Qβ λαβ) to beconvex which is a weaker requirement The Hessian of G(Qβ) how-ever appears to be more difficult to compute and to analyze in generalbut may lead to stronger results in specific cases (eg only pairwiseinteractions or substituting a particular choice of Aαβ )

bull It may be possible to strengthen the bound equation 71 on loopybelief correlations especially for interactions that involve more thantwo nodes

An important question is how the uniqueness of loopy belief propaga-tion fixed points relates to the convergence of loopy belief propagation

Uniqueness of Loopy Belief Propagation Fixed Points 2411

Intuitively one might expect that if loopy belief propagation has a uniquefixed point it will also converge to it This also seems to be the argumenta-tion in Tatikonda and Jordan (2002) However to the best of our knowledgethere is no proof of such correspondence Furthermore the following set ofsimulations does seem to suggest otherwise

We consider a Boltzmann machine with four binary nodes weights

w = ω

0 1 minus1 minus11 0 1 minus1minus1 1 0 minus1minus1 minus1 minus1 0

zero thresholds and potentials

ij(xi xj) = exp[wij4] if xi = xj and ij(xi xj) = exp[minuswij4] if xi = xj

Running loopy belief propagation possibly damped as in equation 39 weobserve ldquoconvergentrdquo and ldquononconvergentrdquo behavior For relatively smallweights loopy belief propagation converges to the trivial fixed point withPi(xi) = 05 for all nodes i and xi = 0 1 as in the lower left inset inFigure 4 For relatively large weights it ends up in a limit cycle as shown inthe upper right inset The weight strength that forms the transition betweenthis ldquoconvergentrdquo and ldquononconvergentrdquo behavior strongly depends on thestep size7 This by itself makes it hard to defend a one-to-one correspondencebetween convergence of loopy belief propagation (apparently dependingon step size) and uniqueness of fixed points (obviously independent of stepsize)

For weights larger than roughly 58 loopy belief propagation failed toconverge to the trivial fixed point even for very small step sizes Howeverrunning a convergent double-loop algorithm from many different initialconditions and many weight strengths considerably larger than 58 we al-ways ended up in the trivial fixed point and never in another one We foundsimilar behavior for a three-node Boltzmann machine (same weight matrixas above except for the fourth node) for very large weights loopy beliefpropagation ends up in a limit cycle whereas a convergent double-loopalgorithm converges to the trivial fixed point which here by corollary 2is guaranteed to be unique In future work we hope to elaborate on theseissues

7 Note that the conditions for guaranteed uniqueness imply ω = 43 for corollary 3and ω = log(2) asymp 069 for theorem 4 both far below the weight strengths where ldquonon-convergentrdquo behavior sets in

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

2408 T Heskes

Figure 3 Three Ising grids in factor-graph notation circles denote nodes boxesinteractions (a) Toroidal boundary conditions All elements of the allocationmatrix equal to 34 (not shown) (b) Aperiodic boundary conditions and (c) twoloops left The elements of the allocation matrix along the edges follow directlyfrom optimizing condition 3 in theorem 4 and symmetry considerations WithB = 2minus 2A in b and C = 1minusA in c the optimal settings for the single remainingvariable A then boil down to 34 and 1 minus

radic18 respectively See the text for

further explanation

Uniqueness of Loopy Belief Propagation Fixed Points 2409

The minimal (uniform) compensation in condition 3 of theorem 4 amountsto A = 34 for all combinations of potentials and nodes Substitution intocondition 2 then yields

σ le 13

and thus α le 11+radic23

asymp 055

The critical value that follows from corollary 3 is in this case slightly better

ω lt 1 and thus α le 11+ eminus12 asymp 062

Next we consider the same grid with aperiodic boundary conditions asin Figure 3b Numerically we find a critical αcritical asymp 079 The value thatfollows from corollary 3 is dominated by the center node and hence staysthe same a unique loopy belief propagation fixed point for α lt 062 Theo-rem 4 can be exploited to shift resources a little In principle we can solvethe nonlinear programming problem but for this small problem it can stillbe done by hand with the following argumentation Minimal compensationaccording to condition 3 in theorem 4 combined with symmetry considera-tions yields the allocation matrix elements along on the edges in Figure 3bIt is then easy to check that there are only two different appearances ofcondition 2

(2minus 2A)σ + 34le 1 and

12σ + A le 1

The optimal choice for A is the one in which both conditions turn out to beidentical In this way we obtain A = 34 yielding

σ le 12

and thus α le 11+radic12

asymp 058

still slightly worse than the condition from corollary 3An example in which the condition obtained with theorem 4 is better

than the one from corollary 3 is given in Figure 3c Straightforward analysisfollowing the same recipe as for Figure 3b yields A = 1minusradic18 with

σ leradic

12

and thus α le 1

1+radic

1minusradic12asymp 065

better than theα lt 062 from corollary 3 and to be compared with the criticalαcritical asymp 088

9 Discussion

In this article we derived sufficient conditions for loopy belief propagationto have just a single fixed point These conditions remain much too strong tobe anywhere near the necessary conditions and in that sense should be seenas no more than a first step These conditions have the following positivefeatures

2410 T Heskes

bull Generalize the conditions for convexity of the Bethe free energy

bull Incorporate the (local) strength of potentials

bull Scale naturally as a function of the ldquotemperaturerdquo

bull Are invariant to arbitrary definitions of potentials and self-interactions

Although the analysis that led to these conditions may seem quite involvedit basically consists of a relatively straightforward combination of two ob-servations The first observation is that we can exploit the arbitrariness inthe definition of the Bethe free energy when we incorporate the constraintsThis forms the basis of the resource allocation argument And the secondobservation concerns the bound on the correlation of a loopy belief propa-gation marginal that leads to the introduction of the strength of a potential

Besides its theoretical usefulness there are more practical uses Firstalgorithms for guaranteed convergence explicitly minimize the Bethe freeenergy They can be considered ldquobound optimization algorithmsrdquo similarto expectation maximization and iterative proportional fitting in the innerloop they minimize a bound on the Bethe free energy which is then up-dated in the outer loop In practice it appears that the tighter the boundthe faster the convergence (see eg Heskes et al 2003) Instead of a boundthat is convex (Yuille 2002) or convex over the set of constraints (Teh ampWelling 2002 Heskes et al 2003) we might relax the convexity conditionand choose a tighter bound that still has a unique minimum thereby speed-ing up the convergence Second in Wainwright et al (2002) a convexifiedBethe free energy is proposed The arguments for this class of free energiesare twofold they yield a bound on the partition function (instead of justan approximation as the standard Bethe free energy) and have a uniqueminimum Focusing on the second argument the conditions in this articlecan be used to construct Bethe free energies that may not be convex (overthe set of constraints) but do have a unique minimum and being closer tothe standard Bethe free energy may yield better approximations

We can think of the following opportunities to make the sufficient con-ditions derived here stricter and thus closer to necessary conditions

bull The conditions guarantee convexity of the dual G(Qβ λαβ) with re-spect to Qβ But in fact we need only G(Qβ) equiv maxλαβ G(Qβ λαβ) to beconvex which is a weaker requirement The Hessian of G(Qβ) how-ever appears to be more difficult to compute and to analyze in generalbut may lead to stronger results in specific cases (eg only pairwiseinteractions or substituting a particular choice of Aαβ )

bull It may be possible to strengthen the bound equation 71 on loopybelief correlations especially for interactions that involve more thantwo nodes

An important question is how the uniqueness of loopy belief propaga-tion fixed points relates to the convergence of loopy belief propagation

Uniqueness of Loopy Belief Propagation Fixed Points 2411

Intuitively one might expect that if loopy belief propagation has a uniquefixed point it will also converge to it This also seems to be the argumenta-tion in Tatikonda and Jordan (2002) However to the best of our knowledgethere is no proof of such correspondence Furthermore the following set ofsimulations does seem to suggest otherwise

We consider a Boltzmann machine with four binary nodes weights

w = ω

0 1 minus1 minus11 0 1 minus1minus1 1 0 minus1minus1 minus1 minus1 0

zero thresholds and potentials

ij(xi xj) = exp[wij4] if xi = xj and ij(xi xj) = exp[minuswij4] if xi = xj

Running loopy belief propagation possibly damped as in equation 39 weobserve ldquoconvergentrdquo and ldquononconvergentrdquo behavior For relatively smallweights loopy belief propagation converges to the trivial fixed point withPi(xi) = 05 for all nodes i and xi = 0 1 as in the lower left inset inFigure 4 For relatively large weights it ends up in a limit cycle as shown inthe upper right inset The weight strength that forms the transition betweenthis ldquoconvergentrdquo and ldquononconvergentrdquo behavior strongly depends on thestep size7 This by itself makes it hard to defend a one-to-one correspondencebetween convergence of loopy belief propagation (apparently dependingon step size) and uniqueness of fixed points (obviously independent of stepsize)

For weights larger than roughly 58 loopy belief propagation failed toconverge to the trivial fixed point even for very small step sizes Howeverrunning a convergent double-loop algorithm from many different initialconditions and many weight strengths considerably larger than 58 we al-ways ended up in the trivial fixed point and never in another one We foundsimilar behavior for a three-node Boltzmann machine (same weight matrixas above except for the fourth node) for very large weights loopy beliefpropagation ends up in a limit cycle whereas a convergent double-loopalgorithm converges to the trivial fixed point which here by corollary 2is guaranteed to be unique In future work we hope to elaborate on theseissues

7 Note that the conditions for guaranteed uniqueness imply ω = 43 for corollary 3and ω = log(2) asymp 069 for theorem 4 both far below the weight strengths where ldquonon-convergentrdquo behavior sets in

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

Uniqueness of Loopy Belief Propagation Fixed Points 2409

The minimal (uniform) compensation in condition 3 of theorem 4 amountsto A = 34 for all combinations of potentials and nodes Substitution intocondition 2 then yields

σ le 13

and thus α le 11+radic23

asymp 055

The critical value that follows from corollary 3 is in this case slightly better

ω lt 1 and thus α le 11+ eminus12 asymp 062

Next we consider the same grid with aperiodic boundary conditions asin Figure 3b Numerically we find a critical αcritical asymp 079 The value thatfollows from corollary 3 is dominated by the center node and hence staysthe same a unique loopy belief propagation fixed point for α lt 062 Theo-rem 4 can be exploited to shift resources a little In principle we can solvethe nonlinear programming problem but for this small problem it can stillbe done by hand with the following argumentation Minimal compensationaccording to condition 3 in theorem 4 combined with symmetry considera-tions yields the allocation matrix elements along on the edges in Figure 3bIt is then easy to check that there are only two different appearances ofcondition 2

(2minus 2A)σ + 34le 1 and

12σ + A le 1

The optimal choice for A is the one in which both conditions turn out to beidentical In this way we obtain A = 34 yielding

σ le 12

and thus α le 11+radic12

asymp 058

still slightly worse than the condition from corollary 3An example in which the condition obtained with theorem 4 is better

than the one from corollary 3 is given in Figure 3c Straightforward analysisfollowing the same recipe as for Figure 3b yields A = 1minusradic18 with

σ leradic

12

and thus α le 1

1+radic

1minusradic12asymp 065

better than theα lt 062 from corollary 3 and to be compared with the criticalαcritical asymp 088

9 Discussion

In this article we derived sufficient conditions for loopy belief propagationto have just a single fixed point These conditions remain much too strong tobe anywhere near the necessary conditions and in that sense should be seenas no more than a first step These conditions have the following positivefeatures

2410 T Heskes

bull Generalize the conditions for convexity of the Bethe free energy

bull Incorporate the (local) strength of potentials

bull Scale naturally as a function of the ldquotemperaturerdquo

bull Are invariant to arbitrary definitions of potentials and self-interactions

Although the analysis that led to these conditions may seem quite involvedit basically consists of a relatively straightforward combination of two ob-servations The first observation is that we can exploit the arbitrariness inthe definition of the Bethe free energy when we incorporate the constraintsThis forms the basis of the resource allocation argument And the secondobservation concerns the bound on the correlation of a loopy belief propa-gation marginal that leads to the introduction of the strength of a potential

Besides its theoretical usefulness there are more practical uses Firstalgorithms for guaranteed convergence explicitly minimize the Bethe freeenergy They can be considered ldquobound optimization algorithmsrdquo similarto expectation maximization and iterative proportional fitting in the innerloop they minimize a bound on the Bethe free energy which is then up-dated in the outer loop In practice it appears that the tighter the boundthe faster the convergence (see eg Heskes et al 2003) Instead of a boundthat is convex (Yuille 2002) or convex over the set of constraints (Teh ampWelling 2002 Heskes et al 2003) we might relax the convexity conditionand choose a tighter bound that still has a unique minimum thereby speed-ing up the convergence Second in Wainwright et al (2002) a convexifiedBethe free energy is proposed The arguments for this class of free energiesare twofold they yield a bound on the partition function (instead of justan approximation as the standard Bethe free energy) and have a uniqueminimum Focusing on the second argument the conditions in this articlecan be used to construct Bethe free energies that may not be convex (overthe set of constraints) but do have a unique minimum and being closer tothe standard Bethe free energy may yield better approximations

We can think of the following opportunities to make the sufficient con-ditions derived here stricter and thus closer to necessary conditions

bull The conditions guarantee convexity of the dual G(Qβ λαβ) with re-spect to Qβ But in fact we need only G(Qβ) equiv maxλαβ G(Qβ λαβ) to beconvex which is a weaker requirement The Hessian of G(Qβ) how-ever appears to be more difficult to compute and to analyze in generalbut may lead to stronger results in specific cases (eg only pairwiseinteractions or substituting a particular choice of Aαβ )

bull It may be possible to strengthen the bound equation 71 on loopybelief correlations especially for interactions that involve more thantwo nodes

An important question is how the uniqueness of loopy belief propaga-tion fixed points relates to the convergence of loopy belief propagation

Uniqueness of Loopy Belief Propagation Fixed Points 2411

Intuitively one might expect that if loopy belief propagation has a uniquefixed point it will also converge to it This also seems to be the argumenta-tion in Tatikonda and Jordan (2002) However to the best of our knowledgethere is no proof of such correspondence Furthermore the following set ofsimulations does seem to suggest otherwise

We consider a Boltzmann machine with four binary nodes weights

w = ω

0 1 minus1 minus11 0 1 minus1minus1 1 0 minus1minus1 minus1 minus1 0

zero thresholds and potentials

ij(xi xj) = exp[wij4] if xi = xj and ij(xi xj) = exp[minuswij4] if xi = xj

Running loopy belief propagation possibly damped as in equation 39 weobserve ldquoconvergentrdquo and ldquononconvergentrdquo behavior For relatively smallweights loopy belief propagation converges to the trivial fixed point withPi(xi) = 05 for all nodes i and xi = 0 1 as in the lower left inset inFigure 4 For relatively large weights it ends up in a limit cycle as shown inthe upper right inset The weight strength that forms the transition betweenthis ldquoconvergentrdquo and ldquononconvergentrdquo behavior strongly depends on thestep size7 This by itself makes it hard to defend a one-to-one correspondencebetween convergence of loopy belief propagation (apparently dependingon step size) and uniqueness of fixed points (obviously independent of stepsize)

For weights larger than roughly 58 loopy belief propagation failed toconverge to the trivial fixed point even for very small step sizes Howeverrunning a convergent double-loop algorithm from many different initialconditions and many weight strengths considerably larger than 58 we al-ways ended up in the trivial fixed point and never in another one We foundsimilar behavior for a three-node Boltzmann machine (same weight matrixas above except for the fourth node) for very large weights loopy beliefpropagation ends up in a limit cycle whereas a convergent double-loopalgorithm converges to the trivial fixed point which here by corollary 2is guaranteed to be unique In future work we hope to elaborate on theseissues

7 Note that the conditions for guaranteed uniqueness imply ω = 43 for corollary 3and ω = log(2) asymp 069 for theorem 4 both far below the weight strengths where ldquonon-convergentrdquo behavior sets in

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

2410 T Heskes

bull Generalize the conditions for convexity of the Bethe free energy

bull Incorporate the (local) strength of potentials

bull Scale naturally as a function of the ldquotemperaturerdquo

bull Are invariant to arbitrary definitions of potentials and self-interactions

Although the analysis that led to these conditions may seem quite involvedit basically consists of a relatively straightforward combination of two ob-servations The first observation is that we can exploit the arbitrariness inthe definition of the Bethe free energy when we incorporate the constraintsThis forms the basis of the resource allocation argument And the secondobservation concerns the bound on the correlation of a loopy belief propa-gation marginal that leads to the introduction of the strength of a potential

Besides its theoretical usefulness there are more practical uses Firstalgorithms for guaranteed convergence explicitly minimize the Bethe freeenergy They can be considered ldquobound optimization algorithmsrdquo similarto expectation maximization and iterative proportional fitting in the innerloop they minimize a bound on the Bethe free energy which is then up-dated in the outer loop In practice it appears that the tighter the boundthe faster the convergence (see eg Heskes et al 2003) Instead of a boundthat is convex (Yuille 2002) or convex over the set of constraints (Teh ampWelling 2002 Heskes et al 2003) we might relax the convexity conditionand choose a tighter bound that still has a unique minimum thereby speed-ing up the convergence Second in Wainwright et al (2002) a convexifiedBethe free energy is proposed The arguments for this class of free energiesare twofold they yield a bound on the partition function (instead of justan approximation as the standard Bethe free energy) and have a uniqueminimum Focusing on the second argument the conditions in this articlecan be used to construct Bethe free energies that may not be convex (overthe set of constraints) but do have a unique minimum and being closer tothe standard Bethe free energy may yield better approximations

We can think of the following opportunities to make the sufficient con-ditions derived here stricter and thus closer to necessary conditions

bull The conditions guarantee convexity of the dual G(Qβ λαβ) with re-spect to Qβ But in fact we need only G(Qβ) equiv maxλαβ G(Qβ λαβ) to beconvex which is a weaker requirement The Hessian of G(Qβ) how-ever appears to be more difficult to compute and to analyze in generalbut may lead to stronger results in specific cases (eg only pairwiseinteractions or substituting a particular choice of Aαβ )

bull It may be possible to strengthen the bound equation 71 on loopybelief correlations especially for interactions that involve more thantwo nodes

An important question is how the uniqueness of loopy belief propaga-tion fixed points relates to the convergence of loopy belief propagation

Uniqueness of Loopy Belief Propagation Fixed Points 2411

Intuitively one might expect that if loopy belief propagation has a uniquefixed point it will also converge to it This also seems to be the argumenta-tion in Tatikonda and Jordan (2002) However to the best of our knowledgethere is no proof of such correspondence Furthermore the following set ofsimulations does seem to suggest otherwise

We consider a Boltzmann machine with four binary nodes weights

w = ω

0 1 minus1 minus11 0 1 minus1minus1 1 0 minus1minus1 minus1 minus1 0

zero thresholds and potentials

ij(xi xj) = exp[wij4] if xi = xj and ij(xi xj) = exp[minuswij4] if xi = xj

Running loopy belief propagation possibly damped as in equation 39 weobserve ldquoconvergentrdquo and ldquononconvergentrdquo behavior For relatively smallweights loopy belief propagation converges to the trivial fixed point withPi(xi) = 05 for all nodes i and xi = 0 1 as in the lower left inset inFigure 4 For relatively large weights it ends up in a limit cycle as shown inthe upper right inset The weight strength that forms the transition betweenthis ldquoconvergentrdquo and ldquononconvergentrdquo behavior strongly depends on thestep size7 This by itself makes it hard to defend a one-to-one correspondencebetween convergence of loopy belief propagation (apparently dependingon step size) and uniqueness of fixed points (obviously independent of stepsize)

For weights larger than roughly 58 loopy belief propagation failed toconverge to the trivial fixed point even for very small step sizes Howeverrunning a convergent double-loop algorithm from many different initialconditions and many weight strengths considerably larger than 58 we al-ways ended up in the trivial fixed point and never in another one We foundsimilar behavior for a three-node Boltzmann machine (same weight matrixas above except for the fourth node) for very large weights loopy beliefpropagation ends up in a limit cycle whereas a convergent double-loopalgorithm converges to the trivial fixed point which here by corollary 2is guaranteed to be unique In future work we hope to elaborate on theseissues

7 Note that the conditions for guaranteed uniqueness imply ω = 43 for corollary 3and ω = log(2) asymp 069 for theorem 4 both far below the weight strengths where ldquonon-convergentrdquo behavior sets in

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

Uniqueness of Loopy Belief Propagation Fixed Points 2411

Intuitively one might expect that if loopy belief propagation has a uniquefixed point it will also converge to it This also seems to be the argumenta-tion in Tatikonda and Jordan (2002) However to the best of our knowledgethere is no proof of such correspondence Furthermore the following set ofsimulations does seem to suggest otherwise

We consider a Boltzmann machine with four binary nodes weights

w = ω

0 1 minus1 minus11 0 1 minus1minus1 1 0 minus1minus1 minus1 minus1 0

zero thresholds and potentials

ij(xi xj) = exp[wij4] if xi = xj and ij(xi xj) = exp[minuswij4] if xi = xj

Running loopy belief propagation possibly damped as in equation 39 weobserve ldquoconvergentrdquo and ldquononconvergentrdquo behavior For relatively smallweights loopy belief propagation converges to the trivial fixed point withPi(xi) = 05 for all nodes i and xi = 0 1 as in the lower left inset inFigure 4 For relatively large weights it ends up in a limit cycle as shown inthe upper right inset The weight strength that forms the transition betweenthis ldquoconvergentrdquo and ldquononconvergentrdquo behavior strongly depends on thestep size7 This by itself makes it hard to defend a one-to-one correspondencebetween convergence of loopy belief propagation (apparently dependingon step size) and uniqueness of fixed points (obviously independent of stepsize)

For weights larger than roughly 58 loopy belief propagation failed toconverge to the trivial fixed point even for very small step sizes Howeverrunning a convergent double-loop algorithm from many different initialconditions and many weight strengths considerably larger than 58 we al-ways ended up in the trivial fixed point and never in another one We foundsimilar behavior for a three-node Boltzmann machine (same weight matrixas above except for the fourth node) for very large weights loopy beliefpropagation ends up in a limit cycle whereas a convergent double-loopalgorithm converges to the trivial fixed point which here by corollary 2is guaranteed to be unique In future work we hope to elaborate on theseissues

7 Note that the conditions for guaranteed uniqueness imply ω = 43 for corollary 3and ω = log(2) asymp 069 for theorem 4 both far below the weight strengths where ldquonon-convergentrdquo behavior sets in

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

2412 T Heskes

0 02 04 06 08 135

4

45

5

55

6

step size

wei

ght s

tren

gth

0 2000495

0505

0 1000

1

Figure 4 The transition between ldquoconvergentrdquo and ldquononconvergentrdquo behavioras a function of the step size used for damping loopy belief propagation andthe weight strength Simulations on a four-node Boltzmann machine The insetsshow the marginal P1(x1 = 1) as a function of the number of loopy belief itera-tions for step size 02 and strength 4 (lower left) and step size 06 and strength6 (upper right) See the text for further detail

Acknowledgments

This work has been supported in part by the Dutch Technology FoundationSTW I thank the anonymous reviewers for their constructive comments andJoris Mooij for computing the critical αcriticalrsquos in section 83

References

Heskes T (2002) Stable fixed points of loopy belief propagation are minima ofthe Bethe free energy In S Becker S Thrun amp K Obermayer (Eds) Advancesin neural information processing systems 15 (pp 359ndash366) Cambridge MA MITPress

Heskes T Albers K amp Kappen B (2003) Approximate inference and con-strained optimization In Uncertainty in artificial intelligence Proceedings of theNineteenth Conference (UAI-2003) (pp 313ndash320) San Francisco Morgan Kauf-mann

Kschischang F Frey B amp Loeliger H (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47(2) 498ndash519

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004

Uniqueness of Loopy Belief Propagation Fixed Points 2413

Lauritzen S amp Spiegelhalter D (1988) Local computations with probabilitieson graphical structures and their application to expert systems Journal of theRoyal Statistics Society B 50 157ndash224

Luenberger D (1984) Linear and nonlinear programming Reading MA Addison-Wesley

McEliece R MacKay D amp Cheng J (1998) Turbo decoding as an instanceof Pearlrsquos ldquobelief propagationrdquo algorithm IEEE Journal on Selected Areas inCommunication 16(2) 140ndash152

McEliece R amp Yildirim M (2003) Belief propagation on partially ordered setsIn D Gilliam amp J Rosenthal (Eds) Mathematical systems theory in biologycommunications computation and finance (pp 275ndash300) New York Springer

Minka T (2001) Expectation propagation for approximate Bayesian inferenceIn J Breese amp D Koller (Eds) Uncertainty in artificial intelligence Proceedingsof the Seventeenth Conference (UAI-2001) (pp 362ndash369) San Francisco MorganKaufmann

Murphy K Weiss Y amp Jordan M (1999) Loopy belief propagation for ap-proximate inference An empirical study In K Laskey amp H Prade (Eds)Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence(pp 467ndash475) San Francisco Morgan Kaufmann

Pakzad P amp Anantharam V (2002) Belief propagation and statistical physicsIn 2002 Conference on Information Sciences and Systems Princeton NJ PrincetonUniversity

Pearl J (1988) Probabilistic reasoning in intelligent systems Networks of plausibleinference San Francisco Morgan Kaufmann

Tatikonda S amp Jordan M (2002) Loopy belief propagation and Gibbs mea-sures In A Darwiche amp N Friedman (Eds) Uncertainty in artificial intelli-gence Proceedings of the Eighteenth Conference (UAI-2002) (pp 493ndash500) SanFrancisco Morgan Kaufmann

Teh Y amp Welling M (2002) The unified propagation and scaling algorithm InT Dietterich S Becker amp Z Ghahramani (Eds) Advances in neural informationprocessing systems 14 (pp 953ndash960) Cambridge MA MIT Press

Wainwright M Jaakkola T amp Willsky A (2002) A new class of upper boundson the log partition function In A Darwiche amp N Friedman (Eds) Uncer-tainty in artificial intelligence Proceedings of the Eighteenth Conference (UAI-2002)(pp 536ndash543) San Francisco Morgan Kaufmann

Weiss Y (2000) Correctness of local probability propagation in graphical modelswith loops Neural Computation 12(1) 1ndash41

Weiss Y amp Freeman W (2001) Correctness of belief propagation in graphicalmodels with arbitrary topology Neural Computation 13(10) 2173ndash2200

Welling M amp Teh Y (2003) Approximate inference in Boltzmann machinesArtificial Intelligence 143(1) 19ndash50

Yedidia J Freeman W amp Weiss Y (2001) Generalized belief propagation InT Leen T Dietterich amp V Tresp (Eds) Advances in neural information processingsystems 13 (pp 689ndash695) Cambridge MA MIT Press

Yuille A (2002) CCCP algorithms to minimize the Bethe and Kikuchi free ener-gies Convergent alternatives to belief propagation Neural Computation 141691ndash1722

Received December 2 2003 accepted April 29 2004