Loopy belief propagation - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/loopyBP.pdf · In general BP will not have a unique xed point and will not converge

Loopy belief propagation

Yinghzen Li and Alex Matthews

Cambridge Machine Learning Group

April 3rd, 2014

Why Loopy belief propagation?

Family of approximate inference methods for factor graphs.

Exact and efficient on trees.

Can give could results on graphs with cycles.

Of particular interest to Bayesians.

Heavily used in:

I Machine learning

I Statistics

I Information theory

I Physics

Message passing for Big Data

Often naturally suited to use in big data applications.

Locality of computation with respect to factor graph.

Locality of data with respect to factor graph.

Therefore often easy to parallelize.[MacKay, 2002]

Example 1: Turbo codes

Sparse graph decoders may be viewed as inference on a factorgraph using Loopy BP [McEliece et al., 1998].

These codes were the first practical ones to get close to theShannon rate for the channels in questions.

Used in:

I 3G and 4G mobile communication protocols.

I Satellite communications including the Mars Orbiter.

There is thus a great deal of theoretical interest from thiscommunity.

Example 2: TrueSkill

TrueSkill is used by Microsoft to match players on Xbox andGames for Windows [Herbrich et al., 2007].

Regarded as one of the best examples of an industrial Bayesianalgorithm on Big Data.

Inference is done using EP.

EP can be understood in terms of LBP with the extra complicationof continuous message approximation.

Why does EP work so well here despite the presence of loops?

Some papers from this group using EP

[Lopez-Paz et al., 2013] ‘Gaussian Process Vine Copulas forMultivariate Dependence’

[Hernandez-Lobato et al., 2013] ‘Generalized Spike-and-Slab Priorsfor Bayesian Group Feature Selection Using ExpectationPropagation’

[Turner and Sahani, 2011] ‘Probabilistic amplitude and frequencydemodulation.’

[Nickisch and Rasmussen, 2008] ‘Approximations for BinaryGaussian Process Classification’

Today’s talk

Yinghzen will give a tutorial on some theoretical aspects of LBP

I Revision of basics.

I Relation to variational energies.

I Divergence measures.

Alex will talk about two 2013 papers by Noorshams andWainwright on Stochastic Belief propagation.

Stochastic Belief propagation

Today we review two recent papers by Noorshams and Wainwright.[Noorshams and Wainwright, 2013b,Noorshams and Wainwright, 2013a]

Interested in marginals for xi ∈ {1, 2, ..., d} of some discrete vectorx

Address the issue of Θ(d2) dependence of BP update usingstochastic updates.

This is of interest in applications where d is prohibitively large.

Can also be extended to continuous xi using orthogonal basisfunction approximation.

Analysis is systematic and elegant.

In my opinion ideas have yet to reach their full practical potential.

The general belief propagation equations (1)

Revise notation:

Indices i , j , k , ... for variables. Indices a, b, c , ... for factors.

Variables xi etc. Factors ψa etc.

The set N (a) ⊆ {1, 2, ..,N} contains indices of all variables forfactor a.

Define the set xa = {xi |i ∈ N (a)}

Similarly the set N (i) contains the indices of all the factorsconnected to variable i . In general we are interested in themarginals of the following joint probability distribution with Mfactors.

µ(x) =1

Z

∏a

ψa(xa) (1)

The general belief propagation equations (2)

To recap the general belief propagation equations involve factor tovariable messages m and variable to factor messages m.The recursion is:

m(t)a→i (xi ) ∝

∑xa\i

ψa(xa)∏

j∈N (a)\i

m(t)j→a(xj) (2)

m(t+1)i→a (xi ) ∝

∏b∈N (i)\a

m(t)b→i (xi ) (3)

The symbol ∝ is shorthand for normalisation.

The marginals are given by:

µ(t)(xi ) ∝∏

a∈N (a)

m(t)a→i (xi ) (4)

Pairwise factor graphs

The two papers are restricted to pairwise Markov random fieldswith strictly positive factors.

µ(x) =1

Z

∏i∈V

ψi (xi )∏

(ij)∈E

ψij(xi , xj) (5)

Second product here in eq. (5) is taken over undirected edges E .

With this assumption BP may be reformulated as passing messages

m between all directed pairs of variables (i , j) ∈−→E .

m(t+1)i→j (xj) ∝

∑xi

ψi ,j(xi , xj)ψi (xi )∏

k∈N (i)\j

m(t)k→i (xi ) (6)

Generality of pairwise assumption?

In the ‘monster’ paper [Wainwright and Jordan, 2008] conversionto pairwise Markov random fields is discussed. Consider:

ψijk(xi , xj , xk) (7)

where xν ∈ χν ν ∈ {i , j , k}.

Introduce auxiliary variable z = (zi , zj , zk) ∈ χi × χj × χk . Define:

ψν(z, xν) = [ψijk(zi , zj , zk)]1/31(zν = xν) for ν = i , j , k (8)

Then:ψijk(xi , xj , xk) =

∑z

∏ν∈{i ,j ,k}

ψν(z, xν) (9)

So conversion is possible but not necessarily practical.

Some manipulation of the pairwise update (1)

Explicitly normalised update:

m(t+1)i→j (xj) =

∑xiψij(xi , xj)ψi (xi )

∏k∈N (i)\j m

(t)k→i (xi )∑

x′i x′jψij(x′i , x

′j)ψi (x′i )

∏k∈N (i)\j m

(t)k→i (x′i )

(10)

Define:

Γij(xi , xj) :=ψij(xi , xj)∑x′jψij(xi , x′j)

(11)

βij(xi ) :=

∑x′j

ψij(xi , x′j)

ψi (xi ) (12)

so that:ψij(xi , xj)ψi (xi ) = Γij(xi , xj)βij(xi ) (13)

Some manipulation of the pairwise update (2)Substituting obtain:

m(t+1)i→j (xj) =

∑xi

Γij(xi , xj)βij(xi )∏

k∈N (i)\j m(t)k→i (xi )∑

x′i x′jΓij(x′i , x

′j)βij(x′i )

∏k∈N (i)\j m

(t)k→i (x′i )

(14)

=

∑xi

Γij(xi , xj)βij(xi )∏

k∈N (i)\j m(t)k→i (xi )∑

x′iβij(x′i )

∏k∈N (i)\j m

(t)k→i (x′i )

(15)

= (Bv)j (16)

where B is now a matrix with (suppressing some indices):

Bji = Γ(xi , xj) (17)

and v is now a vector where (suppressing some indices):

vi =β(xi )

∏k∈N (i)\j m

(t)k→i (xi )∑

x′iβ(x′i )

∏k∈N (i)\j m

(t)k→i (x′i )

(18)

An elegant matrix form

We have reached the following matrix-vector form for the update:

m = Bv (19)

This is the origin of the Θ(d2) dependence which will be timeconsuming when d is large.

This is particularly elegant because the elements of v and eachcolumn of B are normalised to 1.

Thus randomly select an element Y of v according to its mass andthen approximate the multiplication with the Yth column of B.

Conveniently this will be normalised as we require for a validongoing message.

This operation requires Θ(d) computation and only requirescommunicating an index along edges.

Too good to be true?

This is not a miraculous method to speed up matrix multiplication.

Although unbiased the estimator may have a high variance.

Although it will need only Θ(d) per iteration the algorithm willlikely require more iterations to achieve the same accuracy.

The balance between these two competing requirements will needchecking before we can accept the claim of a speed up.

Summary of algorithm: initialisation

Initialise each message vector, one per directed edge, mi→j in thesimplex.

For each directed edge pre-compute:

Γij(xi , xj) :=ψij(xi , xj)∑x′jψij(xi , x′j)

(20)

βij(xi ) :=

∑x′j

ψij(xi , x′j)

ψi (xi ) (21)

In general this requires Θ(d2) compute and memory.

Could be troublesome? Perhaps in some cases can be doneanalytically.

Summary of algorithm: recursion

For t = 0, 1, 2, ...

For each directed edge (i , j) ∈−→E

1. Compute the product of incoming messages:

M(t)ij (xi ) =

∏k∈N (i)\j

mk→i (xi ) (22)

2. Pick a random index r(t+1)i→j ∈ {1, 2, ..., d} according to the

distribution

p(t)i→j(r (t+1)) ∝ M

(t)ij (xi = r (t+1))βij(xi = r (t+1)) (23)

3. For a given step size λ(t) update the message.

m(t+1)i→j = (1− λ(t))m

(t)i→j + λ(t)Γij(i = r (t+1), :) (24)

Requires Θ(d) computation per iteration.

Quantitative bounds on error and rates

For what factor graphs and λ(t) is the a useful approximationalgorithm?

The stochastic belief propagation proves strong consistency andrates of convergence in the case of a trees.

Under a contractivity assumption on the message updates similarresults are proved for graphs with loops.

To anticipate: As we would expect there is a trade off betweenerror and number of iterations.

This trade off becomes more favourable as d becomes large. Buthow large?

Convergence and rates for trees

Theorem:

1. For any tree structured graph and λ(t) = 1/(t + 1) themessages converge almost surely to the unique, exact, fixedpoint of the unrandomised algorithm as t →∞.

2. With the same λ(t) in expectation the element-wise error ofthe messages is O(1/

√t). To be specific let m be the

concatenation of all 2|E| message vectors. Let m∗ be theexact result then:

E [|mν − mν |] ≤ O(1/√

t) ∀ν (25)

Contractivity assumption for general graphs

In general BP will not have a unique fixed point and will notconverge.

Consider F (m) to be the operation of the BP updates on theconcatenated message vector m.

The assumed contractivity condition is that for any two message mand m′ the updates obey:

||F (m)− F (m′)||2 ≤ (1− γ/2)||m− m′||2 ∀m, m′ (26)

where γ ∈ (0, 2) ensures the Lipschitz constant is less than 1.

These are sufficient conditions for convergence of thenon-stochastic algorithm to some m∗ which may not have theexact marginals of the true factor graph.

As an example the paper shows what this means for the potentialsof a Potts model.

Error rates for non-stochastic BP

We know:

||F (m)− F (m′)||2 ≤ (1− γ/2)||m− m′||2 ∀m, m′ (27)

Banach fixed point theorem implies:

∃! m∗ : F (m∗) = m∗ ∧ limn→∞

F (n)(m) = m∗ ∀m (28)

Use to bound rate:

||F (n)(m)−m∗||2 ≤ (1−γ/2)n||m−m∗||2 ≤ (1−γ/2)n||m∗||2 (29)

So to guarantee percentage squared error ε require:

nE ≥1

2

log(1/ε)

log(1/(1− γ/2))≥ log(1/ε)

2(30)

Convergence and rates of SBP for contractive graphs

Assume contractivity:

Theorem:

1. With λ(t) = O(1/t) the messages converge almost surely tom∗ as t →∞

2. Assume additionally γ ≥ 1. With λ(t) = 1γ(t+1) , we have:

E[||m− m∗||22

]||m∗||22

≤ K

γ2

(1 + log(t)

t

)(31)

Also prove concentration bounds near mean.

How many iterations for Stochastic BP?

We have bound in mean squared error (MSE) ε:

ε ≤ K

γ2

(1 + log(t)

t

)(32)

And this behaviour is reasonably concentrated near the mean.

We would expect nS to be enough where:

nS ≥K

ε(33)

Comparing stochastic and non-stochastic variants

We have nE ≥ log(1/ε)2 and nS ≥ K

ε

Dependence of these bounds on graph structure through γ and Kis slightly murky, at least to me.

Claim is that substituting for n into the two algorithms exact BPrequires:

Θ(|E|d2 log(1/ε)|) (34)

and stochastic BP requires:

Θ(|E|d 1

ε) (35)

For a given d if our desired accuracy is not as good as some ε weshould start using SBP. The larger d the smaller this ε.

Note that ε is a squared error. The exact trade off is seen inexperiments.

Experiments (1)

Simulated Potts model example. ε = 0.01 corresponds to 10%error!

Experiments (2)

Experiments (3)

Extension to continuous spaces

Intuitively one case which seems similar to large d is when themessages are functions.

The analysis given so far is extended to this case in the JMLRpaper.

We will briefly discuss this case.

Replace sums with integrals. The normalised message updateequation becomes:

m(t+1)i→j (xj) =

∫Γij(xi , xj)βij(xi )

∏k∈N (i)\j m

(t)k→i (xi )dxi∫

βij(x′i )∏

k∈N (i)\j m(t)k→i (x′i )dx′i

(36)

Orthogonal basis approximation

Algorithm can run as before except we need to deal with the factthat the ‘columns’ passed around would be functions.

Choose an orthonormal basis {φα}∞α=1 of the function space. Thatis to say: ∫

φu(x)φv (x)dx = δuv (37)

And functions f :

f =∞∑u=1

a(u)φu (38)

Approximation comes from truncating at some finite L.

Care will need to be taken to deal with the possibility of negativeareas in the approximate message.

Summary of algorithm (1)

For each directed edge pre-compute:

Γij(xi , xj) :=ψij(xi , xj)∫ψij(xi , x′j)dx′j

(39)

βij(xi ) :=

(∫ψij(xi , x

′j)dx′j

)ψi (xi ) (40)

Additionally need to compute for u = 1, 2, .., L

ωi→j(u, xj) =

∫Γij(xi , xj)φu(xi )dxi (41)

Claim is that this can be done easily numerically. Certainly true ifbasis functions are eigenfunctions of Γ.

Summary of algorithm (2)For t = 0, 1, 2, .... For each directed edge (i , j) ∈

−→E

1. Approximation of incoming messages. For each k ∈ N (i) \ j :

mk→i (xi ) =

[L∑

u=1

a(u)(t)k→iφu(xi )

]+

(42)

2. Draw k IID samples X(t+1)s from density proportional to:

βij(xi )∏

k∈N (i)\j

mk→i (xi ) (43)

3. Compute the message update coefficients {b(u)(t+1)i→j }.

b(u)(t+1)i→j =

1

k

k∑s=1

ωi→j(u,Xs) u = 1, 2, .., L (44)

4. For u = 1, 2, .., L, step size λ(t) update coefficients.

a(u)(t+1)i→j = (1− λ(t))a(u)

(t)i→j + λ(t)b(u)

(t+1)i→j (45)

Related ideas to consider

Mentioned in the paperParticle filters are used on trees [Doucet et al., 2001]. Looselyspeaking they are asymptotically correct as the number of particlesis increased, whereas these methods on trees are asymptoticallycorrect as the number of iterations is increased.

For people interested in this area Kernel belief propagation[Song et al., 2011] looks to have quite a bit in common with thecontinuous paper.

Not mentioned in the paperIn the discrete case how is Gibbs sampling connected to thesemodels?

Suggestions for future research

Authors’ suggestions

Currently no interpretation in terms of Bethe energy.

Analysis of max-product and other related algorithms.

Less ‘cumbersome’ extension to models with higher order factors.

Currently assume that the factors are strictly positive, not usefulfor hard constraints, predict it is possible to relax this.

Suggest it is possible to reconsider the contractivity assumption.

Our suggestion

Wins are largest when d is large. Seems likely there aren’t manyapplications in this realm yet simply because this hasn’t beenpossible.

References I

Doucet, A., De Freitas, N., and Gordon, N. (2001).Sequential monte carlo methods in practice.Springer-Verlag.

Herbrich, R., Minka, T., and Graepel, T. (2007).Trueskill: A bayesian skill rating system.pages 569–576. MIT Press.

Hernandez-Lobato, D., Hernandez-Lobato, J. M., and Dupont,P. (2013).Generalized spike-and-slab priors for Bayesian group featureselection using expectation propagation.14:1891–1945.

Lopez-Paz, D., Hernandez-Lobato, J. M., and Ghahramani, Z.(2013).Gaussian process vine copulas for multivariate dependence.Atlanta, Georgia, USA.

References II

MacKay, D. J. C. (2002).Information Theory, Inference & Learning Algorithms.Cambridge University Press, New York, NY, USA.

McEliece, R., MacKay, D. J. C., and Cheng, J.-F. (1998).Turbo decoding as an instance of pearl’s ldquo;beliefpropagation rdquo; algorithm.Selected Areas in Communications, IEEE Journal on,16(2):140–152.

Nickisch, H. and Rasmussen, C. E. (2008).Approximations for binary Gaussian process classification.9:2035–2078.

Noorshams, N. and Wainwright, M. J. (2013a).Belief propagation for continuous state spaces: stochasticmessage-passing with quantitative guarantees.Journal of Machine Learning Research, 14(1):2799–2835.

References III

Noorshams, N. and Wainwright, M. J. (2013b).Stochastic belief propagation: A low-complexity alternative tothe sum-product algorithm.IEEE Transactions on Information Theory, 59(4):1981–2000.

Song, L., Gretton, A., Bickson, D., Low, Y., and Guestrin, C.(2011).Kernel belief propagation.In In Artificial Intelligence and Statistics (AISTATS), Ft.Lauderdale, FL.

Turner, R. E. and Sahani, M. (2011).Probabilistic amplitude and frequency demodulation.pages 981–989.

References IV

Wainwright, M. J. and Jordan, M. I. (2008).Graphical models, exponential families, and variationalinference.Found. Trends Mach. Learn., 1(1-2):1–305.

Documents

Loopy belief propagation - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/loopyBP.pdf · In general BP will not have a unique xed point and will not converge