29
Orbit-Product Analysis of (Generalized) Gaussian Belief Propagation Jason Johnson, Post-Doctoral Fellow, LANL Joint work with Michael Chertkov and Vladimir Chernyak Physics of Algorithms Workshop Santa Fe, New Mexico September 3, 2009

Physics of Algorithms Talk

Embed Size (px)

Citation preview

Page 1: Physics of Algorithms Talk

Orbit-Product Analysis of (Generalized) GaussianBelief Propagation

Jason Johnson, Post-Doctoral Fellow, LANLJoint work with Michael Chertkov and Vladimir Chernyak

Physics of Algorithms WorkshopSanta Fe, New Mexico

September 3, 2009

Page 2: Physics of Algorithms Talk

Overview

Introduction

I graphical models + belief propagation

I specialization to Gaussian model

Analysis of Gaussian BP

I walk-sum analysis for means, variances, covariances1

I orbit-product analysis/corrections for determinant2

Current Work on Generalized Belief Propagation (GBP) [Yedidia etal]

I uses larger “regions” to capture more walks/orbits of thegraph (better approximation)

I However, it can also lead to over-counting of walks/orbits(bad approximation/unstable algorithm)!

1Earlier joint work with Malioutov & Willsky (NIPS, JMLR ’06).2Johnson, Chernyak & Chertkov (ICML ’09).

Page 3: Physics of Algorithms Talk

Graphical Models

A graphical model is a multivariate probability distribution that isexpressed in terms of interactions among subsets of variables (e.g.pairwise interactions on the edges of a graph G ).

P(x) =1

Z

∏i∈V

ψi (xi )∏{i ,j}∈G

ψij(xi , xj)

Markov property:

BA S

P(xA, xB |xS) = P(xA|xS)P(xB |xS)

Given the potential functions ψ, the goal of inference is to computemarginals P(xi ) =

∑xV\i

P(x) or the normalization constant Z ,

which is generally difficult in large, complex graphical models.

Page 4: Physics of Algorithms Talk

Gaussian Graphical Model

Information form of Gaussian density.

P(x) ∝ exp{−1

2xT Jx + hT x}

Gaussian graphical model: sparse J matrix

Jij 6= 0 if and only if {i , j} ∈ G

Potentials:ψi (xi ) = e−

12Jiix

2i +hixi

ψij(xi , xj) = e−Jijxixj

Inference corresponds to calculation of mean vector µ = J−1h,covariance matrix K = J−1 or determinant Z = det J−1. MarginalsP(xi ) specified by means µi and variances Kii .

Page 5: Physics of Algorithms Talk

Belief Propagation

Belief Propagation iteratively updates a set of messages µi→j(xj)defined on directed edges of the graph G using the rule:

µi→j(xj) ∝∑xi

ψi (xi )∏

k∈N(i)\j

µk→i (xi )ψ(xi , xj)

Iterate message updates until converges to a fixed point.

Marginal Estimates: combine messages at a node

P(xi ) =1

Ziψi (xi )

∏k∈N(i)

µk→i (xi )︸ ︷︷ ︸ψ̃i (xi )

Page 6: Physics of Algorithms Talk

Belief Propagation II

Pairwise Estimates (on edges of graph):

P(xi , xj) =1

Zijψ̃i (xi )ψ̃j(xj)

ψ(xi , xj)

µi→j(xj)µj→i (xi )︸ ︷︷ ︸ψ̃ij (xi ,xj )

Estimate of Normalization Constant:

Zbp =∏i∈V

Zi

∏{i ,j}∈G

Zij

ZiZj

BP fixed point is saddle point of RHS with respect tomessages/reparameterizations.

In trees, BP converges in finite number of steps and is exact(equivalent to variable elimination).

Page 7: Physics of Algorithms Talk

Gaussian Belief Propagation (GaBP)

Messages µi→j(xj) ∝ exp{12αi→jx

2j + βi→jxj}.

BP fixed-point equations reduce to:

αi→j = J2ij (Jii − αi\j)

−1

βi→j = −Jij(Jii − αi\j)−1(hi + βi\j)

where αi\j =∑

k∈N(i)\j αk→i and βi\j =∑

k∈N(i)\j αk→i .

Marginals specified by:

Kbpi = (Jii −

∑k∈N(i)

αk→i )−1

µbpi = Kbp

i (hi +∑

k∈N(i)

βk→i )

Page 8: Physics of Algorithms Talk

Gaussian BP Determinant Estimate

Estimates of pairwise covariance on edges:

Kbp(ij) =

(Jii − αi\j Jij

Jij Jjj − αj\i

)−1

Estimate of Z , det K = det J−1:

Zbp =∏i∈V

Zi

∏{i ,j}∈G

Zij

ZiZj

where Zi = Kbpi and Zij = det Kbp

(ij).

Exact in tree models (equivalent to Gaussian elimination),approximate in loopy models.

Page 9: Physics of Algorithms Talk

The BP Computation Tree

BP marginal estimates are equivalent to the exact marginal in atree-structured model [Weiss & Freeman].

2 3

4

1

5

7 8 9

6

2

3

2

1

4

5 5

4 6 8

1 3 3 9 7 9 5 95 9 1 7 3 9 7 9

6 8 8

7

6

µ(1)5→6

µ(4)2→1

µ(2)6→3

µ(3)3→2

The BP messages correspond to upwards variable elimination stepsin this computation tree.

Page 10: Physics of Algorithms Talk

Walk-Summable Gaussian Models

Let J = I − R. If ρ(R) < 1 then (I − R)−1 =∑∞

L=0 RL.

Walk-Sum interpretation of inference:

Kij =∞∑

L=0

∑w :i

L→j

Rw ?=∑

w :i→j

Rw

µi =∑

j

hj

∞∑L=0

∑w :j

L→i

Rw ?=∑

w :∗→i

h∗Rw

Walk-Summable if∑

w :i→j |Rw | converges for all i , j . Absoluteconvergence implies convergence of walk-sums (to same value) forarbitrary orderings and partitions of the set of walks. Equivalent toρ(|R|) < 1.

Page 11: Physics of Algorithms Talk

Walk-Sum Interpretation of GaBP

Combine interpretation of BP as exact inference on computationtree with walk-sum interpretation of Gaussian inference in trees:

I messages represent walk-sums in subtrees of computation tree

I Gauss BP converges in walk-summable models

I complete walk-sum for the means

I incomplete walk-sum for the variances

Page 12: Physics of Algorithms Talk

Complete Walk-Sum for Means

Every walk in G ending at a node i maps to a walk of thecomputation tree Ti (ending at root node of Ti )...

2 3

4 5

7 8 9

6

1

2

3

2

4

5 5

4 6 8

1 3 3 9 7 9 5 95 9 1 7 3 9 7 9

6 8 8

7

6

1

Gaussian BP converges to the correct means in WS models.

Page 13: Physics of Algorithms Talk

Incomplete Walk-Sum for Variances

Only those totally backtracking walks of G can be embedded asclosed walks in the computation tree...

2 3

4

1

5

7 8 9

6

2

3

2

1

4

5 5

4 6 8

1 3 3 9 7 9 5 95 9 1 7 3 9 7 9

6 8 8

7

6

Gaussian BP converges to incorrect variance estimates(underestimate in non-negative model).

Page 14: Physics of Algorithms Talk

Zeta Function and Orbit-Product

What about the determinant?

Definition of Orbits:

I A walk is closed if it begins and ends at same vertex.

I It is primitive if does not repeat a shorter walk.

I Two primitive walks are equivalent if one is a cyclic shift ofthe other.

I Define orbits ` ∈ L of G to be equivalence classes of closed,primitive walks.

Theorem. Let Z , det(I − R)−1. If ρ(|R|) < 1 then

Z =∏`

(1− R`)−1 ,∏`

Z`.

A kind of zeta function in graph theory.

Page 15: Physics of Algorithms Talk

Zbp as Totally-Backtracking Orbit-Product

Definition of Totally-Backtracking Orbits:

I Orbit is reducible if it contains backtracking steps ...(ij)(ji)...,else it is irreducible (or backtrackless).

I Every orbit ` has a unique irreducible core γ = Γ(`) obtainedby iteratively deleting pairs of backtracking steps until no moreremain. Let Lγ denote the set of all orbits that reduce to γ.

I Orbit is totally backtracking (or trivial) if it reduces to theempty orbit Γ(`) = ∅, else it is non-trivial.

Theorem. If ρ(|R|) < 1 then Zbp (defined earlier) is equal to thetotally-backtracking orbit-product:

Zbp =∏`∈L∅

Z`

Page 16: Physics of Algorithms Talk

Orbit-Product Correction and Error Bound

Orbit-product correction to Zbp:

Z = Zbp∏6̀∈L∅

Z`

Error Bound: missing orbits must all involve cycles of the graph...

1

n

∣∣∣∣logZ

Zbp

∣∣∣∣ ≤ ρg

g(1− ρ)

where ρ , ρ(|R|) < 1 and g is girth of the graph (length ofshortest cycle).

Page 17: Physics of Algorithms Talk

Reduction to Backtrackless Orbit-Product Correction

We may reduce the orbit-product correction to one over justbacktrackless orbits γ

Z = Zbp

∏`

Z` = Zbp

∏γ

∏`∈L(γ)

Z`

︸ ︷︷ ︸

Z ′γ

with modified orbit-factors Z ′γ based on GaBP

Z ′γ = (1−∏

(ij)∈γ

r ′ij)−1 where r ′ij , (1− αi\j)

−1rij

The factor (1− αi\j)−1 serves to reconstruct totally-backtracking

walks at each point i along the backtrackless orbit γ.

Page 18: Physics of Algorithms Talk

Backtrackless Determinant Correction

Define backtrackless graph G ′ of G as follows: nodes of G ′

correspond to directed edges of G , edges (ij)→ (jk) for k 6= i .

4 5 6

7 8 9

1 2 3

65

78 89

98

56

6985 96

633652

23

3221

12

25

54

4114

47 74

87

45

58

Let R ′ be adjacency matrix of G ′ with modified edge-weights r ′

based on GaBP. Then,

Z = Zbp det(I − R ′)−1

Page 19: Physics of Algorithms Talk

Region-Based Estimates/CorrectionsSelect a set of regions R ⊂ 2V that is closed under intersectionsand cover all vertices and edges of G.

Define regions counts (nA ∈ Z,A ∈ R) by inclusion-exclusion rule:

nA = 1−∑

B∈R|A(B

nB

To capture all orbits covered by any region (without over-counting)we calculate the estimate:

ZR ,∏B

ZnBB ,

∏B

(det(I − RB)−1)nB

Error Bounds. Select regions to cover all orbits up to length L.Then,

1

n

∣∣∣∣logZBZ

∣∣∣∣ ≤ ρL

L(1− ρ)

Page 20: Physics of Algorithms Talk

Example: 2-D Grids

Choice of regions for grids: overlapping L× L, L2 × L, L× L

2 , L2 ×

L2

(shifted by L2 ).

For example, in 6× 6 grid with block size L = 4:

n = +1 n = −1 n = +1

Page 21: Physics of Algorithms Talk

256× 256 Periodic Grid, uniform edge weights r ∈ [0, .25].Test with L = 2, 4, 8, 16, 32.

0 0.05 0.1 0.15 0.2 0.250

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

r

ρ(|R|)ρ(|R′|)

0 0.05 0.1 0.15 0.2 0.250

0.05

0.1

0.15

0.2

0.25

r

n−1 log Ztrue

n−1 log Zbp

n−1 log ZB (L=2,4,8,...)

0 0.05 0.1 0.15 0.2 0.250

0.05

0.1

0.15

0.2

0.25

r

n−1 log Ztrue

n−1 log Zbp

n−1 log Zbp

Z′B

5 10 15 20 25 3010

−12

10−10

10−8

10−6

10−4

10−2

L

n−1|log Z−1true

ZB|

n−1|log Z−1true

Zbp

Z′B|

Page 22: Physics of Algorithms Talk

Generalized Belief PropagationSelect a set of regions R ⊂ 2V that is closed under intersectionsand cover all vertices and edges of G.

Define regions counts (nA ∈ Z,A ∈ R) by inclusion-exclusion rule:

nA = 1−∑

B∈R|A(B

nB

Then, GBP solves for saddle point of

ZR(ψ) ,∏A∈R

Z (ψR)nR

over reparameterizations {ψA,A ∈ R} of the form

P(x) =1

Z

∏A∈R

ψR(xR)nR

Denote saddle-point by Zgbp = ZR(ψgbp).

Page 23: Physics of Algorithms Talk

Example: 2-D Grid Revisited

4 6 8 10 12 14 1610

−16

10−14

10−12

10−10

10−8

10−6

10−4

10−2

block size

free

ene

rgy

erro

r

block estimateGBP estimate

Page 24: Physics of Algorithms Talk

GBP Toy Example

Look at graph G = K4 and consider different choices of regions...

1

2

34

2

3

4

1

BP Regions:

12

1314

n = +1

24 23

34

1

2

3

n = −2

4

Page 25: Physics of Algorithms Talk

GBP “3∆” Regions:

n = +1

134

123124

12n = −1

14 13

1

n = +1

GBP “4∆” Regions:

n = +1

134

123124

234

12

1314

n = −1

24 23

34

1

2

3

n = +1

4

Page 26: Physics of Algorithms Talk

Computational Experiment with equal edge weights r = .32 (themodel becomes singular/indefinite for r ≥ 1

3).

Z = 10.9

Zbp = 2.5

Zgbp(3∆) = 9.9

Zgbp(4∆) = 54.4!!!

GBP with 3∆ regions is big improvement of BP (GBP capturesmore orbits).

What went wrong with the 4∆ method?

Page 27: Physics of Algorithms Talk

Orbit-Product Interpretation of GBP

Answer: sometimes GBP can overcount orbits of the graph.

I Let T (R) be the set of hypertrees T one may construct fromregions R.

I Orbit ` spans T if we can embed ` in T but cannot embed itin any sub-hypertree of T .

I Let g` , #{T ∈ T (R)|` spans T}.

Orbit-Product Interpretation of GBP:

Zgbp =∏`

Z g``

Remark. GBP may also include multiples of an orbit asindependent orbits (these are not counted by Z ).

We say GBP is consistent if g` ≤ 1 for all (primitive) orbits andg` = 0 for multiples of orbits (no overcounting).

Page 28: Physics of Algorithms Talk

Examples of Over-Counting

Orbit ` = [(12)(23)(34)(41)]:

1

2

34

1

24

3

134 123

1

24

3

234

124

Orbit ` = [(12)(23)(34)(42)(21)]:

1

2

34

1

34

2 2

124

134

123 2

34

234

1

12

Page 29: Physics of Algorithms Talk

Conclusion and Future Work

Graphical view of inference in walk-summable Gaussian graphicalmodels that is very intuitive for understanding iterative inferencealgorithms and approximation methods.

Future Work:

I many open questions on GBP.

I multiscale method to approximate longer orbits fromcoarse-grained model.

I beyond walk-summable?