Probabilistic Inference in Distributed Systems

1

Probabilistic Inference in Distributed Systems

Stanislav Funiak

Disclaimer:Statements made in this talk are the sole opinions of the presenter and do not necessarily represent the official position of the University or presenter’s advisor.

2

Monitoring in Emergency Response Systems

Firefighters enter a building

As they run around, place a bunch of sensors

Want to monitor the temperature in various places

p(temperature at location i

Xi

| temperature observed at all sensors)

z

3

Efficient inference:Nice model:

Monitoring in Emergency Response Systems

You ask a 10-701 graduate for help: “learn the model”

You ask a 10-708 graduate for help: “implement efficient inference”

Put them in an IntelTM Core-Trio machine with 30GB RAM

Simulation experiments work great

X1

X2

X3

X4

X5

X6

Z2 Z6Z4

hiddenstate

observedtemp.

Done!

4

D-Day arrives…

You start up your machine and…

Firefighters deploy the sensors

The network goes down. Got flooded.

You call up an old-time friend at MIT.

Sends you a patch in 24 minutes.

highly optimized routing

Ooops! Part of the ceiling just went down, lost connection again

5

Last-minute Link Stats

mhm, communication is lossy mhm, link qualities change

*

* Joke warning: = 1 week

Maybe having a good routing was not such a bad idea…

6

What’s wrong here?

• Cannot rely on centralized infrastructure– too costly to gather all observations– need be robust against node failures, message losses– may want to perform online control

• nodes equipped with actuators

• Want to perform inference directly on network nodes

Also:

Autonomous teams of mobile robots

7

Distributed Inference – The Big Picture

p(Qn | temperature observed at all sensors)

zEach node nissues a query

some variables,e.g. temperature at locations 1,2,3

Nodes collaborate at computing the query

8

Probabilistic model vs. physical layer

Probabilisticmodel

physical nodes

available communication linksX1

X2

X3

X4

X5

X6

Z2 Z6Z4

Sensor network

Physicallayer

9

Natural solution: Loopy B.P.Suppose: Network nodes = Variables

1 3

2 4

7

6

5

8

10

Natural solution: Loopy B.P.

Issues:

X1 X3 X5 X7

X2 X4 X6 X8

4!6

5!6

6!8

[Pfeffer, 2003, 2005]

Suppose: Network nodes = Variables

Then could run loopy B.P. directly on the network

p(X4)could view as

• may not observe network structure • potentially non-converging• definitely over-confident will revisit in

experimental results

not fully resolved

: 99% hot

Truth: 51% hot, 49% cold

11

Want the Following Properties

1. Global correctness:Eventually, each node obtains the true distributionp(Qn | z)

2. Partial correctness:Before convergence, a node can form a meaningfulapproximation of p(Qn | z)

3. Local correctness:without seeing other nodes’ beliefs, each node can condition on its own observations

12

Outline

X1

X2

X3

X4

X5

X6

Z2 Z6Z4

Sensor network

communication links

routing tree

reparametrizedmodel

1. Nodes make local observations

2. Nodes establish a routing structure

3. Communicate tocompute the query

offline

distribute the model

Input model (BN / MRF) [Paskin & Guestrin, 2004]

13

Standard parameterization not robust

Now, suppose someone told us p(X2 | X3) and p(X3 | X1)

effectively, assuming uniform prior on X2

lost CPD

probability ofhigh temp.?

Much better: inference in a simpler model

Suppose we “lose” a CPD / potential (not communicated yet, a node failed)

observehigh temp.

p(X2 | X1) £ p(X3 | X1,X2) £ p(X4 | X2,X3) = p(X4 | X1)

Distribution changes dramatically

X1 X2

X3

X4

Exact model:

X2X1

X3

X4

Construct approximation:

preserves correlation btw X1 and X3

¼

X2?X1 | X3

15

Review: Junction Tree representation

BN / MNJunction tree

running intersection

family-preserving

(Think as writing the CPDs p(X6 | X4,X5), etc.)

clique marginals

separator marginals

X1

X2

X3

X4

X5

X6X1,X2 X2,X3,X4 X3,X4,X5 X4,X5,X6

separatorX4,X5

we’ll keepthese

not important(can be computed)

X3,X4,X5

clique

X3,X4X2

16

Properties used by the Algorithm

1. Marginalization amounts to pruning cliques:

Key properties:

2. Using a subset of cliques amounts to KL-projection:

3

2 4

5

6

exact:

3

2 4

5

6

approximate:

X2,3 ? X5,6 | X4

X2,X3,X4 X4,X5,X6

X4

T’:

X1,X2

X2

Junction tree T

X2,X3,X4 X3,X4,X5 X4,X5,X6

X3,4 X4,5

£

all distributions that factor as T’

X2,X3,X4 X4,X5,X6

missing clique

X3,X4,X5

17

How are these structures used for distributed inference?

From clique marginals to distributed inference

X1,X2 X3,X4,X5 X4,X5,X6

1 3

4

6

X1, X2 X2, X3, X4

X3, X4, X5

X4, X5, X6

X2, X3, X4 , X5

Network junction tree:[Paskin et al, 2005]

• used for communication

• satisfies running intersection property

• adaptive, can be optimized

Clique marginals

are assigned to network nodes

stronger linksweaker links

X2,X3,X4

18

X1,X2 X2,X3,X4 X3,X4,X5 X4,X5,X6

Global model: external J.T.Robust message passing algorithm

X1,X2 X3,X4,X5

X4,X5,X6

1 3

4

6

X1, X2 X2, X3, X4

X3, X4, X5

X4, X5, X6

X2, X3, X4 , X5

Clique marginals

X4,X5,X6

Local cliques:

Node locally decides, which cliques sufficient for its neighbors

X3,X4,X5

node 3 obtained

X2,X3,X4X2,X3,X4

Network junction treeNodes communicate clique marginals along the network junction tree

exact

19

Message passing = pruning leaf cliques

Theorem: On a path towards some network node, cliques that are not passed form branches of an external junction tree.

Corollary: At convergence, each node obtains subtree of external junction tree.

[Ch 6, Paskin, 2004]

1 34

6

X1,X2 X2,X3,X4X3,X4,X5

X4,X5,X6

X3,X4,X5

X4,X5,X6X2,X3,X4

ReplayX1,X2 X2,X3,X4 X3,X4,X5 X4,X5,X6

External junction tree:

pruned cliquescliques obtainedby node 1

20

Incorporating observationsOriginal model

X1

X2

X3

X4

X5

X6

Reparametrized as junction tree

Z1 Z3

Z4

Z6

X1,X2 X2,X3,X4 X3,X4,X5 X4,X5,X6

Suppose all observation variables are leaves:

Can associate each likelihood with any clique that covers its parents• algorithm will pass around clique priors and clique likelihoods• marginalization still amounts to pruning

e.g., suppose marginalize out X1

21

Putting it all together

Theorem: Global correctnessAt convergence, each node n obtains exact distribution overits query variables, conditioned on all observations

Theorem: Partial correctnessBefore convergence, each node n obtains a KL projection over its query variables, conditioned on collected observations E

junction tree formed bycollected cliques

22

Results: Convergence

Robust message passing

bet

ter

Standard sum-product algorithm

Model: Nodes estimate temperature as well as additive bias

converges early closeto the global optimum

Bad answers for a long time;then “snaps” in

(iteration)

23

Results: Robustness

Communication partitioned at t=60;restored at t=120

Node failure

converges closeto the global optimum

insensitive to node failures

bet

ter

(robust message passing algorithm)

24

How about dynamic inference?Firefighters get fancier equipment…

Place wireless cameras around an environment

Want to determine the locations automatically

local observationlocation Ci?

[Funiak et al 2006]

25

Firefighters get fancier equipment…Distributed camera localization:

camera location Ci object trajectory M1:T

This is a dynamic inference problem

26

How localization works in practice…

27

Model: (Dynamic) Bayesian Network

C2C1Cameralocations

observationsO(t)

O11 O1

2 O15

O25

M1

t=1Objectlocation

Transition model:

t-1 t

Filtering: compute the posterior distribution

image

Measurement model:

M5M2

t=2 t=5

stateprocesses

28

Filtering: Summary

prediction

estimation

prior distribution

posterior distribution

roll-up

29

Observations & transitions introduce dependencies

t t + 1

Suppose person observed by cameras 1 & 2 at two consecutive time steps

At time t:

At time t+1:

No independence assertionsamong C1, C2, Mt+1

Typically, after a while, no independence assertions among state variables

C1, C2, …, CN, Mt+1

C1, Mt C2, Mt C3

C1, C2, Mt+1 C3

30

Junction Tree Assumed Density Filtering

prior distributionat time t

A

B C

D E

ABC

BCD

CDE

Markov network: Junction tree:

exact prior at time t+1

ABCD

BCDE

A

B C

D E

approximate belief at time t+1

ABC

BCD

CDE

A

B C

D E

estimationprediction

roll-up KL projection

Periodically project to a “small” junction tree [Boyen,Koller 1998]

31

Distributed Assumed Density Filtering

X1,X2 X2,X3,X4 X3,X4,X5 X4,X5,X6

1 3

4

6

At each time step, a node computes a marginal over its clique(s)

1. Initialization:

2. Estimation:

3. Prediction:

condition on evidence (distributed)

advance to the next step (local)

32

Results: Convergence

RMS error

bet

ter

centralized solution

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

3 5 10 15 20Iterations per time step

Theorem: Given sufficient communication at each time step, distribution obtained by the algorithm is equal to running B&K98 algorithm.

33

Convergence: Temperature monitoringb

ette

r

Iterations per time step

34

Comparison with Loopy B.P.b

ette

r

Loopy, window 5

Loopy, window 1

Distributed filter, 1 iter/step

Distributed filter, 3 iter/step

t=1 t=2 t=3 t=4 t=5UnrolledDBN:

35

Partitions introduce inconsistencies

real camera network distribution computedby nodes on the left

distribution computedby nodes on the right

network partition

objectlocation

cameraposes

The beliefs obtained by the left and the right sub-network do not agree on the shared variables, do not represent a globally consistent distribution

Good news: the beliefs are not too different.Main difference: how certain the beliefs are.

36

The “two Bayesians meet on a street” problem

I believe the sun is up. Man, isn’t it down?

Hard problem, in general. Need samples to decide…

37

AlignmentIdea: formulate as an optimization problem.

Suppose we define aligned distribution to match the clique marginals:

aligneddistribution

inconsistentprior marginals

Not so great for Gaussians…

x

belief 1: uncertain

belief 2: certaini(x)

Aligneddistribution

This objective tends to forget information…

38

Alignment

For Gaussians, is a convex problem:

determinant maximization[Vandenberghe et al, SIAM 1998]

linear regression, can distribute[Guestrin IPSN 04]

Suppose we use KL divergence in “wrong” order

aligneddistribution

inconsistentprior marginals

Good: tends to prefer more certain distributions q

39

Results: Partition

progressively partitionthe communication graph

bet

ter

Number of partition components

omniscient best

omniscient worst

KL minimization performsas well as best unaligned solution

a simpler alignment

40

Conclusion

Distributed inference presents many interesting challenges• perform inference directly on the sensor nodes• robust to message losses, node failures

Static inference: message passing on routing tree• message = collections of clique marginals, likelihoods• obtain joint distribution• convergence, partial correctness properties

Dynamic inference: assumed density filtering• address inconsistencies

Documents

Probabilistic Inference in Distributed Systems