Query-Specific Learning and Inference for Probabilistic Graphical Models

Carnegie Mellon

Query-Specific Learning and Inferencefor Probabilistic Graphical Models

Thesis committee: Carlos Guestrin Eric Xing J. Andrew Bagnell Pedro Domingos (University of Washington)

14 June 2011

Anton Chechetka

2

Motivation

Fundamental problem: to reason accurately about

noisyhigh-dimensional data with

local interactions

3

Sensor networks

• noisy: sensors fail noise in readings• high-dimensional: many sensors, (temperature, humidity, …) per sensor• local interactions: nearby locations have high correlations

4

Hypertext classification

• noisy: automated text understanding is far from perfect• high-dimensional: a variable for every webpage• local interactions: directly linked pages have correlated topics

5

Image segmentation

• noisy: local information is not enough camera sensor noise compression artifacts• high-dimensional: a variable for every patch• local interactions: cows are next to grass, airplanes next to sky

6

Probabilistic graphical models

Noisy

high-dimensional data with

local interactions

Probabilistic inference

a graph to encodeonly direct interactions

over many variables

)(

),()|(

EP

EQPEQP

query evidence

7

Graphical models semantics

Ff

XZ

XP

1

Factorized distributions

X3

Graph structure

X4

X5 X2

X1

X7

X6

543 ,, XXXX

X are small subsets of X compact representation

separator

8

Graphical models workflow

Ff

XZ

XP

1X3

Learn/constructstructure

Learn/defineparameters Inference P(Q|E=E)

Factorized distributions Graph structure

X4

X5 X2

X1

X7

X6

9

Graph. models fundamental problems


Learn/defineparameters

Inference

P(Q|E=E)

#P-complete (exact)NP-complete (approx)

NP-complete

exp(|X|)

Compoundingerrors

10

Domain knowledge structures don’t help

(webpages)

Domain knowledge-based structuresdo not support tractable inference

11

This thesis: general directions

Emphasizing the computational aspects of the graphLearn accurate and tractable models

Compensate for reduced expressive power withexact inference and optimal parametersGain significant speedups

Inference speedups via better prioritization of computationEstimate the long-term effects of propagating information through the graphUse long-term estimates to prioritize updates

New algorithms for learning and inference in graphical models

to make answering the queries better

12

Thesis contributionsLearn accurate and tractable models

In the generative setting P(Q,E) [NIPS 2007]

In the discriminative setting P(Q|E) [NIPS 2010]

Speed up belief propagation for cases with many nuisance variables [AISTATS 2010]

13

Generative learning

)(

),()|(

EP

EQPEQP

query goallearning goal

Useful when E is not known in advance

Sensors fail unpredictably

Measurements are expensive (e.g. user time), want adaptive evidence selection

14

Tractable vs intractable models workflow

learn simple tractablestructure from

domain knowledge + data

approx. P(Q|E=E)

optimal parameters,exact inference

construct intractablestructure from

domain knowledge

approx. P(Q|E=E)

approximate algs:no quality

guarantees

learn intractablestructure from

data

Tractable models Intractable models

Tractability via low treewidth

Exact inference exponential in treewidth (sum-product)Treewidth NP-complete to compute in generalLow-treewidth graphs are easy to constructConvenient representation: junction treeOther tractable model classes exist too

15

7 2

1 5

3 4

6

Treewidth:size of largest clique in a

triangulated graph

16

Junction treesCliques connected by edges with separatorsRunning intersection propertyMost likely junction tree of given treewidth >1 is NP-completeWe will look for good approximations

C1

X1,X

5

X4,X

5

X1,X

2

X1,X

5

X1,X2,X7

X1,X2,X5

X1,X4,X5 X4,X5,X6

X1,X3,X5

C2

C3

C4

C5

X4,X5,X6

X1,X3,X5X1,X2,X5

X1,X4,X5 X4,X5,X6

X1,X3,X5X1,X2,X5

X1,X4,X5

7 2

1 5

3 4

6

17

Independencies in low-treewidth distributions

EC

CEC SP

CPXP

,,

),( )(conditional mutual information

works in the other way too!

SSSXXI | , XPPKL EC ),(||

X1,X

5 X1,X2,X7X1,X2,X5X1,X4,X5

X4,X5,X6 X1,X3,X5

0 | , SXXI

conditional independencies hold

S C C

X = X2 X3 X7X = X4 X6

P(X) factorizes according to a JT

18

Constraint-based structure learning SSSXXI | , XPPKL EC ),(||

Look for JTs where this holds(constraint-based structure learning)

S1: X1X2

S2: X1X3

S3: X1X4

Sm: Xn-1Xn

…

all candidateseparators

partition remainingvariables into weakly

dependent subsets

all variables X

find consistentjunction tree

C1S1

S8

S7

S3

C2

C3

C4

C5X1 X4

XX X

I(X , X X | S3) <

19

Mutual information complexity

I(X , X- | S) = H(X | S) - H(X | X- S3)

everything except for X conditional entropy

I(X , X- | S) depends on all assignments to X:exp(|X|) complexity in general

Our contribution: polynomial-time upper bound

20

Mutual info upper bound: intuition

I(A,B | C)=??

DF

hard

A BI(D,F|C)

easy

|DF| k

Only look at small subsets D, F

Poly number of small subsetsPoly complexity for every pair

Any conclusions about I(A,B|C)?

In general, no If a good junction tree exists, yes

21

Contribution: mutual info upper bound

Suppose an -JT of treewidth k for P(ABC) exists:

Let for |DF| k+1

Then I(A, B | C) |ABC| ( + )

= max I(D, F | C)

DF

A B|DF| treewidth+1I(D,F|C)

SSSXXI | ,

Theorem:

22

Mutual info upper bound: complexityDirect computation: complexity exp(|ABC|)Our upper bound:

O(|AB|treewidth + 1) small subsets

exp(|C|+ treewidth) time each

|C| = treewidthfor structure learning

I(D,F|C)D

F

A B

|DF| treewidth+1

polynomial(|ABC|) complexity

23

Guarantees on learned model quality

Theorem:Suppose a strongly connected -JT of treewidth k for P(X) exists.

Then our alg. will with probability at least (1-) find a JT s.t.

)2()1(|| XkPPKL JT

2

)/log( XO

2

32)/1log( k

XO

Corollary: strongly connected junction trees are PAC-learnable

quality guarantee

poly samples poly time

using samples and time.

24

Related workReference Model Guarantees Time[Bach+Jordan:2002] tractable local poly(n)[Chow+Liu:1968] tree global O(n2 log n)[Meila+Jordan:2001] tree mix local O(n2 log n)[Teyssier+Koller:2005] compact local poly(n)[Singh+Moore:2005] all global exp(n)[Karger+Srebro:2001] tractable const-factor poly(n)[Abbeel+al:2006] compact PAC poly(n)[Narasimhan+Bilmes:2004] tractable PAC exp(n)our work tractable PAC poly(n)[Gogate+al:2010] tractable with

high treewidthPAC poly(n)

25

Results – typical convergence time

good results early on in practice

Test

log-

likel

ihoo

d

bett

er

26

Results – log-likelihoodbe

tter

our method

OBS local search in limited in-degree Bayes netsChow-Liu most likely JTs of treewidth 1Karger-Srebro constant-factor approximation JTs

27

ConclusionsA tractable upper bound on conditional mutual infoGraceful quality degradation and PAC learnability guaranteesAnalysis on when dynamic programming works[in the thesis]

Dealing with unknown mutual information threshold[in the thesis]

Speedups preserving the guaranteesFurther speedups without guarantees

28





29

Discriminative learning

)(

),()|(

EP

EQPEQP

query goal learning goal

Useful when variables E are always the sameNon-adaptive, one-shot observation

Image pixels scene descriptionDocument text topic, named entities

Better accuracy than generative models

30

Discriminative log-linear models

EQfwwEZ

wEQP ,exp),(

1),|(

feature(domain knowledge)

weight(learn from data)

evidence-dependentnormalization

Don’t sum over all values of EDon’t model P(E)

No need for structure over E

Evidence

Query

f12

f34

31

Model tractability still important

Observation #1: tractable models are necessary for exact inference and parameter learning in the discriminative setting

Tractability is determined by the structure over query

32

Simple local models: motivation

evidence

query

Q=f(E)

E

Q

Locally almost linear

Exploiting evidence values overcomes the expressive power deficit of simple models

We will learn local tractable models

33

Context-specific independence

Observation #2: use evidence values at test time to tune the structure of the models, do not commit to a single tractable model

noedge

34

Low-dimensional dependencies in generative structure learning

CS

SCCS

H(C)H(S)LLH ),(cliques

Generative structure learning often relies only on low-dimensional marginals

Junction trees:decomposable scores

separators

Low-dimensional independence tests: ??

)|,( SBAI

Small changes to structure quick score recomputation

Discriminative structure learning: need inference in full modelfor every datapoint even for small changes in structure

35

Leverage generative learning

Observation #3: generative structure learning algorithms have very useful properties, can we leverage them?

36

Observations so farDiscriminative setting has extra information, including evidence values at test time

Want to use to learn local tractable models

Good structure learning algorithms exist for generative settingthat only require low-dimensional marginals P(Q)

Approach: 1. use local conditionals P(Q | E=E) as “fake marginals” to learn local tractable structures 2. learn exact discriminative feature weights

37

Evidence-specific CRF overviewApproach: 1. use local conditionals P(Q | E=E) as “fake marginals” to learn local tractable structures 2. learn exact discriminative feature weights

Local conditional density estimators P(Q | E)

Evidencevalue E=E

P(Q | E=E)

Generative structurelearning algorithm

Tractable structurefor E=E

Featureweights w

Tractable evidence-specific CRF

Evidence-specific CRF formalism

),(),(exp),,(

1),,|( uEIEQfw

uwEZuwEQP

Observation: identically zero feature 0 does not affect the model

evidence-specific structure: I(E,u){0, 1}extra “structural” parameters

Fixed dense model

Evidence-specific

tree “mask”

Evidence-specific model× =( () ) ( )

38

Evidence-specific

feature values( )

E=E2

E=E3

E=E1 ×××

39

Evidence-specific CRF learning

Learning is in the same order as testing

Local conditional density estimators P(Q | E)

Evidencevalue E=E

P(Q | E=E)

Generative structurelearning algorithm

Tractable structurefor E=E

Featureweights w

Tractable evidence-specific CRF

40

Plug in generative structure learning

),(),(exp),,(

1),,|( uEIEQw

uwEZuwEQP

encodes the output of the chosen structure learning algorithm

Generative Discriminative

P(Qi,Qj)

(pairwise marginals)+

Chow-Liu algorithm=

optimal tree

P(Qi,Qj|E=E)

(pairwise conditionals)+

Chow-Liu algorithm=

good tree for E=E

Directly generalize generative algorithms :

41

Evidence-specific CRF learning: structure

Choose generative structure learning algorithm A

Identify low-dimensional subsets Qβ that A may need

Chow-Liu

All pairs (Qi, Qj)

E Q E Q1,Q2 E EQ1,Q3 Q3,Q4

,original problem low-dimensional pairwise problems

…

),|,(ˆ1331 uEQQP ),|,(ˆ

3443 uEQQP),|,(ˆ1221 uEQQP

42

Estimating low-dimensional conditionals

Use the same features as the baseline high-treewidth model

QQEQuuEZ

uEQP

s.t. ,exp),(

1),|(

EQwwEZ

wEQP ,exp),(

1)|,(Baseline CRF

Low-dimensionalmodel

Scope restriction

End result: optimal u

43

Evidence-specific CRF learning: weights

),(),(exp),,(

1),,|( uEIEQw

uwEZuwEQP

Already chosen the algorithm behind I(E,u)

Already learned parameters u

Only need to learn feature weights w

log P(Q|E,w,u) is concave in w unique global optimum

“effective features”

44


EEQEEQ

E ,E,,),,|(log

),,|(

QffuIw

uwPuwQP

Tree-structured distribution

Fixed dense model

Evidence-specific

tree “mask”( () )

E=E2

E=E3

E=E1

Exacttree-structuredgradients wrt w( )

Q=Q2

Q=Q3

Q=Q1

Σ

Overall gradient(dense)( )

45

Results – WebKBText + links webpage topic

bett

er

Prediction error TimeSVM RMN ESS-CRF M3N

0

0.05

0.1

0.15

RMN ESS-CRF M3N0

200400600800

10001200

Ignore links Standard dense CRF Our work Max-margin model

46

Image segmentation - accuractlocal segment features + neighbor segments type of object

Logisti

c regressi

on

Dense CRF

ESS-CRF

0.6000000000000010.6400000000000010.6800000000000010.7200000000000010.760000000000001

Accuracy

bett

er

Ignore links Standard dense CRF Our work

47

Image segmentation - time

Train time (log scale)

bett

er

Logistic regression

Dense CRF ESS-CRF2

20

200

2000

20000

Test time (log scale)

Logistic regression

Dense CRF ESS-CRF0.3

3

30

300

3000

Ignore links Standard dense CRF Our work

48

ConclusionsUsing evidence values to tune low-treewidth model structure

Compensates for the reduced expressive powerOrder of magnitude speedup at test time (sometimes train time too)

General framework for plugging in existing generative structure learnersStraightforward relational extension [in the thesis]

49





50

Why high-treewidth models?A dense model expressing laws of nature

Protein folding

Max-margin parameters don’t work well (yet?) with evidence-specific structures

51

Query-Specific inference problemevidencequery not interesting

Using information about the queryto speed up convergence of belief propagation

for the query marginals

Eij

jiij XXfP )()(X

52

(loopy) Belief PropagationPassing messages along edges

Variable belief:

Update rule:

Result: all single-variable beliefs

ikEkjj

tjkji

xiji

tij xmxxfxm

j ,

)()1( )()()(

Eij

it

ijit xmxP )()(

~ )()(

kim r

sj

ki

h

u

53

(loopy) Belief PropagationMessage dependencies are local:

Freedom in scheduling updatesRound–robin schedule

Fix message orderApply updates in that order until convergence

r

sj

ki

h

u

dependence

dependence

dep.

54

Dynamic update prioritization

Fixed update sequence is not the best optionDynamic update scheduling can speed up convergence

Tree-Reweighted BP [Wainwright et. al., AISTATS 2003]Residual BP [Elidan et. al. UAI 2006]

Residual BP apply the largest change first

1

informative update

2

wasted computation

large change large

change

large change

small change

small change

small change

55

Residual BP [Elidan et. al., UAI 2006]

Update rule:

Pick edge with largest residual

Update

oldnew

)()(max OLDij

NEWij mm

More effort on the difficult parts of the model

ikEkj

jOLD

jkjix

ijiNEW

ij xmxxfxmj ,

)()( )()()(

)(OLDijm

)( NEWijm

)( NEWijm

But no query

56

• Residual BP updates• no influence on the query• wasted computation

Why edge importance weights?query

residual < residualwhich to update??

• want to update • influence on the query in the future

Residual BP max immediate residual reduction

Our work max approx. eventual effect on P(query)

57

Query-Specific BP

Update rule:

Pick edge with

Update

oldnew

ijOLD

ijNEW

ij Amm max )()(

ikEkj

jOLD

jkjix

ijiNEW

ij xmxxfxmj ,

)()( )()()(

)(OLDijm

)( NEWijm

)( NEWijm

Rest of the talk: defining and computing edge importance

edgeimportance

the only change!

Edge importance base case

approximate eventual update effect on P(Q)

query

r

sj

ki

h

uij

OLDij

NEWij Amm )()(

||P(NEW)(Q) P(OLD)(Q)|| ||m(NEW) m(OLD)||change in query belief change in message

tight bound

1

Base case: edge directly connected to the query Aji=??

1ji ji

||P(Q)|| ||m ||ji

||m ||over values of

all other messages

mjisup mrj

Edge one step away from the query:

Arj=??

mjisup mrj

Edge importance one step awayquery

r

sj

ki

h

u

||P(Q)||

change in query belief

change in message

can compute in closed formlooking at only fji [Mooij, Kappen; 2007]

message importance

ji

rj||m ||

One step away:

Arj=

Edge importance general case

queryr

sj

ki

h

u

||P(Q)|| ||msh|| P(Q)msh

sup

P(Q)msh

sup

Base case: Aji=1

mhrsup msh

mrjsup mhr

mjisup mrj

mjisup mrj

sensitivity(): max impact along the

path

Generalization? expensive to compute bound may be infinite

Edge importance general case

queryr

sj

ki

h

u

1

P(Q)msh

sup

mhrsup msh

mrjsup mhr

mjisup mrj

sensitivity(): max impact along the

path

2

Ash = max all paths from to query sensitivity()h

There are a lot of paths in a graph,trying out every one is intractable

62

Efficient edge importance computation

A = max all paths from to query sensitivity()

There are a lot of paths in a graph,trying out every one is intractable

always 1

always decreases as the path grows

mhrsup msh

mrjsup mhr

mjisup mrj

sensitivity( hrji ) =

always 1always 1

decomposes into individual edge contributions

Dijkstra’s (shortest paths) alg. will efficiently find max-sensitivity paths

for every edge

63

Aji = max all paths from i to query sensitivity()

Query-Specific BP

Run Dijkstra’s alg starting at query to get edge weights

Pick edge with largest weighted residual

Update

ijOLD

ijNEW

ij Amm max )()(

)(OLDijm

)( NEWijm )( NEWijm

More effort on the difficult parts of the model

Takes into account not only graphical structure, but also strength of dependencies

and relevant

64

Experiments – single query

Easy model(sparse connectivity,weak interactions)

Hard model(dense connectivity,strong interactions)

bett

er

Standard residual BP Our work

Faster convergence, but long initialization still a problem

65

Anytime query-specific BPquery

Dijkstra’s alg. BP updates

Query-specific BP:

Anytime QSBP:

same BP update sequence!

r

sj

ki

66

Experiments – anytime QSBP



bett

er

Standard residual BP Our work

Much shorter initialization

Our work + anytime

67

Experiments – multiquery



bett

er

Standard residual BP Our work Our work + anytime

68

ConclusionsWeighting edges is a simple and effective way to improve prioritizationWe introduce a principled notion of edge importance based on both structure and parameters of the modelRobust speedups in the query-specific setting

Don’t spend computation on nuisance variables unless needed for the query marginal

Deferring BP initialization has a large impact

69





70

Future workMore practical JT learning

SAT solvers to construct structure, pruning heuristics, …

Evidence-specific learningTrade efficiency for accuracyMax-margin evidence-specific models

Theory on ES structures too

Inference:Beyond query-specific: better prioritization in generalBeyond BP: query-specific Gibbs sampling?

71

Thesis conclusionsGraphical models are a regularization technique for high-dimensional distributionsRepresentation-based structure is well-understood

Conditional independencies

Right now, structured computation is a “consequence” of representation

Major issues with tractability, approximation quality

Logical next step structured computation as a primary basis of regularizationThis thesis: computation-centric approaches have better efficiency and do not sacrifice accuracy

72

Thank you!

Collaborators: Carlos Guestrin, Joseph Bradley, Dafna Shahaf

73

Mutual info upper bound: qualityUpper bound:

Suppose an -JT exists is the largest mutual information over small subsetsThen I(A, B | C) |ABC| ( + )

No need to know the -JT, only that it exists

No connection between C and the JT separators

C can be of any size, no connection to JT treewidthThe bound is loose only when there is no hope to learn a good JT

74

Typical graphical models workflow


Learn/defineparameters

Inference

P(Q|E=e)

reasonable intractablestructure from

domain knowledge

approx. P(Q|E=e)

The graph isprimarily a

representationtool

approximate algs:no quality

guarantees

75

Contributions – tractable modelsLearn accurate and tractable models

In the generative setting [NIPS 2007]Polynomial-time conditional mutual information upper boundFirst PAC-learning result for strongly connected junction treesGraceful degradation guaranteesSpeedup heuristrics

In the discriminative setting [NIPS 2010]General framework for learning CRF structure that depends on evidence values at test timeExtensions to the relational settingEmpirical: order of magnitude speedups with the same accuracy as high-treewidth models

76

Contributions – faster inferenceSpeed up belief propagation for cases with many nuisance variables [AISTATS 2010]

A framework of importance-weighted residual belief propagationA principled measure of eventual impact of an edge update on the query belief

Prioritize updates by importance for the query instead of absolute magnitude

An anytime modification to defer much of initializationInitial inference results available much soonerOften much faster eventual convergenceThe same fixed points as the full model

77

Future workTwo main bottlenecks:

Constructing JTs given mutual information values.Esp. with non-uniform treewidth, dependence strength

Large sample: learnability guarantees for non-uniform treewidth Small sample: non-uniform treewidth for regularizationConstraint satisfaction, SAT solvers, etc?Relax strong connectivity requirement?

Evaluating mutual information:need to look at 2k+1 variables instead of k+1, large penalty

Branch on features instead of sets of variables? [Gogate+al:2010]

Speedups without guaranteesLocal search, greedy separator construction, …

78

Log-linear parameter learningconditional log-likelihood

DEQ

EQD),(

)|(log)( ,wP|wLLH

Convex optimization: unique global maximum

Gradient: features – [expected features]

EEQEQ

E ,E,),|(log

),|(

Qffw

wPwQP

need inference inference for every E given w

79

Log-linear parameter learning

Generative (E=) Discriminative

Tractable Closed-form Exact gradient-based

IntractableApproximate

gradient-based(no guarantees)

Approximategradient-based(no guarantees)

Inference once per weights update

Inference for every datapoint (Q,E)once per weights update

“manageable” slowdownby the number of datapoints

Complexity“phase

transition”

80

Plug in generative structure learning

),(),(exp),,(

1),,|( uEIEQw

uwEZuwEQP

encodes the output of the chosen structure learning algorithm

Chow-Liu for optimal treesOur thin junction tree learning from part 1Karger-Srebro for high-quality low-diameter junction treesLocal search, etc …

Fix algorithm always get structures with desired properties (e.g. treewidth):

replace P(Q) with approximate conditionals P(Q | E=E, u) everywhere

81


),(),(exp),,(

1),,|( uEIEQw

uwEZuwEQP

Already knowalgorithm behind I(E,u)Already learned u

Only need to learn w

Structure induced by I(E,u)is always tractable

Can find evidence-specific structure I(E=E,u)

for every training datapoint (Q,E)

Learn optimal w exactly

EEQEEQ

E ,E,,),,|(log

),,|(

QffuIw

uwPuwQP

Tree-structured distribution

82

Relational evidence-specific CRFRelational models: templated features + shared weights

webpage webpageLinksTo

LinksTo

LinksTo

Relation:

Groundings:

Learn a singleweight wLINK

wLINK

wLINK

Copy weight for every grounding

83

Relational evidence-specific CRFRelational models: templated features + shared weights

Every grounding is a separate datapoint for structure training

use propositional approach + shared weights

x1 x2

x3

x4 x5

Grounded model Training datasets for “structural” parameters u

x3 x4

x3 x5

x4 x5

x1 x2x1 x3

x1 x4

x1 x5

x2 x3

x2 x4

x2 x5

84

Future workFaster learning: pseudolikelihood is really fast, need to competeLarger treewidth: trade time for accuracyTheory on learning “structural parameters” uMax-margin learning

Inference is basic step in max-margin learning too tractable models are useful beyond log-likelihoodOptimizing feature weights w given local trees is straightforwardOptimizing “structural parameters” u for max-margin is hard

What is the right objective?

Almost tractable structures, other tractable modelsMake sure loops don’t hurt too much

85

Query versus nuisance variablesWe may actually care about only few variables

What are the topics of the webpages on the first page of Google search results for my query?Smart heating control: is anybody going to be at home for the next hour?Does the patient need immediate doctor attention?

But the model may have a lot of other variables to be accurate enough

Don’t care about them per se, but necessary to look at to get the query right

Both query and nuisance variables are unknown, inference algorithms don’t see a differenceSpeed up inference by focusing on the query

Only look at nuisance variable to the extent needed to answer the query

86

Our contributions

Using weighted residuals to prioritize updates

Define message weights reflecting the importance of the message to the query

Computing importance weights efficiently

Experiments: faster convergence on large relational models

87

Interleaving

Dijkstra’s expands the highest weight edges firstqueryexpanded on

previous iteration just expanded

not yet expanded

min expanded edges A A

suppose

M min expanded A

no need to expand further at this point

upper bound on priorityactual priority of

)()( max OLD

ijNEW

ijEDGESALLij mmM

ijOLD

ijNEW

ijEXPANDEDij Amm )()(max

88

Deferring BP initialization

Observation: Dijkstra’s alg. expands the most important edges first

Do we really need to look at every low importance edgebefore applying BP updates?

No! Can use upper bounds on priority instead.

89

Upper bounds in priority queue

Observation: for edges low in the priority queue, an upper bound on the priority is enough

r

sj

ki

Updates priority queue

Exact priority needed fortop element

Priority upper boundis enough here

90

|| factor( ) ||

Priority upper bound for not yet seen edges

priority( ) = residual( ) importance weight( )

importance weight( )s.t. is already expanded

priority( )

Component-wise upper bound without looking at the edge!

r

sj

ki

Expand several edges with Dijkstra’s : For :(residual) (weight) = (exact priority)

For all the other edges…

91

Interleaving BP and Dijkstra’s

Dijkstra

BP

Dijkstra

Dijkstra

BP BP

…

exact priority upper bound BP update

exact priority upper bound Dijkstra expand an edge

queryfull model

><

Documents

Query-Specific Learning and Inference for Probabilistic Graphical Models