91
Carnegie Mellon Query-Specific Learning and Inference for Probabilistic Graphical Models Thesis committee: Carlos Guestrin Eric Xing J. Andrew Bagnell Pedro Domingos (University of Washington) 14 June 2011 Anton Chechetka

Query-Specific Learning and Inference for Probabilistic Graphical Models

Embed Size (px)

DESCRIPTION

Query-Specific Learning and Inference for Probabilistic Graphical Models. Anton Chechetka. - PowerPoint PPT Presentation

Citation preview

Page 1: Query-Specific Learning and Inference for Probabilistic Graphical Models

Carnegie Mellon

Query-Specific Learning and Inferencefor Probabilistic Graphical Models

Thesis committee: Carlos Guestrin Eric Xing J. Andrew Bagnell Pedro Domingos (University of Washington)

14 June 2011

Anton Chechetka

Page 2: Query-Specific Learning and Inference for Probabilistic Graphical Models

2

Motivation

Fundamental problem: to reason accurately about

noisyhigh-dimensional data with

local interactions

Page 3: Query-Specific Learning and Inference for Probabilistic Graphical Models

3

Sensor networks

• noisy: sensors fail noise in readings• high-dimensional: many sensors, (temperature, humidity, …) per sensor• local interactions: nearby locations have high correlations

Page 4: Query-Specific Learning and Inference for Probabilistic Graphical Models

4

Hypertext classification

• noisy: automated text understanding is far from perfect• high-dimensional: a variable for every webpage• local interactions: directly linked pages have correlated topics

Page 5: Query-Specific Learning and Inference for Probabilistic Graphical Models

5

Image segmentation

• noisy: local information is not enough camera sensor noise compression artifacts• high-dimensional: a variable for every patch• local interactions: cows are next to grass, airplanes next to sky

Page 6: Query-Specific Learning and Inference for Probabilistic Graphical Models

6

Probabilistic graphical models

Noisy

high-dimensional data with

local interactions

Probabilistic inference

a graph to encodeonly direct interactions

over many variables

)(

),()|(

EP

EQPEQP

query evidence

Page 7: Query-Specific Learning and Inference for Probabilistic Graphical Models

7

Graphical models semantics

Ff

XZ

XP

1

Factorized distributions

X3

Graph structure

X4

X5 X2

X1

X7

X6

543 ,, XXXX

X are small subsets of X compact representation

separator

Page 8: Query-Specific Learning and Inference for Probabilistic Graphical Models

8

Graphical models workflow

Ff

XZ

XP

1X3

Learn/constructstructure

Learn/defineparameters Inference P(Q|E=E)

Factorized distributions Graph structure

X4

X5 X2

X1

X7

X6

Page 9: Query-Specific Learning and Inference for Probabilistic Graphical Models

9

Graph. models fundamental problems

Learn/constructstructure

Learn/defineparameters

Inference

P(Q|E=E)

#P-complete (exact)NP-complete (approx)

NP-complete

exp(|X|)

Compoundingerrors

Page 10: Query-Specific Learning and Inference for Probabilistic Graphical Models

10

Domain knowledge structures don’t help

(webpages)

Domain knowledge-based structuresdo not support tractable inference

Page 11: Query-Specific Learning and Inference for Probabilistic Graphical Models

11

This thesis: general directions

Emphasizing the computational aspects of the graphLearn accurate and tractable models

Compensate for reduced expressive power withexact inference and optimal parametersGain significant speedups

Inference speedups via better prioritization of computationEstimate the long-term effects of propagating information through the graphUse long-term estimates to prioritize updates

New algorithms for learning and inference in graphical models

to make answering the queries better

Page 12: Query-Specific Learning and Inference for Probabilistic Graphical Models

12

Thesis contributionsLearn accurate and tractable models

In the generative setting P(Q,E) [NIPS 2007]

In the discriminative setting P(Q|E) [NIPS 2010]

Speed up belief propagation for cases with many nuisance variables [AISTATS 2010]

Page 13: Query-Specific Learning and Inference for Probabilistic Graphical Models

13

Generative learning

)(

),()|(

EP

EQPEQP

query goallearning goal

Useful when E is not known in advance

Sensors fail unpredictably

Measurements are expensive (e.g. user time), want adaptive evidence selection

Page 14: Query-Specific Learning and Inference for Probabilistic Graphical Models

14

Tractable vs intractable models workflow

learn simple tractablestructure from

domain knowledge + data

approx. P(Q|E=E)

optimal parameters,exact inference

construct intractablestructure from

domain knowledge

approx. P(Q|E=E)

approximate algs:no quality

guarantees

learn intractablestructure from

data

Tractable models Intractable models

Page 15: Query-Specific Learning and Inference for Probabilistic Graphical Models

Tractability via low treewidth

Exact inference exponential in treewidth (sum-product)Treewidth NP-complete to compute in generalLow-treewidth graphs are easy to constructConvenient representation: junction treeOther tractable model classes exist too

15

7 2

1 5

3 4

6

Treewidth:size of largest clique in a

triangulated graph

Page 16: Query-Specific Learning and Inference for Probabilistic Graphical Models

16

Junction treesCliques connected by edges with separatorsRunning intersection propertyMost likely junction tree of given treewidth >1 is NP-completeWe will look for good approximations

C1

X1,X

5

X4,X

5

X1,X

2

X1,X

5

X1,X2,X7

X1,X2,X5

X1,X4,X5 X4,X5,X6

X1,X3,X5

C2

C3

C4

C5

X4,X5,X6

X1,X3,X5X1,X2,X5

X1,X4,X5 X4,X5,X6

X1,X3,X5X1,X2,X5

X1,X4,X5

7 2

1 5

3 4

6

Page 17: Query-Specific Learning and Inference for Probabilistic Graphical Models

17

Independencies in low-treewidth distributions

EC

CEC SP

CPXP

,,

),( )(conditional mutual information

works in the other way too!

SSSXXI | , XPPKL EC ),(||

X1,X

5 X1,X2,X7X1,X2,X5X1,X4,X5

X4,X5,X6 X1,X3,X5

0 | , SXXI

conditional independencies hold

S C C

X = X2 X3 X7X = X4 X6

P(X) factorizes according to a JT

Page 18: Query-Specific Learning and Inference for Probabilistic Graphical Models

18

Constraint-based structure learning SSSXXI | , XPPKL EC ),(||

Look for JTs where this holds(constraint-based structure learning)

S1: X1X2

S2: X1X3

S3: X1X4

Sm: Xn-1Xn

all candidateseparators

partition remainingvariables into weakly

dependent subsets

all variables X

find consistentjunction tree

C1S1

S8

S7

S3

C2

C3

C4

C5X1 X4

XX X

I(X , X X | S3) <

Page 19: Query-Specific Learning and Inference for Probabilistic Graphical Models

19

Mutual information complexity

I(X , X- | S) = H(X | S) - H(X | X- S3)

everything except for X conditional entropy

I(X , X- | S) depends on all assignments to X:exp(|X|) complexity in general

Our contribution: polynomial-time upper bound

Page 20: Query-Specific Learning and Inference for Probabilistic Graphical Models

20

Mutual info upper bound: intuition

I(A,B | C)=??

DF

hard

A BI(D,F|C)

easy

|DF| k

Only look at small subsets D, F

Poly number of small subsetsPoly complexity for every pair

Any conclusions about I(A,B|C)?

In general, no If a good junction tree exists, yes

Page 21: Query-Specific Learning and Inference for Probabilistic Graphical Models

21

Contribution: mutual info upper bound

Suppose an -JT of treewidth k for P(ABC) exists:

Let for |DF| k+1

Then I(A, B | C) |ABC| ( + )

= max I(D, F | C)

DF

A B|DF| treewidth+1I(D,F|C)

SSSXXI | ,

Theorem:

Page 22: Query-Specific Learning and Inference for Probabilistic Graphical Models

22

Mutual info upper bound: complexityDirect computation: complexity exp(|ABC|)Our upper bound:

O(|AB|treewidth + 1) small subsets

exp(|C|+ treewidth) time each

|C| = treewidthfor structure learning

I(D,F|C)D

F

A B

|DF| treewidth+1

polynomial(|ABC|) complexity

Page 23: Query-Specific Learning and Inference for Probabilistic Graphical Models

23

Guarantees on learned model quality

Theorem:Suppose a strongly connected -JT of treewidth k for P(X) exists.

Then our alg. will with probability at least (1-) find a JT s.t.

)2()1(|| XkPPKL JT

2

)/log( XO

2

32)/1log( k

XO

Corollary: strongly connected junction trees are PAC-learnable

quality guarantee

poly samples poly time

using samples and time.

Page 24: Query-Specific Learning and Inference for Probabilistic Graphical Models

24

Related workReference Model Guarantees Time[Bach+Jordan:2002] tractable local poly(n)[Chow+Liu:1968] tree global O(n2 log n)[Meila+Jordan:2001] tree mix local O(n2 log n)[Teyssier+Koller:2005] compact local poly(n)[Singh+Moore:2005] all global exp(n)[Karger+Srebro:2001] tractable const-factor poly(n)[Abbeel+al:2006] compact PAC poly(n)[Narasimhan+Bilmes:2004] tractable PAC exp(n)our work tractable PAC poly(n)[Gogate+al:2010] tractable with

high treewidthPAC poly(n)

Page 25: Query-Specific Learning and Inference for Probabilistic Graphical Models

25

Results – typical convergence time

good results early on in practice

Test

log-

likel

ihoo

d

bett

er

Page 26: Query-Specific Learning and Inference for Probabilistic Graphical Models

26

Results – log-likelihoodbe

tter

our method

OBS local search in limited in-degree Bayes netsChow-Liu most likely JTs of treewidth 1Karger-Srebro constant-factor approximation JTs

Page 27: Query-Specific Learning and Inference for Probabilistic Graphical Models

27

ConclusionsA tractable upper bound on conditional mutual infoGraceful quality degradation and PAC learnability guaranteesAnalysis on when dynamic programming works[in the thesis]

Dealing with unknown mutual information threshold[in the thesis]

Speedups preserving the guaranteesFurther speedups without guarantees

Page 28: Query-Specific Learning and Inference for Probabilistic Graphical Models

28

Thesis contributionsLearn accurate and tractable models

In the generative setting P(Q,E) [NIPS 2007]

In the discriminative setting P(Q|E) [NIPS 2010]

Speed up belief propagation for cases with many nuisance variables [AISTATS 2010]

Page 29: Query-Specific Learning and Inference for Probabilistic Graphical Models

29

Discriminative learning

)(

),()|(

EP

EQPEQP

query goal learning goal

Useful when variables E are always the sameNon-adaptive, one-shot observation

Image pixels scene descriptionDocument text topic, named entities

Better accuracy than generative models

Page 30: Query-Specific Learning and Inference for Probabilistic Graphical Models

30

Discriminative log-linear models

EQfwwEZ

wEQP ,exp),(

1),|(

feature(domain knowledge)

weight(learn from data)

evidence-dependentnormalization

Don’t sum over all values of EDon’t model P(E)

No need for structure over E

Evidence

Query

f12

f34

Page 31: Query-Specific Learning and Inference for Probabilistic Graphical Models

31

Model tractability still important

Observation #1: tractable models are necessary for exact inference and parameter learning in the discriminative setting

Tractability is determined by the structure over query

Page 32: Query-Specific Learning and Inference for Probabilistic Graphical Models

32

Simple local models: motivation

evidence

query

Q=f(E)

E

Q

Locally almost linear

Exploiting evidence values overcomes the expressive power deficit of simple models

We will learn local tractable models

Page 33: Query-Specific Learning and Inference for Probabilistic Graphical Models

33

Context-specific independence

Observation #2: use evidence values at test time to tune the structure of the models, do not commit to a single tractable model

noedge

Page 34: Query-Specific Learning and Inference for Probabilistic Graphical Models

34

Low-dimensional dependencies in generative structure learning

CS

SCCS

H(C)H(S)LLH ),(cliques

Generative structure learning often relies only on low-dimensional marginals

Junction trees:decomposable scores

separators

Low-dimensional independence tests: ??

)|,( SBAI

Small changes to structure quick score recomputation

Discriminative structure learning: need inference in full modelfor every datapoint even for small changes in structure

Page 35: Query-Specific Learning and Inference for Probabilistic Graphical Models

35

Leverage generative learning

Observation #3: generative structure learning algorithms have very useful properties, can we leverage them?

Page 36: Query-Specific Learning and Inference for Probabilistic Graphical Models

36

Observations so farDiscriminative setting has extra information, including evidence values at test time

Want to use to learn local tractable models

Good structure learning algorithms exist for generative settingthat only require low-dimensional marginals P(Q)

Approach: 1. use local conditionals P(Q | E=E) as “fake marginals” to learn local tractable structures 2. learn exact discriminative feature weights

Page 37: Query-Specific Learning and Inference for Probabilistic Graphical Models

37

Evidence-specific CRF overviewApproach: 1. use local conditionals P(Q | E=E) as “fake marginals” to learn local tractable structures 2. learn exact discriminative feature weights

Local conditional density estimators P(Q | E)

Evidencevalue E=E

P(Q | E=E)

Generative structurelearning algorithm

Tractable structurefor E=E

Featureweights w

Tractable evidence-specific CRF

Page 38: Query-Specific Learning and Inference for Probabilistic Graphical Models

Evidence-specific CRF formalism

),(),(exp),,(

1),,|( uEIEQfw

uwEZuwEQP

Observation: identically zero feature 0 does not affect the model

evidence-specific structure: I(E,u){0, 1}extra “structural” parameters

Fixed dense model

Evidence-specific

tree “mask”

Evidence-specific model× =( () ) ( )

38

Evidence-specific

feature values( )

E=E2

E=E3

E=E1 ×××

Page 39: Query-Specific Learning and Inference for Probabilistic Graphical Models

39

Evidence-specific CRF learning

Learning is in the same order as testing

Local conditional density estimators P(Q | E)

Evidencevalue E=E

P(Q | E=E)

Generative structurelearning algorithm

Tractable structurefor E=E

Featureweights w

Tractable evidence-specific CRF

Page 40: Query-Specific Learning and Inference for Probabilistic Graphical Models

40

Plug in generative structure learning

),(),(exp),,(

1),,|( uEIEQw

uwEZuwEQP

encodes the output of the chosen structure learning algorithm

Generative Discriminative

P(Qi,Qj)

(pairwise marginals)+

Chow-Liu algorithm=

optimal tree

P(Qi,Qj|E=E)

(pairwise conditionals)+

Chow-Liu algorithm=

good tree for E=E

Directly generalize generative algorithms :

Page 41: Query-Specific Learning and Inference for Probabilistic Graphical Models

41

Evidence-specific CRF learning: structure

Choose generative structure learning algorithm A

Identify low-dimensional subsets Qβ that A may need

Chow-Liu

All pairs (Qi, Qj)

E Q E Q1,Q2 E EQ1,Q3 Q3,Q4

,original problem low-dimensional pairwise problems

),|,(ˆ1331 uEQQP ),|,(ˆ

3443 uEQQP),|,(ˆ1221 uEQQP

Page 42: Query-Specific Learning and Inference for Probabilistic Graphical Models

42

Estimating low-dimensional conditionals

Use the same features as the baseline high-treewidth model

QQEQuuEZ

uEQP

s.t. ,exp),(

1),|(

EQwwEZ

wEQP ,exp),(

1)|,(Baseline CRF

Low-dimensionalmodel

Scope restriction

End result: optimal u

Page 43: Query-Specific Learning and Inference for Probabilistic Graphical Models

43

Evidence-specific CRF learning: weights

),(),(exp),,(

1),,|( uEIEQw

uwEZuwEQP

Already chosen the algorithm behind I(E,u)

Already learned parameters u

Only need to learn feature weights w

log P(Q|E,w,u) is concave in w unique global optimum

“effective features”

Page 44: Query-Specific Learning and Inference for Probabilistic Graphical Models

44

Evidence-specific CRF learning: weights

EEQEEQ

E ,E,,),,|(log

),,|(

QffuIw

uwPuwQP

Tree-structured distribution

Fixed dense model

Evidence-specific

tree “mask”( () )

E=E2

E=E3

E=E1

Exacttree-structuredgradients wrt w( )

Q=Q2

Q=Q3

Q=Q1

Σ

Overall gradient(dense)( )

Page 45: Query-Specific Learning and Inference for Probabilistic Graphical Models

45

Results – WebKBText + links webpage topic

bett

er

Prediction error TimeSVM RMN ESS-CRF M3N

0

0.05

0.1

0.15

RMN ESS-CRF M3N0

200400600800

10001200

Ignore links Standard dense CRF Our work Max-margin model

Page 46: Query-Specific Learning and Inference for Probabilistic Graphical Models

46

Image segmentation - accuractlocal segment features + neighbor segments type of object

Logisti

c regressi

on

Dense CRF

ESS-CRF

0.6000000000000010.6400000000000010.6800000000000010.7200000000000010.760000000000001

Accuracy

bett

er

Ignore links Standard dense CRF Our work

Page 47: Query-Specific Learning and Inference for Probabilistic Graphical Models

47

Image segmentation - time

Train time (log scale)

bett

er

Logistic regression

Dense CRF ESS-CRF2

20

200

2000

20000

Test time (log scale)

Logistic regression

Dense CRF ESS-CRF0.3

3

30

300

3000

Ignore links Standard dense CRF Our work

Page 48: Query-Specific Learning and Inference for Probabilistic Graphical Models

48

ConclusionsUsing evidence values to tune low-treewidth model structure

Compensates for the reduced expressive powerOrder of magnitude speedup at test time (sometimes train time too)

General framework for plugging in existing generative structure learnersStraightforward relational extension [in the thesis]

Page 49: Query-Specific Learning and Inference for Probabilistic Graphical Models

49

Thesis contributionsLearn accurate and tractable models

In the generative setting P(Q,E) [NIPS 2007]

In the discriminative setting P(Q|E) [NIPS 2010]

Speed up belief propagation for cases with many nuisance variables [AISTATS 2010]

Page 50: Query-Specific Learning and Inference for Probabilistic Graphical Models

50

Why high-treewidth models?A dense model expressing laws of nature

Protein folding

Max-margin parameters don’t work well (yet?) with evidence-specific structures

Page 51: Query-Specific Learning and Inference for Probabilistic Graphical Models

51

Query-Specific inference problemevidencequery not interesting

Using information about the queryto speed up convergence of belief propagation

for the query marginals

Eij

jiij XXfP )()(X

Page 52: Query-Specific Learning and Inference for Probabilistic Graphical Models

52

(loopy) Belief PropagationPassing messages along edges

Variable belief:

Update rule:

Result: all single-variable beliefs

ikEkjj

tjkji

xiji

tij xmxxfxm

j ,

)()1( )()()(

Eij

it

ijit xmxP )()(

~ )()(

kim r

sj

ki

h

u

Page 53: Query-Specific Learning and Inference for Probabilistic Graphical Models

53

(loopy) Belief PropagationMessage dependencies are local:

Freedom in scheduling updatesRound–robin schedule

Fix message orderApply updates in that order until convergence

r

sj

ki

h

u

dependence

dependence

dep.

Page 54: Query-Specific Learning and Inference for Probabilistic Graphical Models

54

Dynamic update prioritization

Fixed update sequence is not the best optionDynamic update scheduling can speed up convergence

Tree-Reweighted BP [Wainwright et. al., AISTATS 2003]Residual BP [Elidan et. al. UAI 2006]

Residual BP apply the largest change first

1

informative update

2

wasted computation

large change large

change

large change

small change

small change

small change

Page 55: Query-Specific Learning and Inference for Probabilistic Graphical Models

55

Residual BP [Elidan et. al., UAI 2006]

Update rule:

Pick edge with largest residual

Update

oldnew

)()(max OLDij

NEWij mm

More effort on the difficult parts of the model

ikEkj

jOLD

jkjix

ijiNEW

ij xmxxfxmj ,

)()( )()()(

)(OLDijm

)( NEWijm

)( NEWijm

But no query

Page 56: Query-Specific Learning and Inference for Probabilistic Graphical Models

56

• Residual BP updates• no influence on the query• wasted computation

Why edge importance weights?query

residual < residualwhich to update??

• want to update • influence on the query in the future

Residual BP max immediate residual reduction

Our work max approx. eventual effect on P(query)

Page 57: Query-Specific Learning and Inference for Probabilistic Graphical Models

57

Query-Specific BP

Update rule:

Pick edge with

Update

oldnew

ijOLD

ijNEW

ij Amm max )()(

ikEkj

jOLD

jkjix

ijiNEW

ij xmxxfxmj ,

)()( )()()(

)(OLDijm

)( NEWijm

)( NEWijm

Rest of the talk: defining and computing edge importance

edgeimportance

the only change!

Page 58: Query-Specific Learning and Inference for Probabilistic Graphical Models

Edge importance base case

approximate eventual update effect on P(Q)

query

r

sj

ki

h

uij

OLDij

NEWij Amm )()(

||P(NEW)(Q) P(OLD)(Q)|| ||m(NEW) m(OLD)||change in query belief change in message

tight bound

1

Base case: edge directly connected to the query Aji=??

1ji ji

||P(Q)|| ||m ||ji

Page 59: Query-Specific Learning and Inference for Probabilistic Graphical Models

||m ||over values of

all other messages

mjisup mrj

Edge one step away from the query:

Arj=??

mjisup mrj

Edge importance one step awayquery

r

sj

ki

h

u

||P(Q)||

change in query belief

change in message

can compute in closed formlooking at only fji [Mooij, Kappen; 2007]

message importance

ji

rj||m ||

Page 60: Query-Specific Learning and Inference for Probabilistic Graphical Models

One step away:

Arj=

Edge importance general case

queryr

sj

ki

h

u

||P(Q)|| ||msh|| P(Q)msh

sup

P(Q)msh

sup

Base case: Aji=1

mhrsup msh

mrjsup mhr

mjisup mrj

mjisup mrj

sensitivity(): max impact along the

path

Generalization? expensive to compute bound may be infinite

Page 61: Query-Specific Learning and Inference for Probabilistic Graphical Models

Edge importance general case

queryr

sj

ki

h

u

1

P(Q)msh

sup

mhrsup msh

mrjsup mhr

mjisup mrj

sensitivity(): max impact along the

path

2

Ash = max all paths from to query sensitivity()h

There are a lot of paths in a graph,trying out every one is intractable

Page 62: Query-Specific Learning and Inference for Probabilistic Graphical Models

62

Efficient edge importance computation

A = max all paths from to query sensitivity()

There are a lot of paths in a graph,trying out every one is intractable

always 1

always decreases as the path grows

mhrsup msh

mrjsup mhr

mjisup mrj

sensitivity( hrji ) =

always 1always 1

decomposes into individual edge contributions

Dijkstra’s (shortest paths) alg. will efficiently find max-sensitivity paths

for every edge

Page 63: Query-Specific Learning and Inference for Probabilistic Graphical Models

63

Aji = max all paths from i to query sensitivity()

Query-Specific BP

Run Dijkstra’s alg starting at query to get edge weights

Pick edge with largest weighted residual

Update

ijOLD

ijNEW

ij Amm max )()(

)(OLDijm

)( NEWijm )( NEWijm

More effort on the difficult parts of the model

Takes into account not only graphical structure, but also strength of dependencies

and relevant

Page 64: Query-Specific Learning and Inference for Probabilistic Graphical Models

64

Experiments – single query

Easy model(sparse connectivity,weak interactions)

Hard model(dense connectivity,strong interactions)

bett

er

Standard residual BP Our work

Faster convergence, but long initialization still a problem

Page 65: Query-Specific Learning and Inference for Probabilistic Graphical Models

65

Anytime query-specific BPquery

Dijkstra’s alg. BP updates

Query-specific BP:

Anytime QSBP:

same BP update sequence!

r

sj

ki

Page 66: Query-Specific Learning and Inference for Probabilistic Graphical Models

66

Experiments – anytime QSBP

Easy model(sparse connectivity,weak interactions)

Hard model(dense connectivity,strong interactions)

bett

er

Standard residual BP Our work

Much shorter initialization

Our work + anytime

Page 67: Query-Specific Learning and Inference for Probabilistic Graphical Models

67

Experiments – multiquery

Easy model(sparse connectivity,weak interactions)

Hard model(dense connectivity,strong interactions)

bett

er

Standard residual BP Our work Our work + anytime

Page 68: Query-Specific Learning and Inference for Probabilistic Graphical Models

68

ConclusionsWeighting edges is a simple and effective way to improve prioritizationWe introduce a principled notion of edge importance based on both structure and parameters of the modelRobust speedups in the query-specific setting

Don’t spend computation on nuisance variables unless needed for the query marginal

Deferring BP initialization has a large impact

Page 69: Query-Specific Learning and Inference for Probabilistic Graphical Models

69

Thesis contributionsLearn accurate and tractable models

In the generative setting P(Q,E) [NIPS 2007]

In the discriminative setting P(Q|E) [NIPS 2010]

Speed up belief propagation for cases with many nuisance variables [AISTATS 2010]

Page 70: Query-Specific Learning and Inference for Probabilistic Graphical Models

70

Future workMore practical JT learning

SAT solvers to construct structure, pruning heuristics, …

Evidence-specific learningTrade efficiency for accuracyMax-margin evidence-specific models

Theory on ES structures too

Inference:Beyond query-specific: better prioritization in generalBeyond BP: query-specific Gibbs sampling?

Page 71: Query-Specific Learning and Inference for Probabilistic Graphical Models

71

Thesis conclusionsGraphical models are a regularization technique for high-dimensional distributionsRepresentation-based structure is well-understood

Conditional independencies

Right now, structured computation is a “consequence” of representation

Major issues with tractability, approximation quality

Logical next step structured computation as a primary basis of regularizationThis thesis: computation-centric approaches have better efficiency and do not sacrifice accuracy

Page 72: Query-Specific Learning and Inference for Probabilistic Graphical Models

72

Thank you!

Collaborators: Carlos Guestrin, Joseph Bradley, Dafna Shahaf

Page 73: Query-Specific Learning and Inference for Probabilistic Graphical Models

73

Mutual info upper bound: qualityUpper bound:

Suppose an -JT exists is the largest mutual information over small subsetsThen I(A, B | C) |ABC| ( + )

No need to know the -JT, only that it exists

No connection between C and the JT separators

C can be of any size, no connection to JT treewidthThe bound is loose only when there is no hope to learn a good JT

Page 74: Query-Specific Learning and Inference for Probabilistic Graphical Models

74

Typical graphical models workflow

Learn/constructstructure

Learn/defineparameters

Inference

P(Q|E=e)

reasonable intractablestructure from

domain knowledge

approx. P(Q|E=e)

The graph isprimarily a

representationtool

approximate algs:no quality

guarantees

Page 75: Query-Specific Learning and Inference for Probabilistic Graphical Models

75

Contributions – tractable modelsLearn accurate and tractable models

In the generative setting [NIPS 2007]Polynomial-time conditional mutual information upper boundFirst PAC-learning result for strongly connected junction treesGraceful degradation guaranteesSpeedup heuristrics

In the discriminative setting [NIPS 2010]General framework for learning CRF structure that depends on evidence values at test timeExtensions to the relational settingEmpirical: order of magnitude speedups with the same accuracy as high-treewidth models

Page 76: Query-Specific Learning and Inference for Probabilistic Graphical Models

76

Contributions – faster inferenceSpeed up belief propagation for cases with many nuisance variables [AISTATS 2010]

A framework of importance-weighted residual belief propagationA principled measure of eventual impact of an edge update on the query belief

Prioritize updates by importance for the query instead of absolute magnitude

An anytime modification to defer much of initializationInitial inference results available much soonerOften much faster eventual convergenceThe same fixed points as the full model

Page 77: Query-Specific Learning and Inference for Probabilistic Graphical Models

77

Future workTwo main bottlenecks:

Constructing JTs given mutual information values.Esp. with non-uniform treewidth, dependence strength

Large sample: learnability guarantees for non-uniform treewidth Small sample: non-uniform treewidth for regularizationConstraint satisfaction, SAT solvers, etc?Relax strong connectivity requirement?

Evaluating mutual information:need to look at 2k+1 variables instead of k+1, large penalty

Branch on features instead of sets of variables? [Gogate+al:2010]

Speedups without guaranteesLocal search, greedy separator construction, …

Page 78: Query-Specific Learning and Inference for Probabilistic Graphical Models

78

Log-linear parameter learningconditional log-likelihood

DEQ

EQD),(

)|(log)( ,wP|wLLH

Convex optimization: unique global maximum

Gradient: features – [expected features]

EEQEQ

E ,E,),|(log

),|(

Qffw

wPwQP

need inference inference for every E given w

Page 79: Query-Specific Learning and Inference for Probabilistic Graphical Models

79

Log-linear parameter learning

Generative (E=) Discriminative

Tractable Closed-form Exact gradient-based

IntractableApproximate

gradient-based(no guarantees)

Approximategradient-based(no guarantees)

Inference once per weights update

Inference for every datapoint (Q,E)once per weights update

“manageable” slowdownby the number of datapoints

Complexity“phase

transition”

Page 80: Query-Specific Learning and Inference for Probabilistic Graphical Models

80

Plug in generative structure learning

),(),(exp),,(

1),,|( uEIEQw

uwEZuwEQP

encodes the output of the chosen structure learning algorithm

Chow-Liu for optimal treesOur thin junction tree learning from part 1Karger-Srebro for high-quality low-diameter junction treesLocal search, etc …

Fix algorithm always get structures with desired properties (e.g. treewidth):

replace P(Q) with approximate conditionals P(Q | E=E, u) everywhere

Page 81: Query-Specific Learning and Inference for Probabilistic Graphical Models

81

Evidence-specific CRF learning: weights

),(),(exp),,(

1),,|( uEIEQw

uwEZuwEQP

Already knowalgorithm behind I(E,u)Already learned u

Only need to learn w

Structure induced by I(E,u)is always tractable

Can find evidence-specific structure I(E=E,u)

for every training datapoint (Q,E)

Learn optimal w exactly

EEQEEQ

E ,E,,),,|(log

),,|(

QffuIw

uwPuwQP

Tree-structured distribution

Page 82: Query-Specific Learning and Inference for Probabilistic Graphical Models

82

Relational evidence-specific CRFRelational models: templated features + shared weights

webpage webpageLinksTo

LinksTo

LinksTo

Relation:

Groundings:

Learn a singleweight wLINK

wLINK

wLINK

Copy weight for every grounding

Page 83: Query-Specific Learning and Inference for Probabilistic Graphical Models

83

Relational evidence-specific CRFRelational models: templated features + shared weights

Every grounding is a separate datapoint for structure training

use propositional approach + shared weights

x1 x2

x3

x4 x5

Grounded model Training datasets for “structural” parameters u

x3 x4

x3 x5

x4 x5

x1 x2x1 x3

x1 x4

x1 x5

x2 x3

x2 x4

x2 x5

Page 84: Query-Specific Learning and Inference for Probabilistic Graphical Models

84

Future workFaster learning: pseudolikelihood is really fast, need to competeLarger treewidth: trade time for accuracyTheory on learning “structural parameters” uMax-margin learning

Inference is basic step in max-margin learning too tractable models are useful beyond log-likelihoodOptimizing feature weights w given local trees is straightforwardOptimizing “structural parameters” u for max-margin is hard

What is the right objective?

Almost tractable structures, other tractable modelsMake sure loops don’t hurt too much

Page 85: Query-Specific Learning and Inference for Probabilistic Graphical Models

85

Query versus nuisance variablesWe may actually care about only few variables

What are the topics of the webpages on the first page of Google search results for my query?Smart heating control: is anybody going to be at home for the next hour?Does the patient need immediate doctor attention?

But the model may have a lot of other variables to be accurate enough

Don’t care about them per se, but necessary to look at to get the query right

Both query and nuisance variables are unknown, inference algorithms don’t see a differenceSpeed up inference by focusing on the query

Only look at nuisance variable to the extent needed to answer the query

Page 86: Query-Specific Learning and Inference for Probabilistic Graphical Models

86

Our contributions

Using weighted residuals to prioritize updates

Define message weights reflecting the importance of the message to the query

Computing importance weights efficiently

Experiments: faster convergence on large relational models

Page 87: Query-Specific Learning and Inference for Probabilistic Graphical Models

87

Interleaving

Dijkstra’s expands the highest weight edges firstqueryexpanded on

previous iteration just expanded

not yet expanded

min expanded edges A A

suppose

M min expanded A

no need to expand further at this point

upper bound on priorityactual priority of

)()( max OLD

ijNEW

ijEDGESALLij mmM

ijOLD

ijNEW

ijEXPANDEDij Amm )()(max

Page 88: Query-Specific Learning and Inference for Probabilistic Graphical Models

88

Deferring BP initialization

Observation: Dijkstra’s alg. expands the most important edges first

Do we really need to look at every low importance edgebefore applying BP updates?

No! Can use upper bounds on priority instead.

Page 89: Query-Specific Learning and Inference for Probabilistic Graphical Models

89

Upper bounds in priority queue

Observation: for edges low in the priority queue, an upper bound on the priority is enough

r

sj

ki

Updates priority queue

Exact priority needed fortop element

Priority upper boundis enough here

Page 90: Query-Specific Learning and Inference for Probabilistic Graphical Models

90

|| factor( ) ||

Priority upper bound for not yet seen edges

priority( ) = residual( ) importance weight( )

importance weight( )s.t. is already expanded

priority( )

Component-wise upper bound without looking at the edge!

r

sj

ki

Expand several edges with Dijkstra’s : For :(residual) (weight) = (exact priority)

For all the other edges…

Page 91: Query-Specific Learning and Inference for Probabilistic Graphical Models

91

Interleaving BP and Dijkstra’s

Dijkstra

BP

Dijkstra

Dijkstra

BP BP

exact priority upper bound BP update

exact priority upper bound Dijkstra expand an edge

queryfull model

><