37
ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning Dhruv Batra Virginia Tech Topics Markov Random Fields: Inference Approximate: Variational Inference Readings: KF 11.1,11.2,11.5, Barber 28.1,28.3,28.4

ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

ECE 6504: Advanced Topics in Machine Learning

Probabilistic Graphical Models and Large-Scale Learning

Dhruv Batra Virginia Tech

Topics –  Markov Random Fields: Inference

–  Approximate: Variational Inference

Readings: KF 11.1,11.2,11.5, Barber 28.1,28.3,28.4

Page 2: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

Administrativia •  HW3

–  Out 2 days ago –  Due: Apr 4, 11:55pm –  Implementation: Loopy Belief Propagation in MRFs

•  Project Presentations –  When: April 22, 24 –  Where: in class –  5 min talk

•  Main results •  Semester completion 2 weeks out from that point so nearly finished

results expected •  Slides due: April 21 11:55pm

(C) Dhruv Batra 2

Page 3: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

Recap of Last Time

(C) Dhruv Batra 3

Page 4: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

Message Passing •  Variables/Factors “talk” to each other via messages:

(C) Dhruv Batra 4 Image Credit: James Coughlan

“I (variable X3) think that you (variable X2):

belong to state 1 with confidence 0.4 belong to state 2 with confidence 10

belong to state 3 with confidence 1.5”

Page 5: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

•  Initialization: –  Assign each factor φ to a cluster α(φ), Scope[φ]⊆Cα(φ) –  Initialize cluster:

–  Initialize messages:

•  While not converged, send messages:

•  Belief: –  On board

Generalized BP

(C) Dhruv Batra 5 Slide Credit: Carlos Guestrin

Page 6: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

Example •  Chain MRF

•  VE steps on board

(C) Dhruv Batra 6

Compute: X5 X3 X4 X2 X1 P (X1 | X5 = x5)

Page 7: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

Elimination order: O = {C,D,I,H,G,S,L,J}

Difficulty

SAT Grade

Happy Job

Coherence

Letter

Intelligence

Factors Generated

(C) Dhruv Batra 7 Slide Credit: Carlos Guestrin

DIG

JSL GJSL

HGJ

CD

GSI

D

GI

GS

GJ

JSL

Page 8: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

Cluster graph for VE

•  VE generates cluster tree! (Also called Clique Tree or Junction Tree) –  One cluster for each factor used/generated –  Edge i – j, if fi used to generate fj –  “Message” from i to j generated when

marginalizing a variable from fi –  Tree because factors only used once

•  Proposition: –  “Message” δ

ij from i to j

–  Scope[δij] ⊆ Sij

DIG

JSL GJSL

HGJ

CD

GSI

(C) Dhruv Batra 8 Slide Credit: Carlos Guestrin

D

GI

GS

GJ

JSL

Page 9: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

Approximate Inference •  So far: Exact Inference

–  VE & Junction Trees –  Exponential in tree-width

•  There are many many approximate inference algorithms for PGMs –  You have already seen BP

•  Next –  Variational Inference –  Connections to BP / Message-Passing

(C) Dhruv Batra 9

Page 10: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

What is Variational Inference? •  A class of methods for approximate inference

–  And parameter learning –  And approximating integrals basically..

•  Key idea –  Reality is complex –  Instead of performing approximate computation in something

complex –  Can we perform exact computation in something “simple”? –  Just need to make sure the simple thing is “close” to the

complex thing.

•  Key Problems –  What is close? –  How do we measure closeness when we can’t perform

operations on the complex thing?

(C) Dhruv Batra 10

Page 11: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

•  Given two distributions p and q KL divergence:

•  D(p||q) = 0 iff p=q

•  Not symmetric – p determines where difference is important

KL divergence: Distance between distributions

(C) Dhruv Batra 11 Slide Credit: Carlos Guestrin

Page 12: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

Find simple approximate distribution

•  Suppose p is intractable posterior •  Want to find simple q that approximates p •  KL divergence not symmetric

•  D(p||q) –  true distribution p defines support of diff. –  the “correct” direction –  will be intractable to compute

•  D(q||p) –  approximate distribution defines support –  tends to give overconfident results –  will be tractable

(C) Dhruv Batra 12 Slide Credit: Carlos Guestrin

Page 13: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

Example 1 •  p = 2D Gaussian with arbitrary co-variance •  q = 2D Gaussian with diagonal co-variance

(C) Dhruv Batra 13

z1

z2

(a)0 0.5 10

0.5

1

z1

z2

(b)0 0.5 10

0.5

1argmin_q KL (p || q)

p = Green; q = Red

argmin_q KL (q || p)

Page 14: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

Example 2 •  p = Mixture of Two Gaussians •  q = Single Gaussian

(C) Dhruv Batra 14

argmin_q KL (p || q)

p = Blue; q = Red

argmin_q KL (q || p)

Page 15: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

Back to graphical models •  Inference in a graphical model:

–  P(x) = –  want to compute P(Xi) –  our p:

•  What is the simplest q? –  every variable is independent: –  mean field approximation –  can compute any prob. very efficiently

Page 16: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

•  Choose a family of approximating distributions which is tractable. The simplest [Mean Field] Approximation:

•  Measure the quality of approximations. Two possibilities:

•  Find the approximation minimizing this distance

p(x) =1

Z

(s,t)∈E

ψst(xs, xt)�

s∈Vψs(xs)

q(x) =�

s∈Vqs(xs)

D(p || q) =�

x

p(x) logp(x)

q(x)D(q || p) =

x

q(x) logq(x)

p(x)

Variational Approximate Inference

(C) Dhruv Batra 16 Slide Credit: Erik Sudderth

Page 17: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

•  D(p||q)=

D(p||q) for mean field – KL the right way

•  Trivially minimized by setting

•  Doesn’t provide a computational method…

qi(xi) = pi(xi)

Page 18: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

Plan for today •  MRF Inference

–  Message-Passing as Variational Inference •  Mean Field •  Structured Mean Field

–  (Specialized) MAP Inference •  Integer Programming Formulation •  Linear Programming Relaxation •  Dual Decomposition

(C) Dhruv Batra 18

Page 19: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

•  D(q||p)=

D(q||p) for mean field – KL the reverse direction

Page 20: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

•  D(q||p): –  p is Markov net PF

•  Theorem:

•  Where “Gibbs Free Energy”:

Reverse KL & The Partition Function Difficulty

SAT Grade

Happy Job

Coherence

Letter

Intelligence

logZ = F [p, q] +D(q||p)

= Hq(X ) + Eq [Score(X )]

F [p, q] = Hq(X ) + Eq

��

c

logψc(Xc)

= Hq(X ) +�

c

xc

q(xc)θ(xc)

Page 21: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

•  Maximizing Energy Functional ⇔ Minimizing Reverse KL

•  Theorem: Energy Function is lower bound on partition function

–  Maximizing energy functional corresponds to search for tight lower bound on partition function

Understanding Reverse KL, Free Energy & The Partition Function logZ = F [p, q] +D(q||p) F [p, q] = Hq(X ) + Eq

��

c

logψc(Xc)

(C) Dhruv Batra 21 Slide Credit: Carlos Guestrin

Page 22: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

•  Add Lagrange multipliers to enforce

H(q) =�

s∈VHs(qs) = −

s∈V

xs

qs(xs) log qs(xs)

xs

qs(xs) = 1

Mean Field Equations F [p, q] = Hq(X ) + Eq

��

c

logψc(Xc)

c

xc

qc(xc)θ(xc) =�

i

xi

qi(xi)θi(xi) +�

(i,j)∈E

xi

xj

qi(xi)qj(xj)θij(xi, xj)

qi(xi) ∝ ψi(xi)�

j∈N(i)

exp

xj

θij(xi, xj)qj(xj)

•  Taking derivatives and simplifying, we find a set of fixed point equations:

•  Updating one marginal at a time gives convergent coordinate descent

Page 23: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

p(x) =1

Z

(s,t)∈E

ψst(xs, xt)�

s∈Vψs(xs)

xt

qt(xt) ∝ ψt(xt)�

u∈Γ(t)

mut(xt)

BP:

MF:

Mean Field versus Belief Propagation

(C) Dhruv Batra 23 Slide Credit: Erik Sudderth

Big implications from small changes:

•  Belief Propagation: Produces exact marginals for any tree, but for general graphs no guarantees of convergence or accuracy

•  Mean Field: Guaranteed to converge for general graphs, always lower-bounds partition function, but approximate even on trees

Page 24: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

There are many stationary points!

Page 25: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

(C) Dhruv Batra 25

Page 26: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

(C) Dhruv Batra 26

Page 27: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

(C) Dhruv Batra 27

Page 28: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

(C) Dhruv Batra 28

Page 29: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

(C) Dhruv Batra 29

Page 30: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

(C) Dhruv Batra 30

Page 31: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

(C) Dhruv Batra 31

Page 32: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

(C) Dhruv Batra 32

Page 33: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

(C) Dhruv Batra 33

Page 34: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

(C) Dhruv Batra 34

Page 35: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

(C) Dhruv Batra 35

Page 36: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

(C) Dhruv Batra 36

Page 37: ECE 6504: Advanced Topics in Machine Learnings14ece6504/slides/L15_mrf...– 5 min talk • Main results • Semester completion 2 weeks out from that point so nearly finished results

•  Structured Variational method: –  select a form for approximate distribution –  minimize reverse KL

•  Equivalent to maximizing energy functional –  searching for a tight lower bound on the partition function

•  Many possible models for Q: –  independent (mean field) –  structured as a Markov net –  cluster variational

•  Several subtleties outlined in the book

What you need to know about variational methods