Sargur Srihari [email protected]/CSE674/Chap11/11.1-InfAsOpti.pdf · 362 Assignment a 6o 6r 6t 6o 6o 6r 6r maxc 600,000 300,030 5, ooo,5oo 1,000 200 1,000, 100 100,010

Probabilistic Graphical Models Srihari

1 1

Inference as Optimization

Sargur Srihari [email protected]


2

Topics in Inference as Optimization

•  Overview •  Exact Inference revisited •  The Energy Functional •  Optimizing the Energy Functional


3

Exact and Approximate Inference •  PGMs represent probability distributions PΦ(χ)

–  Where χ is a set of variables and Φ is a set of factors

•  Inference is the task of answering queries – e.g., compute conditional probability PΦ(Y|E=e),

•  Problem of inference in PGMs is NP-hard – Worst case is exponential

•  Exact Inference is often efficient using – Variable Elimination or Clique tree Algorithms

•  But complexity is exponential in tree width of network •  In such cases exact algorithms become infeasible

•  This motivates approximate inference

Y ,E ⊆ χ


4

Approximate Target Distribution •  We consider approximate inference methods

where the approximation arises from constructing an approximation to target distribution PΦ

•  This approximation takes a simpler form that allows inference

•  Simpler approximating form exploits factorization structure of PGM


Principles of Approximate Algorithms

•  Approximate inference methods share common conceptual principles:

1.  Find target class Q of “easy” distributions Q and 2.  Then Search for an instance within that class that

best approximates PΦ

3.  Answer queries using inference on Q instead of PΦ4.  All methods optimize the same target function for

measuring similarity between Q and PΦ•  This reformulates inference problem as:

–  Optimizing an objective function over class Q


Reformulated Inference Problem •  This problem is one of constrained optimization

–  i.e., find distribution Q that minimizes D(Q|| PΦ) – Such problems can be solved by variety of different

optimization techniques •  Technique most often used for PGMs is based

on Lagrange multipliers •  Constrained optimization and Lagrange solution

is discussed next

6

Probabilistic Graphical Models Srihari What is constrained optimization? •  Ex: find the maximum entropy distribution over

X with Val(X)={x1,..xK} where entropy is – Unconstrained Optimization

•  Use gradient method treating each P(xk) as a parameter θk – Compute gradient of HP(X) wrt parameters: –  Setting partial derivative to 0 we get log(θk)=1, or θk=1/2 –  But nos do not add up to 1, and hence not a distribution –  Flaw in analysis: we want constraints that Σkθk=1, and θk≥ 0

–  Constrained Optimization •  Maximizing a function f under equality constraints •  Find θ •  Maximizing f(θ) •  Subject to c1(θ)=0 …..

cm(θ)=0

H(X) = − p(xk )logP(xk )

k=1

K

∑

∂∂θ

k

H(X) = − log(θk)− 1

Method of Lagrange multipliers allows us to solve constrained optimization problems using tools for unconstrained optimization. Lagrangian is

J(θ,λ) = f (θ)− λjc

j(θ)

j=1

m

∑


Lagrange leads to Message Passing •  Method of Lagrange multipliers produces a set

of equations that characterize the optima of the objective

•  It produces a set of fixed-point equations that define each variable in terms of others

•  Fixed point equations derived from constrained energy optimization can be viewed as passing messages over a graph object

8


9

Categories of methods in this class

1.  Message passing on Clique Tree –  Loopy belief propagation

•  Optimize approximate versions of the energy functional

2.  Message passing on Clique Trees with approximate messages

–  Called expectation propagation •  Maximize exact energy functional but with relaxed constraints on Q

3.  Mean-field method –  Originates in statistical physics

•  Focus on Q that has simple factorization


Examples of Clique Tree Bayesian Network 1 Moralized Graph and Clique tree

Cluster Graph

Bayesian Network 2 Triangu- lation

Moralized Graph

Clique tree

Probabilistic Graphical Models Srihari Calibrated Clique Tree

C

(a) (b) (c)

A

D BBD

A

C

BD

A

C

P(A,B,C.D) = 1Zφ1(A,B) ⋅φ2 (B,C) ⋅φ3(C,D) ⋅φ4 (D,A)

where

Z = φ1(A,B) ⋅φ2 (B,C) ⋅φ3(C,D) ⋅φ4 (D,A)A,B,C ,D∑

Z=7,201,840

1.Gibbs Distribution

C1: {A,B,D} C2: {B,C,D}S1,2:{B,D}

362

Assignmenta

6o6r6t6o6o6r6r

maxc600,000300,030

5, ooo,5oo1,000

2001,000, 100

100,010200,000

Chapter 1&

Assienment

a,o

aona"

alalalAL

Assignment b0bob0

bL

bL

bt6t

cocLctcococ1ct

d1 ll I4o4t4o4rd0 ll 5,

6o

bL

br

600,2001,300, 1305, 100, 510

201,000

reparameteriza-tion

clique treeinvariant

Example 10.5

n,z(B, D) l3z(8,C,

Figure 10.6 lhe clique and sepset bellefs for the Misconception example,

Using equation (10.9), the denominator can be rewritten as:

il 6i-i5i-r.(i- j)e€r

Each message da+i appears exactly once in the numerator and exactly once in theso that all messages cancel. The remaining expression is simply:

l rbo{co1: e*.ieVr

Thus, via equation (10.i0), the clique and sepsets beliefs provide aunnormalized measure. This property is called the clique tree irunriant, for reasonsbecome clear later on in this chapter.

Another intuition for this result can be obtained from the following example:

Consider a clique tree obtained from Markav network A*B -C -D, with anfactors @. Our clique tree in this case would haue three cliques C1 : {A,B), Czand, Cs : {CLD}. When the clique tree is calibrated, we haue that B1(A,B) : Po(Zz(B,C) : Po(B,C). From the conditional independence properties of thishaue that

Po(A,B,C): Po(A,B)P,(C I B),and

P^rc I B\:1vjB'c) .po(B)

As B2(B,C) : Po(B,C), we can obtain Po@) by marginalizing B2(B,C). Thus, we

Br@,qrffifir)p|(A, B)p2(B,C)-U;p,@O-

d

Po(A, B,C)

E.g., !PΦ

a1,b0,c1,d 0( ) = 100 and measure indiuced is

β1

a1,b0,d 0( )β2b0,c1,d 0( )

µ1,2

b0,d 0( )=

200 ⋅300 ⋅100600 ⋅200

= 100

2. Clique Tree (triangulated):

Beliefs:

β1

A,B,D( ) = !PΦ

A,B,D( ) = ψ1

A,B,D( )C∑ = φ

1(A,B)φ

2(B,C)φ

3(C,D)φ

4(D,A)

C∑

e.g., β1(a1,b0,d 0) = 100 +100 = 200

µ1,2

(B,D) = β1

C1−S1,2

∑ C1( ) = β

1A∑ A,B,D( )

e.g., µ1,2

(b0,d 0) = 600,000 + 200 = 600,200

β2

B,C,D( ) = !PΦ

B,C,D( ) = ψ2

B,C,D( )A∑

e.g.,β2

b0,c1,d 0( ) = 300,000 +100 = 300,100

Initial Potentials:

ψ1

A,B,D( ) = φ1

A,B( )φ2B,C( )φ3

C,D( )φ4D,A( )

ψ2

B,C,D( ) = φ1

A,B( )φ2B,C( )φ3

C,D( )φ4D,A( )

!PΦ

A,B,C,D( ) = φ1

A,B( )φ2B,C( )φ3

C,D( )φ4D,A( )

β1(A,B,D) β2(B,C,D) μ12(B,D) Clique Beliefs

Sepset Beliefs Measure induced by calibrated tree T

QT

=β

i(C

i)

i∏

µij

ij∏

where µ

i,j= β

iCi−Si ,j

∑ (Ci) = β

jCj−Si ,j

∑ (Cj)

Unnormalized Measure


Belief Propagation

12

A simple network A Clique Tree A Cluster Graph

•  Clique tree and cluster graph are alternative ways of doing inference

•  Cluster graph may contain loops •  Inference is Called Loopy Belief Propagation •  Clusters are smaller than in Clique Tree


13

Exact Inference Revisited

•  We have a factorized distribution of the form

–  where Uϕ =Scope (ϕ) –  Factors are:

•  CPDs in a BN or •  potentials in a MN

•  We are interested in answering queries: –  about marginal probabilities of variables and –  about the partition function

PΦ

X( ) =1Z

φ Uφ( )

φ∈Φ∏


14

Cluster Graph Representation •  End-product of Belief Propagation is a calibrated

cluster tree –  A calibrated set of beliefs represents a distribution

•  We view exact inference as searching over the set of distributions Q that are representable by the cluster tree to find a distribution Q* that matches PΦ

Cluster graph U for factors Φ over χ is an undirected graph Each of whose nodes i is associated with a subset Each edge between pair of clusters Ci and Cj is associated with a sepset A tree T is a clique tree for graph H if

Each node in T corresponds to a clique in H and each maximal clique in H is a node in T Each sepset Si,j separates W<(j,j) and W<(j,i) in H

Ci⊆ χ

S

i,j⊆C

i∩C

j


15

Distance between Q and PΦ •  We need to optimize distance between Q and

PΦ without answering hard queries about PΦ •  Relative entropy (or K-L divergence) allows us

to exploit the structure of PΦ without performing reasoning with it – Relative entropy of P1 and P2 defined as

•  It is always non-negative •  Equal to 0 if and only if P1 = P2

– We search for distribution Q that minimizes D(Q|| PΦ)

D P1||P

2( ) = EP1

lnP1χ( )

lnP2χ( )

⎡

⎣

⎢⎢⎢

⎤

⎦

⎥⎥⎥


16

Specifying the set Q •  We need to specify objects to optimize over •  Suppose we are given:

–  a clique tree structure T for PΦ, –  a set of beliefs Q={βi: i ε VT} U {μi,j: (i-j) ε ET}

where Ci are clusters in T, βi denote beliefs over Ci and μi,j denotes beliefs Si,j of edges in T

•  Set of beliefs in T defines a distribution Q by

•  We are now searching over a set of distributions Q –  that are representable by a set of beliefs Q over the cliques and

sepsets in a particular clique tree structure T

Q χ( ) =

βi

i∈VT

∏µ

i,ji−j( )∈VT

∏

The beliefs correspond to marginals of Q βi[ci]=Q(ci) µij[sij]=Q(sij)


17

Statement of Inference as Optimization •  Exact inference is one of maximizing -D(Q || PΦ)

over the space of calibrated sets Q Ctree-Optimize-KL•  Find Q={βi: i ε VT} U {μi,j: (i-j) ε ET}

•  Maximizing -D(Q || PΦ)

•  Subject to

•  Theorem: If T is an I-map of PΦ then there is a unique solution to Ctree-Optimize-KL

µi,j

si,j⎡⎣⎢⎤⎦⎥ = β

iCi−Si ,j

∑ ci( ) ∀ i− j( )∈E

T,∀s

i,j∈Val S

i,j( )

βi

ci

∑ ci( ) = 1 ∀i ∈ V

T


Possible approach •  Examine different configurations of beliefs that

satisfy marginal consistency constraints – Select the configuration that maximizes the

objective – Such as exhaustive examination is impossible to

perform •  Instead of searching over a space of all

calibrated trees we can search over a space of simpler distributions – We will not find a distribution equivalent to PΦ but

one that is reasonably close 18


19

The Energy Functional •  Directly evaluating D(Q || PΦ) is unwieldy

–  Because summation over all χ is infeasible in practice

•  Instead use equivalent form –  Where F is the energy functional

•  Theorem:

•  Since the term ln Z does not depend on Q, – minimizing relative entropy D(Q || PΦ) is equivalent

to maximizing the energy functional •  Energy functional has two terms:

– energy term (expectation of logs of factors in Φ) and entropy term

D P1||P

2( ) = EP1

lnP1χ( )

lnP2χ( )

⎡

⎣

⎢⎢⎢

⎤

⎦

⎥⎥⎥= P

1χ∑ χ⎡⎣⎢

⎤⎦⎥

lnP1χ( )

lnP2χ( )

⎡

⎣

⎢⎢⎢

⎤

⎦

⎥⎥⎥

D Q ||P

Φ( ) = lnZ −F !PΦ,Q( )

F !P

Φ,Q⎡

⎣⎢⎤⎦⎥ = E

Qln !P χ( )⎡⎣⎢

⎤⎦⎥+ H

Qχ( ) = E

Qlnφ⎡⎣⎢⎤⎦⎥

φ∈Φ∑ + H

Qχ( )

F !P

Φ,Q( )

F !PΦ,Q⎡⎣⎢⎤⎦⎥ = E

Qlnφ⎡⎣⎢⎤⎦⎥

φ∈Φ∑ +H

Qχ( )


20

Optimizing the Energy Functional •  From here onward we pose the problem of finding a

good Q as one of maximizing the energy functional –  Equivalently minimizing the relative entropy –  Importantly energy functional involves expectations in Q –  By choosing Q that allow efficient inference we can evaluate/

optimize the energy functional •  Moreover, energy Functional is a lower bound on

partition function –  Since D(Q||PΦ) ≥0 we have –  Useful since partition function is usually the hardest part of

inference •  Plays important role in learning

lnZ ≥F !P

Φ,Q⎡

⎣⎢⎤⎦⎥


21

Strategies for optimizing energy functional

•  Methods are referred to as Variational Methods •  Refers to a strategy in which we introduce new

parameters that increase the degrees of freedom •  Each choice of these parameters gives a different

approximation •  We attempt to optimize the variational parameters to

get the best approximation •  Variational calculus: finding optima of a functional

–  E.g., distribution that maximizes entropy


Further Topics in Variational Methods

•  Exact Inference •  Propagation-Based Approximations •  Propagation with Approximate Messages •  Structured Variational Approximations

22

Documents

Sargur Srihari [email protected]/CSE674/Chap11/11.1-InfAsOpti.pdf · 362 Assignment a 6o 6r 6t 6o 6o 6r 6r maxc 600,000 300,030 5, ooo,5oo 1,000 200 1,000, 100 100,010