Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012
Le Song
Lecture 19, Nov 1, 2012
Reading: Chap 8, C. Bishop Book
Inference in Graphical Models
Conditional Independence Assumptions
Global Markov Assumption
𝐴 ⊥ 𝐵|𝐶, 𝑠𝑒𝑝𝐺 𝐴, 𝐵; 𝐶
2
Local Markov Assumption
𝑋 ⊥ 𝑁𝑜𝑛𝑑𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑋|𝑃𝑎𝑋
𝐴 𝐶 𝐵 𝑋
𝑃𝑎𝑋
𝑁𝑜𝑛𝑑𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑋
𝐵𝑁 𝑀𝑁
𝑃
𝐵𝑁 𝑀𝑁
Moralize
Triangulate
Undirected Tree Undirected Chordal Graph
Distribution Factorization
Bayesian Networks (Directed Graphical Models) 𝐼 − 𝑚𝑎𝑝: 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃
⇔
𝑃(𝑋1, … , 𝑋𝑛) = 𝑃(𝑋𝑖 | 𝑃𝑎𝑋𝑖)
𝑛
𝑖=1
3
Markov Networks (Undirected Graphical Models) 𝑠𝑡𝑟𝑖𝑐𝑡𝑙𝑦 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑃, 𝐼 − 𝑚𝑎𝑝: 𝐼 𝐺 ⊆ 𝐼 𝑃
⇔
𝑃(𝑋1, … , 𝑋𝑛) = 1
𝑍 Ψ𝑖 𝐷𝑖
𝑚
𝑖=1
𝑍 = Ψ𝑖 𝐷𝑖
𝑚
𝑖=1𝑥1,𝑥2,…,𝑥𝑛
Clique Potentials
Conditional Probability Tables (CPTs)
Maximal Clique Normalization
(Partition Function)
Inference in Graphical Models
Graphical models give compact representations of probabilistic distributions 𝑃 𝑋1, … , 𝑋𝑛 (n-way tables to much smaller tables)
How do we answer queries about 𝑃?
Compute likelihood
Compute conditionals
Compute maximum a posteriori assignment
We use inference as a name for the process of computing answers to such queries
4
Most queries involve evidence
Evidence 𝑒 is an assignment of values to a set 𝐸 variables
Evidence are observations on some variables
Without loss of generality 𝐸 = 𝑋𝑘+1, … , 𝑋𝑛
Simplest query: compute probability of evidence
𝑃 𝑒 = … 𝑃(𝑥1, … , 𝑥𝑘 , 𝑒)𝑥𝑘𝑥1
This is often referred to as computing the likelihood of 𝑒
Query Type 1: Likelihood
5
𝐸
Sum over this set of variables
Query Type 2: Conditional Probability
Often we are interested in the conditional probability distribution of a variable given the evidence
𝑃 𝑋 𝑒 =𝑃 𝑋, 𝑒
𝑃 𝑒=𝑃(𝑋, 𝑒)
𝑃(𝑋 = 𝑥, 𝑒)𝑥
It is also called a posteriori belief in 𝑋 given evidence 𝑒
We usually query a subset 𝑌 of all variables 𝒳 = {𝑌, 𝑍, 𝑒} and “don’t care” about the remaining 𝑍
𝑃 𝑌 𝑒 = 𝑃(𝑌, 𝑍 = 𝑧|𝑒)
𝑧
Take all possible configuration of 𝑍 into account
The processes of summing out the unwanted variable Z is called marginalization
6
Query Type 2: Conditional Probability Example
7
𝐸
Sum over this set of variables
𝐸 Sum over this set of variables
Interested in the conditionals for these variables
Interested in the conditionals for these variables
Prediction: what is the probability of an outcome given the starting condition
The query node is a descendent of the evidence
Diagnosis: what is the probability of disease/fault given symptoms
The query node is an ancestor of the evidence
Learning under partial observations (Fill in the unobserved)
Information can flow in either direction
Inference can combine evidence from all parts of the networks
Application of a posteriori Belief
8
𝐴 𝐵 𝐶
𝐴 𝐵 𝐶
Query Type 3: Most Probable Assignment
Want to find the most probably joint assignment for some variables of interests
Such reasoning is usually performed under some given evidence 𝑒, and ignoring (the values of other variables) 𝑍
Also called maximum a posteriori (MAP) assignment for 𝑌
𝑀𝐴𝑃 𝑌 𝑒 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦 𝑃 𝑌 𝑒 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦 𝑃 𝑌, 𝑍 = 𝑧 𝑒 𝑧
9
𝐸
Sum over this set of variables
Interested in the most probable values for these variables
Application of MAP assignment
Classification
Find most likely label, given the evidence
Explanation
What is the most likely scenario, given the evidence
Cautionary note:
The MAP assignment of a variable dependence on its context – the set of variables being jointly queried
Example:
MAP of 𝑋, 𝑌 ?
(0, 0)
MAP of 𝑋?
1
10
X Y P(X,Y)
0 0 0.35
0 1 0.05
1 0 0.3
1 1 0.3
X P(X)
0 0.4
1 0.6
Computing the a posteriori belief 𝑃 𝑋 𝑒 in a GM is NP-hard in general
Hardness implies we cannot find a general procedure that works efficiently for arbitrary GMs
For particular families of GMs, we can have provably efficient procedures
For some families of GMs, we need to design efficient approximate inference algorithms
Complexity of Inference
11
eg. trees
eg. grids
Approaches to inference
Exact inference algorithms
Variable elimination algorithm
Message-passing algorithm (sum-product, belief propagation algorithm)
The junction tree algorithm
Approximate inference algorithms
Sampling methods/Stochastic simulation
Variational algorithms
12
Marginalization and Elimination
A metabolic pathway: What is the likelihood protein 𝐸 is produced
Query: P(E)
𝑃 𝐸 = 𝑃 𝑎, 𝑏, 𝑐, 𝑑, 𝐸𝑎𝑏𝑐𝑑
Using graphical models, we get
𝑃 𝐸 = 𝑃 𝑎)𝑃 𝑏 𝑎 𝑃 𝑐 𝑏 𝑃 𝑑 𝑐 𝑃(𝐸|𝑑𝑎𝑏𝑐𝑑
13
𝐴 𝐵 𝐶 𝐷 𝐸
Naïve summation needs to enumerate over an
exponential number of terms
Rearranging terms and the summations
𝑃 𝐸
= 𝑃 𝑎)𝑃 𝑏 𝑎 𝑃 𝑐 𝑏 𝑃 𝑑 𝑐 𝑃(𝐸|𝑑
𝑎𝑏𝑐𝑑
= 𝑃 𝑐 𝑏 𝑃 𝑑 𝑐 𝑃 𝐸 𝑑 𝑃 𝑎 𝑃 𝑏 𝑎
𝑎𝑏𝑐𝑑
Elimination in Chains
14
𝐴 𝐵 𝐶 𝐷 𝐸
Elimination in Chains (cont.)
Now we can perform innermost summation efficiently
𝑃 𝐸
= 𝑃 𝑐 𝑏 𝑃 𝑑 𝑐 𝑃 𝐸 𝑑 𝑃 𝑎 𝑃 𝑏 𝑎
𝑎𝑏𝑐𝑑
= 𝑃 𝑐 𝑏 𝑃 𝑑 𝑐 𝑃 𝐸 𝑑 𝑃(𝑏)
𝑏𝑐𝑑
The innermost summation eliminates one variable from our summation argument at a local cost.
15
𝐴 𝐵 𝐶 𝐷 𝐸
𝑃(𝑏)
Equivalent to matrix-vector multiplication, |Val(A)| * |Val(B)|
Elimination in Chains (cont.)
Rearranging and then summing again, we get
𝑃 𝐸
= 𝑃 𝑐 𝑏 𝑃 𝑑 𝑐 𝑃 𝑒 𝑑 𝑃(𝑏)
𝑏𝑐𝑑
= 𝑃 𝑑 𝑐 𝑃 𝐸 𝑑 𝑃 𝑐 𝑏 𝑃 𝑏
𝑏𝑐𝑑
= 𝑃 𝑑 𝑐 𝑃 𝐸 𝑑 𝑃(𝑐)
𝑐𝑑
16
𝐴 𝐵 𝐶 𝐷 𝐸
𝑃(𝑏) 𝑃(𝑐)
Equivalent to matrix-vector multiplication, |Val(B)| * |Val(C)|
C B 0 1
0 0 .15 0.35
1 0.85 0.65
B 0
0 0 .25
1 0.75
Elimination in Chains (cont.)
Eliminate nodes one by one all the way to the end
𝑃 𝐸 = 𝑃 𝐸 𝑑 𝑃(𝑑)
𝑑
Computational Complexity for a chain of length 𝑘
Each step costs O(|Val(𝑋𝑖)| * |Val(𝑋𝑖+1)|) operations: O(𝑘𝑛2)
Ψ 𝑋𝑖 = 𝑃 𝑋𝑖 𝑋𝑖−1)𝑃(𝑋𝑖−1)𝑥𝑖−1
Compare to naïve summation: O(𝑛𝑘)
… 𝑃(𝑥1, … , 𝑋𝑘)𝑥𝑘−1𝑥1
17
𝐴 𝐵 𝐶 𝐷 𝐸
𝑃(𝑏) 𝑃(𝑐)
Undirected Chains
18
𝐴 𝐵 𝐶 𝐷 𝐸
Rearrange terms, perform local summation …
𝑃 𝐸
= 1
𝑍Ψ 𝑏, 𝑎 Ψ 𝑐, 𝑏 Ψ 𝑑, 𝑐 Ψ(𝐸, 𝑑)
𝑎𝑏𝑐𝑑
=1
𝑍 Ψ 𝑐, 𝑏 Ψ 𝑑, 𝑐 Ψ 𝐸, 𝑑 Ψ 𝑏, 𝑎
𝑎𝑏𝑐𝑑
=1
𝑍 Ψ 𝑐, 𝑏 Ψ 𝑑, 𝑐 Ψ 𝐸, 𝑑 Ψ 𝑏
𝑏𝑐𝑑
The Sum-Product Operation
During inference, we try to compute an expression
Sum-product form: ΨΨ∈𝓕𝑍
𝓧 = {𝑋1, … , 𝑋𝑛} the set of variables
𝓕 a set of factors such that for each Ψ ∈ 𝓕, 𝑆𝑐𝑜𝑝𝑒 Ψ ∈ 𝓧
𝓨 ⊂ 𝓧 a set of query variables
𝓩 = 𝓧−𝓨 the variables to eliminate
The result of eliminating the variables in 𝓩 is a factor
𝜏 𝓨 = Ψ
Ψ∈𝓕𝑧
This factor does not necessarily correspond to any probability or conditional probability in the network.
𝑃 𝓨 =𝜏(𝓨)
𝜏(𝓨)
19
Inference via Variable Elimination
General Idea
Write query in the form
𝑃 𝑋1, 𝑒 = … 𝑃 𝑥𝑖 𝑃𝑎𝑋𝑖𝑖𝑥2𝑥3𝑥𝑛
The sum is ordered to suggest an elimination order
Then iteratively
Move all irrelevant terms outside of innermost sum
Perform innermost sum, getting a new term
Insert the new term into the product
Finally renormalize
𝑃 𝑋1 𝑒 = 𝜏 𝑋1, 𝑒
𝜏(𝑋1, 𝑒)𝑥1
20
A more complex network
A food web
What is the probability 𝑃 𝐴 𝐻 that hawks are leaving given that the grass condition is poor?
21
𝐴 𝐵
𝐶 𝐷
𝐸 𝐹
𝐺 𝐻
Query: 𝑃(𝐴|ℎ), need to eliminate 𝐵, 𝐶, 𝐷, 𝐸, 𝐹, 𝐺, 𝐻
Initial factors 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 𝑎 𝑃 𝑔 𝑒 𝑃 ℎ 𝑒, 𝑓
Choose an elimination order: 𝐻, 𝐺, 𝐹, 𝐸, 𝐷, 𝐶, 𝐵 (<)
Step 1: Eliminate G
Conditioning (fix the evidence node on its observed value)
𝑚ℎ 𝑒, 𝑓 = 𝑃(𝐻 = ℎ|𝑒, 𝑓)
Example: Variable Elimination
22
𝐴 𝐵
𝐶 𝐷
𝐸 𝐹
𝐺 𝐻
𝐴 𝐵
𝐶 𝐷
𝐸 𝐹
𝐺
Query: 𝑃(𝐴|ℎ), need to eliminate 𝐵, 𝐶, 𝐷, 𝐸, 𝐹, 𝐺
Initial factors 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 𝑎 𝑃 𝑔 𝑒 𝑃 ℎ 𝑒, 𝑓
⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 𝑎 𝑃 𝑔 𝑒 𝑚ℎ(𝑒, 𝑓)
Step 2: Eliminate 𝐺
Compute 𝑚𝑔 𝑒 = 𝑃 𝑔 𝑒 𝑔 = 1
⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 𝑎 𝑚𝑔 𝑒 𝑚ℎ(𝑒, 𝑓)
⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 𝑎 𝑚ℎ(𝑒, 𝑓)
Example: Variable Elimination
23
𝐴 𝐵
𝐶 𝐷
𝐸 𝐹
𝐺 𝐻
𝐴 𝐵
𝐶 𝐷
𝐸 𝐹
Query: 𝑃(𝐴|ℎ), need to eliminate 𝐵, 𝐶, 𝐷, 𝐸, 𝐹
Initial factors 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 𝑎 𝑃 𝑔 𝑒 𝑃 ℎ 𝑒, 𝑓
⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 𝑎 𝑃 𝑔 𝑒 𝑚ℎ 𝑒, 𝑓
⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 𝑎 𝑚ℎ 𝑒, 𝑓
Step 3: Eliminate 𝐹
Compute 𝑚𝑓 𝑒, 𝑎 = 𝑃 𝑓 𝑎 𝑚ℎ(𝑒, 𝑓) 𝑓
⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑚𝑓(𝑒, 𝑎)
Example: Variable Elimination
24
𝐴 𝐵
𝐶 𝐷
𝐸 𝐹
𝐺 𝐻
𝐴 𝐵
𝐶 𝐷
𝐸
Query: 𝑃(𝐴|ℎ), need to eliminate 𝐵, 𝐶, 𝐷, 𝐸
Initial factors 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 𝑎 𝑃 𝑔 𝑒 𝑃 ℎ 𝑒, 𝑓
⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 𝑎 𝑃 𝑔 𝑒 𝑚ℎ 𝑒, 𝑓
⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 𝑎 𝑚ℎ 𝑒, 𝑓
⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑚𝑓 𝑎, 𝑒
Step 3: Eliminate 𝐸
Compute 𝑚𝑒 𝑎, 𝑐, 𝑑 = 𝑃 𝑒 𝑐, 𝑑 𝑚𝑓(𝑎, 𝑒) 𝑒
⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑚𝑒(𝑎, 𝑐, 𝑑)
Example: Variable Elimination
25
𝐴 𝐵
𝐶 𝐷
𝐸 𝐹
𝐺 𝐻
𝐴 𝐵
𝐶 𝐷
Query: 𝑃(𝐴|ℎ), need to eliminate 𝐵, 𝐶, 𝐷
Initial factors 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 𝑎 𝑃 𝑔 𝑒 𝑃 ℎ 𝑒, 𝑓
⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 𝑎 𝑃 𝑔 𝑒 𝑚ℎ 𝑒, 𝑓
⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 𝑎 𝑚ℎ 𝑒, 𝑓
⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑚𝑓 𝑎, 𝑒
⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑚𝑒 𝑎, 𝑐, 𝑑
Step 3: Eliminate 𝐷
Compute 𝑚𝑑 𝑎, 𝑐 = 𝑃 𝑑 𝑎 𝑚𝑒(𝑎, 𝑐, 𝑑) 𝑑
⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑚𝑑(𝑎, 𝑐)
Example: Variable Elimination
26
𝐴 𝐵
𝐶 𝐷
𝐸 𝐹
𝐺 𝐻
𝐴 𝐵
𝐶
Query: 𝑃(𝐴|ℎ), need to eliminate 𝐵, 𝐶
Initial factors 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 𝑎 𝑃 𝑔 𝑒 𝑃 ℎ 𝑒, 𝑓
⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 𝑎 𝑃 𝑔 𝑒 𝑚ℎ 𝑒, 𝑓
⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 𝑎 𝑚ℎ 𝑒, 𝑓
⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑚𝑓 𝑎, 𝑒
⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑚𝑒 𝑎, 𝑐, 𝑑
⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑚𝑑 𝑎, 𝑐
Step 3: Eliminate 𝐶
Compute 𝑚𝑐 𝑎, 𝑏 = 𝑃 𝑐 𝑏 𝑚𝑑(𝑎, 𝑐) 𝑐
⇒ 𝑃 𝑎 𝑃 𝑏 𝑚𝑐(𝑎, 𝑏)
Example: Variable Elimination
27
𝐴 𝐵
𝐶 𝐷
𝐸 𝐹
𝐺 𝐻
𝐴 𝐵
𝐶
Query: 𝑃(𝐴|ℎ), need to eliminate 𝐵
Initial factors 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 𝑎 𝑃 𝑔 𝑒 𝑃 ℎ 𝑒, 𝑓
⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 𝑎 𝑃 𝑔 𝑒 𝑚ℎ 𝑒, 𝑓
⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 𝑎 𝑚ℎ 𝑒, 𝑓
⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑚𝑓 𝑎, 𝑒
⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑚𝑒 𝑎, 𝑐, 𝑑
⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑚𝑑 𝑎, 𝑐
⇒ 𝑃 𝑎 𝑃 𝑏 𝑚𝑐 𝑎, 𝑏
Step 3: Eliminate 𝐶
Compute 𝑚𝑏 𝑎 = 𝑃(𝑏)𝑚𝑐(𝑎, 𝑏) 𝑏
⇒ 𝑃 𝑎 𝑚𝑏(𝑎)
Example: Variable Elimination
28
𝐴 𝐵
𝐶 𝐷
𝐸 𝐹
𝐺 𝐻
𝐴 𝐵
𝐶
Query: 𝑃(𝐴|ℎ), need to renormalize over 𝐴
Initial factors 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 𝑎 𝑃 𝑔 𝑒 𝑃 ℎ 𝑒, 𝑓
⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 𝑎 𝑃 𝑔 𝑒 𝑚ℎ 𝑒, 𝑓
⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑃 𝑓 𝑎 𝑚ℎ 𝑒, 𝑓
⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑃 𝑒 𝑐, 𝑑 𝑚𝑓 𝑎, 𝑒
⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑃 𝑑 𝑎 𝑚𝑒 𝑎, 𝑐, 𝑑
⇒ 𝑃 𝑎 𝑃 𝑏 𝑃 𝑐 𝑏 𝑚𝑑 𝑎, 𝑐
⇒ 𝑃 𝑎 𝑃 𝑏 𝑚𝑐 𝑎, 𝑏
⇒ 𝑃 𝑎 𝑚𝑏 𝑎
Step 3: renormalize
𝑃 𝑎, ℎ = 𝑃 𝑎 𝑚𝑏 𝑎 , compute 𝑃(ℎ) = 𝑃 𝑎 𝑚𝑏(𝑎)𝑎
⇒ 𝑃 𝑎 ℎ = 𝑃 𝑎 𝑚𝑏(𝑎)
𝑃 𝑎 𝑚𝑏(𝐴)𝑎
Example: Variable Elimination
29
𝐴 𝐵
𝐶 𝐷
𝐸 𝐹
𝐺 𝐻
𝐴 𝐵
𝐶
Complexity of variable elimination
Suppose in one elimination step we compute
𝑚𝑥 𝑦1, … , 𝑦𝑘 = 𝑚𝑥′ (𝑥, 𝑦1, … , 𝑦𝑘)𝑥
𝑚𝑥′ 𝑥, 𝑦1, … , 𝑦𝑘 = 𝑚𝑖 𝑥, 𝑦𝑐𝑖
𝑘𝑖=1
This requires
𝑘 ∗ 𝑉𝑎𝑙 𝑋 ∗ 𝑉𝑎𝑙 𝑌𝑐𝑖𝑖 multiplications
For each value of 𝑥, 𝑦1, … , 𝑦𝑘, we do k multiplications
𝑉𝑎𝑙 𝑋 ∗ 𝑉𝑎𝑙 𝑌𝑐𝑖𝑖 additions
For each value of 𝑦1, … , 𝑦𝑘, we do 𝑉𝑎𝑙 𝑋 additions
Complexity is exponential in the number of variables in the intermediate factor
30
𝑋
𝑦1 𝑦𝑘 𝑦𝑖
Inference in Graphical Models
General form of the inference problem
𝑃 𝑋1, … , 𝑋𝑛 ∝ Ψ(𝐷𝑖)𝑖
Want to query 𝑌 variable given evidence 𝑒, and “don’t care” a set of 𝑍 variables
Compute 𝜏 𝑌, 𝑒 = Ψ(𝐷𝑖)𝑖𝑍 using variable elimination
Renormalize to obtain the conditionals 𝑃 𝑌|𝑒 =𝜏(𝑌,𝑒)
𝜏(𝑌,𝑒)𝑌
Two examples: use graph structure
to order computation
31
𝐴 𝐵 𝐶 𝐷 𝐸
𝐴 𝐵
𝐶 𝐷
𝐸 𝐹
𝐺 𝐻
Chain:
DAG:
From Variable Elimination to Message Passing
Recall that induced dependency during marginalization is captured in elimination cliques
Summation Elimination
Intermediate term Elimination cliques
Can this lead to an generic inference algorithm?
32
Nice localization in computation
𝑃 𝐸 = 𝑃 𝑎)𝑃 𝑏 𝑎 𝑃 𝑐 𝑏 𝑃 𝑑 𝑐 𝑃(𝐸|𝑑𝑎𝑏𝑐𝑑
𝑃 𝐸 = 𝑃 𝐸 𝑑 𝑃 𝑑 𝑐 ( 𝑃 𝑐 𝑏 𝑃 𝑏 𝑎 𝑃 𝑎)𝑎𝑏𝑐𝑑
Chain: Query E
33
𝐴 𝐵 𝐶 𝐷 𝐸
𝑚𝐴𝐵 𝑏
𝑚𝐵𝐶 𝑐
𝑚𝐶𝐷 𝑑
𝑃 𝐸 = 𝑚𝐷𝐸 𝐸
𝑚𝐴𝐵 𝑏 𝑚𝐵𝐶 𝑐 𝑚𝐶𝐷 𝑑 𝑚𝐷𝐸 𝐸
Start elimination away from the query variable
𝑃(𝐶) = 𝑃 𝑎)𝑃 𝑏 𝑎 𝑃 𝑐 𝑏 𝑃 𝑑 𝑐 𝑃(𝑒|𝑑𝑎𝑏𝑒𝑑
𝑃(𝐶) = ( 𝑃 𝑑 𝐶 ( 𝑃(𝑒|𝑑))) ( 𝑃 𝐶 𝑏 ( 𝑃 𝑏 𝑎 𝑃 𝑎𝑎𝑏 )𝑒𝑑 )
Chain: Query C
34
𝐴 𝐵 𝐶 𝐷 𝐸
𝑚𝐴𝐵 𝑏
𝑚𝐵𝐶 𝐶
𝑚𝐷𝐸 𝑑
𝑚𝐷𝐶 𝐶
𝑃 𝐶 = 𝑚𝐷𝐶 𝐶 𝑚𝐵𝐶(𝐶)
𝑚𝐴𝐵 𝑏 𝑚𝐵𝐶 𝐶 𝑚𝐷𝐶 𝐶 𝑚𝐸𝐷 𝑑
Chain: What if I want to query everybody
𝑃 𝐵 = ( 𝑃 𝑐 𝐵 ( 𝑃 𝑑 𝑐𝑑𝑐 ( 𝑃 𝑒 𝑑 )))𝑒 𝑃 𝐵 𝑎 𝑃 𝑎𝑎
Query 𝑃 𝐴 , 𝑃 𝐵 , 𝑃 𝐶 , 𝑃 𝐷 , 𝑃(𝐸)
Computational cost
Each message 𝑂 𝐾2
Chain length is 𝐿
Cost for each query is about 𝑂 𝐿𝐾2
For 𝐿 queries, cost is about 𝑂 𝐿2𝐾2
35
𝐴 𝐵 𝐶 𝐷 𝐸
𝑚𝐴𝐵 𝐵 𝑚𝐶𝐵 𝐵 𝑚𝐷𝐶 𝑐 𝑚𝐸𝐷 𝑑
What is shared in these queries?
𝑃 𝐵 = ( 𝑃 𝑐 𝐵 ( 𝑃 𝑑 𝑐𝑑𝑐 ( 𝑃 𝑒 𝑑 )))𝑒 𝑃 𝐵 𝑎 𝑃 𝑎𝑎
𝑃 𝐸 = 𝑃 𝐸 𝑑 𝑃 𝑑 𝑐 ( 𝑃 𝑐 𝑏 𝑃 𝑏 𝑎 𝑃 𝑎)𝑎𝑏𝑐𝑑
𝑃 𝐶 = ( 𝑃 𝑑 𝐶 ( 𝑃(𝑒|𝑑))) ( 𝑃 𝐶 𝑏 ( 𝑃 𝑏 𝑎 𝑃 𝑎𝑎𝑏 )𝑒𝑑 )
36
𝐴 𝐵 𝐶 𝐷 𝐸
𝑚𝐴𝐵 𝑏 𝑚𝐵𝐶 𝑐 𝑚𝐶𝐷 𝑑 𝑚𝐷𝐸 𝐸
𝐴 𝐵 𝐶 𝐷 𝐸
𝑚𝐴𝐵 𝑏 𝑚𝐵𝐶 𝐶 𝑚𝐷𝐶 𝐶 𝑚𝐸𝐷 𝑑
𝐴 𝐵 𝐶 𝐷 𝐸
𝑚𝐴𝐵 𝐵 𝑚𝐶𝐵 𝐵 𝑚𝐷𝐶 𝑐 𝑚𝐸𝐷 𝑑
The number of unique message is 2(𝐿 − 1)
Forward-backward algorithm
Compute and cache the 2(𝐿 − 1) unique messages
In query time, just multiply together the messages from the neighbors
eg. 𝑃 𝐷 = 𝑚𝐶𝐷 𝐷 𝑚𝐸𝐷(𝐷)
37
𝐴 𝐵 𝐶 𝐷 𝐸
𝑚𝐴𝐵 𝑏 𝑚𝐵𝐶 𝑐 𝑚𝐶𝐷 𝑑 𝑚𝐷𝐸 𝑒
Forward pass:
𝐴 𝐵 𝐶 𝐷 𝐸
𝑚𝐵𝐴 𝑎 𝑚𝐶𝐵 𝑏 𝑚𝐷𝐶 𝑐 𝑚𝐸𝐷 𝑑
Backward pass:
𝐴 𝐵 𝐶 𝐷 𝐸
𝑚𝐶𝐷 𝐷 𝑚𝐸𝐷 𝐷 For all queries, 𝑂 2𝐿𝐾2
DAG: Variable elimination
Elimination order H, G, F, E, B, C, D
𝑃 𝐴 =
𝑃 𝐴 𝑃 𝑑 𝐴 ( ( 𝑃 𝑏 𝑃 𝑐 𝑏 )( 𝑃 𝑒 𝑐, 𝑑 ( 𝑃 𝑔 𝑒 )( 𝑃 𝑓 𝐴 𝑃 ℎ 𝑒, 𝑓 ))) ℎ 𝑓𝑔 𝑒𝑏𝑐𝑑
38
𝐴 𝐵
𝐶 𝐷
𝐸 𝐹
𝐺 𝐻
𝑚𝐻(𝐸𝐹) 𝑒, 𝑓
𝑚𝐹(𝐴𝐸) 𝐴, 𝑒
𝑚𝐺𝐸 𝑒
𝑚𝐸(𝐴𝐶𝐷) 𝐴, 𝑐, 𝑑
𝑚𝐵𝐶 𝑐
𝑚𝐶(𝐴𝐷) 𝐴, 𝑑
𝑚𝐷𝐴 𝐴
4-way tables
created!
DAG: Cliques of size 4 are generated
39
𝐴 𝐵
𝐶 𝐷
𝐸 𝐹
𝐺 𝐻
𝐴 𝐵
𝐶 𝐷
𝐸 𝐹
𝐺 𝐻
𝑚𝐻(𝐸𝐹) 𝑒, 𝑓
𝐴 𝐵
𝐶 𝐷
𝐸 𝐹
𝐺 𝐻
𝑚𝐺𝐸 𝑒
𝐴 𝐵
𝐶 𝐷
𝐸 𝐹
𝐺 𝐻
𝑚𝐹(𝐴𝐸) 𝐴, 𝑒
𝐴 𝐵
𝐶 𝐷
𝐸 𝐹
𝐺 𝐻
𝑚𝐸(𝐴𝐶𝐷) 𝐴, 𝑐, 𝑑
𝐴 𝐵
𝐶 𝐷
𝐸 𝐹
𝐺 𝐻
𝑚𝐵𝐶 𝑐
𝐴 𝐵
𝐶 𝐷
𝐸 𝐹
𝐺 𝐻
𝑚𝐶(𝐴𝐷) 𝐴, 𝑑
𝐴 𝐵
𝐶 𝐷
𝐸 𝐹
𝐺 𝐻
𝑚𝐷𝐴 𝐴
4-way tables
created!
DAG: A different elimination order
Elimination order G, H, F, B, C, D, E
𝑃 𝐴
= ( 𝑃(𝑑|𝐴)𝑑 𝑃(𝑒|𝑐, 𝑑)𝑐 𝑃 𝑏 𝑃 𝑐 𝑏𝑏 𝑃 𝑓 𝐴 𝑃 ℎ 𝑒, 𝑓ℎ𝑓 𝑃 𝑔 𝑒𝑔 )𝑒
40
𝐴 𝐵
𝐶 𝐷
𝐸 𝐹
𝐺 𝐻
𝑚𝐺𝐸 𝑒
𝑚𝐹(𝐴𝐸) 𝐴, 𝑒
𝑚𝐻(𝐸𝐹) 𝑒, 𝑓
𝑚𝐶(𝐸𝐷) 𝑒, 𝑑
𝑚𝐵𝐶 𝑐
𝑚𝐸𝐴 𝐴
𝑚𝐷(𝐴𝐸) 𝐴, 𝑒
NO 4-way tables!
DAG: No cliques of size 4
41
𝐴 𝐵
𝐶 𝐷
𝐸 𝐹
𝐺 𝐻
𝐴 𝐵
𝐶 𝐷
𝐸 𝐹
𝐺 𝐻
𝑚𝐺𝐸 𝑒
𝐴 𝐵
𝐶 𝐷
𝐸 𝐹
𝐺 𝐻
𝑚𝐻(𝐸𝐹) 𝑒, 𝑓
𝐴 𝐵
𝐶 𝐷
𝐸 𝐹
𝐺 𝐻
𝑚𝐹(𝐴𝐸) 𝐴, 𝑒
𝐴 𝐵
𝐶 𝐷
𝐸 𝐹
𝐺 𝐻
𝑚𝐵𝐶 𝑐
𝐴 𝐵
𝐶 𝐷
𝐸 𝐹
𝐺 𝐻
𝑚𝐶(𝐷𝐸) 𝑑, 𝑒
𝐴 𝐵
𝐶 𝐷
𝐸 𝐹
𝐺 𝐻
𝑚𝐷(𝐴𝐸) 𝐴, 𝑒
𝐴 𝐵
𝐶 𝐷
𝐸 𝐹
𝐺 𝐻
𝑚𝐸𝐴 𝐴
Any thoughts?
Chain has nice properties
forward-backward algorithm works
Immediate results (messages) along edges
Can we generalize to other graphs? (trees, loopy graphs?)
How about undirected trees? Is there a forward-backward algorithm?
Loopy graph is more complicated Different elimination order results in different computational cost
Can we somehow make loopy graph behave like trees?
42
Tree Graphical Models
43
Undirected tree: a unique path between any pair of nodes
Directed tree: all nodes except the root have exactly one parent
Equivalence of directed and undirected trees
Any undirected tree can be converted to a directed tree by choosing a root node and directing all edges away from it
A directed tree and the corresponding undirected tree make the conditional independence assertions
Parameterization are essentially the same
Undirected tree: 𝑃 𝑋 =1
𝑍 Ψ 𝑋𝑖 Ψ(𝑋𝑖 , 𝑋𝑗)(𝑖,𝑗)∈E𝑖∈V
Directed tree: 𝑃 𝑋 = 𝑃 𝑋𝑟 𝑃(𝑋𝑗|𝑋𝑖)𝑖,𝑗 ∈𝐸
Equivalence: Ψ 𝑋𝑖 = 𝑃 𝑋𝑟 , Ψ 𝑋𝑖 , 𝑋𝑗 = 𝑃 𝑋𝑗 𝑋𝑖 , 𝑍 =
1,Ψ 𝑋𝑖 = 1
44
Message passing on trees
Message passed along tree edges
𝑃 𝑋𝑖, 𝑋𝑗 , 𝑋𝑘 , 𝑋𝑙, 𝑋𝑓 ∝
Ψ 𝑋𝑖 Ψ 𝑋𝑗 Ψ 𝑋𝑘 Ψ 𝑋𝑙 Ψ 𝑋𝑓 Ψ 𝑋𝑖 , 𝑋𝑗 Ψ 𝑋𝑘 , 𝑋𝑗 Ψ 𝑋𝑙 , 𝑋𝑗 Ψ(𝑋𝑖 , 𝑋𝑓)
𝑃 𝑓 = Ψ(𝑋𝑓) (Ψ 𝑋𝑖 Ψ 𝑋𝑖 , 𝑋𝑓 Ψ 𝑋𝑗 Ψ 𝑋𝑖 , 𝑋𝑗 ( Ψ 𝑋𝑘 Ψ 𝑋𝑘 , 𝑋𝑗𝑥𝑘 )( Ψ 𝑋𝑙 Ψ 𝑋𝑙 , 𝑋𝑗𝑥𝑙 )𝑥𝑗 )𝑥𝑖
45
𝑓 𝑖 𝑗
𝑘
𝑙
𝑚𝑘𝑗 𝑋𝑗
𝑚𝑙𝑗 𝑋𝑗
𝑚𝑗𝑖 𝑋𝑖 𝑚𝑖𝑓 𝑋𝑓
𝑚𝑙𝑗 𝑋𝑗 𝑚𝑘𝑗 𝑋𝑘
𝑚𝑗𝑖 𝑋𝑖
𝑚𝑖𝑓 𝑋𝑓
Sharing messages on trees
Query f
Query j
46
𝑓 𝑖 𝑗
𝑘
𝑙
𝑚𝑘𝑗 𝑋𝑗
𝑚𝑙𝑗 𝑋𝑗
𝑚𝑗𝑖 𝑋𝑖 𝑚𝑖𝑓 𝑋𝑓
𝑓 𝑖 𝑗
𝑘
𝑙
𝑚𝑘𝑗 𝑋𝑗
𝑚𝑙𝑗 𝑋𝑗
𝑚𝑖𝑗 𝑋𝑗 𝑚𝑓𝑖 𝑋𝑖
Computational cost for all queries
Query 𝑃 𝑋𝑘 , 𝑃 𝑋𝑙 , 𝑃 𝑋𝑗 , 𝑃 𝑋𝑖 , 𝑃 𝑋𝑓
Doing things separately
Each message 𝑂 𝐾2
Number of edges is 𝐿
Cost for each query is about 𝑂 𝐿𝐾2
For 𝐿 queries, cost is about 𝑂 𝐿2𝐾2
47
𝑓 𝑖 𝑗
𝑘
𝑙
𝑚𝑘𝑗 𝑋𝑗
𝑚𝑙𝑗 𝑋𝑗
𝑚𝑖𝑗 𝑋𝑗 𝑚𝑓𝑖 𝑋𝑖
Forward-backward algorithm in trees
Forward: pick one leave as root, compute all messages, cache
Backward: pick another root, compute all messages, cache
Eg. Query j
48
𝑓 𝑖 𝑗
𝑘
𝑙
𝑚𝑘𝑗 𝑋𝑗
𝑚𝑙𝑗 𝑋𝑗
𝑚𝑗𝑖 𝑋𝑖 𝑚𝑖𝑓 𝑋𝑓
𝑓 𝑖 𝑗
𝑘
𝑙
𝑚𝑗𝑘 𝑋𝑘
𝑚𝑙𝑗 𝑋𝑗
𝑚𝑖𝑗 𝑋𝑗 𝑚𝑖𝑓 𝑋𝑓
𝑓 𝑖 𝑗
𝑘
𝑙
𝑚𝑘𝑗 𝑋𝑗
𝑚𝑙𝑗 𝑋𝑗
𝑚𝑖𝑗 𝑋𝑗
resuse
Computational saving for trees
Compute forward and backward messages for each edge, save them
Doing things separately
Each message 𝑂 𝐾2
Number of edges is 𝐿
2𝐿 unique messages
Cost for all queries is about 𝑂 2𝐿𝐾2
49
𝑓 𝑖 𝑗
𝑘
𝑙
𝑚𝑘𝑗 𝑋𝑗
𝑚𝑙𝑗 𝑋𝑗
𝑚𝑗𝑖 𝑋𝑖 𝑚𝑖𝑓 𝑋𝑓
𝑚𝑓𝑖 𝑋𝑖 𝑚𝑖𝑗 𝑋𝑗 𝑚𝑗𝑘 𝑋𝑘
𝑚𝑗𝑙 𝑋𝑙
Message passing algorithm
𝑚𝑗𝑖 𝑋𝑖 ∝ Ψ 𝑋𝑖 , 𝑋𝑗𝑋𝑗Ψ 𝑋𝑗 𝑚𝑠𝑗 𝑋𝑗𝑠∈N 𝑗 \i
50
𝑓 𝑖 𝑗
𝑘
𝑙
𝑚𝑘𝑗 𝑋𝑗
𝑚𝑙𝑗 𝑋𝑗
𝑚𝑗𝑖 𝑋𝑖
N 𝑗 \i
𝑝𝑟𝑜𝑑𝑢𝑐𝑡 𝑜𝑓 𝑖𝑛𝑐𝑜𝑚𝑖𝑛𝑔 𝑚𝑒𝑠𝑠𝑎𝑔𝑒𝑠
𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑦 𝑏𝑦 𝑙𝑜𝑐𝑎𝑙 𝑝𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙𝑠
𝑆𝑢𝑚 𝑜𝑢𝑡 𝑋𝑗 𝑋𝑗 can send
message when incoming messages from 𝑁 𝑗 \i arrive
From Variable Elimination to Message Passing
Recall Variable Elimination Algorithm
Choose an ordering in which the query node 𝑓 is the final node
Eliminate node 𝑖 by removing all potentials containing 𝑖, take sum/product over 𝑥𝑖
Place the resultant factor back
For a Tree graphical model:
Choose query node f as the root of the tree
View tree as a directed tree with edges pointing towards 𝑓
Elimination of each node can be considered as message-passing directly along tree branches, rather than on some transformed graphs
Thus, we can use the tree itself as a data-structure to inference
51
How about general graph?
Trees are nice
Can just compute two messages for each edge
Order computation along the graph
Associate intermediate results with edges
General graph is not so clear
Different elimination generate different cliques and factor size
Computation and immediate results not associated with edges
Local computation view is not so clear
52
𝑓 𝑖 𝑗
𝑘
𝑙
𝑚𝑘𝑗 𝑋𝑗
𝑚𝑙𝑗 𝑋𝑗
𝑚𝑗𝑖 𝑋𝑖 𝑚𝑖𝑓 𝑋𝑓
𝑚𝑓𝑖 𝑋𝑖 𝑚𝑖𝑗 𝑋𝑗 𝑚𝑗𝑘 𝑋𝑘
𝑚𝑗𝑙 𝑋𝑙
𝐴 𝐵
𝐶 𝐷
𝐸 𝐹
𝐺 𝐻
𝐴 𝐵
𝐶 𝐷
𝐸 𝐹
𝐺 𝐻
Can we make them tree like or treat them
as trees?
Message passing for loopy graph
Local message passing for trees guarantees the consistency of local marginals
𝑃 𝑋𝑖 computed is the correct one
𝑃 𝑋𝑖 , 𝑋𝑗 computed is the correct on
…
For loopy graphs, no consistency guarantees for local message passing
53
𝑓 𝑖 𝑗
𝑘
𝑙
𝑚𝑘𝑗 𝑋𝑗
𝑚𝑙𝑗 𝑋𝑗
𝑚𝑗𝑖 𝑋𝑖
Inference for loopy graph models is NP-hard in general
Treat loopy graphs locally as if they were trees
Iteratively estimate the marginal
Read in messages
Process messages
Send updated out messages
Repeat for all variables until convergence
Loopy belief propagation
54
A
Message update schedule
Synchronous update:
𝑋𝑗 can send message when incoming messages from 𝑁 𝑗 \i
arrive
Slow
Provably correct for tree, may converge for loopy graphs
Asynchronous update:
𝑋𝑗 can send message when there is a change in any incoming messages
from 𝑁 𝑗 \i
Fast
Not easy to prove convergence, but empirically it often works
55