Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Bayesian networks: Inference and learning
CS194-10 Fall 2011 Lecture 22
CS194-10 Fall 2011 Lecture 22 1
Outline
♦ Exact inference (briefly)
♦ Approximate inference (rejection sampling, MCMC)
♦ Parameter learning
CS194-10 Fall 2011 Lecture 22 2
Inference by enumeration
Slightly intelligent way to sum out variables from the joint without actuallyconstructing its explicit representation
Simple query on the burglary network:B E
JA
M
P(B|j, m)= P(B, j, m)/P (j, m)= αP(B, j, m)= α Σe Σa P(B, e, a, j, m)
Rewrite full joint entries using product of CPT entries:P(B|j, m)= α Σe Σa P(B)P (e)P(a|B, e)P (j|a)P (m|a)= αP(B) Σe P (e) Σa P(a|B, e)P (j|a)P (m|a)
Recursive depth-first enumeration: O(D) space, O(KD) time
CS194-10 Fall 2011 Lecture 22 3
Evaluation tree
P(j|a).90
P(m|a).70 .01
P(m| a)
.05P(j| a) P(j|a)
.90
P(m|a).70 .01
P(m| a)
.05P(j| a)
P(b).001
P(e).002
P( e).998
P(a|b,e).95 .06
P( a|b, e).05P( a|b,e)
.94P(a|b, e)
Enumeration is inefficient: repeated computatione.g., computes P (j|a)P (m|a) for each value of e
CS194-10 Fall 2011 Lecture 22 4
Efficient exact inference
Junction tree and variable elimination algorithmsavoid repeated computation (generalized from of dynamic programming)♦ Singly connected networks (or polytrees):
– any two nodes are connected by at most one (undirected) path– time and space cost of exact inference are O(KLD)
♦ Multiply connected networks:– can reduce 3SAT to exact inference ⇒ NP-hard– equivalent to counting 3SAT models ⇒ #P-complete
A B C D
1 2 3
AND
0.5
LL
L
L1. A v B v C2. C v D v A3. B v C v D
0.50.50.5
CS194-10 Fall 2011 Lecture 22 5
Inference by stochastic simulation
Idea: replace sum over hidden-variable assignments with a random sam-ple.
1) Draw N samples from a sampling distribution S
Coin
0.52) Compute an approximate posterior probability P3) Show this converges to the true probability P
Outline:– Sampling from an empty network– Rejection sampling: reject samples disagreeing with evidence– Markov chain Monte Carlo (MCMC): sample from a stochastic process
whose stationary distribution is the true posterior
CS194-10 Fall 2011 Lecture 22 6
Sampling from an empty network
function Prior-Sample(bn) returns an event sampled from prior specified by bn
inputs: bn, a Bayesian network specifying joint distribution P(X1, . . . , XD)
x← an event with D elements
for j = 1, . . . , D dox[j]← a random sample from P(Xj | values of parents(Xj) in x)
return x
CS194-10 Fall 2011 Lecture 22 7
Example
Cloudy
RainSprinkler
WetGrass
CTF
.80
.20
P(R|C)CTF
.10
.50
P(S|C)
S RT TT FF TF F
.90
.90
.99P(W|S,R)
P(C).50
.01
CS194-10 Fall 2011 Lecture 22 8
Example
Cloudy
RainSprinkler
WetGrass
CTF
.80
.20
P(R|C)CTF
.10
.50
P(S|C)
S RT TT FF TF F
.90
.90
.99P(W|S,R)
P(C).50
.01
CS194-10 Fall 2011 Lecture 22 9
Example
Cloudy
RainSprinkler
WetGrass
CTF
.80
.20
P(R|C)CTF
.10
.50
P(S|C)
S RT TT FF TF F
.90
.90
.99P(W|S,R)
P(C).50
.01
CS194-10 Fall 2011 Lecture 22 10
Example
Cloudy
RainSprinkler
WetGrass
CTF
.80
.20
P(R|C)CTF
.10
.50
P(S|C)
S RT TT FF TF F
.90
.90
.99P(W|S,R)
P(C).50
.01
CS194-10 Fall 2011 Lecture 22 11
Example
Cloudy
RainSprinkler
WetGrass
CTF
.80
.20
P(R|C)CTF
.10
.50
P(S|C)
S RT TT FF TF F
.90
.90
.99P(W|S,R)
P(C).50
.01
CS194-10 Fall 2011 Lecture 22 12
Example
Cloudy
RainSprinkler
WetGrass
CTF
.80
.20
P(R|C)CTF
.10
.50
P(S|C)
S RT TT FF TF F
.90
.90
.99P(W|S,R)
P(C).50
.01
CS194-10 Fall 2011 Lecture 22 13
Example
Cloudy
RainSprinkler
WetGrass
CTF
.80
.20
P(R|C)CTF
.10
.50
P(S|C)
S RT TT FF TF F
.90
.90
.99P(W|S,R)
P(C).50
.01
CS194-10 Fall 2011 Lecture 22 14
Sampling from an empty network contd.
Probability that PriorSample generates a particular event
SPS(x1 . . . xD) = ΠDj = 1P (xj|parents(Xj)) = P (x1 . . . xD)
i.e., the true prior probability
E.g., SPS(t, f, t, t) = 0.5× 0.9× 0.8× 0.9 = 0.324 = P (t, f, t, t)
Let NPS(x1 . . . xD) be the number of samples generated for event x1, . . . , xD
Then we have
limN→∞
P (x1, . . . , xD) = limN→∞
NPS(x1, . . . , xD)/N
= SPS(x1, . . . , xD)
= P (x1 . . . xD)
That is, estimates derived from PriorSample are consistent
Shorthand: P (x1, . . . , xD) ≈ P (x1 . . . xD)
CS194-10 Fall 2011 Lecture 22 15
Rejection sampling
P(X|e) estimated from samples agreeing with e
function Rejection-Sampling(X ,e, bn,N ) returns an estimate of P(X|e)
local variables: N, a vector of counts for each value of X , initially zero
for i = 1 to N dox←Prior-Sample(bn)
if x is consistent with e thenN[x ]←N[x ]+1 where x is the value of X in x
return Normalize(N)
E.g., estimate P(Rain|Sprinkler = true) using 100 samples27 samples have Sprinkler = true
Of these, 8 have Rain = true and 19 have Rain = false.
P(Rain|Sprinkler = true) = Normalize(〈8, 19〉) = 〈0.296, 0.704〉
Similar to a basic real-world empirical estimation procedure
CS194-10 Fall 2011 Lecture 22 16
Analysis of rejection sampling
P(X|e) = αNPS(X, e) (algorithm defn.)= NPS(X, e)/NPS(e) (normalized by NPS(e))≈ P(X, e)/P (e) (property of PriorSample)= P(X|e) (defn. of conditional probability)
Hence rejection sampling returns consistent posterior estimates
Problem: hopelessly expensive if P (e) is small
P (e) drops off exponentially with number of evidence variables!
CS194-10 Fall 2011 Lecture 22 17
Approximate inference using MCMC
General idea of Markov chain Monte Carlo
♦ Sample space Ω, probability π(ω) (e.g., posterior given e)
♦ Would like to sample directly from π(ω), but it’s hard
♦ Instead, wander around Ω randomly, collecting samples
♦ Random wandering is controlled by transition kernel φ(ω → ω′)specifying the probability of moving to ω′ from ω(so the random state sequence ω0, ω1, . . . , ωt is a Markov chain)
♦ If φ is defined appropriately, the stationary distribution is π(ω)so that, after a while (mixing time) the collected samples are drawn from π
CS194-10 Fall 2011 Lecture 22 18
Gibbs sampling in Bayes nets
Markov chain state ωt = current assignment xt to all variables
Transition kernel: pick a variable Xj, sample it conditioned on all others
Markov blanket property: P (Xj | all other variables) = P (Xj | mb(Xj))so generate next state by sampling a variable given its Markov blanket
function Gibbs-Ask(X ,e, bn,N ) returns an estimate of P(X|e)
local variables: N, a vector of counts for each value of X , initially zero
Z, the nonevidence variables in bn
z, the current state of variables Z, initially random
for i = 1 to N dochoose Zj in Z uniformly at random
set the value of Zj in z by sampling from P(Zj|mb(Zj))
N[x ]←N[x ] + 1 where x is the value of X in z
return Normalize(N)
CS194-10 Fall 2011 Lecture 22 19
The Markov chain
With Sprinkler = true, WetGrass = true, there are four states:
Cloudy
RainSprinkler
WetGrass
Cloudy
RainSprinkler
WetGrass
Cloudy
RainSprinkler
WetGrass
Cloudy
RainSprinkler
WetGrass
CS194-10 Fall 2011 Lecture 22 20
MCMC example contd.
Estimate P(Rain|Sprinkler = true, WetGrass = true)
Sample Cloudy or Rain given its Markov blanket, repeat.Count number of times Rain is true and false in the samples.
E.g., visit 100 states31 have Rain = true, 69 have Rain = false
P(Rain|Sprinkler = true, WetGrass = true)= Normalize(〈31, 69〉) = 〈0.31, 0.69〉
CS194-10 Fall 2011 Lecture 22 21
Markov blanket sampling
Markov blanket of Cloudy is Cloudy
RainSprinkler
WetGrass
Sprinkler and RainMarkov blanket of Rain is
Cloudy, Sprinkler, and WetGrass
Probability given the Markov blanket is calculated as follows:
P (x′j|mb(Xj)) = αP (x′j|parents(Xj))ΠZ`∈Children(Xj)P (z`|parents(Z`))
E.g., φ(¬cloudy , rain → cloudy , rain)= 0.5×αP (cloudy)P ( | cloudy)P ( | cloudy) = 0.5×α× 0.5× 0.1× 0.8= 0.5× 0.040
0.040+0.050 = 0.2222
(Easy for discrete variables; continuous case requires mathematicalanalysis for each combination of distribution types.)
Easily implemented in message-passing parallel systems, brains
Can converge slowly, especially for near-deterministic models
CS194-10 Fall 2011 Lecture 22 22
Theory for Gibbs sampling
Theorem: stationary distribution for Gibbs transition kernel is P (z | e);i.e., long-run fraction of time spent in each state is exactlyproportional to its posterior probability
Proof sketch:– The Gibbs transition kernel satisfies detailed balance for P (z | e)
i.e., for all z, z′ the “flow” from z to z′ is the same as from z′ to z– π is the unique stationary distribution for any ergodic transition kernel
satisfying detailed balance for π
CS194-10 Fall 2011 Lecture 22 23
Detailed balance
Let πt(z) be the probability the chain is in state z at time t
Detailed balance condition: “outflow” = “inflow” for each pair of states:
πt(z)φ(z→ z′) = πt(z′)φ(z′ → z) for all z, z′
Detailed balance ⇒ stationarity:
πt+1(z) = Σz′πt(z′)φ(z′ → z) = Σz′πt(z)φ(z→ z′)
= πt(z)Σz′φ(z→ z′)
= πt(z′)
MCMC algorithms typically constructed by designing a transitionkernel φ that is in detailed balance with desired π
CS194-10 Fall 2011 Lecture 22 24
Gibbs sampling transition kernel
Probability of choosing variable Zj to sample is 1/(D − |E|)
Let Zj be all other nonevidence variables, i.e., Z− ZjCurrent values are zj and zj; e is fixed; transition probability is given by
φ(z→ z′) = φ(zj, zj → z′j, zj) = P (z′j|zj, e)/(D − |E|)
This gives detailed balance with P (z | e):
π(z)φ(z→ z′) =1
D − |E|P (z | e)P (z′j|zj, e) =
1
D − |E|P (zj, zj | e)P (z′j|zj, e)
=1
D − |E|P (zj|zj, e)P (zj | e)P (z′j|zj, e) (chain rule)
=1
D − |E|P (zj|zj, e)P (z′j, zj | e) (chain rule backwards)
= φ(z′ → z)π(z′) = π(z′)φ(z′ → z)
CS194-10 Fall 2011 Lecture 22 25
Summary (inference)
Exact inference:– polytime on polytrees, NP-hard on general graphs– space = time, very sensitive to topology
Approximate inference by MCMC:– Generally insensitive to topology– Convergence can be very slow with probabilities close to 1 or 0– Can handle arbitrary combinations of discrete and continuous variables
CS194-10 Fall 2011 Lecture 22 26
Parameter learning: Complete data
θjk` = P (Xj = k | Parents(Xj) = `)
Let x(i)j = value of Xj in example i; assume Boolean for simplicity
Log likelihood
L(θ) =N∑
i = 1
D∑j = 1
log P (x(i)j | parents(Xj)
(i)
=N∑
i = 1
D∑j = 1
log θx
(i)j
j1`(i)(1− θj1`(i))
1−x(i)j
∂L
∂θj1`=
Nj1`
θj1`− Nj0`
1− θj1`= 0 gives
θj1` =Nj1`
Nj0` + Nj1`=
Nj1`
Nj`.
I.e., learning is completely decomposed; MLE = observed condition frequency
CS194-10 Fall 2011 Lecture 22 27
Example
Red/green wrapper depends probabilistically on flavor:
P(F=cherry)
Flavor
Wrapper
!
Fcherrylime
P(W=red | F)!1!2
Likelihood for, e.g., cherry candy in green wrapper:
P (F = cherry ,W = green|hθ,θ1,θ2)
= P (F = cherry |hθ,θ1,θ2)P (W = green|F = cherry , hθ,θ1,θ2)
= θ · (1− θ1)
N candies, rc red-wrapped cherry candies, etc.:
P (X|hθ,θ1,θ2) = θc(1− θ)` · θrc1 (1− θ1)
gc · θr`2 (1− θ2)
g`
L = [c log θ + ` log(1− θ)]
+ [rc log θ1 + gc log(1− θ1)]
+ [r` log θ2 + g` log(1− θ2)]
CS194-10 Fall 2011 Lecture 22 28
Example contd.
Derivatives of L contain only the relevant parameter:
∂L
∂θ=
c
θ− `
1− θ= 0 ⇒ θ =
c
c + `
∂L
∂θ1=
rc
θ1− gc
1− θ1= 0 ⇒ θ1 =
rc
rc + gc
∂L
∂θ2=
r`
θ2− g`
1− θ2= 0 ⇒ θ2 =
r`
r` + g`
CS194-10 Fall 2011 Lecture 22 29
Hidden variables
Why learn models with hidden variables?
Smoking Diet Exercise
Symptom1 Symptom2 Symptom3
(a) (b)
HeartDisease
Smoking Diet Exercise
Symptom1 Symptom2 Symptom3
2 2 2
54
6 6 6
2 2 2
54 162 486
Hidden variables ⇒ simplified structure, fewer parameters⇒ faster learning
CS194-10 Fall 2011 Lecture 22 30
Learning with and without hidden variables
3
3.05
3.1
3.15
3.2
3.25
3.3
3.35
3.4
3.45
3.5
0 1000 2000 3000 4000 5000
Aver
age
nega
tive
log lik
eliho
od p
er ca
se
Number of training cases
3-3 network, APN algorithm3-1-3 network, APN algorithm
1 hidden node, NN algorithm3 hidden nodes, NN algorithm5 hidden nodes, NN algorithm
Target network
CS194-10 Fall 2011 Lecture 22 31
EM for Bayes nets
For t = 0 to ∞ (until convergence) do
E step: Compute all pijk` = P (Xj = k, Parents(Xj) = ` | e(i), θ(t))
M step: θ(t+1)jk` =
Njk`
Σk′Njk′`=
Σipijk`
ΣiΣk′pijk′`
E step can be any exact or approximate inference algorithm
With MCMC, can treat each sample as a complete-data example
CS194-10 Fall 2011 Lecture 22 32
Candy example
Hole
Bag
P(Bag=1)!
WrapperFlavor
Bag12
P(F=cherry | B)
!F2
!F1
-2020
-2010
-2000
-1990
-1980
0 20 40 60 80 100 120
Log-
likel
ihoo
d L
Iteration number
CS194-10 Fall 2011 Lecture 22 33
Example: Car insurance
SocioEconAge
GoodStudentExtraCar
MileageVehicleYear
RiskAversion
SeniorTrain
DrivingSkill MakeModelDrivingHist
DrivQualityAntilock
Airbag CarValue HomeBase AntiTheftt
TheftOwnDamage
PropertyCostLiabilityCostMedicalCost
Cushioning
Ruggedness Accident
OtherCost OwnCost
CS194-10 Fall 2011 Lecture 22 34
Example: Car insurance contd.
1.8
2
2.2
2.4
2.6
2.8
3
3.2
3.4
0 1000 2000 3000 4000 5000
Aver
age
nega
tive
log lik
eliho
od p
er ca
se
Number of training cases
12--3 network, APN algorithmInsurance network, APN algorithm
1 hidden node, NN algorithm5 hidden nodes, NN algorithm
10 hidden nodes, NN algorithmTarget network
CS194-10 Fall 2011 Lecture 22 35
Bayesian learning in Bayes nets
Parameters become variables (parents of their previous owners):
P (Wrapper = red | Flavor = cherry , Θ1 = θ1, Θ2 = θ2) = θ1 .
network replicated for each example, parameters shared across all examples:
P(F=cherry)
Flavor
Wrapper
!
Fcherrylime
P(W=red | F)!1!2
Flavor1
Wrapper1
Flavor2
Wrapper2
Flavor3
Wrapper3
!
!1 !2
CS194-10 Fall 2011 Lecture 22 36
Bayesian learning contd.
Priors for parameter variables: Beta, Dirichlet, Gamma, Gaussian, etc.
With independent Beta or Dirichlet priors,MAP EM learning = pseudocounts + expected counts:
M step: θ(t+1)jk` = ajk` +
Njk`
Σk′Njk′`
Implemented in EM training mode for Hugin and other Bayes net packages
CS194-10 Fall 2011 Lecture 22 37
Summary (parameter learning)
Complete data: likelihood factorizes; each parameter θjk` learned separatelyfrom the observed counts for conditional frequency Njk`/Nj`
Incomplete data: likelihood is a summation over all values of hidden variables;can apply EM by computing “expected counts”
Bayesian learning: parameters become variables in a replicated modelwith their own prior distributions defined by hyperparameters; thenBayesian learning is just ordinary inference in the model
CS194-10 Fall 2011 Lecture 22 38