Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Bayesian networks: Modeling
CS194-10 Fall 2011 Lecture 21
CS194-10 Fall 2011 Lecture 21 1
Outline
♦ Overview of Bayes nets
♦ Syntax and semantics
♦ Examples
♦ Compact conditional distributions
CS194-10 Fall 2011 Lecture 21 2
Learning with complex probability models
Learning cannot succeed without imposing some prior structure on thehypothesis space (by constraint or by preference)
Generative models P (X | θ) support MLE, MAP, and Bayesian learning fordomains that (approximately) reflect the assumptions underlying the model:♦ Naive Bayes—conditional independence of attributes given class value♦ Mixture models—domain has a flat, discrete category structure♦ All i.i.d. models—model doesn’t change over time♦ Etc.
Would like to express arbitrarily complex and flexible prior knowledge:♦ Some attributes depend on others♦ Categories have hierarchical structure;
objects may be mixtures of several categories♦ Observations at time t may depend on earlier observations♦ Etc.
CS194-10 Fall 2011 Lecture 21 3
Bayesian networks
A simple, graphical notation for conditional independence assertionsamong a predefined set of random variables Xj, j = 1, . . . , Dand hence for compact specification of arbitrary joint distributions
Syntax:a set of nodes, one per variablea directed, acyclic graph (link ≈ “directly influences”)a set of parameters for each node given its parents:
θ(Xj|Parents(Xj))
In the simplest case, parameters consist ofa conditional probability table (CPT) giving thedistribution over Xj for each combination of parent values
CS194-10 Fall 2011 Lecture 21 4
Example
Topology of network encodes conditional independence assertions:
Weather Cavity
Toothache Catch
Weather is independent of the other variables
Toothache and Catch are conditionally independent given Cavity
CS194-10 Fall 2011 Lecture 21 5
Example
I’m at work, neighbor John calls to say my alarm is ringing, but neighborMary doesn’t call. Sometimes it’s set off by minor earthquakes. Is there aburglar?
Variables: Burglar, Earthquake, Alarm, JohnCalls, MaryCallsNetwork topology reflects “causal” knowledge:
– A burglar can set the alarm off– An earthquake can set the alarm off– The alarm can cause Mary to call– The alarm can cause John to call
CS194-10 Fall 2011 Lecture 21 6
Example contd.
.001P(B)
.002P(E)
Alarm
Earthquake
MaryCallsJohnCalls
Burglary
BTTFF
ETFTF
.95
.29
.001
.94
P(A|B,E)
ATF
.90
.05
P(J|A) ATF
.70
.01
P(M|A)
CS194-10 Fall 2011 Lecture 21 7
Compactness
A CPT for Boolean Xj with L Boolean parents has B E
JA
M
2L rows for the combinations of parent values
Each row requires one parameter p for Xj = true(the parameter for Xj = false is just 1− p)
If each variable has no more than L parents,the complete network requires O(D · 2L) parameters
I.e., grows linearly with D, vs. O(2D) for the full joint distribution
For burglary net, 1 + 1 + 4 + 2 + 2 = 10 parameters (vs. 25 − 1 = 31)
CS194-10 Fall 2011 Lecture 21 8
Global semantics
Global semantics defines the full joint distribution B E
JA
M
as the product of the local conditional distributions:
P (x1, . . . , xD) = ΠDj = 1θ(xj|parents(Xj)) .
e.g., P (j ∧m ∧ a ∧ ¬b ∧ ¬e)
=
CS194-10 Fall 2011 Lecture 21 9
Global semantics
Global semantics defines the full joint distribution B E
JA
M
as the product of the local conditional distributions:
P (x1, . . . , xD) = ΠDj = 1θ(xj|parents(Xj)) .
e.g., P (j ∧m ∧ a ∧ ¬b ∧ ¬e)
= θ(j|a)θ(m|a)θ(a|¬b,¬e)θ(¬b)θ(¬e)
= 0.9× 0.7× 0.001× 0.999× 0.998
≈ 0.00063
CS194-10 Fall 2011 Lecture 21 10
Global semantics
Global semantics defines the full joint distribution B E
JA
M
as the product of the local conditional distributions:
P (x1, . . . , xD) = ΠDj = 1θ(xj|parents(Xj)) .
e.g., P (j ∧m ∧ a ∧ ¬b ∧ ¬e)
= θ(j|a)θ(m|a)θ(a|¬b,¬e)θ(¬b)θ(¬e)
= 0.9× 0.7× 0.001× 0.999× 0.998
≈ 0.00063
Theorem: θ(Xj|Parents(Xj)) = P(Xj|Parents(Xj))
CS194-10 Fall 2011 Lecture 21 11
Local semantics
Local semantics: each node is conditionally independentof its nondescendants given its parents
. . .
. . .U1
X
Um
Yn
Znj
Y1
Z1j
Theorem: Local semantics ⇔ global semantics
CS194-10 Fall 2011 Lecture 21 12
Markov blanket
Each node is conditionally independent of all others given itsMarkov blanket: parents + children + children’s parents
. . .
. . .U1
X
Um
Yn
Znj
Y1
Z1j
CS194-10 Fall 2011 Lecture 21 13
Constructing Bayesian networks
Need a method such that a series of locally testable assertions ofconditional independence guarantees the required global semantics
1. Choose an ordering of variables X1, . . . , XD
2. For j = 1 to Dadd Xj to the networkselect parents from X1, . . . , Xj−1 such that
P(Xj|Parents(Xj)) = P(Xj|X1, . . . , Xj−1)i.e., Xj is conditionally independent of other variables given parents
This choice of parents guarantees the global semantics:
P(X1, . . . , XD) = ΠDj = 1P(Xj|X1, . . . , Xj−1) (chain rule)
= ΠDj = 1P(Xj|Parents(Xj)) (by construction)
CS194-10 Fall 2011 Lecture 21 14
Example: Car diagnosis
Initial evidence: car won’t startTestable variables (green), “broken, so fix it” variables (orange)Hidden variables (gray) ensure sparse structure, reduce parameters
lights
no oil no gas starterbroken
battery age alternator broken
fanbeltbroken
battery dead no charging
battery flat
gas gauge
fuel lineblocked
oil light
battery meter
car won’t start dipstick
CS194-10 Fall 2011 Lecture 21 15
Example: Car insurance
SocioEconAge
GoodStudentExtraCar
MileageVehicleYear
RiskAversion
SeniorTrain
DrivingSkill MakeModelDrivingHist
DrivQualityAntilock
Airbag CarValue HomeBase AntiTheftt
TheftOwnDamage
PropertyCostLiabilityCostMedicalCost
Cushioning
Ruggedness Accident
OtherCost OwnCost
CS194-10 Fall 2011 Lecture 21 16
Compact conditional distributions
CPT grows exponentially with number of parentsCPT becomes infinite with continuous-valued parent or child
Solution: canonical distributions that are defined compactly
Deterministic nodes are the simplest case:X = f (Parents(X)) for some function f
E.g., Boolean functionsNorthAmerican ⇔ Canadian ∨ US ∨Mexican
E.g., numerical relationships among continuous variables
∂LakeLevel
∂t= inflow + precipitation - outflow - evaporation
CS194-10 Fall 2011 Lecture 21 17
Compact conditional distributions contd.
Noisy-OR distributions model multiple noninteracting causes1) Parents U1 . . . UL include all causes (can add leak node)2) Independent failure probability q` for each cause alone
⇒ P (X|U1 . . . UM ,¬UM+1 . . .¬UL) = 1−ΠM` = 1q`
Cold F lu Malaria P (Fever) P (¬Fever)F F F 0.0 1.0F F T 0.9 0.1F T F 0.8 0.2F T T 0.98 0.02 = 0.2× 0.1T F F 0.4 0.6T F T 0.94 0.06 = 0.6× 0.1T T F 0.88 0.12 = 0.6× 0.2T T T 0.988 0.012 = 0.6× 0.2× 0.1
Number of parameters linear in number of parents
CS194-10 Fall 2011 Lecture 21 18
Hybrid (discrete+continuous) networks
Discrete (Subsidy? and Buys?); continuous (Harvest and Cost)
Buys?
HarvestSubsidy?
Cost
Option 1: discretization—possibly large errors, large CPTsOption 2: finitely parameterized canonical families
1) Continuous variable, discrete+continuous parents (e.g., Cost)2) Discrete variable, continuous parents (e.g., Buys?)
CS194-10 Fall 2011 Lecture 21 19
Continuous child variables
Need one conditional density function for child variable given continuousparents, for each possible assignment to discrete parents
Most common is the linear Gaussian model, e.g.,:
P (Cost = c|Harvest = h, Subsidy? = true)
= N(ath + bt, σt)(c)
=1
σt
√2π
exp
−1
2
c− (ath + bt)
σt
2
Mean Cost varies linearly with Harvest, variance is fixed
Linear variation is unreasonable over the full rangebut works OK if the likely range of Harvest is narrow
CS194-10 Fall 2011 Lecture 21 20
Continuous child variables
0 2 4 6 8 10Cost c02 46 81012
Harvest h
00.10.20.30.4
P(c | h, subsidy)
All-continuous network with LG distributions⇒ full joint distribution is a multivariate Gaussian
Discrete+continuous LG network is a conditional Gaussian network i.e., amultivariate Gaussian over all continuous variables for each combination ofdiscrete variable values
CS194-10 Fall 2011 Lecture 21 21
Discrete variable w/ continuous parents
Probability of Buys? given Cost should be a “soft” threshold:
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12
P(c)
Cost c
Probit distribution uses integral of Gaussian:Φ(x) = ∫x
−∞N(0, 1)(x)dxP (Buys? = true | Cost = c) = Φ((−c + µ)/σ)
CS194-10 Fall 2011 Lecture 21 22
Why the probit?
1. It’s sort of the right shape
2. Can view as hard threshold whose location is subject to noise
Buys?
Cost Cost Noise
CS194-10 Fall 2011 Lecture 21 23
Discrete variable contd.
Sigmoid (or logit) distribution also used in neural networks:
P (Buys? = true | Cost = c) =1
1 + exp(−2−c+µσ )
Sigmoid has similar shape to probit but much longer tails:
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12
P(buys
| c)
Cost c
LogitProbit
CS194-10 Fall 2011 Lecture 21 24
Summary (representation)
Bayes nets provide a natural representation for (causally induced)conditional independence
Topology + CPTs = compact representation of joint distribution⇒ fast learning from few examples
Generally easy for (non)experts to construct
Canonical distributions (e.g., noisy-OR, linear Gaussian)⇒ compact representation of CPTs⇒ faster learning from fewer examples
CS194-10 Fall 2011 Lecture 21 25