Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes

Bayesian networks: Modeling

CS194-10 Fall 2011 Lecture 21

CS194-10 Fall 2011 Lecture 21 1

Outline

♦ Overview of Bayes nets

♦ Syntax and semantics

♦ Examples

♦ Compact conditional distributions


Learning with complex probability models

Learning cannot succeed without imposing some prior structure on thehypothesis space (by constraint or by preference)

Generative models P (X | θ) support MLE, MAP, and Bayesian learning fordomains that (approximately) reflect the assumptions underlying the model:♦ Naive Bayes—conditional independence of attributes given class value♦ Mixture models—domain has a flat, discrete category structure♦ All i.i.d. models—model doesn’t change over time♦ Etc.

Would like to express arbitrarily complex and flexible prior knowledge:♦ Some attributes depend on others♦ Categories have hierarchical structure;

objects may be mixtures of several categories♦ Observations at time t may depend on earlier observations♦ Etc.


Bayesian networks

A simple, graphical notation for conditional independence assertionsamong a predefined set of random variables Xj, j = 1, . . . , Dand hence for compact specification of arbitrary joint distributions

Syntax:a set of nodes, one per variablea directed, acyclic graph (link ≈ “directly influences”)a set of parameters for each node given its parents:

θ(Xj|Parents(Xj))

In the simplest case, parameters consist ofa conditional probability table (CPT) giving thedistribution over Xj for each combination of parent values


Example

Topology of network encodes conditional independence assertions:

Weather Cavity

Toothache Catch

Weather is independent of the other variables

Toothache and Catch are conditionally independent given Cavity


Example

I’m at work, neighbor John calls to say my alarm is ringing, but neighborMary doesn’t call. Sometimes it’s set off by minor earthquakes. Is there aburglar?

Variables: Burglar, Earthquake, Alarm, JohnCalls, MaryCallsNetwork topology reflects “causal” knowledge:

– A burglar can set the alarm off– An earthquake can set the alarm off– The alarm can cause Mary to call– The alarm can cause John to call


Example contd.

.001P(B)

.002P(E)

Alarm

Earthquake

MaryCallsJohnCalls

Burglary

BTTFF

ETFTF

.95

.29

.001

.94

P(A|B,E)

ATF

.90

.05

P(J|A) ATF

.70

.01

P(M|A)


Compactness

A CPT for Boolean Xj with L Boolean parents has B E

JA

M

2L rows for the combinations of parent values

Each row requires one parameter p for Xj = true(the parameter for Xj = false is just 1− p)

If each variable has no more than L parents,the complete network requires O(D · 2L) parameters

I.e., grows linearly with D, vs. O(2D) for the full joint distribution

For burglary net, 1 + 1 + 4 + 2 + 2 = 10 parameters (vs. 25 − 1 = 31)


Global semantics

Global semantics defines the full joint distribution B E

JA

M

as the product of the local conditional distributions:

P (x1, . . . , xD) = ΠDj = 1θ(xj|parents(Xj)) .

e.g., P (j ∧m ∧ a ∧ ¬b ∧ ¬e)

=


Global semantics


JA

M



e.g., P (j ∧m ∧ a ∧ ¬b ∧ ¬e)

= θ(j|a)θ(m|a)θ(a|¬b,¬e)θ(¬b)θ(¬e)

= 0.9× 0.7× 0.001× 0.999× 0.998

≈ 0.00063


Global semantics


JA

M



e.g., P (j ∧m ∧ a ∧ ¬b ∧ ¬e)

= θ(j|a)θ(m|a)θ(a|¬b,¬e)θ(¬b)θ(¬e)

= 0.9× 0.7× 0.001× 0.999× 0.998

≈ 0.00063

Theorem: θ(Xj|Parents(Xj)) = P(Xj|Parents(Xj))


Local semantics

Local semantics: each node is conditionally independentof its nondescendants given its parents

. . .

. . .U1

X

Um

Yn

Znj

Y1

Z1j

Theorem: Local semantics ⇔ global semantics


Markov blanket

Each node is conditionally independent of all others given itsMarkov blanket: parents + children + children’s parents

. . .

. . .U1

X

Um

Yn

Znj

Y1

Z1j


Constructing Bayesian networks

Need a method such that a series of locally testable assertions ofconditional independence guarantees the required global semantics

1. Choose an ordering of variables X1, . . . , XD

2. For j = 1 to Dadd Xj to the networkselect parents from X1, . . . , Xj−1 such that

P(Xj|Parents(Xj)) = P(Xj|X1, . . . , Xj−1)i.e., Xj is conditionally independent of other variables given parents

This choice of parents guarantees the global semantics:

P(X1, . . . , XD) = ΠDj = 1P(Xj|X1, . . . , Xj−1) (chain rule)

= ΠDj = 1P(Xj|Parents(Xj)) (by construction)


Example: Car diagnosis

Initial evidence: car won’t startTestable variables (green), “broken, so fix it” variables (orange)Hidden variables (gray) ensure sparse structure, reduce parameters

lights

no oil no gas starterbroken

battery age alternator broken

fanbeltbroken

battery dead no charging

battery flat

gas gauge

fuel lineblocked

oil light

battery meter

car won’t start dipstick


Example: Car insurance

SocioEconAge

GoodStudentExtraCar

MileageVehicleYear

RiskAversion

SeniorTrain

DrivingSkill MakeModelDrivingHist

DrivQualityAntilock

Airbag CarValue HomeBase AntiTheftt

TheftOwnDamage

PropertyCostLiabilityCostMedicalCost

Cushioning

Ruggedness Accident

OtherCost OwnCost


Compact conditional distributions

CPT grows exponentially with number of parentsCPT becomes infinite with continuous-valued parent or child

Solution: canonical distributions that are defined compactly

Deterministic nodes are the simplest case:X = f (Parents(X)) for some function f

E.g., Boolean functionsNorthAmerican ⇔ Canadian ∨ US ∨Mexican

E.g., numerical relationships among continuous variables

∂LakeLevel

∂t= inflow + precipitation - outflow - evaporation


Compact conditional distributions contd.

Noisy-OR distributions model multiple noninteracting causes1) Parents U1 . . . UL include all causes (can add leak node)2) Independent failure probability q` for each cause alone

⇒ P (X|U1 . . . UM ,¬UM+1 . . .¬UL) = 1−ΠM` = 1q`

Cold F lu Malaria P (Fever) P (¬Fever)F F F 0.0 1.0F F T 0.9 0.1F T F 0.8 0.2F T T 0.98 0.02 = 0.2× 0.1T F F 0.4 0.6T F T 0.94 0.06 = 0.6× 0.1T T F 0.88 0.12 = 0.6× 0.2T T T 0.988 0.012 = 0.6× 0.2× 0.1

Number of parameters linear in number of parents


Hybrid (discrete+continuous) networks

Discrete (Subsidy? and Buys?); continuous (Harvest and Cost)

Buys?

HarvestSubsidy?

Cost

Option 1: discretization—possibly large errors, large CPTsOption 2: finitely parameterized canonical families

1) Continuous variable, discrete+continuous parents (e.g., Cost)2) Discrete variable, continuous parents (e.g., Buys?)


Continuous child variables

Need one conditional density function for child variable given continuousparents, for each possible assignment to discrete parents

Most common is the linear Gaussian model, e.g.,:

P (Cost = c|Harvest = h, Subsidy? = true)

= N(ath + bt, σt)(c)

=1

σt

√2π

exp

−1

2

c− (ath + bt)

σt

2

Mean Cost varies linearly with Harvest, variance is fixed

Linear variation is unreasonable over the full rangebut works OK if the likely range of Harvest is narrow


Continuous child variables

0 2 4 6 8 10Cost c02 46 81012

Harvest h

00.10.20.30.4

P(c | h, subsidy)

All-continuous network with LG distributions⇒ full joint distribution is a multivariate Gaussian

Discrete+continuous LG network is a conditional Gaussian network i.e., amultivariate Gaussian over all continuous variables for each combination ofdiscrete variable values


Discrete variable w/ continuous parents

Probability of Buys? given Cost should be a “soft” threshold:

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12

P(c)

Cost c

Probit distribution uses integral of Gaussian:Φ(x) = ∫x

−∞N(0, 1)(x)dxP (Buys? = true | Cost = c) = Φ((−c + µ)/σ)


Why the probit?

1. It’s sort of the right shape

2. Can view as hard threshold whose location is subject to noise

Buys?

Cost Cost Noise


Discrete variable contd.

Sigmoid (or logit) distribution also used in neural networks:

P (Buys? = true | Cost = c) =1

1 + exp(−2−c+µσ )

Sigmoid has similar shape to probit but much longer tails:

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12

P(buys

| c)

Cost c

LogitProbit


Summary (representation)

Bayes nets provide a natural representation for (causally induced)conditional independence

Topology + CPTs = compact representation of joint distribution⇒ fast learning from few examples

Generally easy for (non)experts to construct

Canonical distributions (e.g., noisy-OR, linear Gaussian)⇒ compact representation of CPTs⇒ faster learning from fewer examples


Documents

Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes