1 CMSC 671 Fall 2010 Class #18/19 – Wednesday, November 3 / Monday, November 8 Some material borrowed with permission from Lise Getoor

1

CMSC 671CMSC 671Fall 2010Fall 2010

Class #18/19 – Wednesday, November 3 / Monday, November 8

Some material borrowed with permission from Lise Getoor

2

Next two classes

• Probability theory (quick review!)• Bayesian networks

– Network structure– Conditional probability tables– Conditional independence

• Bayesian inference– From the joint distribution– Using independence/factoring– From sources of evidence

3

Bayesian ReasoningBayesian Reasoning

Chapter 13

4

Sources of uncertainty• Uncertain inputs

– Missing data– Noisy data

• Uncertain knowledge– Multiple causes lead to multiple effects– Incomplete enumeration of conditions or effects– Incomplete knowledge of causality in the domain– Probabilistic/stochastic effects

• Uncertain outputs– Abduction and induction are inherently uncertain– Default reasoning, even in deductive fashion, is uncertain– Incomplete deductive inference may be uncertain

Probabilistic reasoning only gives probabilistic results (summarizes uncertainty from various sources)

5

Decision making with uncertainty

• Rational behavior:– For each possible action, identify the possible outcomes– Compute the probability of each outcome– Compute the utility of each outcome– Compute the probability-weighted (expected) utility

over possible outcomes for each action– Select the action with the highest expected utility

(principle of Maximum Expected Utility)

6

Why probabilities anyway?• Kolmogorov showed that three simple axioms lead to the

rules of probability theory– De Finetti, Cox, and Carnap have also provided compelling

arguments for these axioms1. All probabilities are between 0 and 1:

• 0 ≤ P(a) ≤ 12. Valid propositions (tautologies) have probability 1, and

unsatisfiable propositions have probability 0:• P(true) = 1 ; P(false) = 0

3. The probability of a disjunction is given by:• P(a b) = P(a) + P(b) – P(a b)

aba b

7

Probability theory

• Random variables– Domain

• Atomic event: complete specification of state

• Prior probability: degree of belief without any other evidence

• Joint probability: matrix of combined probabilities of a set of variables

• Alarm, Burglary, Earthquake– Boolean (like these), discrete,

continuous• Alarm=True Burglary=True

Earthquake=Falsealarm burglary ¬earthquake

• P(Burglary) = .1

• P(Alarm, Burglary) =alarm ¬alarm

burglary .09 .01¬burglary .1 .8

8

Probability theory (cont.)

• Conditional probability: probability of effect given causes

• Computing conditional probs:– P(a | b) = P(a b) / P(b)– P(b): normalizing constant

• Product rule:– P(a b) = P(a | b) P(b)

• Marginalizing:– P(B) = ΣaP(B, a)

– P(B) = ΣaP(B | a) P(a) (conditioning)

• P(burglary | alarm) = .47P(alarm | burglary) = .9

• P(burglary | alarm) = P(burglary alarm) / P(alarm) = .09 / .19 = .47

• P(burglary alarm) = P(burglary | alarm) P(alarm) = .47 * .19 = .09

• P(alarm) = P(alarm burglary) + P(alarm ¬burglary) = .09+.1 = .19

9

Example: Inference from the jointalarm ¬alarmearthquake ¬earthquake earthquake ¬earthquake

burglary .01 .08 .001 .009¬burglary .01 .09 .01 .79

P(Burglary | alarm) = α P(Burglary, alarm) = α [P(Burglary, alarm, earthquake) + P(Burglary, alarm, ¬earthquake) = α [ (.01, .01) + (.08, .09) ] = α [ (.09, .1) ]

Since P(burglary | alarm) + P(¬burglary | alarm) = 1, α = 1/(.09+.1) = 5.26 (i.e., P(alarm) = 1/α = .19 – quizlet: how can you verify this?)

P(burglary | alarm) = .09 * 5.26 = .474

P(¬burglary | alarm) = .1 * 5.26 = .526

10

Exercise: Inference from the joint

• Queries:– What is the prior probability of smart?– What is the prior probability of study?– What is the conditional probability of prepared, given

study and smart?• Save these answers for next time!

p(smart study prep)

smart smart

study study study study

prepared .432 .16 .084 .008

prepared .048 .16 .036 .072

11

Independence• When two sets of propositions do not affect each others’

probabilities, we call them independent, and can easily compute their joint and conditional probability:– Independent (A, B) → P(A B) = P(A) P(B), P(A | B) = P(A)

• For example, {moon-phase, light-level} might be independent of {burglary, alarm, earthquake}– Then again, it might not: Burglars might be more likely to

burglarize houses when there’s a new moon (and hence little light)– But if we know the light level, the moon phase doesn’t affect

whether we are burglarized– Once we’re burglarized, light level doesn’t affect whether the alarm

goes off• We need a more complex notion of independence, and

methods for reasoning about these kinds of relationships

12

Exercise: Independence

• Queries:– Is smart independent of study?– Is prepared independent of study?

p(smart study prep)

smart smart


prepared .432 .16 .084 .008

prepared .048 .16 .036 .072

13

Conditional independence• Absolute independence:

– A and B are independent if P(A B) = P(A) P(B); equivalently, P(A) = P(A | B) and P(B) = P(B | A)

• A and B are conditionally independent given C if– P(A B | C) = P(A | C) P(B | C)

• This lets us decompose the joint distribution:– P(A B C) = P(A | C) P(B | C) P(C)

• Moon-Phase and Burglary are conditionally independent given Light-Level

• Conditional independence is weaker than absolute independence, but still useful in decomposing the full joint probability distribution

14

Exercise: Conditional independence

• Queries:– Is smart conditionally independent of prepared, given

study?– Is study conditionally independent of prepared, given

smart?

p(smart study prep)

smart smart


prepared .432 .16 .084 .008

prepared .048 .16 .036 .072

15

Bayes’s rule• Bayes’s rule is derived from the product rule:

– P(Y | X) = P(X | Y) P(Y) / P(X)

• Often useful for diagnosis: – If X are (observed) effects and Y are (hidden) causes, – We may have a model for how causes lead to effects (P(X | Y))– We may also have prior beliefs (based on experience) about the

frequency of occurrence of effects (P(Y))– Which allows us to reason abductively from effects to causes (P(Y |

X)).

16

Bayesian inference• In the setting of diagnostic/evidential reasoning

– Know prior probability of hypothesis conditional probability

– Want to compute the posterior probability• Bayes’ theorem (formula 1):

onsanifestatievidence/m

hypotheses

1 mj

i

EEE

H

)(/)|()()|( jijiji EPHEPHPEHP

)( iHP)|( ij HEP

)|( ij HEP

)|( ji EHP

)( iHP

17

Simple Bayesian diagnostic reasoning

• Knowledge base:– Evidence / manifestations: E1, … Em

– Hypotheses / disorders: H1, … Hn

• Ej and Hi are binary; hypotheses are mutually exclusive (non-overlapping) and exhaustive (cover all possible cases)

– Conditional probabilities: P(Ej | Hi), i = 1, … n; j = 1, … m

• Cases (evidence for a particular instance): E1, …, El

• Goal: Find the hypothesis Hi with the highest posterior– Maxi P(Hi | E1, …, El)

18

Bayesian diagnostic reasoning II

• Bayes’ rule says that– P(Hi | E1, …, El) = P(E1, …, El | Hi) P(Hi) / P(E1, …, El)

• Assume each piece of evidence Ei is conditionally independent of the others, given a hypothesis Hi, then:– P(E1, …, El | Hi) = l

j=1 P(Ej | Hi)

• If we only care about relative probabilities for the Hi, then we have:– P(Hi | E1, …, El) = α P(Hi) l

j=1 P(Ej | Hi)

19

Limitations of simple Bayesian inference

• Cannot easily handle multi-fault situation, nor cases where intermediate (hidden) causes exist:– Disease D causes syndrome S, which causes correlated

manifestations M1 and M2

• Consider a composite hypothesis H1 H2, where H1 and H2 are independent. What is the relative posterior?– P(H1 H2 | E1, …, El) = α P(E1, …, El | H1 H2) P(H1 H2)

= α P(E1, …, El | H1 H2) P(H1) P(H2)= α l

j=1 P(Ej | H1 H2) P(H1) P(H2)

• How do we compute P(Ej | H1 H2) ??

20

Limitations of simple Bayesian inference II

• Assume H1 and H2 are independent, given E1, …, El?– P(H1 H2 | E1, …, El) = P(H1 | E1, …, El) P(H2 | E1, …, El)

• This is a very unreasonable assumption– Earthquake and Burglar are independent, but not given Alarm:

• P(burglar | alarm, earthquake) << P(burglar | alarm)

• Another limitation is that simple application of Bayes’s rule doesn’t allow us to handle causal chaining:

– A: this year’s weather; B: cotton production; C: next year’s cotton price– A influences C indirectly: A→ B → C– P(C | B, A) = P(C | B)

• Need a richer representation to model interacting hypotheses, conditional independence, and causal chaining

• Next time: conditional independence and Bayesian networks!

21

Bayesian NetworksBayesian Networks

Chapter 14.1-14.3

Some material borrowedfrom Lise Getoor

22

Bayesian Belief Networks (BNs)

• Definition: BN = (DAG, CPD) – DAG: directed acyclic graph (BN’s structure)

• Nodes: random variables (typically binary or discrete, but methods also exist to handle continuous variables)

• Arcs: indicate probabilistic dependencies between nodes (lack of link signifies conditional independence)

– CPD: conditional probability distribution (BN’s parameters)• Conditional probabilities at each node, usually stored as a table

(conditional probability table, or CPT)

– Root nodes are a special case – no parents, so just use priors in CPD:

iiii xxP of nodesparent all ofset theis where)|( ππ

)()|( so , iiii xPxP ∅ ππ

23

Example BN

a

b c

d e

P(C|A) = 0.2 P(C|A) = 0.005P(B|A) = 0.3

P(B|A) = 0.001

P(A) = 0.001

P(D|B,C) = 0.1 P(D|B,C) = 0.01P(D|B,C) = 0.01 P(D|B,C) = 0.00001

P(E|C) = 0.4 P(E|C) = 0.002

Note that we only specify P(A) etc., not P(¬A), since they have to add to one

24

• Conditional independence assumption–

where q is any set of variables (nodes) other than and its successors

– blocks influence of other nodes on and its successors (q influences onlythrough variables in )

– With this assumption, the complete joint probability distribution of all variables in the network can be represented by (recovered from) local CPDs by chaining these CPDs:

ix

)|(),...,( 11 iinin xPxxP πΠ

)|(),|( iiii xPqxP ππ

ix iπ ix

iπ q

ix iπ

Conditional independence and chaining

25

Chaining: Example

Computing the joint probability for all variables is easy:

P(a, b, c, d, e) = P(e | a, b, c, d) P(a, b, c, d) by the product rule= P(e | c) P(a, b, c, d) by cond. indep. assumption= P(e | c) P(d | a, b, c) P(a, b, c) = P(e | c) P(d | b, c) P(c | a, b) P(a, b)= P(e | c) P(d | b, c) P(c | a) P(b | a) P(a)

a

b c

d e

26

Topological semantics

• A node is conditionally independent of its non-descendants given its parents

• A node is conditionally independent of all other nodes in the network given its parents, children, and children’s parents (also known as its Markov blanket)

• The method called d-separation can be applied to decide whether a set of nodes X is independent of another set Y, given a third set Z

28

Inference in Bayesian Inference in Bayesian NetworksNetworks

Chapter 14.4-14.5

Some material borrowedfrom Lise Getoor

29

Inference tasks• Simple queries: Computer posterior marginal P(Xi | E=e)

– E.g., P(NoGas | Gauge=empty, Lights=on, Starts=false)

• Conjunctive queries: – P(Xi, Xj | E=e) = P(Xi | e=e) P(Xj | Xi, E=e)

• Optimal decisions: Decision networks include utility information; probabilistic inference is required to find P(outcome | action, evidence)

• Value of information: Which evidence should we seek next?• Sensitivity analysis: Which probability values are most

critical?• Explanation: Why do I need a new starter motor?

30

Approaches to inference

• Exact inference – Enumeration– Belief propagation in polytrees– Variable elimination– Clustering / join tree algorithms

• Approximate inference– Stochastic simulation / sampling methods– Markov chain Monte Carlo methods– Genetic algorithms– Neural networks– Simulated annealing– Mean field theory

31

Direct inference with BNs

• Instead of computing the joint, suppose we just want the probability for one variable

• Exact methods of computation:– Enumeration– Variable elimination– Join trees: get the probabilities associated with every query variable

32

Inference by enumeration

• Add all of the terms (atomic event probabilities) from the full joint distribution

• If E are the evidence (observed) variables and Y are the other (unobserved) variables, then:

P(X|e) = α P(X, E) = α ∑ P(X, E, Y)• Each P(X, E, Y) term can be computed using the chain rule• Computationally expensive!

33

Example: Enumeration

• P(xi) = Σ πi P(xi | πi) P(πi)• Suppose we want P(D=true), and only the value of E is

given as true• P (d|e) = ΣABCP(a, b, c, d, e)

= ΣABCP(a) P(b|a) P(c|a) P(d|b,c) P(e|c)• With simple iteration to compute this expression, there’s

going to be a lot of repetition (e.g., P(e|c) has to be recomputed every time we iterate over C=true)

a

b c

d e

34

Exercise: Enumeration

smart study

prepared fair

pass

p(smart)=.8 p(study)=.6

p(fair)=.9

p(prep|…) smart smart

study .9 .7

study .5 .1

p(pass|…)smart smart

prep prep prep prep

fair .9 .7 .7 .2

fair .1 .1 .1 .1

Query: What is the probability that a student studied, given that they pass the exam?

35

Variable elimination

• Basically just enumeration, but with caching of local calculations

• Linear for polytrees (singly connected BNs)• Potentially exponential for multiply connected BNs

Exact inference in Bayesian networks is NP-hard!• Join tree algorithms are an extension of variable elimination

methods that compute posterior probabilities for all nodes in a BN simultaneously

36

Variable elimination

General idea:• Write query in the form

• Iteratively– Move all irrelevant terms outside of innermost sum– Perform innermost sum, getting a new term– Insert the new term into the product

∑ ∑∑∏kx x x i

iin paxPXP3 2

)|(),( Le

37

Variable elimination: ExampleVariable elimination: Example

RainSprinkler

Cloudy

WetGrass

∑c,s,r

)c(P)c|s(P)c|r(P)s,r|w(P)w(P

∑ ∑s,r c

)c(P)c|s(P)c|r(P)s,r|w(P

∑s,r

1 )s,r(f)s,r|w(P )s,r(f1

39

A more complex example

Visit to Asia Smoking

Lung CancerTuberculosis

Abnormalityin Chest Bronchitis

X-Ray Dyspnea

• “Asia” network:

40

V S

LT

A B

X D

),|()|(),|()|()|()|()()( badPaxPltaPsbPslPvtPsPvP

• We want to compute P(d)• Need to eliminate: v,s,x,t,l,a,b

Initial factors

41

V S

LT

A B

X D


• We want to compute P(d)• Need to eliminate: v,s,x,t,l,a,b

Initial factors

Eliminate: v

Note: fv(t) = P(t)In general, result of elimination is not necessarily a probability term

Compute: ∑v

v vtPvPtf )|()()(

),|()|(),|()|()|()()( badPaxPltaPsbPslPsPtfv⇒

42

V S

LT

A B

X D


• We want to compute P(d)• Need to eliminate: s,x,t,l,a,b

• Initial factors

Eliminate: s

Summing on s results in a factor with two arguments fs(b,l)In general, result of elimination may be a function of several variables

Compute: ∑s

s slPsbPsPlbf )|()|()(),(


),|()|(),|(),()( badPaxPltaPlbftf sv⇒

43

V S

LT

A B

X D


• We want to compute P(d)• Need to eliminate: x,t,l,a,b

• Initial factors

Eliminate: x

Note: fx(a) = 1 for all values of a !!

Compute: ∑x

x axPaf )|()(



),|(),|()(),()( badPltaPaflbftf xsv⇒

44

V S

LT

A B

X D


• We want to compute P(d)• Need to eliminate: t,l,a,b

• Initial factors

Eliminate: tCompute: ∑

tvt ltaPtflaf ),|()(),(




),|(),()(),( badPlafaflbf txs⇒

45

V S

LT

A B

X D


• We want to compute P(d)• Need to eliminate: l,a,b

• Initial factors

Eliminate: lCompute: ∑

ltsl laflbfbaf ),(),(),(




),|(),()(),( badPlafaflbf txs⇒

),|()(),( badPafbaf xl⇒

46

V S

LT

A B

X D


• We want to compute P(d)• Need to eliminate: b

• Initial factors

Eliminate: a,bCompute:

∑∑ b

aba

xla dbfdfbadpafbafdbf ),()(),|()(),(),(




),|()(),( badPafbaf xl⇒),|(),()(),( badPlafaflbf txs⇒

)(),( dfdbf ba ⇒⇒

47

Dealing with evidence

• How do we deal with evidence?

• Suppose we are give evidence V = t, S = f, D = t• We want to compute P(L, V = t, S = f, D = t)

V S

LT

A B

X D

48

Dealing with evidence • We start by writing the factors:

• Since we know that V = t, we don’t need to eliminate V• Instead, we can replace the factors P(V) and P(T|V) with

• These “select” the appropriate parts of the original factors given the evidence• Note that fp(V) is a constant, and thus does not appear in elimination of other variables


V S

LT

A B

X D

)|()()( )|()( tVTPTftVPf VTpVP

49

Dealing with evidence • Given evidence V = t, S = f, D = t• Compute P(L, V = t, S = f, D = t )• Initial factors, after setting evidence:

),()|(),|()()()( ),|()|()|()|()()( bafaxPltaPbflftfff badPsbPslPvtPsPvP

V S

LT

A B

X D

50

• Given evidence V = t, S = f, D = t• Compute P(L, V = t, S = f, D = t )• Initial factors, after setting evidence:

• Eliminating x, we get


V S

LT

A B

X D

),()(),|()()()( ),|()|()|()|()()( bafafltaPbflftfff badPxsbPslPvtPsPvP


51




• Eliminating t, we get


V S

LT

A B

X D


),()(),()()( ),|()|()|()()( bafaflafbflfff badPxtsbPslPsPvP

52





• Eliminating a, we get


V S

LT

A B

X D



),()()( )|()|()()( lbfbflfff asbPslPsPvP

53

Dealing with evidence • Given evidence V = t, S = f, D = t• Compute P(L, V = t, S = f, D = t )• Initial factors, after setting evidence:



• Eliminating a, we get

• Eliminating b, we get


V S

LT

A B

X D



),()()( )|()|()()( lbfbflfff asbPslPsPvP

)()()|()()( lflfff bslPsPvP

54

Variable elimination algorithm• Let X1,…, Xm be an ordering on the non-query variables

• For i = m, …, 1

– Leave in the summation for Xi only factors mentioning Xi

– Multiply the factors, getting a factor that contains a number for each value of the variables mentioned, including Xi

– Sum out Xi, getting a factor f that contains a number for each value of the variables mentioned, not including Xi

– Replace the multiplied factor in the summation

∏∑ ∑∑j

jjX XX

))X(Parents|X(P...1 m2

55

∑x

kxkx yyxfyyf ),,,('),,( 11 KK

∏

m

ilikx i

yyxfyyxf1

,1,1,11 ),,(),,,(' KK

Complexity of variable eliminationSuppose in one elimination step we compute

This requires

multiplications (for each value for x, y1, …, yk, we do m multiplications) and

additions (for each value of y1, …, yk , we do |Val(X)| additions)

►Complexity is exponential in the number of variables in the intermediate factors►Finding an optimal ordering is NP-hard

∏⋅⋅i

iYXm )Val()Val(

∏⋅i

iYX )Val()Val(

56

Exercise: Variable elimination

smart study

prepared fair

pass


p(fair)=.9


study .9 .7

study .5 .1


prep prep prep prep

fair .9 .7 .7 .2

fair .1 .1 .1 .1

Query: What is the probability that a student is smart, given that they pass the exam?

57

Conditioning

• Conditioning: Find the network’s smallest cutset S (a set of nodes whose removal renders the network singly connected)

– In this network, S = {A} or {B} or {C} or {D}• For each instantiation of S, compute the belief update with your favorite

inference algorithm• Combine the results from all instantiations of S• Computationally expensive (finding the smallest cutset is in general

NP-hard, and the total number of possible instantiations of S is O(2|S|))

a

b c

d e

58

Approximate inference:Direct sampling

• Suppose you are given values for some subset of the variables, E, and want to infer values for unknown variables, Z

• Randomly generate a very large number of instantiations from the BN– Generate instantiations for all variables – start at root variables and

work your way “forward” in topological order• Rejection sampling: Only keep those instantiations that are

consistent with the values for E• Use the frequency of values for Z to get estimated

probabilities• Accuracy of the results depends on the size of the sample

(asymptotically approaches exact results)

59

Exercise: Direct sampling

smart study

prepared fair

pass


p(fair)=.9


study .9 .7

study .5 .1


prep prep prep prep

fair .9 .7 .7 .2

fair .1 .1 .1 .1

Topological order = …?Random number generator: .35, .76, .51, .44, .08, .28, .03, .92, .02, .42

60

Likelihood weighting

• Idea: Don’t generate samples that need to be rejected in the first place!

• Sample only from the unknown variables Z• Weight each sample according to the likelihood that it

would occur, given the evidence E

61

Markov chain Monte Carlo algorithm• So called because

– Markov chain – each instance generated in the sample is dependent on the previous instance

– Monte Carlo – statistical sampling method

• Perform a random walk through variable assignment space, collecting statistics as you go– Start with a random instantiation, consistent with evidence variables– At each step, for some nonevidence variable, randomly sample its

value, consistent with the other current assignments

• Given enough samples, MCMC gives an accurate estimate of the true distribution of values

62

Exercise: MCMC sampling

smart study

prepared fair

pass


p(fair)=.9


study .9 .7

study .5 .1


prep prep prep prep

fair .9 .7 .7 .2

fair .1 .1 .1 .1

Topological order = …?Random number generator: .35, .76, .51, .44, .08, .28, .03, .92, .02, .42

63

Summary

• Bayes nets– Structure– Parameters– Conditional independence– Chaining

• BN inference– Enumeration– Variable elimination– Sampling methods

Documents

1 CMSC 671 Fall 2010 Class #18/19 – Wednesday, November 3 / Monday, November 8 Some material borrowed with permission from Lise Getoor