1 Chapter 14 Probabilistic Reasoning. 2 Outlines Representing Knowledge in an Uncertain Domain The semantics of Bayesian Networks Efficient Representation

1

Chapter 14

Probabilistic Reasoning

2

Outlines

• Representing Knowledge in an Uncertain Domain

• The semantics of Bayesian Networks• Efficient Representation of Conditional

Distributions• Exact inference in Bayesian Networks• Approximate inference in in Bayesian

Networks• Extending Probability to FOL

representations• Other approaches to Uncertain Reasoning

3

14-1• Full joint probability distribution

– can answer any question about the domain, but can become intractably large as the number of variables grows.

– Furthermore, specifying probabilities for atomic events is rather unnatural and can be very difficult unless a large amount of data is available from which to gather statistical estimates.

• We also saw that independence and conditional independence relationships among variables can greatly reduce the number of probabilities that need to be specified in order to define the full joint distribution.

• This section introduces a data structure called a Bayesian networks to represent the dependencies among variables and to give a concise specification of any full joint probability distribution.

4

Definition Definition Definition Definition

A Bayesian network is a directed acyclic graph (DAG) which consists of:

• A set of random variables which makes up the nodes of the network.

• A set of directed links (arrows) connecting pairs of nodes. If there is an arrow from node X to node Y, X is said to be a parent of Y.

• Each node Xi has a conditional probability distribution P(Xi|Parents(Xi)) that quantifies the effect of the parents on the node.

Intuitions: • A Bayesian network models our incomplete understanding of

the causal relationships from an application domain.• A node represents some state of affairs or event.• A link from X to Y means that X has a direct influence on

Y.

5

What are Bayesian Networks?

• Graphical notation for conditional independence assertions

• Compact specification of full joint distributions

• What do they look like?– Set of nodes, one per variable– Directed, acyclic graph– Conditional distribution for each node given its

parents: P(Xi|Parents(Xi))

6

Example (Fig. 14.1)

• Weather is independent of the other variables

• Toothache and Catch are conditionally independent given Cavity

Weather Cavity

Toothache Catch

causes

effects

7

Bayesian Network Notice that Cavity is the “cause” of both Toothach

e and PCatch, and represent the causality links explicitly

Give the prior probability distribution of Cavity Give the conditional probability tables of Toothach

e and PCatchCavity

Toothache

P(Cavity)0.2

P(Toothache|c)

CavityCavity

0.60.1

PCatch

P(PClass|c)CavityCavity

0.90.02

5 probabilities, instead of 7

P(ctpc) = P(tpc|c) P(c) = P(t|c) P(pc|c) P(c)

8

Another Example

Sample Domain: • You have a burglar alarm installed in your

home. It is fairly reliable at detecting a burglary, but also responds on occasion to minor earthquakes.

• You also have to neighbors, John and Mary, who have promised to call you at work when they hear the alarm.

• John always calls when he hears the alarm, but sometimes confuses the telephone ringing with the alarm and calls then, too.

• Mary, on the other hand, likes rather loud music and sometimes misses the alarm altogether.

9

Example

• I’m at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn’t call. Sometimes it’s set off by minor earthquakes. Is there a burglar?

• What are the variables?– Burglar– Earthquake– Alarm– JohnCalls– MaryCalls

10

Another Example (continued)

• Network Topology reflects causal knowledge:– A burglar can set the alarm off– An earthquake can set the alarm off– The alarm can cause Mary to call– The alarm can cause John to call

• Assumption– they do not perceive any burglaries directly, – they do not notice the minor earthquakes,

and – they do not confer before calling.

11

Another example (Fig.14.2)

Alarm

Earthquake

JohnCalls MaryCalls

Burglary

12

Conditional Probability table (CPT)• Each distribution is shown as a conditional probability

table, or CPT. • CPT can be used for discrete variables; other

representations, including those suitable for continuous variables

• Each row in a CPT contains the conditionalprobability of each node value for a conditioning case.

• A conditioning case is just a possible combination of values for the parent nodes—a miniature atomic event, if you like.

• Each row must sum to 1, because the entries represent an exhaustive set of cases for the variable.

• For Boolean variables, once you know that the probability of a true value is p, the probability of false must be 1 – p, so we often omit the second number, as in Figure 14.2.

• In general, a table for a Boolean variable with k Boolean parents contains 2k independently specifiable probabilities.

• A node with no parents has only one row, representing the prior probabilities of each possible value of the variable.

13

Another example (Fig.14.2)

B E P(A|B,E)

T T .95

T F .94

F T .29

F F .001

P(B)

.001

A P(M|A)

T .70

F .01

A P(J|A)

T .90

F .05

P(E)

.001

Alarm

Earthquake

JohnCalls MaryCalls

Burglary

14

Compactness

• Conditional Probability Table (CPT): distribution over Xi for each combination of parent values

• Each row requires one number p for for Xi=true (since the false case is just 1-p)

• A CPT for Boolean Xi with k Boolean parents has 2k rows for the combinations of parent values

• Full network requires O(n = 2k) numbers (instead of 2n)

B E P(A|B,E)

T T .95

T F .94

F T .29

F F .001

A

E

J M

B

15

14.2 Semantics of Bayesian Networks

• Global: Representing the full joint distribution– be helpful in understanding how to

construct networks,

• Local: Representing conditional independence – be helpful in designing inference

procedures

16

Global Semantics• Global semantics defines the full joint distribution as

the product of the local conditional distributions• P(X1=x1, X2=x2, X3=x3, …,Xn=xn)• = P(X1,…,Xn)=n

i=1 P(Xi|Parents(Xi))– where parents (Xi) denotes the specific values of the variable

s in Parents(Xi). – Thus, each entry in the joint distribution is represented by th

e product of the appropriate elements of the conditional probability tables (CPTs) in the Bayesian network.

– The CPTs therefore provide a decomposed representation of the joint distribution.

• Example:What is P(j m a b e)? = P(j|a)P(m|a)P(a| b, e)P( b)P( e)0.90*0.70*0.001*0.999*0.998=0.00062

17

But does a BN represent a belief state?

In other words, can we compute the full joint

distribution of the propositions from it?

18

Calculation of Joint Probability

B E P(A|…)

TTFF

TFTF

0.950.940.290.001

Burglary Earthquake

Alarm

MaryCallsJohnCalls

P(B)

0.001

P(E)

0.002

A P(J|…)

TF

0.900.05

A P(M|…)

TF

0.700.01

P(JMABE) = ??

19

P(JMABE)= P(JM|A,B,E) P(ABE)= P(J|A,B,E) P(M|A,B,E) P(ABE)(J and M are independent given A)

P(J|A,B,E) = P(J|A)(J and BE are independent given A)

P(M|A,B,E) = P(M|A) P(ABE) = P(A|B,E) P(B|E) P(E)

= P(A|B,E) P(B) P(E)(B and E are independent)

P(JMABE) = P(J|A)P(M|A)P(A|B,E)P(B)P(E)

Burglary Earthquake

Alarm

MaryCallsJohnCalls

20


B E P(A|…)

TTFF

TFTF

0.950.940.290.001

Burglary Earthquake

Alarm

MaryCallsJohnCalls

P(B)

0.001

P(E)

0.002

A P(J|…)

TF

0.900.05

A P(M|…)

TF

0.700.01

P(JMABE)= P(J|A)P(M|A)P(A|B,E)P(B)P(E)= 0.9 x 0.7 x 0.001 x 0.999 x 0.998= 0.00062

21


B E P(A|…)

TTFF

TFTF

0.950.940.290.001

Burglary Earthquake

Alarm

MaryCallsJohnCalls

P(B)

0.001

P(E)

0.002

A P(J|…)

TF

0.900.05

A P(M|…)

TF

0.700.01


P(x1x2…xn) = i=1,…,nP(xi|parents(Xi)) full joint distribution table

22


B E P(A|…)

TTFF

TFTF

0.950.940.290.001

Burglary Earthquake

Alarm

MaryCallsJohnCalls

P(B)

0.001

P(E)

0.002

A P(J|…)

TF

0.900.05

A P(M|…)

TF

0.700.01


Since a BN defines the full joint distribution of a set of propositions, it represents a belief state

Since a BN defines the full joint distribution of a set of propositions, it represents a belief state

P(x1x2…xn) = i=1,…,nP(xi|parents(Xi)) full joint distribution table

23

24

Chain Rule

25

• We need to choose parents for each node such that this property holds. Intuitively, the parents of node Xi should contain all those nodes in X1, ... , Xi_1 that directly influence Xi.

• For example, suppose we have completed the network in Figure 14.2 except for the choice of parents for MaryCalls. MaryCalls is certainly influenced by whether there is a Burglary or an Earthquake, but not directly influenced. Intuitively, our knowledge of the domain tells us that these events influence Mary's calling behavior only through their effect on the alarm.

• Also, given the state of the alarm, whether John calls has no influence on Mary's calling. Formally speaking, we believe that the following conditional independence statement holds:

• P(MaryCalls John Calls , Alarm, Earthquake, Burglary) = P( MaryCalls I Alarm) .

26

Constructing Bayesian Networks (cont.)

• Direct influencers should be added to the network first

• The correct order in which to add nodes is to add “root causes” first, then the variables they influence, and so on.

• We need to choose parents for each node such that this property holds. Intuitively, the parents of node Xi should contain all those nodes in X1, X2, ... , Xi-1 that directly influence Xi.

• If we don’t follow these rules, we can end up with a very complicated network.

27

Constructing Bayesian Networks

M

A

E

J

B

M

P(MaryCalls|JohnCalls,Alarm,Earthquake,Burglary) = P(MaryCalls|Alarm)

28


MaryCalls

P(J|M)=P(J)?

Chosen order: M,J,A,B,E

29


JohnCalls

MaryCalls

P(J|M)=P(J)? No!P(A|J,M)=P(A|J)? P(A|J,M)=P(A)?


30


Alarm

JohnCalls

MaryCalls

P(J|M)=P(J)? No!P(A|J,M)=P(A|J)? P(A|J,M)=P(A)? No.P(B|A,J,M)=P(B|A)?P(B|A,J,M)=P(B)?


31


Burglary

Alarm

JohnCalls

MaryCalls

P(J|M)=P(J)? No! P(A|J,M)=P(A|J)? P(A|J,M)=P(A)? No. P(B|A,J,M)=P(B|A)? Yes! P(E|B,A,J,M)=P(E|A)?P(B|A,J,M)=P(B)? No! P(E|B,A,J,M)=P(E|A,B)?


32


Alarm

JohnCalls

Burglary

Earthquake

MaryCalls

P(J|M)=P(J)? No! P(A|J,M)=P(A|J)? P(A|J,M)=P(A)? No. P(B|A,J,M)=P(B|A)? Yes! P(E|B,A,J,M)=P(E|A)? No!P(B|A,J,M)=P(B)? No! P(E|B,A,J,M)=P(E|A,B)? Yes!


33

Bad example

34


Earthquake

JohnCalls

Burglary

Alarm

MaryCalls

35

Local Semantics

• Local Semantics: Each node is conditionally independent of its nondescendants given its parents

36

Markov Blanket• A node is conditionally independent of all other

nodes in the network, given its parents, children, and children's parents—that is, given its Markov blanket.

37

14-3 Efficient Representation of Conditional Distributions

• Even if the maximum number of parents k is smallish, filling in the CPT for a node requires up to O(2k) numbers and perhaps a great deal of experience with all the possible conditioning cases.

• In fact, this is a worst-case scenario in which the relationship between the parents and the child is completely arbitrary.

• Usually, such relationships are describable by a canonical distribution that fits some standard pattern.

• In such cases, the complete table can be specified by naming the pattern and perhaps supplying a few parameters—much easier than supplying an exponential number of parameters.

38

Deterministic nodes• A deterministic node has its value specified

exactly by the values of its parents, with no uncertainty. – The relationship can be a logical one:

• for example, the relationship between the parent nodes Canadian, US, Mexican and the child node NorthAmerican is simply that the child is the disjunction of the parents.

– The relationship can also be numerical: • for example, if the parent nodes are the prices of a particu

lar model of car at several dealers, and the child node is the price that a bargain hunter ends up paying, then the child node is the minimum of the parent values; or

• if the parent nodes are the inflows (rivers, runoff, precipitation) into a lake and the outflows (rivers, evaporation, seepage) from the lake and the child is the change in the water level of the lake, then the value of the child is the difference between the inflow parents and the outflow parents.

39

Efficient representation of PDs

40

Efficient representation of PDs

41

Noisy-OR relation

• The standard example is the noisy-OR relation, which is a generalization of the logical OR.

• In propositional logic, we might say that Fever is true if and only if Cold, Flu( 流行性感冒 ), or Malaria( 瘧疾 ) is true.

• The noisy-OR model allows for uncertainty about the ability of each parent to cause the child to be true—the causal relationship between parent and child may be inhibited, and so a patient could have a cold, but not exhibit a fever.

42

Noisy-OR relation

• The model makes two assumptions. – First, it assumes that all the possible causes

are listed. (This is not as strict as it seems, because we can always add a so-called leak node that covers "miscellaneous causes.")

– Second, it assumes that inhibition of each parent is independent of inhibition of any other parents:

• for example, whatever inhibits Malaria from causing a fever is independent of whatever inhibits Flu from causing a fever.

• Fever is false if and only if all its true parents are inhibited, and the probability of this is the product of the inhibition probabilities for each parent.

43

Example• Let us suppose these individual inhibition probabilities

(or noisy paramaters) are as follows: – P(fever |cold, flu , malaria) = 0.6 , [P(fever |cold) = 0.4 ], – P( fever | cold , flu, malaria) = 0.2 , [P(fever|flu) = 0.8 ], – P( fever | cold , flu, malaria) = 0.1 . [P(fever|malaria) = 0.9 ],

• Then, from this information and the noisy-OR assumptions, the entire CPT can be built.

O(k)

44

45

46

Bayesian nets with cont. variables• Many real-world problems involve continuou

s quantities.• Much of statistics deals with random variable

s whose domains are continuous. By definition, continuous variables have an infinite number of possible values, so it is impossible to specify conditional probabilities explicitly for each value.

• Handle continuous variables is to avoid them by using discretizations, dividing up the possible values into a fixed set of intervals.

• Discretization is sometimes an adequate solution, but often results in a considerable loss of accuracy and very large CPTs.

47

cont. variables

• The other solution is to define standard families of probability density functions (see Appendix A) that are specified by a finite number of parameters. – For example, a Gaussian (or normal) distribution N

(µ, σ2) (x) has the mean µ and the variance σ2 as parameters.

• A network with both discrete and continuous variables is called a hybrid Bayesian network. – the conditional distribution for a continuous variab

le given discrete or continuous parents; P(C|C) or P(C|D)

– the conditional distribution for a discrete variable given continuous parents. P(D|C)

48

49

Hybrid (discrete + continuous) networks

Discrete (Subsidy? and Buys?);Continuous (Harvest and Cost)

• How to deal with this?

補助產季、收穫

50

Probability density functions

• Instead of probability distributions

• For continuous variablescontinuous variables• Ex.: let X denote tomorrow’s

maximum temperature in the summer in EindhovenBelief that X is distributed uniformly

between 18 and 26 degree Celsius:P(X=x) = U[18,26](x)P(X=20,5) = U[18,26](20,5)=0,125/C

51

PDF

52

CDF

53

Normal PDF

54



• How to deal with this?

55



• Option 1: discretizationdiscretization – possibly large errors, large CPTs

• Option 2: finitely parameterized cafinitely parameterized canonical familiesnonical familiesa) Continuous variable, discrete + c

ontinuous parents (e.g., Cost)b) Discrete variable, continuous par

ents (e.g., Buys?)

56

a) Continuous child variables

• Need one conditional density function for child variable given continuous parents, for each possible assignment to discrete parents

• Most common is the linear Gaussian model, e.g.:

• Mean Cost varies linearly w. Harvest, variance is fixed

• Linear variation is unreasonable over the full range, but works OK if the likely range of Harvest is narrow

57

Continuous child variables – ex.

• All-continuous network w. LG distribution full joint is a multivariate

Gaussian

• Discrete + continuous LG network is a conditional Gaussian networkconditional Gaussian network, i.e., a multivariate Gaussianmultivariate Gaussian over all continuous variables for each combination of discrete variable values

58

b) Discrete child, continuous parent

• P(buys|Cost=c) = ((-c + ) / ) • with - threshold for buying • Probit distributionProbit distribution:

- the integralintegral on the standard normal distribution

• Logit distributionLogit distribution:– Uses the sigmoidsigmoid function

x

dxxNx ))(1,0()(

xex

21

1)(

59

Probit distribution

60

14-4 Exact inference in Bayesian Networks

• Let us use the following notations:– X denotes the query variable– e denotes the set of evidence variables E1, … , En

– y denotes the set of nonevidence (hidden) variables Y1, … , Yn

– Complete set of variable X={X}E Y.• A typical query asks for the posterior probabili

ty distribution P(Xle). • P(Burglary|JohnCalls = true, MaryCalls = tru

e) = <0.284, 0.716>

61

Inference by enumeration• any conditional probability can be computed by summi

ng terms from the full joint distribution. • query P(X le) can be answered using Equation (13.6),

• Bayesian network gives a complete representation of the full joint distribution.

• The terms P(x, e, y) in the joint distribution can be written as products of conditional probabilities from the network.

• Therefore, a query can be answered using a Bayesian network by computing sums of products of conditional probabilities from the network.

62

63

13-4 Enumerate-Joint-Ask

64

65

Computation

Hidden variables

66

Improvement

67

68

69

70

71

72

73

74

Variable elimination algorithm

The enumeration algorithm can be improved substantially by eliminating repeated calculations of the kind illustrated in Figure 14.8.

The idea is simple: do the calculation once andsave the results for later use. This is a form of dynamic programming. There are several versions of this approach; we present the variable elimination algorithm, which is the simplest.

Variable elimination works by evaluating expressions such as Equation (14.3) in right-to-left order (that is, bottom-up in Figure 14.8). Intermediate results are stored, and summations over each variable are done only for those portions of the expression that depend on the variable.

75

76

Pointwise product

77

78

79

Elimination-ASK

80

New evidence E indicates that JohnCalls with some probability p

We would like to know the posterior probability of the other beliefs, e.g. P(Burglary|E)

P(B|E) = P(BJ|E) + P(B J|E)= P(B|J,E) P(J|E) + P(B |J,E) P(J|E)= P(B|J) P(J|E) + P(B|J) P(J|E)= p P(B|J) + (1-p) P(B|J)

We need to compute P(B|J) and P(B|J)

Querying the BN

81

The BN gives P(t|c) What about P(c|t)? P(Cavity|t)

= P(Cavity t)/P(t)= P(t|Cavity) P(Cavity) / P

(t)[Bayes’ rule]

P(c|t) = P(t|c) P(c)

Querying a BN is just applying the trivial Bayes’ rule on a larger scale

Querying the BN

Cavity

Toothache

P(C)

0.1

C P(T|c)TF

0.40.01111

82

P(b|J) = P(bJ)= maeP(bJmae) [marginalization]

= maeP(b)P(e)P(a|b,e)P(J|a)P(m|a) [BN]

= P(b)eP(e)aP(a|b,e)P(J|a)mP(m|a) [re-ordering]

Depth-first evaluation of P(b|J) leads to computing each of the 4 following products twice:P(J|A) P(M|A), P(J|A) P(M|A), P(J|A) P(M|A), P(J|A) P(M|A)

Bottom-up (right-to-left) computation + caching – e.g., variable elimination algorithm (see R&N) – avoids such repetition

For singly connected BN, the computation takes time linear in the total number of CPT entries ( time linear in the # propositions if CPT’s size is bounded)

Querying the BN

83

Singly Connected BN

A BN is singly connected (or polytree) if there is at most one undirected path between any two nodes

Burglary Earthquake

Alarm

MaryCallsJohnCalls

is singly connected

The time and space complexity of exact inference in polytrees is linear in the size of the network (the number of CPT entries).

84

Multiply Connected BN

A BN is multiply connected if there is more than one undirected path between a pair of nodes

Burglary Earthquake

Alarm

MaryCallsJohnCalls

is multiply connected

variable elimination can have exponential time and space complexity in the worst case, even when the number of parents per node is bounded. it includes inference in propositional logic as a special case, inference in Bayesian networks is NP-hard.

85

Multiply Connected BN

A BN is multiply connected if there is more than one undirected path between a pair of nodes

Burglary Earthquake

Alarm

MaryCallsJohnCalls

is multiply connectedQuerying a multiply-connected BN takes time exponential in the total

number of CPT entries in the worst case

Querying a multiply-connected BN takes time exponential in the total

number of CPT entries in the worst case

86

Clustering algorithm

• Join tree algorithm O(n)• Widely used in commercial networks tools.• join individual nodes of the network to for

m clus ter nodes in such a way that the resulting network is a polytree.

• Once the network is in polytree form, a special-purpose inference algorithm is applied. Essentially, the algorithm is a form of constraint propagation (see Chapter 5) where the constraints ensure that neighboring clusters agree on the posterior probability of any variables that they have in common.

87

Clustering algorithm

88

14-5 Approximate inference in Bayesian Networks

• Given the intractability of exact inference in large, multiply connected networks, it is essential to consider approximate inference methods.

• This section describes randomized sampling algorithms, also called Monte Carlo algorithms, that provide approximate answers whose accuracy depends on the number of samples generated.

• In recent years, Monte Carlo alg rithms have become widely used in computer science to estimate quantities that are difficult to calculate exactly. For example, the simulated annealing algorithm.

• We describe two families of algorithms: – direct sampling and – Markov chain sampling. – variational methods and (skip)– loopy propagation (skip).

89

Methods

i.Sampling from an empty networkii.Rejection sampling: reject

samples disagreeing w. evidenceiii.Likelihood weighting: use

evidence to weight samplesiv.MCMC: sample from a stochastic

process whose stationary distribution is the true posterior

90

Introduction

• The primitive element in any sampling algorithm is the generation of samples from a known probability distribution.

• For example, an unbiased coin can be thought of as a random variable Coin with values (heads, tails) and a prior distribution P(Coin) = (0.5, 0.5).

• Sampling from this distribution is exactly like flipping the coin: with probability 0.5 it will return heads, and with probability 0.5 it will return tails.

• Given a source of random numbers in the range [0, 1], it is a simple matter to sample any distribution on a single variable.

91

Sampling on Bayesian Network

• The simplest kind of random sampling process for Bayesiah networks generates events from a network that has no evidence associated with it.

• The idea is to sample each variable in turn, in topological order.

• The probability distribution from which the value is sampled is conditioned on the values already assigned to the variable's parents.

92

93

Prior-sample

• This algorithm is shown in Figure 14.12. We can illustrate its operation on the network in Figure 14.11(a), assuming an ordering [Cloudy, Sprinkler, Rain, WetGrass] :

• Sample from P(Cloudy) _ <0.5, 0.5>; suppose this returns true.

• Sample from P(Sprinkler |Cloudy = true) = <0.1, 0.9>; suppose this returns false.

• Sample from P(Rain | Cloudy = true) = <0.8, 0.2>; suppose this returns true.

• Sample from P( WetGrass| Sprinkler = false, Rain = true) = <0.9, 0.1>; suppose this returns true.

• PRIOR-SAMPLE returns the event [true, false, true, true].

94

i. Sampling from an empty network – cont.

• Probability that PRIOR-SAMPLE generates a particular event:

SPS(x1, … ,xn) = n i=1P(Xi|Parents(Xi))=P(x1,…, xn)

• NPS (Y=y) no. of samples generated for which Y=y for any set of variables Y.

• Then, P’(Y=y) = NPS(Y=y)/N and

• lim N P’(Y=y) = h SPS(Y=y,H=h) = = h P(Y=y,H=h) = = P(Y=y)

estimates derived from PRIOR-SAMPLE are consistenconsistentt

95

II Rejection samplingII Rejection sampling

• Rejection sampling is a general method for producing samples from a hard-to-sample distribution given an easy-to-sample distribution.

• In its simplest form, it can be used to compute conditional probabilities—that is, to determine P(X le).

• REJECTION-SAMPLING algorithm– First, generates samples from the prior

distribution specified by the network. – Then, it rejects all those that do not match the

evidence. – Finally, the estimate P(X = x| e) is obtained by

counting how often X = x occurs in the remaining samples

96

Rejection-sampling algorithm

97

Example• Assume that we wish to estimate P(Rainl Sprinkler = true), us

ing 100 samples. • Of the 100 that we generate, suppose that 73 have Sprinkler =

false and are rejected, while 27 have Sprinkler = true; of the 27, 8 have Rain = true and 19 have Rain = false.

• P( Rain1 |Sprinkler = true) ≈ NORMALIZE(<8,19>) = <0.296, 0.704> .

• The true answer is <0.3, 0.7>. • As more samples are collected, the estimate will converge to

the true answer. The standard deviation of the error in each probability will be proportional to 1/sqrt(n), where n is the number of samples used in the estimate.

• The biggest problem with rejection sampling is that it rejects so many samples! The fraction of samples consistent with the evidence e drops exponentially as the number of evi dence variables grows, so the procedure is simply unusable for complex problems.

98

iii. Likelihood weighting analysis

• avoids the inefficiency of rejection sampling by generating only events that are consistent with the evidence e.

• generates consistent probability estimates.• fixes the values for the evidence variables E and

samples only the remaining variables X and Y. This guarantees that each event generated is consistent with the evidence.

• Before tallying the counts in the distribution for the query variable, each event is weighted by the likelihood that the event accords to the evidence, as measured by the product of the conditional probabilities for each evidence variable, given its parents.

• Intuitively, events in which the actual evidence appears unlikely should be given less weight.

99

Example

• query P(Rain lSprinkler = true, WetGrass = true).

• First, the weight w is set to 1.0. • Then an event is generated:• Sample from P( Cloudy) _=(0.5, 0.5); suppose this returns tru

e.• Sprinkler is an evidence variable with value true. • Therefore, we set• w <-_ w x P(Sprinkler = true| Cloudy = true) = 0.1 .• Sample from P(Rain| Cloudy = true) = (0.8, 0.2); suppose this r

eturns true.• WetGrass is an evidence variable with value true. • Therefore, we set• w <- w x P(WetGrass = true Sprinkler = true, Rain = true) = 0.0

99 .

100

101

iii. Likelihood weighting analysis

• Sampling probability for WEIGHTED-SAMPLE is SWS(y,e) = l i=1P(yi|Parents(Yi))

• Note: pays attention to evidence in ancestors only somewhere “in between” prior and posterior distribution

• Weight for a given sample y,e, is w(y,e) = n i=1P(ei|Parents(Ei))

• Weighted sampling probability isSWS(y,e) w(y,e) = l i=1P(yi|Parents(Yi)) m i=1P(ei|Parents(Ei)) =

P(y,e) # by standard global semantics of network

• Hence, likelihood weighting is consistentconsistent• But performance still degrades w. many evidence variabl

es

102

iv. MCMC Example• Estimate P(Rain|Sp

rinkler=true, WetGrass=true)

• Sample Cloudy then Rain, repeat.– Markov blanket of C

loudy is Sprinkler and Rain.

– Markov blanket of Rain is Cloudy, Sprinkler and WetGrass.

103

iv. MCMC Example – cont.0. Random initial state: Cloudy=true and Rain=false1. P(Cloudy|MB(Cloudy)) = P(Cloudy|Sprinkler, Rai

n)sample false

2. P(Rain|MB(Rain)) = P(Rain|Cloudy, Sprinkler,WetGrass)

sample true

Visit 100 states 31 have Rain=true, 69 have Rain=false P’(Rain|Sprinkler=true,WetGrass=true) = NORMALIZ

E(<31,69>) = <0.31,0.69>

104

Probability of x, given MB(x)

)(

))(|())(|())(|(XChildrenY

jj

j

YparentsyPXparentsxPXmbxP

105

MCMC algorithm

106

Performance of statistical algorithms

• Polytime approximation• Stochastic approximation techniques

such as likelihood weighting and MCMC – can give reasonable estimates of true pos

terior probabilities in a network, and – can cope with much larger networks

107

14-6 Skip

108

14-7 Other approaches to uncertain reasoning

• Different generations of expert systems– Strict logic reasoning (ignore uncertainty)– Probabilistic techniques using the full Joint– Default reasoning - believed until a better reaso

n is found to believe something else– Rules with certainty factors– Handling ignorance - Dempster-Shafer theory– Vagueness( 含糊 ) - something is sort of true (fuzz

y logic)• Probability makes the same ontological co

mmitment as logic: the event is true or false

109

Default reasoning

• The four-wheel car conclusion is reached by default.

• New evidence can cause the conclusion retracted, while FOL is strictly monotonic.

• Representatives are default logic, nonmonotonic logic, circumscription

• There are problematic issues

110

Rule-based methods

• Logical reasoning systems have properties like:– Monotonicity– Locality

• In logical systems, whenever we have a rule of the form AB, we can conclude B, given evidence A, without worrying about any other rules.

– Detachment• Once a logical proof is found for a proposition B, the proposition can

be used regardless of how it was derived. That is, it can be detached from its justification.

– Truth-functionality• In logic, the truth of complex sentences can be computed fr

om the truth of the components.

111

Rule-based method

• These properties are good for obvious computational advantages;

• bad as they’re inappropriate for uncertain reasoning.

112

Representing ignorance:

• Dempster—Shafer theory• The Dempster—Shafer theory is design

ed to deal with the distinction between uncertainty and ignorance( 無知 ).

• Rather than computing the probability of a proposition, it computes the probability that the evidence supports the proposition.

• This measure of belief is called a belief function, written Bel (X)

113

Example• coin flipping for an example of belief functions. • Suppose a shady character comes up to you and offers to bet

you $10 that his coin will come up heads on the next flip. • Given that the coin might or might not be fair, what belief sh

ould you ascribe to the event that it comes up heads? • Dempster—Shafer theory says that because you have no evid

ence either way, you have to say that the belief Bel(Heads) = 0 and also that Bel(Heads) = 0.

• Now suppose you have an expert at your disposal who testifies with 90% certainty that the coin is fair (i.e., he is 90% sure that P(Heads) = 0.5).

• Then Dempster—Shafer theory gives Bel(Heads) = 0.9 x 0.5 = 0.45 and likewise Bel( Heads) = 0.45.

• There is still a 10 percentage point "gap" that is not accounted for by the evidence.

• "Dempster's rule" (Dempster, 1968) shows how to combine evidence to give new values for Bel, and Shafer's work extends this into a complete computational model.

114

Fuzzy set & fuzzy logic• Fuzzy set theory is a means of specifying how well an object s

atisfies a vague description( 模糊的描述 ). • For example, consider the proposition "Nate is tall." Is this tr

ue, if Nate is 5' 10"? Most people would hesitate to answer "true" or "false," preferring to say, "sort of."

• Note that this is not a question of uncertainty about the external world—we are sure of Nate's height. The issue is that the linguistic term "tall" does not refer to a sharp demarcation of objects into two classes—there are degrees of tallness.

• For this reason, fuzzy set theory is not a method for uncertain reasoning at all.

• Rather, fuzzy set theory treats Tall as a fuzzy predicate and says that the truth value of Tall(Nate) is a number between 0 and 1, rather than being just true or false. The name "fuzzy set" derives from the interpretation of the predicate as implicitly defining a set of its members—a set that does not have sharp boundaries.

115

Fuzzy logic

• Fuzzy logic is a method for reasoning with logical expressions describing membership in fuzzy sets.

• For example, the complex sentence Tall(Nate) Heavy(Nate) has a fuzzy truth value that is a function of the truth values of its components.

• The standard rules for evaluating the fuzzy truth, T, of a complex sentence are– T(A B) = min(T(A), T(B)) – T(A B) = max(T(A),T(B)) – T(A)=1-T(A).

116

Summary• Reasoning properly

– In FOL, it means conclusions follow from premises– In probability, it means having beliefs that allow an

agent to act rationally• Conditional independence info is vital• A Bayesian network is a complete representati

on for the JPD, but exponentially smaller in size

• Bayesian networks can reason causally, diagnostically, intercausally, or combining two or more of the three.

• For polytrees, the computational time is linear in network size.

Documents

1 Chapter 14 Probabilistic Reasoning. 2 Outlines Representing Knowledge in an Uncertain Domain The semantics of Bayesian Networks Efficient Representation