10/15 Homework 3 due 10/17 Mid-term on 10/24 Project 2 accepted until 10/26 (4 days automatic extension)

10/15

Homework 3 due 10/17Mid-term on 10/24

Project 2 accepted until 10/26 (4 days automatic extension)

Problem:

--Need too many

numbers…

--The needed numbers

are harder to assess

You can avoid assessing P(E=e)if you assess P(Y|E=e) since it must add up to 1

Relative ease/utility of Assessing various types of probabilities• Joint distribution requires us to assess

probabilities of type P(x1,~x2,x3,….~xn)

• This means we have to look at all entities in the world and see which fraction of them have x1,~x2,x3….~xm true

• Difficult experiment to setup..

• Conditional probabilities of type P(A|B) are relatively much easier to assess

– You just need to look at the set of entities having B true, and look at the fraction of them that also have A true

• Among the conditional probabilities, causal probabilities of the form P(effect|cause) are better to assess than diagnostic probabilities of the form P(cause|effect)– Causal probabilities tend to me more stable compared to diagnostic

probabilities– (for example, a text book in dentistry can publish P(TA|Cavity) and hope

that it will hold in a variety of places. In contrast, P(Cavity|TA) may depend on other fortuitous factors—e.g. in areas where people tend to eat a lot of icecream, many tooth aches may be prevalent, and few of them may be actually due to cavities.

Doc, Doc, I have

flu. Can you tell

if I have runny nose?

A be Anthrax; Rn be Runny NoseP(A|Rn) = P(Rn|A) P(A)/ P(Rn)

Get by with easier to assess numbers

Generalized bayes rule

P(A|B,e) = P(B|A,e) P(A|e) P(B|e)

Think of this as analogous to inference rules (like modus-ponens)

What happens if there are multiple symptoms…?

Patient walked in and complained of toothache

You assess P(Cavity|Toothache)

Now you try to probe the patients mouth with that steel thingie, and it catches…

How do we update our belief in Cavity?

P(Cavity|TA, Catch) = P(TA,Catch| Cavity) * P(Cavity)

P(TA,Catch)

= P(TA,Catch|Cavity) * P(Cavity)Need to know this!If n evidence variables,We will need 2n probabilities!

Conditional independenceTo the rescue Suppose P(TA,Catch|cavity) = P(TA|Cavity)*P(Catch|Cavity)

Written as A||B

Conditional Independence Assertions

• We write X || Y | Z to say that the set of variables X is conditionally independent of the set of variables Y given evidence on the set of variables Z (where X,Y,Z are subsets of the set of all random variables in the domain model)

• We saw that Bayes Rule computations can exploit conditional independence assertions. Specifically, – X || Y| Z implies

• P(X & Y|Z) = P(X|Z) * P(Y|Z)• P(X|Y, Z) = P(X|Z)• P(Y|X,Z) = P(Y|Z)

• Idea: Why not write down all conditional independence assertions that hold in a domain?

Cond. Indep. Assertions (Contd)

• Idea: Why not write down all conditional independence assertions (CIA) (X || Y | Z) that hold in a domain?

• Problem: There can be exponentially many conditional independence assertions that hold in a domain (recall that X, Y and Z are all subsets of the domain variables.

• Brilliant Idea: May be we should implicitly specify the CIA by writing down the “dependencies” between variables using a graphical model– A Bayes Network is a way of doing just this.

• The Bayes Net is a Directed Acyclic Graph whose nodes are random variables, and the immediate dependencies between variables are represented by directed arcs

– The topology of a bayes network shows the inter-variable dependencies. Given the topology, there is a way of checking if any Cond. Indep. Assertion. holds in the network (the Bayes Ball algorithm and the D-Sep idea)

CIA implicit in Bayes Nets

• So, what conditional independence assumptions are implicit in Bayes nets?– Local Markov Assumption:

• A node N is independent of its non-descendants (including ancestors) given its immediate parents. (So if P are the immediate paretnts of N, and A is a subset of of Ancestors and other non-descendants, then {N} || A| P )

• (Equivalently) A node N is independent of all other nodes given its markov blanket (parents, children, children’s parents)

– Given this assumption, many other conditional independencies follow. For a full answer, we need to appeal to D-Sep condition and/or Bayes Ball reachability

Topological Semantics

Indep

enden

ce fr

om

Non-d

esce

dants

holds

Given ju

st th

e par

ents

Indep

enden

ce fr

om

Every

nod

e hold

s

Given m

arkov

blan

ket

These two conditions are equivalent Many other conditional indepdendence assertions follow from these

http://www.officiallynuts.com/linandsnoopn.html

P(~B)P(~E|~B)P(A|~E,~B)P(J|A,~E,~B)P(M|A,~B,~E,J) (Chain Rule)

P(~B) P(~E)P(A|~E,~B)P(J|A)P(M|A)

Local Semantics Node independent of non-descendants given its parents

D-sep (direction dependent Separation)• X || Y | E if every

undirected path from X to Y is blocked by E

– A path is blocked if there is a node Z on the path s.t.

1. Z is in E and Z has one arrow coming in and another going out

2. Z is in E and Z has both arrows going out

3. Neither Z nor any of its descendants are in E and both path arrows lead to Z

B||M|A(J,M)||E | AB||E

B||E | AB||E | M

Bayes Nets are not sufficient to model all sets of CIAs

• We said that a bayes net implicitly represents a bunch of CIA

• Qn. If I tell you exactly which CIA hold in a domain, can you give me a bayes net that exactly models those and only those CIA?– Unfortunately, NO. (See the example to the right)– This is why there is another type of graphical

models called “undirected graphical models”• In an undirected graphical model, also called a

markov random field, nodes correspond to random variables, and the immediate dependencies between variables are represented by undirected edges.

– The CIA modeled by an undirected graphical model are different

» X || Y | Z in an undirected graph if every path from a node in X to a node in Y must pass through a node in Z (so if we remove the nodes in Z, then X and Y will be disconnected)

– Undirected models are good to represent “soft constraints” between random variables (e.g. the correlation between different pixels in an image) while directed models are good for representing causal influences between variables

Give a bayes net onX,Y,Z,W s.t. X||Y|(Z,W) Z||W|(X,Y)And no other C.I.

Impossible

!

X

Y

WZ

10/17

P(~B)P(~E|~B)P(A|~E,~B)P(J|A,~E,~B)P(M|A,~B,~E,J) (Chain Rule)

P(~B) P(~E)P(A|~E,~B)P(J|A)P(M|A)

Local Semantics Node independent of non-descendants given its parents

D-sep (direction dependent Separation)• X || Y | E if every

undirected path from X to Y is blocked by E

– A path is blocked if there is a node Z on the path s.t.

1. Z is in E and Z has one arrow coming in and another going out

2. Z is in E and Z has both arrows going out

3. Neither Z nor any of its descendants are in E and both path arrows lead to Z

B||M|A(J,M)||E | AB||E

B||E | AB||E | M

Causal Chains; Common Causes; Common Effects: Another way of

understanding D-SEP

Causal chain X causes Y through Z is blocked if Z is given

Common Cause X and Y are caused by Z is blocked if Z is given

Common Effect X and Y cause Z is blocked only if neither Z nor its descendants are given

Causality

• Logic (both deterministic and probabilistic) doesn’t by default encapsulate causality

– A logical dependency A => B is direction less (it is basically equivalent to ~A V B)– A probabilistic dependency A || B | C is also directionless. It only states that A and

B are correlated– Causality, on the other hand, has direction. – If the logical/probabilistic model ignores this it can get to logically valid and yet

silly conclusions• E.g. If grass is wet, then it rained. If you break this bottle, then grass gets wet. Ergo: If

you break this botte, then it will rain. • E.g. The probability of there being two bombs on a plane is exceedingly low. So, to

increase the safety of the flight I will take a bomb along with me.• E.g.(?) Children whose homes have a lot of books do well in school. Go ahead and put a

whole bunch of books in each home (Actually, the state of Illinois had a program where they will send books every month to all kids).

– Learning “causal models” from data is thus harder than learning “correlation models”

– Which is why it is so very great when causal models are available for the asking—through domain experts.

See IJC

AI 99 R

esearch Excellence

lecture by Judea P

earl

Blog Questions

• You have been given the topology of a bayes network, but haven't yet gotten the conditional probability tables (to be concrete, you may think of the pearl alarm-earth quake scenario bayes net). Your friend shows up and says he has the joint distribution all ready for you. You don't quite trust your friend and think he is making these numbers up. Is there any way you can prove that your friends' joint distribution is not correct?

• Answer:– Check to see if the joint

distribution given by your friend satisfies all the conditional independence assumptions.

– For example, in the Pearl network, Compute P(J|A,M,B) and P(J|A). These two numbers should come out the same!

• Notice that your friend could pass all the conditional indep assertions, and still be cheating re: the probabilities

– For example, he filled up the CPTs of the network with made up numbers (e.g. P(B)=0.9; P(E)=0.7 etc) and computed the joint probability by multiplying the CPTs. This will satisfy all the conditional indep assertions..!

– The main point to understand here is that the network topology does put restrictions on the joint distribution.

Blog Questions (2)

• Continuing bad friends, in the question above, suppose a second friend comes along and says that he can give you the conditional probabilities that you want to complete the specification of your bayes net. You ask him a CPT entry, and pat comes a response--some number between 0 and 1. This friend is well meaning, but you are worried that the numbers he is giving may lead to some sort of inconsistent joint probability distribution. Is your worry justified ( i.e., can your friend give you numbers that can lead to an inconsistency?)

(To understand "inconsistency", consider someone who insists on giving you P(A), P(B), P(A&B) as well as P(AVB) and they wind up not satisfying the P(AVB)= P(A)+P(B) -P(A&B)[or alternately, they insist on giving you P(A|B), P(B|A), P(A) and P(B), and the four numbers dont satisfy the bayes rule]

• Answer: No—as long as we only ask the friend to fill up the CPTs in the bayes network, there is no way the numbers won’t makeup a consistent joint probability distribution

– This should be seen as a feature..• Personal Probabilities

– John may be an optimist and believe that P(burglary)=0.01 and Tom may be a pessimist and believe that P(burglary)=0.99

– Bayesians consider both John and Tom to be fine (they don’t insist on an objective frequentist interpretation for probabilites)

– However, Bayesians do think that John and Tom should act consistently with their own beliefs

• For example, it makes no sense for John to go about installing tons of burglar alarms given his belief, just as it makes no sense for Tom to put all his valuables on his lawn

Blog Questions (3)

• Your friend heard your claims that Bayes Nets can represent any possible conditional independence assertions exactly. He comes to you and says he has four random variables, X, Y, W and Z, and only TWO conditional independence assertions:

X .ind. Y | {W,Z}W .ind. Z | {X, Y}

He dares you to give him a bayes network topology on these four nodes that exactly represents these and only these conditional independencies. Can you? (Note that you only need to look at 4 vertex directed graphs).

• Answer: No this is not possible.• Here are two “wrong” answers

– Consider a disconnected graph where X, Y, W, Z are all unconnected. In this graph, the two CIAs hold. However, unfortunately so do many other CIAs

– Consider a graph where W and Z are both immediate parents of X and Y. In this case, clearly, X .ind. Y| {W,Z}. However, W and Z are definitely dependent given X and Y (Explaining away).

• Undirected models can capture these CIA exactly. Consider a graph X is connected to W and Z; and Y is connected to W and Z (sort of a diamond).

– In undirected models CIA is defined in terms of graph separability

– Since X and Y separate W and Z (i.e., every path between W and Z must pass through X and Y), W .ind. Z|{X,Y}. Similarly the other CIA

• Undirected graphs will be unable to model some scenarios that directed ones can; so you need both…

• There are also distributions that neither undirected nor directed models can perfect-map (see picture above)

• If we can’t have perfect map, we can consider giving up either I-map or a d-map

– Giving up I-map leads to loss of accuracy (since your model assumes CIAs that don’t exist in the distribution). It can however increase efficiency (e.g. naïve bayes models)

– Giving up D- map leads to loss of efficiency but preserves accuracy (if you think more things are connected that really are, you will assess more probabilities—and some of them wind up being redundant anyway because of the CIAs that hold in the distribution)

Given a graphical model G, and a distribution D,G is an I-map of D if every CIA reflected in G actually holds in D [“soundness”]G is a D-map of D if every CIA of D is reflected in G [“completeness”]G is a perfect map of D if it is both I-map and D-map

BN

All distributions

MN

Distributions that can have a perfect map in terms of a bayes network X

Y

WZ

An MN that BN can’t represent

X Y

Z

A BN that MN can’t represent

Alarm

P(A|J,M) =P(A)?

Burglary Earthquake

How many probabilities are needed?

13 for the new; 10 for the old Is this the worst?

1 2

4

8

16

31!

Introduce variables in the causal order

Digression: Is assessing/learning numbers the only hard model-learning problem?

• We are making it sound as if assessing the probabilities is a big deal• In doing so, we are taking into account model acquisition/learning

costs. • How come we didn’t care about these issues in logical reasoning? Is it

because acquiring logical knowledge is easy?• Actually—if we are writing programs for worlds that we (the humans)

already live in, it is easy for us (humans) to add the logical knowledge into the program. It is a pain to give the probabilities..

• On the other hand, if the agent is fully autonomous and is bootstrapping itself, then learning logical knowledge is actually harder than learning probabilities..– For example, we will see that given the bayes network topology (“logic”),

learning its CPTs is much easier than learning both topology and CPTs

Making the network Sparse by introducing intermediate variables

• Consider a network of boolean variables where n parent nodes are connected to m children nodes (with each parent influencing each child).– You will need n+m*2n conditional

probabilities• Suppose you realize that what is

really influencing the child nodes is some single aggregate function on the parent’s values (e.g. sum of the parents). – We can introduce a single

intermediate node called “sum” which has links from all the n parent nodes, and separately influences each of the m child nodes

• You will wind up needing only n+2n+2m conditional probabilities to specify this new network!

We only consider the failure to causeProb of the causesthat hold

How about Noisy And? (hint: A&B => ~( ~A V ~B) )

k

i=j+1ri

Prob that X holds even though ith

parent doesn’t

Constructing Belief Networks: Summary

• [[Decide on what sorts of queries you are interested in answering– This in turn dictates what factors to model in the network

• Decide on a vocabulary of the variables and their domains for the problem– Introduce “Hidden” variables into the network as needed to make the

network “sparse”

• Decide on an order of introduction of variables into the network– Introducing variables in causal direction leads to fewer connections

(sparse structure) AND easier to assess probabilities

• Try to use canonical distributions to specify the CPTs– Noisy-OR– Parameterized discrete/continuous distributions

• Such as Poisson, Normal (Gaussian) etc

Case Study: Pathfinder System

• Domain: Lymph node diseases– Deals with 60 diseases and 100 disease findings

• Versions:– Pathfinder I: A rule-based system with logical reasoning– Pathfinder II: Tried a variety of approaches for uncertainity

• Simple bayes reasoning outperformed – Pathfinder III: Simple bayes reasoning, but reassessed probabilities

– Parthfinder IV: Bayesian network was used to handle a variety of conditional dependencies.

• Deciding vocabulary: 8 hours• Devising the topology of the network: 35 hours• Assessing the (14,000) probabilities: 40 hours

– Physician experts liked assessing causal probabilites

• Evaluation: 53 “referral” cases– Pathfinder III: 7.9/10– Pathfinder IV: 8.9/10 [Saves one additional life in every 1000 cases!]– A more recent comparison shows that Pathfinder now outperforms experts who helped

design it!!

Documents

10/15 Homework 3 due 10/17 Mid-term on 10/24 Project 2 accepted until 10/26 (4 days automatic extension)