ersitdechter/courses/ics-275b/koller...Daphne Koller Stanford Univ ersit y Jan uary 13, 2000 In the previous c hapter, w e discussed the represen tation of global prop erties of indep

Local probabilistic models�

Handout #10

Daphne Koller

Stanford University

January 13, 2000

In the previous chapter, we discussed the representation of global properties of independnece bygraphs. These properties of independence speci�ed a factorizaiton of the joint distribution in to aproduct of CPDs. Until now, we mostly ignored CPDs. However, it is clear that the representationalpower of networks is in the ability to represent CPDs. In this chapter we will examine CPDs inmore detail. We will describe a range of representations and consider their implications in termsof local properties of independence.

1 Tabular CPDs

When dealing with joint probability of discrete random variable, we can always resort to tabularrepresentation. Simply put, we can represent P (X j PaX) as a table that contain an entry for eachjoint assignment to X and PaX. In order for this to be a proper CPD, we require that all thevalues are not negatives, and that for each value paX, we haveX

x2Val(X)

P (x j paX) = 1: (1)

It is quite clear that this representation is as general as possible. We can represent every possiblediscrete CPD using such a table. As we shall also see, tabular CPDs can be used in a natural wayin inference algorithms.

However, aside from having these desirable properties, the tabular representation also su�ersfrom several disadvantages. The most obvious one is that the representation can become large andunwieldy. The number of values we need to describe a CPD is the number of joint assignments toX and PaX. Thus, we need jVal(PaX)j � jVal(X)j values in a tabular representation.1 Thus, forexample if we have 5 binary parents of a binary variable X, we need specify 25 = 32 values. Oncewe have 10 parents, we need to specify 210 = 1024 values. Clearly, the number of values growsexponentially in the number of parents.

This can quickly become a serious problem. Consider a medical domain where a symptom, say\high fever", depends on 10 diseases. It would be quite tiresome to ask our expert 1024 questionsof the format: \What is the probability of high fever when the patient has disease A, does not havedisease B, . . . ?" Clearly, our expert will lose patience with us at some point!

�Part of a draft for a textbook, co-authored with Nir Friedman, Hebrew University of Jerusalem1We can save some space by using storing the values for jVal(X)j�1 values of X and deducing the last probability

of the the remaining value by via Eq. (1).

1

Daphne Koller, Stanford University CS228 Notes, Handout #20 2

This example shows another problem with the tabular representation: it ignores structure withinthe CPD. If the CPD is such that there are no similarity between the various cases, i.e., eachcombination of disease has drastically di�erent probability of high-fever, then the expert might bemore patient. However, in this example, like many others, there is a regularity inthe parametersfor di�erent values of the parents of X. For example, it might be that if the patient su�ers fromdisease A, then she is certain to have high-fever and thus P (X j paX) is the same whenever for allvalues paX in which A is true. Indeed, many of the representations we will consider below attemptto explicitly describe such regularties and expliot them to reduce the number of parameters neededto specify a CPD.

Finally, it is clear that if we consider random variables with in�nite domains, we cannot storeeach possible conditional probability in a table.

To avoid these problems we should view CPDs not as tables with all of the conditional prob-abilies, but rather as functions that given values of paX and x return the conditional probabilityP (x j paX). This is all we need in order to have a well-de�ned representation of a BN. In thereminder of the chapter we will explore some of the possible representations of such functions.

2 Deterministic nodes

One of the simplest types of regular CPDs are these where X is a deterministic function of PaX.That is, there is a function f : Val(PaX) 7! Val(X), such that

P (x j paX) =(

1 x = f(paX)0 otherwise

For example, X might be the \or" of its parents. Or in a continous domain, we might representP (X j Y;Z) by the function f(y; z) = sin(y+ez). Of course, the extent to which this representationis more compact than a table (i.e., takes less space in the computer) depends on the expressivepower that our language o�ers us for representing deterministic functions. That is, we might useexpressions over a set of basic logical and artithmetic operations to represent f .

It is clear that deterministic relations are useful in modeling many domains. They often allowsimplify the representation of dependencies (we will see such an example shortly). In addition, insome domains, they are naturally occuring. This is particularly true in \arti�cial domains" such asmodels of machines and electrical circuits. However, we can also �nd them in so called \natural"domains. A simple example is genetics. Recall that genotype of a person is determined by twocopies of each gene. The person's phenotypes are often functions of these values. For example, thegene reponsible for determining blood type has three values a, b, and o. If we represent by the G1and G2 the two copies of the gene, and by T the blood type, then we have that:

T =

8>>>>>>><>>>>>>>:

ab if G1 or G2 is a and the other is ba if at least one of G1 or G2 is equal to a and then other is

either a or ob if at least one of G1 or G2 is equal to b and then other is

either b or oo if G1 = o and G2 = o

Aside from compact representation, we get additional advantage from making the structureexplicit: we can represent additional properties of independencies. Recall that conditional inde-pendence is a numerical property | it is de�ned using equality of probabilities. However, the


D E

A

C

B

Figure 1: A simple example of a network with a determinstic CPD. The double line notationrepresents the fact that C is a deterministic function of A and B.

procedure det-sep(Graph, // network structureD, // set of deterministic nodesX;Y;Z // query

)While there is an Xi such that

(1) Xi 2 D // Xi has a deterministic CPD(2) PaXi � Z

Z Z [ fXigreturn d-sepG(X;Y j Z)

Figure 2: Procedure for computing d-separation in the presence of deterministic CPDs.

graphical structure made certain properties of a distribution explicit. This allowed us to deducethat some independencies hold without looking at the numbers. By making structure explicit inthe CPD, we can do even more of the same.

Example 2.1: Consider the simple network structure in Figure 1. If C is a deterministic functionof A and B, what new conditional independencies do we have? Suppose that we are given the valuesof A and B. Then, since C is deterministic, we also know the value of C. As consequence, we getthat D and E are independent. Thus, we conclude that I(D;E j A;B) holds in the distribution.

Note that if C was not a deterministic function of A and B, then this independence is notneccessarily true. Indeed, d-seperation would not deduce that D and E are independent given Aand B.

Can we augment the d-separation procedure to discover independencies I(X;Y j Z) such asthis? In our example, the �x is to consider C to be part of the evidence once we have A and Bin the evidence. In some situations, we might have variables that are deterministic functions ofvariables that are deterministic functions of the evidence. Thus, we have to iteratively extend theset of evidence variables to contain all the variables that are determined by it.

This discussion suggests the simple procedure shown in Figure 2. It is easy to convince ourselvesthat this algorithm is sound, in the same sense that d-separation is sound.


A

C

B

D

E

Figure 3: A slightly more complex example with deterministic CPDs.

Theorem 2.2: (Soundness of det-sep) Let G be a network structure, and let D;X;Y;Z be setsof variables. If det-sep(G;D;X;Y;Z) returns true, then P j= I(X;Y j Z) for all distributions Psuch that P j= Markov(G) and for each X 2 D, P (X j PaX) is a deterministic CPD.

Does this procedure capture all of the independencies implied by the deterministic functions?As with d-separation, the answer has to be quali�ed. Given only the graph structure and the setof deterministic CPDs, we cannot �nd additional independencies.

Theorem 2.3: (Completeness of det-sep) Let G be a network structure, and let D;X;Y;Z besets of variables. If det-sep(G;D;X;Y;Z) returns false, then there is a distribution P such thatP 6j= I(X;Y j Z) but P j= Markov(G), for each X 2 D, P (X j PaX) is a deterministic CPD.

Of course, particular deterministic functions can imply additional independencies.

Example 2.4: Consider the network of Figure 3 where C is the exclusive or of A and B. Whatadditional independencies do we have here? In the case of XOR (although not for all other de-terministic functions) The values of C and B fully determine that of A. Therefore, we have thatI(D;E j B;C) holds in the distribution.

Speci�c deterministic functions can also induce other independencies, albeit of di�erent typethat the ones we discussed in Chapter ??.

Example 2.5: Consider the following Bayesian network:

X Y

OR

Z


and consider what happens if we are given that Y = y1. In this case, we also know that thedeterministic node D necessarily has value d1. And, as the value of D is �xed, we can concludethat X and Z are independent. In other words, we have that

P (Z j X;Y = y1) = P (Z j Y = y1):

On the other hand, if we are given Y = y0, the value of D is not determined, and it depends onthe value of X. Hence, the corresponding statement conditioned on y0 is false.

This example shows that deterministic nodes induce a form of independence, but it is di�erentfrom the standard notion on which we have focused so far. Up to now, we have restricted attentionto independence properties of the form I(X;Y j Z), that imply that P (X j Y;Z) = P (X j Z) forall values of X, Y and Z. Deterministic functions imply a type of independence that only holdsfor particular values of some variables.

De�nition 2.6: Let X;Y;Z be pairwise disjoint sets of variables, let C be a set of variables (thatmight overlap with X [ Y [ Z), and let c 2 Val(C). We say that X and Y are contextuallyindependent given Z and the context c denoted Ic(X;Y j Z; c), if

P (X j Y;Z;C) = P (X j Z; c) whenever P (Y;Z; c) > 0:

We call this form of indepedencies context-speci�c independencies (CSI).In the example above, we would say that Ic(X;Z j y1).

Example 2.7: Consider the same Bayesian network as above, but assume that C is the deter-ministic OR of A and B. In this case, knowing C and B does not always tell me the value of A.However, if C is known to be false, then A and B are both known to be false, and therefore they areindependent. Thus, we have that Ic(A;B j c0). As a consequence, we also have that Ic(D;E j c0).

3 Asymmetric dependencies

Aside from deterministic functions, what other types of regularity we can �nd in CPDs? A commontype of regularity is when we have precisely the same e�ect in several contexts. We can see such aregularity in a modi�ed version of the Alarm example.

Example 3.1: Suppose that the house owner often forgets to turn on the alarm. To model this,we add to the network of Example ?? an additional variable \On" that denotes whether the alarmwas turned on on that day. The structure of the modi�ed network is shown in Figure 4.

Now, we need to describe the CPD P (A j O;B;E). Clearly, if the alarm was not turned on,i.e., O = o0, then it would not be active regardless of the values of B and E. This implies that inthe four cases corresponding to values of O;B;E in which O = o0, we have that that the probabilityof alarm is zero (or a very very small number 10�10 if we want to account for extremely unlikelyoccurrences such as lightning strikes temporarily powering the alarm). That is, P (a1 j o0; b; e) = �for all values b and e.


Earthquake Burglary

Alarm

Call

Radio

On

Earthquake Burglary

Alarm

Call

Radio

On

(a) (b)

Figure 4: (a) The Alarm example modi�ed to consider the probability that the alarm was turnedon. (b) The reduced graph after we remove spurious arcs given the context o0.

O

B

t f

ft0.000001

E

0.90.96

ftE

0.00010.6

ft

O

B

0.95 E

0.0010.2

t f

ft

ft0.000001

(a) (b)

Figure 5: Two tree representations for CPDs of P (A j O;B;E). Internal nodes in the tree denotetests on parent variables. Leaves are annoted with the probability of A = a1.


In this simple example, we have a CPD in which four possible values of PaA describe the sameconditional probability over A. How do we represent such regularity? A simple approach is torepresent this by using a tree representation.

For example, Figure 5 shows two trees we might cansider for the CPD of A in Example 3.1.Given a tree we �nd P (A j o; b; e) by traversing the tree from the root downward. At each internalnode, we see a test on one of the attributes. For example, in the root node of the tree in Figure 5(a)there is a test on the value of O. We the follow the branch that is labeled with the value O is givenin the case we are interested in. Thus, if O = o0, we would reach the leaf labeled with �. Once wereach a leaf we read return the conditional distribution associated with the leaf.

Formally, we use the following recursive de�nition of trees.

De�nition 3.2: A CPD-tree representing a CPD for variable X is a rooted tree; each t-node inthe tree is either a leaf t-node or an interior t-node. Each leaf is labeled with a distribution P (X).Each interior t-node is (a) labeled with some variable Z 2 PaX, (b) associated with a set of arcsto its children, one arc for each zi 2 Val(Z), with each arc labeled by some zi.

A branch through a CPD-tree is a sequence of t-nodes and arcs beginning at the root andproceeding to a leaf node. The assignment induced by branch � is the assignment to the setZ � PaX where each element Z 2 Z labels an interior node of � and is assigned the value z thatlabels the corresponding arc that lies on b. We generally assume that a decision tree is irredundant,that is, no branch b contains two interior nodes labeled by the same variable.

Note that, to avoid confusion, we use t-nodes and arcs for a CPD-tree, as opposed to nodes andedges for entities for a BN.

To illustrate this de�nition, consider the tree in Figure 5(a). There are �ve branchs in thistree. One induces the assignment o0, and corresponds to the situation where the alarm was turnedo�. The other four induce complete assignments to all the parents of A: ho1; b0; e0i, ho1; b1; e0i,ho1; b0; e1i, and ho1; b1; e1i. Thus, this representation breaks down the conditional distribution ofA given its parents into �ve conditions by grouping some of the conditions that we can consider into one.

To elaborate the representation a little, consider a somewhat di�erent example

Example 3.3: Suppose we now have a di�erent alarm system where the wires cannot move andso an earthquake cannot directly cause a contact in the wires and trigger the alarm. This alarmsystem uses a sensitive motion sensor, one that is set o� even by the motion of objects caused bythe earthquake. A burglary causes the alarm if the burglar did not manage to disable the alarm.However, once the burglar disables the alarm, an earthquake can no longer set it o�. What typeof interaction do we get now? Now we know, that P (A j o1; b1; e0) and P (A j o1; b1; e1) are thesame: if a burglary attempt succeeded, the alarm is disabled and would not be triggered by anearthquake. On the other hand, if it failed, the alarm is set o�, and again, earthquake would notchange the �nal outcome.

This type of regularity is represented by tree in Figure 5(b). In this tree there is one branchthat induces the assignment o1; b1. Thus, for both cases we mention above, we would use the sameconditional distribution.

Regularities of this type occur in lots of other situations. As one example, we can have a Wetvariable, denoting whether I get wet; that variable would depend on the Raining variable, but onlyin the context where I am outside. Another, very common situation where this type of regularityoccurs is when we have actions in our model; in these cases, the set of parents for a variables mayvary considerably based on my action. For example, let us revisit the Travel-time example from


Time

T101 T280Road

(a) (b)

Figure 6: (a) A network for the travel-time example, and (b) tree representation of the CPD forP (Time j Road;T101;T280).

Tree1 Tree2

Figure 7: Two equivalent trees.

the previous chapter. Recall that Time, the travel time for getting to work, may depend on bothTraÆc101 and TraÆc280, but only on the one which I actually took. Figure 6 shows how wemight represent this example. Con�guration variables also result in such situations. As a real-lifeexample, in a printer diagnosis BN, the printer can be hooked up either to the net via an ethernetcable or to a local cable. The status of the ethernet cable only a�ects the printer if the printer ishooked up to it.

What is the semantics of the tree representation? As we have seen, to compute P (X j paX) weneed to �nd the unique branch that is consistent with paX and return the distribution associatedwith it. Thus, the form of the tree is not crucial. Only the assignments de�ned by the branchesare. The two trees in Figure 7 are equivalent in the sense that they de�ne the same branches, andassign the same conditional probabilies to each branch.

If we abstract away from the details of the tree representation, we see that what we are repre-senting are the partitions of Val(PaX) that are de�ned by the branches in a tree. This also allows tosee what can be represented as a tree. All the partitions de�ned by a tree must a description via anassignment to a subset of the variable. Thus, we cannot represent the partition that contains onlyo1; b0; e1 and o1; b1; e0. (Of course, we can use two branches with the same conditional probabilityin this example, but then we not capturing some parts of the structure of the CPD.)

This immediately suggests other possible representations of partitions. For example, we mightuse logical formulase to describe partitions. This is a very exible representation that can describeany partition we might consider, but the the formulas might get quite long.

Can we characterize the regularities represented by a tree, or more generally, by any representa-tion of partitions? As for deterministic CPDs, these structures induce properties of context-speci�cindependence. If we consider Example 3.1, then once we know that the alarm is o�, A is indepen-dent of B and E. Again, this is a context-speci�c independence, that holds only for a particularvalue of O. In other words, the CPD for P (A j O;B;E) satis�es Ic(A;B;E j o0). The tree ofFigure 5(b) describes another CSI Ic(A;E j o1; b1): If the alarm is on and there was a burglary, thealarm sound cannot be inuenced by an earthquake.

These two examples might suggest that the only contexts that induce CSI are those de�ned bycomplete branches, ones that go all the way from the root of a CPD-tree to a leaf. This is not


necessarily the case. Consider the CPD of Figure 6. In this example, once we decided to drive viahighway 101, my travel time does not depend on the traÆc load in highway 280. Thus, we havethe property Ic(Time;T280 j Road = 101).

Of course, we want a systematic way of deducing CSI properties from a tree-representation.To do so, we need to consider how a speci�c context inuences a tree. Consider again the tree ofFigure 5(b), and suppose we are given the context b1. Clearly, we now should focus only on branchesthat are consistent with this value. There are two such branches. One induces the assignment o0

and the other the assignment o1; b1. We can immediatly see that the choice between these twobranches does not depend on the value of E. Thus, we to conclude that Ic(A;E j b1) holds in thiscase.

This line of reasoning can be generalized by using the following de�nition.

De�nition 3.4: Let T be a decision tree over some set of variables Z, and let c 2 Val(C) � Z bea context. The reduced tree with respect to c, denoted, T c, is de�ned recursively as follows. Let rbe the root of T .

� If r is leaf node: T c = T .� If r is an interior node, then it is labeled with some variable Z, and T consists of r togetherwith immediate subtrees T1; � � � Tk, associated with values z1; � � � zk of Val(Z):{ if Z is not in C: we set T c to be R together with subtrees T1

c; � � � Tkc.{ if Z is in C: we set T c = Tj

c, where Tj is the subtree associated with the arc labeledwith value zj 2 c.

The reduced tree is the tree we need to traverse in order to get to the conditional probability if weknow that C = c. If a variable does not appear in the reduced tree, then the choice of conditionaldistribution does not depend on it.

Proposition 3.5: Let P (X j PaX) be a CPD that can be represented by a CPD-tree T , let c 2Val(C) for C � PaX be a context, and let Z � PaX. If T c does not test any variable in Z, thenP j= Ic(X;Z j PaX � Z; c).This proposition allows speci�es a computational tool for deducing \local" CSI relations from thetree representation. We can check whether a variable Z is being tested in the reduced tree givena context in linear time. This procedure, however, is incomplete in two ways. First, since theprocedure does not examine the actual parameter values, it can miss additional independenciesthat are true for the speci�c parameter assignments. However, as in the case of completeness ford-separation in BNs, this violation only occurs in degenerate cases. In this case, the degeneracyrequired to induce a violation of completeness is even more obvious than for BNs: if P satis�es anindependence of the form Ic(X;Z j PaX � Z; c), which is not reported by this procedure, then twoof the distributions at the leaves of the CPD-tree must be identical.

Proposition 3.6: Let P (X j PaX) be a CPD that can be represented by a CPD-tree T , where allof the distributions at the leaves of the tree are distinct. Then for any C;Z � PaX and c 2 Val(C),we have that T c does not test any variable in Z if and only if P j= Ic(X;Z j PaX � Z; c).

The more severe limitation of this procedure is that it only tests for independencies between Xand some of its parents given a context and the other parents. Are there are other, more global,


procedure CSI-sep(Graph, // network structureP // a distribution that satis�es Markov(G)c // a contextX;Y;Z // query

)let G0 be a duplicate of Gfor each edge Y ! X in G

if Y ! X is spurious given c in P thenremove Y ! X in G0

return d-sepG0(X;Y j Z;C)

Figure 8: Procedure for computing d-separation in the presence of asymmetric dependencies inCPDs.

implications of such CSI relations? Consider Example 3.1 again. Supposed we know that the alarmis o� (i.e., O = o0). Then, our intuition is that hearing an a radio report regarding an earthquakewould not a�ect the probability of receiving a phone call from the neighbor: since the alarm is o�,an earthquake cannot trigger it, and so the probability of alarm does not increase due to the higherprobability that there was an earthquake. (Note that when the alarm is on, we should anticipatea phone call after hearing the news report; see Section ??.)

Can we capture this intuition formally? Consider the dependence structure in the contextO = o0. Intuitively, in this context the edge E ! A is redundant, since we know that Ic(A;E j o0).Thus, our intuition is that we should check for d-separation in the graph without this edge. Indeed,we can show that this is a sound check for CSI conditions.

We start by formally de�ning the set of parents that are irrelevent given a context. Intuitively,we want to say that Y is irrelevant if X is independent of Y given the context and the other parents.We have to be careful though, since the context might include other variables that are not in thefamily of X that can cause X and Y to be dependent in a non-local fashion (e.g., c contains acommon descendent of both X and Y ). Thus, we use the following de�nition

De�nition 3.7: Let G be a network structure, let P be a distribution such that P j= Markov(G),and let c be a context. De�ne cjZ to be the context restricted to the variables in Z. An edgeY ! X in G is spurious, in the context c, if Ic(X;Y j PaX �Y; cjPaX) holds in P .

It is easy to see that if we represent CPDs with decision trees, then we can determine whetheran edge is spurious or not by the examining the reducted tree. An edge Y ! X is spurious if Ydoes not appear in the reduced tree for P (X j PaX). Thus, for trees, this de�nition has eÆcientprocedural implementation. For many other representations of asymmetric CPDs we also haveeÆcient procedures for identifying spurious edges.

Now we can de�ne a variant of d-separation that takes CSI into account. This procedure isstraightforward: we use local considerations to remove spurious edges, and then apply standardd-separation to the resulting graph. See Figure 8 for pseudo-code for this procedure.

As an example, reconsider the modi�ed Alarm example, with the context O = o0. In this case,we get that the arcs B ! A and E ! A are suprious, and thus the reduced graph is the oneshown in Figure 4(b). As we can see, R and C are d-separated in the reduced graph. Thus, usingCSI-separation we get that R and C are d-separated given the context o0.


An immediate question that we should address is whether this procedure is reliable. That is, isit sound? As expected, it is not hard to show that it is indeed sound.

Theorem 3.8: Let G be a network structure, let P be a distribution such that P j= Markov(G),let c be a context, and let X;Y;Z be sets of variables. If CSI-sep(G;P; c;X;Y;Z) returns true,then P j= Ic(X;Y j Z; c).

Proof: See Exercise ??

Of course, we also want to know if CSI-separation is complete? That is, does it reports all theindependencies in the distribution. Here the answer is more complex. In general, the CSI-separationis not complete.

To see a simple counterexample, consider the example of Figure 6. In this example, CSI-separation will report that T101 and T280 are separated given Time and the context Road = 101.(To see this, note that T280 ! Time is spurious given Road = 101, and thus there is no pathbetween the two variables.) Similarly, if we consider the context Road = 280, we also have thatT101 and T280 are separated given Time. Thus, reasoning by cases, we conclude that once weknow the value of Road, we have that T101 and T280 are independent given Time.

Can we get this conclusion using CSI-separation? Unfortunately, in general, the answer is no.If we invoke CSI-separation with the empty context, then no edges are spurious and CSI-separationreduces to d-separation. Since both T101 and T280 are parents of Time, we conclude that theyare not separated given Time and Road.

The problem here is that CSI-separation does not perform reasoning by cases. Of course, ifwe want to determine whether X and Y and independent given Z and a context c, we can invokeCSI-separation on the context c; z for each possible value of Z, and see if X and Y are separatedin all of these contexts. This procedure, however, is exponential in the number of variables of Z.Thus, it is practical only for small evidence sets. Can we do better than reasoning by cases? Theanswer is that sometimes we cannot. See Exercise ?? for a more detailed examination of this issue.

4 Independence of causal inuence

We now describe another, very di�erent, type of structure in the local probability model. Let usreconsider the Alarm example, but now make di�erent assumptions about the alarm. Why does aburglary cause the alarm to go o�? Perhaps because it activates the motion sensors. Why does anearthquake cause the alarm go to o�? Perhaps because it jiggles some wires. But what happensif both occur? We can assume that these are two independent causal mechanisms, and that thealarm failed to go o� only if neither of these two mechanisms worked.

Assume that P (a1 j b1; e0) = 0:9 and P (a1 j b0; e1) = 0:6. In the case b1; e1, the burglary failsto set o� the alarm with probability 0:1, the earthquake fails to set it o� with probability 0:4, thealarm fails to go o� only if both mechanisms fail, and these failures occur independently; hence,the alarm fails to go o� with probability 0:1 � 0:4 = 0:04. In other words, our CPD for P (A j B;E)is:

A b0e0 b0e1 b1e0 b1e1

a1 0 0:6 0:9 0:96a0 1 0:4 0:1 0:04

Here, we assume for simplicity that there are no spontaneous alarms that are not caused by one ofthese mechanisms. We relax this assumption later on.


Earthquake Burglary

e1

j1 0.6

0.4j0 1

0

e0

m1

m0

b1

0.9

0.1

0

1

b0

Alarm

Wire jiggle Motion

Figure 9: Decomposition of the noisy-or model for Alarm.

An alternative way of understanding this interaction is by assuming that the behavior of thealarm is the one induced by a more elaborate probabilistic model, as represented by the networkfragment in Figure 9. This �gure represents the conditional distribution for the Alarm node givenBurglary and Earthquake; it also uses two intermediate nodes that reveal the associated causalmechanisms. It is easy to verify that the conditional distribution P (A j B;E) induced by thisnetwork is precisely the one shown above.

The probability that B causes A (0:9 in this example) is called the noise parameter, and denoted�B . In the context of our decomposition, �B = P (m

1 j b1). Similarly, we have a noise parameter�E , which in this context is �E = P (j

1 j e1). We can also put in a leak probability that representsthe probability that the alarm would go o� spontaneously, by introducing another node into thenetwork. This node has no parents, and is true with probability �0 = 0:0001. It is also a parent ofthe Alarm node, which remains a deterministic or.

The decomposition of this CPD clearly shows why this local probability model is called a noisy-or model. The basic interaction of the e�ect with its causes is that of an Or, but there is somenoise in the \e�ective value" of each cause.

We can de�ne this model in the more general setting:

De�nition 4.1: LetA be a binary-valued random variable with n binary-valued parentsX1; : : : ;Xn.The CPD P (A j X1; : : : ;Xn) is a noisy-or if there are n + 1 noise parameters �0; �1; : : : ; �n suchthat

P (a0 j X1; : : : ;Xn) = (1� �0)Y

i : Xi=xi1

(1� �i)

The noisy-or interaction is a special case of a general class of local probability models, calledcausal independence, or independence of causal inuence. These models all share the property thatthe inuence of multiple causees can be decomposed into separate inuences of each one. Moreprecisely:

De�nition 4.2: Let A be a random variable with parentsX1; : : : ;Xn. The CPD P (A j X1; : : : ;Xn)exhibits independence of causal inuence if it can be induced by a network fragment of the structureshown in Figure 10, where the CPD of A is a deterministic function.

Independence of causal inuence is a very useful model, with many instantiations, correspondingto di�erent noise models | P (Yi j Xi) | and di�erent deterministic functions. For example, the


X

Y

X

Y

X

Y

1

1

2

2

n

n

A

noise

Figure 10: Independence of causal inuence

noisy-max model is very useful in medical diagnosis. Here, the parents Xi's correspond to theseverity of various diseases that the patient might have, the Yi's correspond to the extent to whichthese diseases inuence a particular symptom, and the severity of the symptom (e.g., Fever) is amax of these severities.

These types of models turn out to be very useful in practice, both because of their cognitiveplausibility and because they provide a signi�cant reduction in the number of parameters requiredto represent the distribution. The number of parameters in the CPD is linear in the number ofparents, as opposed to the usual exponential. Consider, for example, the CPCS network, developedfor the diagnosis of various internal diseases, as shown in Figure ??. The network is speci�ed using8254 parameters, as opposed to almost 134 million (133,931,430) for a network with full CPTs.The causal independence also has computational bene�ts, which we will discuss later.

5 Hierachical Models

Another very useful type of local probability model is one where the CPD is, itself, de�ned via aBayesian network fragment. As a very simple example, consider our decomposition of the noisy-orCPD for the Alarm variable. There, our decomposition represented a model of how the alarm reallyworked on the inside. The model included explicit variables for the various relevant attributes of thealarm, along with their dependency model. These internal variables and their dependencies wereencapsulated inside the Alarm model. Externally, to the rest of the network, we could still view thealarm as a single node with its two inputs: Burglary and Earthquake. All of the internal structurewas encapsulated within the alarm. The entire network fragment was a structured description ofsomething which, for the rest of the network, behaved exactly like a simple CPD.

In general, we can have a hierarchical model where the CPD for a variable X given its parentsY1; : : : ; Yk is de�ned via a separate Bayesian network fragment. That fragment has Y1; : : : ; Yk asinputs; i.e., the fragment doesn't specify a distribution over these variables. Rather, the fragmentspeci�es a conditional distribution over the rest of the variables in the fragment, given the inputs.The output of the fragment is the variable Alarm. By marginalizing over all of these internalvariables | all except the inputs and outputs | the network represents a conditional distributionP (X j Y1; : : : ; Yk), as desired. Again, the de�nition of the distribution is implicit, but can becomputed when necessary. This implicit de�nition can be much more compact than a full CPT.

This type of hierarchical model is clearly useful in device diagnosis tasks. There, the deviceis composed of many other devices. As far as the rest of the model is concerned, the internals of


abdominal-bruit-right-upper-quadrant abdominal-bruit-systolic abdominal-bruit-continuous

abdominal-distention

abdominal-fluid-wave

abdominal-friction-rub

abdominal-pain-epigastriumabdominal-pain-right-upper-quadrant

abdominal-pain-radiation

abdominal-pain-duration-subacute

abdominal-pain-duration-chronic

abdominal-pain-duration-acute

abdominal-pain-exacerbated-by-breathingabdominal-pain-exacerbated-by-cough abdominal-pain-exacerbated-by-motion

abdominal-pain-exacerbated-by-alcohol

abdominal-pain-exacerbated-by-meals

abdominal-pain-nature-non-colicky

abdominal-pain-nature-colicky abdominal-pain-alleviated-by-antacids

abdominal-pain-severity

abdominal-respiratory-movement abdominal-shifting-dullness

abdominal-tenderness abdominal-tenderness-involuntary-guarding-localized

abdominal-tenderness-rebound-tenderness

abnormal-bile-pigment-transport

abrupt-onset-of-illness

activated-partial-thromboplastin-time

acute-fatty-liver-of-pregnancy

acute-hemorrhage

affect affect-affect-description

age

alcoholism-chronic

alcohol-chronic-abuse-history

alcoholic-fatty-liver alcoholic-hepatitis

alkaline-phosphatase-blood

allergy-historyallergy-history-type-of-allergy

dilitation-of-common-bile-duct dilitation-of-major-intrahepatic-bile-ducts

amnesia amnesia-type-of-amnesia

anemia antinuclear-antibody-titer

antibody-hbsag antigen-hbsag

anti-mitochondrial-antibody-titer

antibody-thyroglobulin-reciprocal-titer

appetite

appearance-of-i131-rose-bengal-in-intestine-after-iv-injection

arterial-impulse-magnitude

ascites

ascending-cholangitis

ascitic-fluid-obtained-by-paracentesis ascitic-fluid-wbc-total

back-pain back-pain-laterality back-pain-severity

bilirubinuria bilirubin-blood-conjugated

bilirubin-blood-total

biliary-colic

bile-duct-obstruction-history

bile-plugging-hepatic-canaliculi

gas-in-biliary-tract

bleeding-time

ldh-bloodblood-ammonia-level

blood-culture-pseudomonas blood-culture-enterococcus blood-culture-proteus blood-culture-klebsiella-or-enterobacter blood-culture-e-coli

blood-culture-spirochetal-species

blood-glucose

blood-transfusion-historyblood-transfusion-history-number-of-units-transfused

body-habitus-description

bowel-sounds

breast-enlargement-recent breast-enlargement-recent-laterality

breath-odor breath-odor-odor

breast-tenderness-unilateralbreast-tenderness-bilateral

cardiac-cirrhosis

chest-xray-density-or-infiltrate-shape-of-pulmonary-lesion chest-xray-density-or-infiltrate

cholestasis

cholesterol-blood-decreased

cholesterol-blood-increased

cholestatic-enzyme-elevation

chronic-active-hepatitis

chylous-ascites

clotting-factor-deficit

clot-retraction

coagulopathy-of-hepatocellular-disease

color-of-mucous-membranes

color-of-mucous-membranes-cyanosis

color-of-mucous-membranes-pallor

coma

wbc-neutrophils-percentage

confusion

conjunctiva-suffusion

copper-urine

cough

cryoglobulins-serum

deep-tendon-reflexes

dehydration-syndrome

delirium

depressed-level-of-consciousness

diarrhea

diastolic-arterial-blood-pressure

diarrhea-duration-acute

diarrhea-duration-chronic

diaphragm-elevated diaphragmatic-movement

diffuse-inflammation-of-gastric-mucosa

direct-coombs-test

drug-abuse-historywarfarin-drug-administrationbroad-spectrum-antibiotic-drug-administrationthiouracil-drug-administrationsulfonamide-drug-administrationphenylbutazone-drug-administrationphenothiazine-drug-administration

penicillins-drug-administrationisoniazid-drug-administrationhydralazine-drug-administrationhydantoin-drug-administrationhalothane-drug-administrationcephalosporin-drug-administrationtetracylcline-drug-administrationgeneral-anesthetic-drug-administrationcytotoxic-drug-drug-administration

aspirin-drug-administrationacetaminophen-drug-administrationnitrofurantoin-drug-administrationmethyldopa-drug-administrationdrug-administration-relative-dosage-size

drug-hypersensitivity-generalized

dupuytrens-contractures-of-hands

dyspnea-orthopnea

dyspnea-resting dyspnea-exertional

ectasia-of-small-bile-ducts

edema-scrotum

edema-legs edema-grade-on-0-to-4-scaleedema-bilateral edema-unilateral

temperature-sensitivity-increased-heat

temperature-sensitivity-increased-cold

wbc-eosinophil-count

epistaxis

excessive-bleeding-after-minor-trauma

exposure-to-animals-intimate-contact-cattleexposure-to-animals-intimate-contact-swineexposure-to-animals-intimate-contact-rabbit-rodent-or-other-small-mammal

exposure-to-animals-intimate-contact-dogexposure-to-animals-intimate-contact-catextrahepatic-biliary-obstruction

extravasation-of-contrast-medium-into-stomach-or-duodenum

exudative-ascites

eyes-lid-lag

factor-ix-christmas-level

facies-gross-appearance-face-description

factor-v-proaccelerin-level factor-vii-proconvertin-level

factor-x-stuart-level

jaundice-family-history

fatty-metamorphosis-of-liver

febrile-response-to-microbial-pyrogens

feces-gross-inspection-grossly-bloody feces-gross-inspection-black-tarry

feces-gross-inspection-light-colored

feces-guaiac-test

fever-variability-with-time

fever-variability-with-time-interval

fibrinogen-blood

fibrous-intrahepatic-bands-of-wide-and-variable-thickness

fibrosis-without-loss-of-hepatic-lobular-architecture-degree-or-amount

fibrosis-with-loss-of-hepatic-lobular-architecture-degree-or-amount

fibrosis-without-loss-of-hepatic-lobular-architecture-pattern-of-hepatic-fibrosis

fibrosis-with-loss-of-hepatic-lobular-architecture-any-pattern

fibrosis-with-loss-of-hepatic-lobular-architecture-periportal

fibrosis-with-loss-of-hepatic-lobular-architecture-advanced-pattern-indeterminate

fibrosis-with-loss-of-hepatic-lobular-architecture-reversed-lobulation

fibrosis-with-loss-of-hepatic-lobular-architecture

fibrosis-without-loss-of-hepatic-lobular-architecture

fingers-clubbing

flanks-bulging flanks-heavy

focal-necrosis-and-inflammation-of-hepatocytes

frequency-of-urination

frequency-of-urination-diurnal-pattern

gallbladder-discretely-palpable gallbladder-prolonged-emptying gallbladder-size

gamma-glutamyl-transpeptidase

gastritis-acute

gastrointestinal-blood-loss-from-above-ligament-of-treitz

gastrointestinal-blood-loss-from-below-ligament-of-treitz

general-motor-activity

gross-coagulability-of-blood

gynecomastia

headache

heart-impulse heart-impulse-intensity-of-impulse heart-murmur-left-sternal-border heart-murmur-apex heart-murmur-character heart-murmur-grade heart-murmur-shape-of-murmur heart-murmur-systolic heart-murmur-diastolic

heart-murmur-timing-late heart-murmur-timing-middle heart-rate

headache-severity

hematocrit-blood hemoglobin-blood

hepatomegaly

hepatitis-b-acute

hepatitis-acute-toxic

hepatitis-acute-viral

hepatic-non-caseating-granulomas

hepatitis-contact-history

hepatocellular-damage-due-to-drug-hypersensitivity

hepatocellular-dysfunction hepatocellular-enzyme-elevation

hepatic-fibrosis

hepatocellular-inflammation-and-or-necrosis

hepatic-leptospirosis

hepatic-pain

hernia-umbilical hernia-inguinal

hirsutism

hyperdynamic-circulation

immersion-in-untreated-water-history

impotence

indocyanine-green-retention

influence-of-pregnancy-on-severity-of-illness influence-of-pregnancy-on-severity-of-illness-direction-of-change

iridocyclitis

jaundice

jaundice-clinical-time-course-duration

jaundice-recurrent-episodes-prior-to-present-illness

jaundice-intermittent-during-present-illness

joint-pain joint-pain-clinical-time-course-variability joint-pain-severity

jugular-venous-hum

eye-kayser-fleischer-ring

ketonuria

leptospira-agglutination-test

libido

liver-edge-by-palpation liver-edge-by-palpation-liver-edge-descriptor

liver-gross-contour

liver-gross-contour-liver-contour-description

liver-size-increased

liver-size-decreased

liver-size-degree-or-amount

liver-texture-by-palpation

liver-texture-by-palpation-fine-nodules-uniform-diffuse

liver-texture-by-palpation-bosselated

wbc-lymphocytes-atypical

lymph-node-enlargement

lymph-node-tenderness

macronodal-cirrhosis

magnesium-serum

mallory-bodies-hepatocytes manifestations-of-systemic-viral-infection

marked-necrosis-of-hepatocytes

menstrual-bleeding-increased

menstrual-bleeding-decreased

micronodal-cirrhosis

mononuclear-cells-infiltrating-interlobular-bile-ducts

mucous-membranes-petechiae mucous-membranes-petechiae-site-conjunctiva mucous-membranes-petechiae-site-oral-mucosa

muscle-necrosis-focal

muscle-pain

muscle-tenderness

muscle-weakness muscle-weakness-laterality muscle-weakness-proximal-versus-distal-distribution wbc-myelocytes

narrowing-of-inferior-vena-cava-at-hepatic-level

neutrophilic-cells-infiltrating-interlobular-bile-ducts

occupation-history-veterinarian-or-animal-husbandryoccupation-history-sewer-workeroccupation-history-mineroccupation-history-garbage-workeroccupation-history-fish-cleaneroccupation-history-farm-worker

occupation-history-dock-workeroccupation-history-construction-workeroccupation-history-butcheroccupation-history-medical-laboratory-workeroccupation-history-health-worker

oral-mucosa-superficial-ulcers

orthostatic-hypotension

eye-pain

palpitations

palmar-erythema

paresthesias

parotitis-acute

parenteral-drug-administration-history

parotid-gland-size

past-diagnosis-ulcerative-colitispast-diagnosis-hepatitis-acute-viral

past-diagnosis-crohns-diseasepast-diagnosis-congestive-heart-failure

peritoneal-fluid-diffuse

periportal-infiltration-neutrophils

periportal-infiltration-round-cells

peritoneal-irritation-right-upper-quadrant

pharyngitis-acute pharyngitis-acute-laterality

phosphate-serum

piecemeal-necrosis-of-liver

platelet-count-in-thousands

platelet-morphology platelet-morphology-platelets-description

pregnancy

pregnancy-test

primary-biliary-cirrhosis

proteinuria

proliferation-of-small-bile-ducts

prothrombin-time

pulse-corrigan pulse-pressure pulse-pressure-degree-or-amount pulse-quincke

quantity-of-axillary-and-or-pubic-hairregenerated-hepatic-nodules-of-variable-size

respiration-inspiratory-depth respiration-rate

retina-cotton-wool-exudates

rbc-reticulocyte-count

retina-roth-spots

retina-superficial-flame-shaped-hemorrhages

rheumatoid-factor

rigors

secondary-biliary-cirrhosis

secondary-fatty-liver

serum-albumin

serum-beta-globulin

serum-gamma-globulin serum-gamma-globulin-pattern-of-globulin-increase

serological-indicators-of-hepatocellular-disease

serum-iga-quantitative-level

serum-igg-quantitative-level

serum-igm-quantitative-level

serum-igm-quantitative-level-percent-of-total-serum-protein

sex

sgot-blood sgpt-blood

shellfish-ingestion-history

rbc-size rbc-size-degree-or-amount

skin-acne

skin-anergy-panel-indicates-anergy

skin-pallor

skin-petechiae

skin-pigmentation skin-pruritus

skin-purpura-or-ecchymosis

skin-rash-maculopapular

skin-striae skin-striae-skin-lesion-color

skin-sweating

skin-telangiectasias

skin-temperature-hands-and-feet

skin-temperature-generalized

skin-temperature-subjective-description-cold

skin-temperature-subjective-description-warm

skin-urticaria

sleep-disturbance

sleep-disturbance-hypersomnolence sleep-disturbance-insomnia

smooth-muscle-antibody-titer

spider-angiomata

spleen-size spleen-size-degree-or-amount

sputum-gross-inspection-sputum-description

sputum-production

stomach-aspirate-coffee-grounds stomach-aspirate-gross-blood

stupor-or-somnolence

superficial-gross-hemorrhages-of-mucosa

surgery-history-cholecystectomysurgery-history-biliary-tract-surgerysympathetic-hyperactivity

syncope

systolic-arterial-blood-pressure

systemic-leptospirosis

systemic-manifestations-of-chronic-liver-disease

rbc-targets

temperature

le-test

testis-pain

testicular-size

thrombocytopenia

thrombin-time

tinnitus tinnitus-laterality

cpk-total-blood

wbc-total-in-thousands

tourniquet-test

toxic-substance-exposure-wild-mushrooms

toxic-substance-exposure-phosphorustoxic-substance-exposure-chloroformtoxic-substance-exposure-carbon-tetrachloridetoxic-substance-exposure-alcohol-heavy-consumption

transudative-ascites

tremor

tremor-laterality

tremor-type-of-tremor

triglycerides-serum

hla-type-dw3hla-type-b8hla-type-a1

urea-nitrogen-blood

urine-culture urine-culture-spirochetal-species

urine-gross-inspection urine-gross-inspection-urine-description

urine-sediment-rbc

urobilinogen-urine

vaginal-bleeding-irregular-nonmenstrual vaginal-bleeding-irregular-nonmenstrual-degree-or-amount

vdrl-or-rpr

vertigo

vitamin-k-deficiency

vomiting

vomiting-vomitus-coffee-grounds vomiting-vomitus-gross-blood

vomiting-vomitus-normal-gastric-contents

weight-recent-gain-in-percent

weight-recent-loss-in-percent

zieves-syndrome

Figure 11: the cpcs network for diagnosis of internal diseases. The network contains 448 nodes,906 links.


Read EventPrint Event

Dust

Mouse

Temperature

OS

Printer

Fan

Write Event

Power Supply

Power Source

Spill

Warmed Up

Monitor

Motherboard

Keyboard

Age

Crash

Hard Drive

Computer

Controller

Temperature Age

CableMBR

Head Crash

Bootable

DriveUsable

DBR

FAT

Lost Clusters

Capacity

Surface Damage

Status

Hard

Used

OS-Status

Full

Drive Mechanism

Surface 2 Surface 3 Surface 4Surface 1

Motor

Head Crash

HeadData Transfer

Data AccessDriveMechanism

Temperature

Status

Connected Controller Ok Age

Motor

Stiction

TemperatureAge

Disk Spins

Dead

Figure 12: Four levels of hierarchy in an OOBN model of a computer system.

the component are not relevant. Only its external behavior is. We can encapsulate the internalattributes of a component, making only its external behavior observable to the rest of the model.There is no reason to restrict our model to a single output attribute: the external model mightdepend on several aspects of the component status.

Furthermore, by de�ning a probability model for a type of object, say a disk drive, we can reuseit several times, e.g., if we have several disk drives in our computer system.

In Figure 12 we show a simple hierarchical model for a computer system. This language containsprobabilistic classes for Computer, Motherboard, OS, Hard-Drive, Drive-Mechanism, Drive-Motor,and Disk-Surface. The Computer model has an attribute Has-Hard-Drive of class Hard-Drive;the Hard-Drive class, in turn, has attribute Has-Drive-Mechanism of class Drive-Mechanism. Wecan reuse our model to easily represent situations where an object has several components of thesame type; for example, the Hard-Drive model contains attributes Has-Surface-1, Has-Surface-2, Has-Surface-3 and Has-Surface-4, all of class Disk-Surface. There are also a large number ofsimple attributes, such as Hard-Drive.Status with values f Good, Minor-Damage,Major-Damage,Unreadable g. Most of the di�erent components, in fact, have a Status attribute. Although theyhave the same name, they are in fact di�erent attributes, because they are attributes of di�erentobjects.

The Hard-Drive class has inputs Temperature, Age and OS-Status, and the outputs Status andFull. Although the hard drive has a rich internal state, the only aspects of its state that inuenceobjects outside the hard drive are whether or not it is working properly and whether or not it isfull. The value of the Temperature input of the hard drive in a computer will be obtained fromthe value of the Temperature attribute of the computer itself. A similar process happens for otherinputs.

Besides showing the dependency graph for the classes Computer, Hard-Drive, Drive-Mechanismand Drive-Motor, the �gure also indicates other aspects of the class model. Complex attributes


(ones with a hierarchical model) are shown as rectangles, while simple attributes are ellipses. Eachclass model is contained in a box. Input attributes intersect the top edge of the box, indicatingthe fact that their values are received from outside the class, while output attributes intersect thebottom. The rectangles representing the complex components also have little bubbles on theirborders, showing that attributes are passed into and out of those components.

6 Continuous Variables

So far, we have restricted attention to discrete variables with �nitely many values. What if oneor more variables have in�nitely many values? Clearly, we can't even consider the idea of usingtables as a representation. This situation is quite common: many of the attributes we want torepresent actually take values in a continuous space: temperature, velocity, location, pressure,etc. One solution, which is often used, is to discretize these variables. While this is often done,it is not ideal. In order to get a reasonably accurate model, we often have to use a fairly �nediscretization, leading to very large CPTs. For example, in the application of probabilistic modelsto robot localization (which we will discuss later on), the resolution required for the discretizedversion was 2o for the angle (resulting in 180 values for the variable) and 15cm for the x and ylocation variables. For a reasonably sized environment, the resulting representation had around150 million states.

The view of a probability distribution as a function allows us to provide an alternative solution.All we need is a way of representing the CPD P (X j PaX) in some computer-readable form.

6.1 Density functions

To understand this issue better, let's consider what a continuous distribution looks like. A proba-bility density function (PDF) p is shown in Figure 13.

The probability of the variable being in some range [a; b] is simply

P (X 2 [a; b]) =Z bap(x)dx

In particular, Z1

�1

P (x)dx = 1:

It is important to understand the di�erence between the density function p and the associatedprobability distribution P . At one level, we can view the height of the density function p at eachpoint as representing the \probability" of the variable taking that value. However, that perspectiveis somewhat simpli�ed. First, the actual probability of any given value x is 0. Furthermore, thevalue p(x) is not necessarily in the range [0; 1]. (The only requirement is that the function benon-negative and integrate to 1.) A somewhat more accurate intuition is that the \height" p(x) isthe \contribution" that the value x adds to the integral that allows us to compute P .

As is usual for continuous functions, we represent them using some algebraic formula. There aremany classes of density functions, each associated with some particular template for the algebraicformula. Speci�c densities are instantiations of this template, with actual values substituted forcertain parameters in the template. The most commonly used density function is the Gaussian(normal) distribution. In the univariate case, the Gaussian distribution is parameterized by two


0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

-4 -2 0 2 4

x

P(x)

N(0,1)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

-4 -2 0 2 4

x

P(x)

N(1,1)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

-4 -2 0 2 4

x

P(x)

N(0,4)

(a) (b) (c)

Figure 13: Two univariate Gaussians. (a) Mean 0 and variance 1. (b) Mean 1 and variance 1. (c)Mean 0 and variance 4.

parameters | a mean � and a variance �2. The template has the following form:

p(x) =1p2��2

exp

�(x� �)2

2�2

!:

We typically denote this density function using the notation N(�;�2). Intuitively, we can view theexpression in the top of the exponent as the number squared of standard deviations � that x isaway from the mean �. The more standard deviations x is from the mean, the lower its probability.In fact, the probability gets exponentially lower as x gets further away from the mean. Figure 13shows two examples of Gaussian distributions, for di�erent values of the parameters:

6.2 Conditional distributions

Marginal PDFs are a useful building block, but a BN node is associated with a conditional distri-bution. In a hybrid probabilistic model | one involving both discrete and continuous variables |there are four types of dependencies we should think about representing:

� a discrete node with a discrete parent� a continuous node with a discrete parent� a continuous node with a continuous parent� a discrete node with a continuous parent

Of these, the �rst is the case that we have been exploring until now. We give only one example ofeach of the others, simply to illustrate the basic principles.

Let us �rst consider a continuous node with a discrete parent. As we discussed above, onepossible CPD for a single continuous node X is the Gaussian distribution; this can be representedusing two parameters: the mean and the variance. The simplest way of making the continuousnode X depend on a discrete node U is to de�ne a di�erent set of parameters for every value ofthe discrete parent. More precisely, for every value a 2 Val(U), the CPD for X has a parameter�u and �

2u. The CPD for X is then:

p(X j u) = N(�u;�2u):It is clear that this model extends easily to multiple discrete parents U: we simply have a di�erentset of parameters for every instantiation of values u 2 Val(U).


Now, let's consider a continuous nodeX with a continuous parent Y . Again, one simple solutionis to decide to model the distribution of X as a Gaussian, whose parameters depend on the valueof Y . In this case, we need to have a set of parameters for every one of the in�nitely many valuesy 2 Val(Y ). The simplest and most common solution is to decide that the mean of X is a linearfunction of Y , and that the variance of X does not depend on Y . For example, we might have that

p(X j y) = N(�2x+ 0:9; 1)

This type of dependence is called a linear Gaussian model. It extends to multiple continuousparents in a straightforward way:

De�nition 6.1: Let X be a continuous node with continuous parents Y1; : : : ; Yk. We say that Xhas a linear Gaussian model if there exist parameters a0; : : : ; ak and �

2 such that

p(X j y1; : : : ; yk) = N(a0 + a1y1 + � � �+ akyk;�2)

We can easily extend this model, of course, to have the mean and variance of X depend on thevalue y of Y in any way we want. For example, we might have that the mean of X is sin(y) and thevariance y2=7. However, the linear Gaussian model is a very natural one, which is useful in manypractical applications. One reason is that this type of linear dependence is often quite natural: theposition of a robot at time t can often be viewed as a linear function of its position at time t� 1and its velocity at time t � 1, with some white (Gaussian) noise. Another reason is that linearGaussian dependencies give rise to multivariate Gaussian joint distributions.

More precisely, let X1; : : : ;Xn be a set of random variables. We say that a joint PDF overX1; : : : ;Xn is a multivariate Gaussian if it has the form:

p(X1; : : : ;Xn) =1

(2�)d=2j�j1=2 exp��12(x� �)T��1(x� �)

�

where � is an n� n covariance matrix, where �i;i represents the variance of Xi and �i;j for i 6= jrepresents the covariance of Xi and Xj . Figure 14 shows two multivariate Gaussians, one wherethe covariances are zero, and one where they are positive.

It turns out that continuous Bayesian networks with linear Gaussian models are equivalent tomultivariate Gaussians

Theorem 6.2: Every continuous Bayesian network where all of the dependency models are linearGaussian de�nes a multivariate Gaussian distribution. Conversely, every linear Gaussian distribu-

tion can be represented as a Bayesian network with linear Gaussian models.

In fact, every multivariate distribution (except the one where all variables are independent) hasmultiple representations as a BN, with di�erent structures. For example, the disribution in Fig-ure 14(b) can be represented either as the network where X ! Y or as the network where Y ! X.

We can extend the class of linear Gaussian networks to allow the continuous nodes to havediscrete variables. The idea is the same as the one used above. If a node X has continuous parentsY1; : : : ; Yk and discrete parents U, we simply parameterize it using: for every u 2 Val(U), we haveau;0; : : : ; au;k and �

2u. Then

p(X j y;y) = N(au;0 +kXi=1

au;iyi;�2u):


xy

P(x,y)

xy

P(x,y)

(a) (b)

Figure 14: Gaussians over two variables X and Y . (a) X and Y uncorrelated. (b) X and Ycorrelated.

This dependency model is called a conditional linear Gaussian. It induces joint distributions thatare mixtures | weighted averages | of Gaussians, with one component in the mixture for eachvalue of the discrete network variables, and the weight of the component being the probability ofthat value. Note that the conditional linear Gaussian model does not allow for continuous nodesto have discrete children.

Finally, we move to the case of a discrete child with a continuous parent. The simplest modelis a threshold model. Assume we have a binary discrete node U with a continuous parent Y . Wecan de�ne:

P (u1) =

(0:9 y � 650:05 otherwise

Such a model may be appropriate, for example, if Y is the temperature (in fahrenheit) and U isthe thermostat turning the heater on.

The problem with the threshold model is that the change in probability is discontinuous as afunction of X. A somewhat more reasonable model is the following softmax model. Intuitively,the softmax CPD de�nes a set of R regions (for some parameter R of our choice). The regions arede�ned by a set of R linear functions over the continuous variables. A region is characterized asthat part of the space where one particular linear function is higher than all the others. Each regionis also associated with some distribution over the values of the discrete child; this distribution isthe one used for the variable within this region. The actual CPD is a continuous version of thisregion-based idea, allowing for smooth transitions between the distributions in neighboring regionsof the space.

More precisely, let U be a discrete variable, with continuous parents Y = fY1; : : : ; Ykg. Assumethat U has k possible values, fu1; u2; : : : ; umg. Each of the R regions is de�ned via two vectors ofparameters �r;pr. The vector �r is a vector of weights �r0; �

r1; : : : ; �

rk specifying the linear function

associated with the region. The vector pi = fpr1; : : : ; prmg is the probability distribution overu1; : : : ; um associated with the region (i.e.,

Pmj=1 p

rj = 1). The CPD is now de�ned as: P (U = uj j

Y) =PR

r=1wrprj where w

r =exp(�r

0+Pk

i=1�riYi)PR

q=1exp(�q

0+Pk

i=1�qiYi)

. In other words, the distribution is a weighted

average of the region distributions, where the weight of each \region" depends exponentially on


−5.0 0.0 5.0 10.00.0

0.2

0.4

0.6

0.8

1.0

−2.0 −1.0 0.0 1.0 2.00.0

0.2

0.4

0.6

0.8

1.0

P(C=low|X)P(C=medium|X)P(C=high|X)

Figure 15: Expressive power of a generalized softmax CPD.

how high the value of its de�ning linear function is, relative to the rest. The choice of �i determinesboth the regions and the slope of the transitions between them; the choice of pi determines thedistribution de�ning each region.

The power to choose the number of regions R to be as large as we wish is the key to therich expressive power of the generalized softmax CPD. Figure 15 demonstrates this expressivity. InFigure 15(a), we present an example CPD for a binary variable with R = 4 regions. In Figure 15(b),we show how this CPD can be used to represent a simple classi�er. Here, U is a sensor with threevalues: low, medium and high. The probability of each of these values depends on the value of thecontinuous parent Y . Note that we can easily accomodate a variety of noise models for the sensor:we can make it less reliable in borderline situations by making the transitions between regions moremoderate; we can make it inherently more noisy by having the probabilities of the di�erent valuesin each of the regions be farther away from 0 and 1.

As for the conditional linear Gaussian CPD, our softmax CPD will have a separate componentfor each instantiation of the discrete parents.

We have chosen to focus on a small set of models. Of course, there is an unlimited range ofrepresentations that we can use: any parametric representation for a function of the appropriatetype is �ne in principle. Indeed, the continuous distributions used for the robot grid describedat the beginning of this section were not all linear Gaussian models. The only diÆculty, as faras representation is concerned, is in creating a language that allows for it. Other tasks, such asinference and learning, are a di�erent issue. As we will see, these tasks are diÆcult even for verysimple linear Gaussian hybrid models.

Documents

ersitdechter/courses/ics-275b/koller...Daphne Koller Stanford Univ ersit y Jan uary 13, 2000 In the previous c hapter, w e discussed the represen tation of global prop erties of indep