20

ersitdechter/courses/ics-275b/koller...Daphne Koller Stanford Univ ersit y Jan uary 13, 2000 In the previous c hapter, w e discussed the represen tation of global prop erties of indep

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • Local probabilistic models�

    Handout #10

    Daphne Koller

    Stanford University

    January 13, 2000

    In the previous chapter, we discussed the representation of global properties of independnece bygraphs. These properties of independence speci�ed a factorizaiton of the joint distribution in to aproduct of CPDs. Until now, we mostly ignored CPDs. However, it is clear that the representationalpower of networks is in the ability to represent CPDs. In this chapter we will examine CPDs inmore detail. We will describe a range of representations and consider their implications in termsof local properties of independence.

    1 Tabular CPDs

    When dealing with joint probability of discrete random variable, we can always resort to tabularrepresentation. Simply put, we can represent P (X j PaX) as a table that contain an entry for eachjoint assignment to X and PaX. In order for this to be a proper CPD, we require that all thevalues are not negatives, and that for each value paX, we haveX

    x2Val(X)

    P (x j paX) = 1: (1)

    It is quite clear that this representation is as general as possible. We can represent every possiblediscrete CPD using such a table. As we shall also see, tabular CPDs can be used in a natural wayin inference algorithms.

    However, aside from having these desirable properties, the tabular representation also su�ersfrom several disadvantages. The most obvious one is that the representation can become large andunwieldy. The number of values we need to describe a CPD is the number of joint assignments toX and PaX. Thus, we need jVal(PaX)j � jVal(X)j values in a tabular representation.1 Thus, forexample if we have 5 binary parents of a binary variable X, we need specify 25 = 32 values. Oncewe have 10 parents, we need to specify 210 = 1024 values. Clearly, the number of values growsexponentially in the number of parents.

    This can quickly become a serious problem. Consider a medical domain where a symptom, say\high fever", depends on 10 diseases. It would be quite tiresome to ask our expert 1024 questionsof the format: \What is the probability of high fever when the patient has disease A, does not havedisease B, . . . ?" Clearly, our expert will lose patience with us at some point!

    �Part of a draft for a textbook, co-authored with Nir Friedman, Hebrew University of Jerusalem1We can save some space by using storing the values for jVal(X)j�1 values of X and deducing the last probability

    of the the remaining value by via Eq. (1).

    1

  • Daphne Koller, Stanford University CS228 Notes, Handout #20 2

    This example shows another problem with the tabular representation: it ignores structure withinthe CPD. If the CPD is such that there are no similarity between the various cases, i.e., eachcombination of disease has drastically di�erent probability of high-fever, then the expert might bemore patient. However, in this example, like many others, there is a regularity inthe parametersfor di�erent values of the parents of X. For example, it might be that if the patient su�ers fromdisease A, then she is certain to have high-fever and thus P (X j paX) is the same whenever for allvalues paX in which A is true. Indeed, many of the representations we will consider below attemptto explicitly describe such regularties and expliot them to reduce the number of parameters neededto specify a CPD.

    Finally, it is clear that if we consider random variables with in�nite domains, we cannot storeeach possible conditional probability in a table.

    To avoid these problems we should view CPDs not as tables with all of the conditional prob-abilies, but rather as functions that given values of paX and x return the conditional probabilityP (x j paX). This is all we need in order to have a well-de�ned representation of a BN. In thereminder of the chapter we will explore some of the possible representations of such functions.

    2 Deterministic nodes

    One of the simplest types of regular CPDs are these where X is a deterministic function of PaX.That is, there is a function f : Val(PaX) 7! Val(X), such that

    P (x j paX) =(

    1 x = f(paX)0 otherwise

    For example, X might be the \or" of its parents. Or in a continous domain, we might representP (X j Y;Z) by the function f(y; z) = sin(y+ez). Of course, the extent to which this representationis more compact than a table (i.e., takes less space in the computer) depends on the expressivepower that our language o�ers us for representing deterministic functions. That is, we might useexpressions over a set of basic logical and artithmetic operations to represent f .

    It is clear that deterministic relations are useful in modeling many domains. They often allowsimplify the representation of dependencies (we will see such an example shortly). In addition, insome domains, they are naturally occuring. This is particularly true in \arti�cial domains" such asmodels of machines and electrical circuits. However, we can also �nd them in so called \natural"domains. A simple example is genetics. Recall that genotype of a person is determined by twocopies of each gene. The person's phenotypes are often functions of these values. For example, thegene reponsible for determining blood type has three values a, b, and o. If we represent by the G1and G2 the two copies of the gene, and by T the blood type, then we have that:

    T =

    8>>>>>>><>>>>>>>:

    ab if G1 or G2 is a and the other is ba if at least one of G1 or G2 is equal to a and then other is

    either a or ob if at least one of G1 or G2 is equal to b and then other is

    either b or oo if G1 = o and G2 = o

    Aside from compact representation, we get additional advantage from making the structureexplicit: we can represent additional properties of independencies. Recall that conditional inde-pendence is a numerical property | it is de�ned using equality of probabilities. However, the

  • Daphne Koller, Stanford University CS228 Notes, Handout #20 3

    D E

    A

    C

    B

    Figure 1: A simple example of a network with a determinstic CPD. The double line notationrepresents the fact that C is a deterministic function of A and B.

    procedure det-sep(Graph, // network structureD, // set of deterministic nodesX;Y;Z // query

    )While there is an Xi such that

    (1) Xi 2 D // Xi has a deterministic CPD(2) PaXi � Z

    Z Z [ fXigreturn d-sepG(X;Y j Z)

    Figure 2: Procedure for computing d-separation in the presence of deterministic CPDs.

    graphical structure made certain properties of a distribution explicit. This allowed us to deducethat some independencies hold without looking at the numbers. By making structure explicit inthe CPD, we can do even more of the same.

    Example 2.1: Consider the simple network structure in Figure 1. If C is a deterministic functionof A and B, what new conditional independencies do we have? Suppose that we are given the valuesof A and B. Then, since C is deterministic, we also know the value of C. As consequence, we getthat D and E are independent. Thus, we conclude that I(D;E j A;B) holds in the distribution.

    Note that if C was not a deterministic function of A and B, then this independence is notneccessarily true. Indeed, d-seperation would not deduce that D and E are independent given Aand B.

    Can we augment the d-separation procedure to discover independencies I(X;Y j Z) such asthis? In our example, the �x is to consider C to be part of the evidence once we have A and Bin the evidence. In some situations, we might have variables that are deterministic functions ofvariables that are deterministic functions of the evidence. Thus, we have to iteratively extend theset of evidence variables to contain all the variables that are determined by it.

    This discussion suggests the simple procedure shown in Figure 2. It is easy to convince ourselvesthat this algorithm is sound, in the same sense that d-separation is sound.

  • Daphne Koller, Stanford University CS228 Notes, Handout #20 4

    A

    C

    B

    D

    E

    Figure 3: A slightly more complex example with deterministic CPDs.

    Theorem 2.2: (Soundness of det-sep) Let G be a network structure, and let D;X;Y;Z be setsof variables. If det-sep(G;D;X;Y;Z) returns true, then P j= I(X;Y j Z) for all distributions Psuch that P j= Markov(G) and for each X 2 D, P (X j PaX) is a deterministic CPD.

    Does this procedure capture all of the independencies implied by the deterministic functions?As with d-separation, the answer has to be quali�ed. Given only the graph structure and the setof deterministic CPDs, we cannot �nd additional independencies.

    Theorem 2.3: (Completeness of det-sep) Let G be a network structure, and let D;X;Y;Z besets of variables. If det-sep(G;D;X;Y;Z) returns false, then there is a distribution P such thatP 6j= I(X;Y j Z) but P j= Markov(G), for each X 2 D, P (X j PaX) is a deterministic CPD.

    Of course, particular deterministic functions can imply additional independencies.

    Example 2.4: Consider the network of Figure 3 where C is the exclusive or of A and B. Whatadditional independencies do we have here? In the case of XOR (although not for all other de-terministic functions) The values of C and B fully determine that of A. Therefore, we have thatI(D;E j B;C) holds in the distribution.

    Speci�c deterministic functions can also induce other independencies, albeit of di�erent typethat the ones we discussed in Chapter ??.

    Example 2.5: Consider the following Bayesian network:

    X Y

    OR

    Z

  • Daphne Koller, Stanford University CS228 Notes, Handout #20 5

    and consider what happens if we are given that Y = y1. In this case, we also know that thedeterministic node D necessarily has value d1. And, as the value of D is �xed, we can concludethat X and Z are independent. In other words, we have that

    P (Z j X;Y = y1) = P (Z j Y = y1):

    On the other hand, if we are given Y = y0, the value of D is not determined, and it depends onthe value of X. Hence, the corresponding statement conditioned on y0 is false.

    This example shows that deterministic nodes induce a form of independence, but it is di�erentfrom the standard notion on which we have focused so far. Up to now, we have restricted attentionto independence properties of the form I(X;Y j Z), that imply that P (X j Y;Z) = P (X j Z) forall values of X, Y and Z. Deterministic functions imply a type of independence that only holdsfor particular values of some variables.

    De�nition 2.6: Let X;Y;Z be pairwise disjoint sets of variables, let C be a set of variables (thatmight overlap with X [ Y [ Z), and let c 2 Val(C). We say that X and Y are contextuallyindependent given Z and the context c denoted Ic(X;Y j Z; c), if

    P (X j Y;Z;C) = P (X j Z; c) whenever P (Y;Z; c) > 0:

    We call this form of indepedencies context-speci�c independencies (CSI).In the example above, we would say that Ic(X;Z j y1).

    Example 2.7: Consider the same Bayesian network as above, but assume that C is the deter-ministic OR of A and B. In this case, knowing C and B does not always tell me the value of A.However, if C is known to be false, then A and B are both known to be false, and therefore they areindependent. Thus, we have that Ic(A;B j c0). As a consequence, we also have that Ic(D;E j c0).

    3 Asymmetric dependencies

    Aside from deterministic functions, what other types of regularity we can �nd in CPDs? A commontype of regularity is when we have precisely the same e�ect in several contexts. We can see such aregularity in a modi�ed version of the Alarm example.

    Example 3.1: Suppose that the house owner often forgets to turn on the alarm. To model this,we add to the network of Example ?? an additional variable \On" that denotes whether the alarmwas turned on on that day. The structure of the modi�ed network is shown in Figure 4.

    Now, we need to describe the CPD P (A j O;B;E). Clearly, if the alarm was not turned on,i.e., O = o0, then it would not be active regardless of the values of B and E. This implies that inthe four cases corresponding to values of O;B;E in which O = o0, we have that that the probabilityof alarm is zero (or a very very small number 10�10 if we want to account for extremely unlikelyoccurrences such as lightning strikes temporarily powering the alarm). That is, P (a1 j o0; b; e) = �for all values b and e.

  • Daphne Koller, Stanford University CS228 Notes, Handout #20 6

    Earthquake Burglary

    Alarm

    Call

    Radio

    On

    Earthquake Burglary

    Alarm

    Call

    Radio

    On

    (a) (b)

    Figure 4: (a) The Alarm example modi�ed to consider the probability that the alarm was turnedon. (b) The reduced graph after we remove spurious arcs given the context o0.

    O

    B

    t f

    ft0.000001

    E

    0.90.96

    ftE

    0.00010.6

    ft

    O

    B

    0.95 E

    0.0010.2

    t f

    ft

    ft0.000001

    (a) (b)

    Figure 5: Two tree representations for CPDs of P (A j O;B;E). Internal nodes in the tree denotetests on parent variables. Leaves are annoted with the probability of A = a1.

  • Daphne Koller, Stanford University CS228 Notes, Handout #20 7

    In this simple example, we have a CPD in which four possible values of PaA describe the sameconditional probability over A. How do we represent such regularity? A simple approach is torepresent this by using a tree representation.

    For example, Figure 5 shows two trees we might cansider for the CPD of A in Example 3.1.Given a tree we �nd P (A j o; b; e) by traversing the tree from the root downward. At each internalnode, we see a test on one of the attributes. For example, in the root node of the tree in Figure 5(a)there is a test on the value of O. We the follow the branch that is labeled with the value O is givenin the case we are interested in. Thus, if O = o0, we would reach the leaf labeled with �. Once wereach a leaf we read return the conditional distribution associated with the leaf.

    Formally, we use the following recursive de�nition of trees.

    De�nition 3.2: A CPD-tree representing a CPD for variable X is a rooted tree; each t-node inthe tree is either a leaf t-node or an interior t-node. Each leaf is labeled with a distribution P (X).Each interior t-node is (a) labeled with some variable Z 2 PaX, (b) associated with a set of arcsto its children, one arc for each zi 2 Val(Z), with each arc labeled by some zi.

    A branch through a CPD-tree is a sequence of t-nodes and arcs beginning at the root andproceeding to a leaf node. The assignment induced by branch � is the assignment to the setZ � PaX where each element Z 2 Z labels an interior node of � and is assigned the value z thatlabels the corresponding arc that lies on b. We generally assume that a decision tree is irredundant,that is, no branch b contains two interior nodes labeled by the same variable.

    Note that, to avoid confusion, we use t-nodes and arcs for a CPD-tree, as opposed to nodes andedges for entities for a BN.

    To illustrate this de�nition, consider the tree in Figure 5(a). There are �ve branchs in thistree. One induces the assignment o0, and corresponds to the situation where the alarm was turnedo�. The other four induce complete assignments to all the parents of A: ho1; b0; e0i, ho1; b1; e0i,ho1; b0; e1i, and ho1; b1; e1i. Thus, this representation breaks down the conditional distribution ofA given its parents into �ve conditions by grouping some of the conditions that we can consider into one.

    To elaborate the representation a little, consider a somewhat di�erent example

    Example 3.3: Suppose we now have a di�erent alarm system where the wires cannot move andso an earthquake cannot directly cause a contact in the wires and trigger the alarm. This alarmsystem uses a sensitive motion sensor, one that is set o� even by the motion of objects caused bythe earthquake. A burglary causes the alarm if the burglar did not manage to disable the alarm.However, once the burglar disables the alarm, an earthquake can no longer set it o�. What typeof interaction do we get now? Now we know, that P (A j o1; b1; e0) and P (A j o1; b1; e1) are thesame: if a burglary attempt succeeded, the alarm is disabled and would not be triggered by anearthquake. On the other hand, if it failed, the alarm is set o�, and again, earthquake would notchange the �nal outcome.

    This type of regularity is represented by tree in Figure 5(b). In this tree there is one branchthat induces the assignment o1; b1. Thus, for both cases we mention above, we would use the sameconditional distribution.

    Regularities of this type occur in lots of other situations. As one example, we can have a Wetvariable, denoting whether I get wet; that variable would depend on the Raining variable, but onlyin the context where I am outside. Another, very common situation where this type of regularityoccurs is when we have actions in our model; in these cases, the set of parents for a variables mayvary considerably based on my action. For example, let us revisit the Travel-time example from

  • Daphne Koller, Stanford University CS228 Notes, Handout #20 8

    Time

    T101 T280Road

    (a) (b)

    Figure 6: (a) A network for the travel-time example, and (b) tree representation of the CPD forP (Time j Road;T101;T280).

    Tree1 Tree2

    Figure 7: Two equivalent trees.

    the previous chapter. Recall that Time, the travel time for getting to work, may depend on bothTraÆc101 and TraÆc280, but only on the one which I actually took. Figure 6 shows how wemight represent this example. Con�guration variables also result in such situations. As a real-lifeexample, in a printer diagnosis BN, the printer can be hooked up either to the net via an ethernetcable or to a local cable. The status of the ethernet cable only a�ects the printer if the printer ishooked up to it.

    What is the semantics of the tree representation? As we have seen, to compute P (X j paX) weneed to �nd the unique branch that is consistent with paX and return the distribution associatedwith it. Thus, the form of the tree is not crucial. Only the assignments de�ned by the branchesare. The two trees in Figure 7 are equivalent in the sense that they de�ne the same branches, andassign the same conditional probabilies to each branch.

    If we abstract away from the details of the tree representation, we see that what we are repre-senting are the partitions of Val(PaX) that are de�ned by the branches in a tree. This also allows tosee what can be represented as a tree. All the partitions de�ned by a tree must a description via anassignment to a subset of the variable. Thus, we cannot represent the partition that contains onlyo1; b0; e1 and o1; b1; e0. (Of course, we can use two branches with the same conditional probabilityin this example, but then we not capturing some parts of the structure of the CPD.)

    This immediately suggests other possible representations of partitions. For example, we mightuse logical formulase to describe partitions. This is a very exible representation that can describeany partition we might consider, but the the formulas might get quite long.

    Can we characterize the regularities represented by a tree, or more generally, by any representa-tion of partitions? As for deterministic CPDs, these structures induce properties of context-speci�cindependence. If we consider Example 3.1, then once we know that the alarm is o�, A is indepen-dent of B and E. Again, this is a context-speci�c independence, that holds only for a particularvalue of O. In other words, the CPD for P (A j O;B;E) satis�es Ic(A;B;E j o0). The tree ofFigure 5(b) describes another CSI Ic(A;E j o1; b1): If the alarm is on and there was a burglary, thealarm sound cannot be inuenced by an earthquake.

    These two examples might suggest that the only contexts that induce CSI are those de�ned bycomplete branches, ones that go all the way from the root of a CPD-tree to a leaf. This is not

  • Daphne Koller, Stanford University CS228 Notes, Handout #20 9

    necessarily the case. Consider the CPD of Figure 6. In this example, once we decided to drive viahighway 101, my travel time does not depend on the traÆc load in highway 280. Thus, we havethe property Ic(Time;T280 j Road = 101).

    Of course, we want a systematic way of deducing CSI properties from a tree-representation.To do so, we need to consider how a speci�c context inuences a tree. Consider again the tree ofFigure 5(b), and suppose we are given the context b1. Clearly, we now should focus only on branchesthat are consistent with this value. There are two such branches. One induces the assignment o0

    and the other the assignment o1; b1. We can immediatly see that the choice between these twobranches does not depend on the value of E. Thus, we to conclude that Ic(A;E j b1) holds in thiscase.

    This line of reasoning can be generalized by using the following de�nition.

    De�nition 3.4: Let T be a decision tree over some set of variables Z, and let c 2 Val(C) � Z bea context. The reduced tree with respect to c, denoted, T c, is de�ned recursively as follows. Let rbe the root of T .

    � If r is leaf node: T c = T .� If r is an interior node, then it is labeled with some variable Z, and T consists of r togetherwith immediate subtrees T1; � � � Tk, associated with values z1; � � � zk of Val(Z):{ if Z is not in C: we set T c to be R together with subtrees T1

    c; � � � Tkc.{ if Z is in C: we set T c = Tj

    c, where Tj is the subtree associated with the arc labeledwith value zj 2 c.

    The reduced tree is the tree we need to traverse in order to get to the conditional probability if weknow that C = c. If a variable does not appear in the reduced tree, then the choice of conditionaldistribution does not depend on it.

    Proposition 3.5: Let P (X j PaX) be a CPD that can be represented by a CPD-tree T , let c 2Val(C) for C � PaX be a context, and let Z � PaX. If T c does not test any variable in Z, thenP j= Ic(X;Z j PaX � Z; c).This proposition allows speci�es a computational tool for deducing \local" CSI relations from thetree representation. We can check whether a variable Z is being tested in the reduced tree givena context in linear time. This procedure, however, is incomplete in two ways. First, since theprocedure does not examine the actual parameter values, it can miss additional independenciesthat are true for the speci�c parameter assignments. However, as in the case of completeness ford-separation in BNs, this violation only occurs in degenerate cases. In this case, the degeneracyrequired to induce a violation of completeness is even more obvious than for BNs: if P satis�es anindependence of the form Ic(X;Z j PaX � Z; c), which is not reported by this procedure, then twoof the distributions at the leaves of the CPD-tree must be identical.

    Proposition 3.6: Let P (X j PaX) be a CPD that can be represented by a CPD-tree T , where allof the distributions at the leaves of the tree are distinct. Then for any C;Z � PaX and c 2 Val(C),we have that T c does not test any variable in Z if and only if P j= Ic(X;Z j PaX � Z; c).

    The more severe limitation of this procedure is that it only tests for independencies between Xand some of its parents given a context and the other parents. Are there are other, more global,

  • Daphne Koller, Stanford University CS228 Notes, Handout #20 10

    procedure CSI-sep(Graph, // network structureP // a distribution that satis�es Markov(G)c // a contextX;Y;Z // query

    )let G0 be a duplicate of Gfor each edge Y ! X in G

    if Y ! X is spurious given c in P thenremove Y ! X in G0

    return d-sepG0(X;Y j Z;C)

    Figure 8: Procedure for computing d-separation in the presence of asymmetric dependencies inCPDs.

    implications of such CSI relations? Consider Example 3.1 again. Supposed we know that the alarmis o� (i.e., O = o0). Then, our intuition is that hearing an a radio report regarding an earthquakewould not a�ect the probability of receiving a phone call from the neighbor: since the alarm is o�,an earthquake cannot trigger it, and so the probability of alarm does not increase due to the higherprobability that there was an earthquake. (Note that when the alarm is on, we should anticipatea phone call after hearing the news report; see Section ??.)

    Can we capture this intuition formally? Consider the dependence structure in the contextO = o0. Intuitively, in this context the edge E ! A is redundant, since we know that Ic(A;E j o0).Thus, our intuition is that we should check for d-separation in the graph without this edge. Indeed,we can show that this is a sound check for CSI conditions.

    We start by formally de�ning the set of parents that are irrelevent given a context. Intuitively,we want to say that Y is irrelevant if X is independent of Y given the context and the other parents.We have to be careful though, since the context might include other variables that are not in thefamily of X that can cause X and Y to be dependent in a non-local fashion (e.g., c contains acommon descendent of both X and Y ). Thus, we use the following de�nition

    De�nition 3.7: Let G be a network structure, let P be a distribution such that P j= Markov(G),and let c be a context. De�ne cjZ to be the context restricted to the variables in Z. An edgeY ! X in G is spurious, in the context c, if Ic(X;Y j PaX �Y; cjPaX) holds in P .

    It is easy to see that if we represent CPDs with decision trees, then we can determine whetheran edge is spurious or not by the examining the reducted tree. An edge Y ! X is spurious if Ydoes not appear in the reduced tree for P (X j PaX). Thus, for trees, this de�nition has eÆcientprocedural implementation. For many other representations of asymmetric CPDs we also haveeÆcient procedures for identifying spurious edges.

    Now we can de�ne a variant of d-separation that takes CSI into account. This procedure isstraightforward: we use local considerations to remove spurious edges, and then apply standardd-separation to the resulting graph. See Figure 8 for pseudo-code for this procedure.

    As an example, reconsider the modi�ed Alarm example, with the context O = o0. In this case,we get that the arcs B ! A and E ! A are suprious, and thus the reduced graph is the oneshown in Figure 4(b). As we can see, R and C are d-separated in the reduced graph. Thus, usingCSI-separation we get that R and C are d-separated given the context o0.

  • Daphne Koller, Stanford University CS228 Notes, Handout #20 11

    An immediate question that we should address is whether this procedure is reliable. That is, isit sound? As expected, it is not hard to show that it is indeed sound.

    Theorem 3.8: Let G be a network structure, let P be a distribution such that P j= Markov(G),let c be a context, and let X;Y;Z be sets of variables. If CSI-sep(G;P; c;X;Y;Z) returns true,then P j= Ic(X;Y j Z; c).

    Proof: See Exercise ??

    Of course, we also want to know if CSI-separation is complete? That is, does it reports all theindependencies in the distribution. Here the answer is more complex. In general, the CSI-separationis not complete.

    To see a simple counterexample, consider the example of Figure 6. In this example, CSI-separation will report that T101 and T280 are separated given Time and the context Road = 101.(To see this, note that T280 ! Time is spurious given Road = 101, and thus there is no pathbetween the two variables.) Similarly, if we consider the context Road = 280, we also have thatT101 and T280 are separated given Time. Thus, reasoning by cases, we conclude that once weknow the value of Road, we have that T101 and T280 are independent given Time.

    Can we get this conclusion using CSI-separation? Unfortunately, in general, the answer is no.If we invoke CSI-separation with the empty context, then no edges are spurious and CSI-separationreduces to d-separation. Since both T101 and T280 are parents of Time, we conclude that theyare not separated given Time and Road.

    The problem here is that CSI-separation does not perform reasoning by cases. Of course, ifwe want to determine whether X and Y and independent given Z and a context c, we can invokeCSI-separation on the context c; z for each possible value of Z, and see if X and Y are separatedin all of these contexts. This procedure, however, is exponential in the number of variables of Z.Thus, it is practical only for small evidence sets. Can we do better than reasoning by cases? Theanswer is that sometimes we cannot. See Exercise ?? for a more detailed examination of this issue.

    4 Independence of causal inuence

    We now describe another, very di�erent, type of structure in the local probability model. Let usreconsider the Alarm example, but now make di�erent assumptions about the alarm. Why does aburglary cause the alarm to go o�? Perhaps because it activates the motion sensors. Why does anearthquake cause the alarm go to o�? Perhaps because it jiggles some wires. But what happensif both occur? We can assume that these are two independent causal mechanisms, and that thealarm failed to go o� only if neither of these two mechanisms worked.

    Assume that P (a1 j b1; e0) = 0:9 and P (a1 j b0; e1) = 0:6. In the case b1; e1, the burglary failsto set o� the alarm with probability 0:1, the earthquake fails to set it o� with probability 0:4, thealarm fails to go o� only if both mechanisms fail, and these failures occur independently; hence,the alarm fails to go o� with probability 0:1 � 0:4 = 0:04. In other words, our CPD for P (A j B;E)is:

    A b0e0 b0e1 b1e0 b1e1

    a1 0 0:6 0:9 0:96a0 1 0:4 0:1 0:04

    Here, we assume for simplicity that there are no spontaneous alarms that are not caused by one ofthese mechanisms. We relax this assumption later on.

  • Daphne Koller, Stanford University CS228 Notes, Handout #20 12

    Earthquake Burglary

    e1

    j1 0.6

    0.4j0 1

    0

    e0

    m1

    m0

    b1

    0.9

    0.1

    0

    1

    b0

    Alarm

    Wire jiggle Motion

    Figure 9: Decomposition of the noisy-or model for Alarm.

    An alternative way of understanding this interaction is by assuming that the behavior of thealarm is the one induced by a more elaborate probabilistic model, as represented by the networkfragment in Figure 9. This �gure represents the conditional distribution for the Alarm node givenBurglary and Earthquake; it also uses two intermediate nodes that reveal the associated causalmechanisms. It is easy to verify that the conditional distribution P (A j B;E) induced by thisnetwork is precisely the one shown above.

    The probability that B causes A (0:9 in this example) is called the noise parameter, and denoted�B . In the context of our decomposition, �B = P (m

    1 j b1). Similarly, we have a noise parameter�E , which in this context is �E = P (j

    1 j e1). We can also put in a leak probability that representsthe probability that the alarm would go o� spontaneously, by introducing another node into thenetwork. This node has no parents, and is true with probability �0 = 0:0001. It is also a parent ofthe Alarm node, which remains a deterministic or.

    The decomposition of this CPD clearly shows why this local probability model is called a noisy-or model. The basic interaction of the e�ect with its causes is that of an Or, but there is somenoise in the \e�ective value" of each cause.

    We can de�ne this model in the more general setting:

    De�nition 4.1: LetA be a binary-valued random variable with n binary-valued parentsX1; : : : ;Xn.The CPD P (A j X1; : : : ;Xn) is a noisy-or if there are n + 1 noise parameters �0; �1; : : : ; �n suchthat

    P (a0 j X1; : : : ;Xn) = (1� �0)Y

    i : Xi=xi1

    (1� �i)

    The noisy-or interaction is a special case of a general class of local probability models, calledcausal independence, or independence of causal inuence. These models all share the property thatthe inuence of multiple causees can be decomposed into separate inuences of each one. Moreprecisely:

    De�nition 4.2: Let A be a random variable with parentsX1; : : : ;Xn. The CPD P (A j X1; : : : ;Xn)exhibits independence of causal inuence if it can be induced by a network fragment of the structureshown in Figure 10, where the CPD of A is a deterministic function.

    Independence of causal inuence is a very useful model, with many instantiations, correspondingto di�erent noise models | P (Yi j Xi) | and di�erent deterministic functions. For example, the

  • Daphne Koller, Stanford University CS228 Notes, Handout #20 13

    X

    Y

    X

    Y

    X

    Y

    1

    1

    2

    2

    n

    n

    A

    noise

    Figure 10: Independence of causal inuence

    noisy-max model is very useful in medical diagnosis. Here, the parents Xi's correspond to theseverity of various diseases that the patient might have, the Yi's correspond to the extent to whichthese diseases inuence a particular symptom, and the severity of the symptom (e.g., Fever) is amax of these severities.

    These types of models turn out to be very useful in practice, both because of their cognitiveplausibility and because they provide a signi�cant reduction in the number of parameters requiredto represent the distribution. The number of parameters in the CPD is linear in the number ofparents, as opposed to the usual exponential. Consider, for example, the CPCS network, developedfor the diagnosis of various internal diseases, as shown in Figure ??. The network is speci�ed using8254 parameters, as opposed to almost 134 million (133,931,430) for a network with full CPTs.The causal independence also has computational bene�ts, which we will discuss later.

    5 Hierachical Models

    Another very useful type of local probability model is one where the CPD is, itself, de�ned via aBayesian network fragment. As a very simple example, consider our decomposition of the noisy-orCPD for the Alarm variable. There, our decomposition represented a model of how the alarm reallyworked on the inside. The model included explicit variables for the various relevant attributes of thealarm, along with their dependency model. These internal variables and their dependencies wereencapsulated inside the Alarm model. Externally, to the rest of the network, we could still view thealarm as a single node with its two inputs: Burglary and Earthquake. All of the internal structurewas encapsulated within the alarm. The entire network fragment was a structured description ofsomething which, for the rest of the network, behaved exactly like a simple CPD.

    In general, we can have a hierarchical model where the CPD for a variable X given its parentsY1; : : : ; Yk is de�ned via a separate Bayesian network fragment. That fragment has Y1; : : : ; Yk asinputs; i.e., the fragment doesn't specify a distribution over these variables. Rather, the fragmentspeci�es a conditional distribution over the rest of the variables in the fragment, given the inputs.The output of the fragment is the variable Alarm. By marginalizing over all of these internalvariables | all except the inputs and outputs | the network represents a conditional distributionP (X j Y1; : : : ; Yk), as desired. Again, the de�nition of the distribution is implicit, but can becomputed when necessary. This implicit de�nition can be much more compact than a full CPT.

    This type of hierarchical model is clearly useful in device diagnosis tasks. There, the deviceis composed of many other devices. As far as the rest of the model is concerned, the internals of

  • Daphne Koller, Stanford University CS228 Notes, Handout #20 14

    abdominal-bruit-right-upper-quadrant abdominal-bruit-systolic abdominal-bruit-continuous

    abdominal-distention

    abdominal-fluid-wave

    abdominal-friction-rub

    abdominal-pain-epigastriumabdominal-pain-right-upper-quadrant

    abdominal-pain-radiation

    abdominal-pain-duration-subacute

    abdominal-pain-duration-chronic

    abdominal-pain-duration-acute

    abdominal-pain-exacerbated-by-breathingabdominal-pain-exacerbated-by-cough abdominal-pain-exacerbated-by-motion

    abdominal-pain-exacerbated-by-alcohol

    abdominal-pain-exacerbated-by-meals

    abdominal-pain-nature-non-colicky

    abdominal-pain-nature-colicky abdominal-pain-alleviated-by-antacids

    abdominal-pain-severity

    abdominal-respiratory-movement abdominal-shifting-dullness

    abdominal-tenderness abdominal-tenderness-involuntary-guarding-localized

    abdominal-tenderness-rebound-tenderness

    abnormal-bile-pigment-transport

    abrupt-onset-of-illness

    activated-partial-thromboplastin-time

    acute-fatty-liver-of-pregnancy

    acute-hemorrhage

    affect affect-affect-description

    age

    alcoholism-chronic

    alcohol-chronic-abuse-history

    alcoholic-fatty-liver alcoholic-hepatitis

    alkaline-phosphatase-blood

    allergy-historyallergy-history-type-of-allergy

    dilitation-of-common-bile-duct dilitation-of-major-intrahepatic-bile-ducts

    amnesia amnesia-type-of-amnesia

    anemia antinuclear-antibody-titer

    antibody-hbsag antigen-hbsag

    anti-mitochondrial-antibody-titer

    antibody-thyroglobulin-reciprocal-titer

    appetite

    appearance-of-i131-rose-bengal-in-intestine-after-iv-injection

    arterial-impulse-magnitude

    ascites

    ascending-cholangitis

    ascitic-fluid-obtained-by-paracentesis ascitic-fluid-wbc-total

    back-pain back-pain-laterality back-pain-severity

    bilirubinuria bilirubin-blood-conjugated

    bilirubin-blood-total

    biliary-colic

    bile-duct-obstruction-history

    bile-plugging-hepatic-canaliculi

    gas-in-biliary-tract

    bleeding-time

    ldh-bloodblood-ammonia-level

    blood-culture-pseudomonas blood-culture-enterococcus blood-culture-proteus blood-culture-klebsiella-or-enterobacter blood-culture-e-coli

    blood-culture-spirochetal-species

    blood-glucose

    blood-transfusion-historyblood-transfusion-history-number-of-units-transfused

    body-habitus-description

    bowel-sounds

    breast-enlargement-recent breast-enlargement-recent-laterality

    breath-odor breath-odor-odor

    breast-tenderness-unilateralbreast-tenderness-bilateral

    cardiac-cirrhosis

    chest-xray-density-or-infiltrate-shape-of-pulmonary-lesion chest-xray-density-or-infiltrate

    cholestasis

    cholesterol-blood-decreased

    cholesterol-blood-increased

    cholestatic-enzyme-elevation

    chronic-active-hepatitis

    chylous-ascites

    clotting-factor-deficit

    clot-retraction

    coagulopathy-of-hepatocellular-disease

    color-of-mucous-membranes

    color-of-mucous-membranes-cyanosis

    color-of-mucous-membranes-pallor

    coma

    wbc-neutrophils-percentage

    confusion

    conjunctiva-suffusion

    copper-urine

    cough

    cryoglobulins-serum

    deep-tendon-reflexes

    dehydration-syndrome

    delirium

    depressed-level-of-consciousness

    diarrhea

    diastolic-arterial-blood-pressure

    diarrhea-duration-acute

    diarrhea-duration-chronic

    diaphragm-elevated diaphragmatic-movement

    diffuse-inflammation-of-gastric-mucosa

    direct-coombs-test

    drug-abuse-historywarfarin-drug-administrationbroad-spectrum-antibiotic-drug-administrationthiouracil-drug-administrationsulfonamide-drug-administrationphenylbutazone-drug-administrationphenothiazine-drug-administration

    penicillins-drug-administrationisoniazid-drug-administrationhydralazine-drug-administrationhydantoin-drug-administrationhalothane-drug-administrationcephalosporin-drug-administrationtetracylcline-drug-administrationgeneral-anesthetic-drug-administrationcytotoxic-drug-drug-administration

    aspirin-drug-administrationacetaminophen-drug-administrationnitrofurantoin-drug-administrationmethyldopa-drug-administrationdrug-administration-relative-dosage-size

    drug-hypersensitivity-generalized

    dupuytrens-contractures-of-hands

    dyspnea-orthopnea

    dyspnea-resting dyspnea-exertional

    ectasia-of-small-bile-ducts

    edema-scrotum

    edema-legs edema-grade-on-0-to-4-scaleedema-bilateral edema-unilateral

    temperature-sensitivity-increased-heat

    temperature-sensitivity-increased-cold

    wbc-eosinophil-count

    epistaxis

    excessive-bleeding-after-minor-trauma

    exposure-to-animals-intimate-contact-cattleexposure-to-animals-intimate-contact-swineexposure-to-animals-intimate-contact-rabbit-rodent-or-other-small-mammal

    exposure-to-animals-intimate-contact-dogexposure-to-animals-intimate-contact-catextrahepatic-biliary-obstruction

    extravasation-of-contrast-medium-into-stomach-or-duodenum

    exudative-ascites

    eyes-lid-lag

    factor-ix-christmas-level

    facies-gross-appearance-face-description

    factor-v-proaccelerin-level factor-vii-proconvertin-level

    factor-x-stuart-level

    jaundice-family-history

    fatty-metamorphosis-of-liver

    febrile-response-to-microbial-pyrogens

    feces-gross-inspection-grossly-bloody feces-gross-inspection-black-tarry

    feces-gross-inspection-light-colored

    feces-guaiac-test

    fever-variability-with-time

    fever-variability-with-time-interval

    fibrinogen-blood

    fibrous-intrahepatic-bands-of-wide-and-variable-thickness

    fibrosis-without-loss-of-hepatic-lobular-architecture-degree-or-amount

    fibrosis-with-loss-of-hepatic-lobular-architecture-degree-or-amount

    fibrosis-without-loss-of-hepatic-lobular-architecture-pattern-of-hepatic-fibrosis

    fibrosis-with-loss-of-hepatic-lobular-architecture-any-pattern

    fibrosis-with-loss-of-hepatic-lobular-architecture-periportal

    fibrosis-with-loss-of-hepatic-lobular-architecture-advanced-pattern-indeterminate

    fibrosis-with-loss-of-hepatic-lobular-architecture-reversed-lobulation

    fibrosis-with-loss-of-hepatic-lobular-architecture

    fibrosis-without-loss-of-hepatic-lobular-architecture

    fingers-clubbing

    flanks-bulging flanks-heavy

    focal-necrosis-and-inflammation-of-hepatocytes

    frequency-of-urination

    frequency-of-urination-diurnal-pattern

    gallbladder-discretely-palpable gallbladder-prolonged-emptying gallbladder-size

    gamma-glutamyl-transpeptidase

    gastritis-acute

    gastrointestinal-blood-loss-from-above-ligament-of-treitz

    gastrointestinal-blood-loss-from-below-ligament-of-treitz

    general-motor-activity

    gross-coagulability-of-blood

    gynecomastia

    headache

    heart-impulse heart-impulse-intensity-of-impulse heart-murmur-left-sternal-border heart-murmur-apex heart-murmur-character heart-murmur-grade heart-murmur-shape-of-murmur heart-murmur-systolic heart-murmur-diastolic

    heart-murmur-timing-late heart-murmur-timing-middle heart-rate

    headache-severity

    hematocrit-blood hemoglobin-blood

    hepatomegaly

    hepatitis-b-acute

    hepatitis-acute-toxic

    hepatitis-acute-viral

    hepatic-non-caseating-granulomas

    hepatitis-contact-history

    hepatocellular-damage-due-to-drug-hypersensitivity

    hepatocellular-dysfunction hepatocellular-enzyme-elevation

    hepatic-fibrosis

    hepatocellular-inflammation-and-or-necrosis

    hepatic-leptospirosis

    hepatic-pain

    hernia-umbilical hernia-inguinal

    hirsutism

    hyperdynamic-circulation

    immersion-in-untreated-water-history

    impotence

    indocyanine-green-retention

    influence-of-pregnancy-on-severity-of-illness influence-of-pregnancy-on-severity-of-illness-direction-of-change

    iridocyclitis

    jaundice

    jaundice-clinical-time-course-duration

    jaundice-recurrent-episodes-prior-to-present-illness

    jaundice-intermittent-during-present-illness

    joint-pain joint-pain-clinical-time-course-variability joint-pain-severity

    jugular-venous-hum

    eye-kayser-fleischer-ring

    ketonuria

    leptospira-agglutination-test

    libido

    liver-edge-by-palpation liver-edge-by-palpation-liver-edge-descriptor

    liver-gross-contour

    liver-gross-contour-liver-contour-description

    liver-size-increased

    liver-size-decreased

    liver-size-degree-or-amount

    liver-texture-by-palpation

    liver-texture-by-palpation-fine-nodules-uniform-diffuse

    liver-texture-by-palpation-bosselated

    wbc-lymphocytes-atypical

    lymph-node-enlargement

    lymph-node-tenderness

    macronodal-cirrhosis

    magnesium-serum

    mallory-bodies-hepatocytes manifestations-of-systemic-viral-infection

    marked-necrosis-of-hepatocytes

    menstrual-bleeding-increased

    menstrual-bleeding-decreased

    micronodal-cirrhosis

    mononuclear-cells-infiltrating-interlobular-bile-ducts

    mucous-membranes-petechiae mucous-membranes-petechiae-site-conjunctiva mucous-membranes-petechiae-site-oral-mucosa

    muscle-necrosis-focal

    muscle-pain

    muscle-tenderness

    muscle-weakness muscle-weakness-laterality muscle-weakness-proximal-versus-distal-distribution wbc-myelocytes

    narrowing-of-inferior-vena-cava-at-hepatic-level

    neutrophilic-cells-infiltrating-interlobular-bile-ducts

    occupation-history-veterinarian-or-animal-husbandryoccupation-history-sewer-workeroccupation-history-mineroccupation-history-garbage-workeroccupation-history-fish-cleaneroccupation-history-farm-worker

    occupation-history-dock-workeroccupation-history-construction-workeroccupation-history-butcheroccupation-history-medical-laboratory-workeroccupation-history-health-worker

    oral-mucosa-superficial-ulcers

    orthostatic-hypotension

    eye-pain

    palpitations

    palmar-erythema

    paresthesias

    parotitis-acute

    parenteral-drug-administration-history

    parotid-gland-size

    past-diagnosis-ulcerative-colitispast-diagnosis-hepatitis-acute-viral

    past-diagnosis-crohns-diseasepast-diagnosis-congestive-heart-failure

    peritoneal-fluid-diffuse

    periportal-infiltration-neutrophils

    periportal-infiltration-round-cells

    peritoneal-irritation-right-upper-quadrant

    pharyngitis-acute pharyngitis-acute-laterality

    phosphate-serum

    piecemeal-necrosis-of-liver

    platelet-count-in-thousands

    platelet-morphology platelet-morphology-platelets-description

    pregnancy

    pregnancy-test

    primary-biliary-cirrhosis

    proteinuria

    proliferation-of-small-bile-ducts

    prothrombin-time

    pulse-corrigan pulse-pressure pulse-pressure-degree-or-amount pulse-quincke

    quantity-of-axillary-and-or-pubic-hairregenerated-hepatic-nodules-of-variable-size

    respiration-inspiratory-depth respiration-rate

    retina-cotton-wool-exudates

    rbc-reticulocyte-count

    retina-roth-spots

    retina-superficial-flame-shaped-hemorrhages

    rheumatoid-factor

    rigors

    secondary-biliary-cirrhosis

    secondary-fatty-liver

    serum-albumin

    serum-beta-globulin

    serum-gamma-globulin serum-gamma-globulin-pattern-of-globulin-increase

    serological-indicators-of-hepatocellular-disease

    serum-iga-quantitative-level

    serum-igg-quantitative-level

    serum-igm-quantitative-level

    serum-igm-quantitative-level-percent-of-total-serum-protein

    sex

    sgot-blood sgpt-blood

    shellfish-ingestion-history

    rbc-size rbc-size-degree-or-amount

    skin-acne

    skin-anergy-panel-indicates-anergy

    skin-pallor

    skin-petechiae

    skin-pigmentation skin-pruritus

    skin-purpura-or-ecchymosis

    skin-rash-maculopapular

    skin-striae skin-striae-skin-lesion-color

    skin-sweating

    skin-telangiectasias

    skin-temperature-hands-and-feet

    skin-temperature-generalized

    skin-temperature-subjective-description-cold

    skin-temperature-subjective-description-warm

    skin-urticaria

    sleep-disturbance

    sleep-disturbance-hypersomnolence sleep-disturbance-insomnia

    smooth-muscle-antibody-titer

    spider-angiomata

    spleen-size spleen-size-degree-or-amount

    sputum-gross-inspection-sputum-description

    sputum-production

    stomach-aspirate-coffee-grounds stomach-aspirate-gross-blood

    stupor-or-somnolence

    superficial-gross-hemorrhages-of-mucosa

    surgery-history-cholecystectomysurgery-history-biliary-tract-surgerysympathetic-hyperactivity

    syncope

    systolic-arterial-blood-pressure

    systemic-leptospirosis

    systemic-manifestations-of-chronic-liver-disease

    rbc-targets

    temperature

    le-test

    testis-pain

    testicular-size

    thrombocytopenia

    thrombin-time

    tinnitus tinnitus-laterality

    cpk-total-blood

    wbc-total-in-thousands

    tourniquet-test

    toxic-substance-exposure-wild-mushrooms

    toxic-substance-exposure-phosphorustoxic-substance-exposure-chloroformtoxic-substance-exposure-carbon-tetrachloridetoxic-substance-exposure-alcohol-heavy-consumption

    transudative-ascites

    tremor

    tremor-laterality

    tremor-type-of-tremor

    triglycerides-serum

    hla-type-dw3hla-type-b8hla-type-a1

    urea-nitrogen-blood

    urine-culture urine-culture-spirochetal-species

    urine-gross-inspection urine-gross-inspection-urine-description

    urine-sediment-rbc

    urobilinogen-urine

    vaginal-bleeding-irregular-nonmenstrual vaginal-bleeding-irregular-nonmenstrual-degree-or-amount

    vdrl-or-rpr

    vertigo

    vitamin-k-deficiency

    vomiting

    vomiting-vomitus-coffee-grounds vomiting-vomitus-gross-blood

    vomiting-vomitus-normal-gastric-contents

    weight-recent-gain-in-percent

    weight-recent-loss-in-percent

    zieves-syndrome

    Figure 11: the cpcs network for diagnosis of internal diseases. The network contains 448 nodes,906 links.

  • Daphne Koller, Stanford University CS228 Notes, Handout #20 15

    Read EventPrint Event

    Dust

    Mouse

    Temperature

    OS

    Printer

    Fan

    Write Event

    Power Supply

    Power Source

    Spill

    Warmed Up

    Monitor

    Motherboard

    Keyboard

    Age

    Crash

    Hard Drive

    Computer

    Controller

    Temperature Age

    CableMBR

    Head Crash

    Bootable

    DriveUsable

    DBR

    FAT

    Lost Clusters

    Capacity

    Surface Damage

    Status

    Hard

    Used

    OS-Status

    Full

    Drive Mechanism

    Surface 2 Surface 3 Surface 4Surface 1

    Motor

    Head Crash

    HeadData Transfer

    Data AccessDriveMechanism

    Temperature

    Status

    Connected Controller Ok Age

    Motor

    Stiction

    TemperatureAge

    Disk Spins

    Dead

    Figure 12: Four levels of hierarchy in an OOBN model of a computer system.

    the component are not relevant. Only its external behavior is. We can encapsulate the internalattributes of a component, making only its external behavior observable to the rest of the model.There is no reason to restrict our model to a single output attribute: the external model mightdepend on several aspects of the component status.

    Furthermore, by de�ning a probability model for a type of object, say a disk drive, we can reuseit several times, e.g., if we have several disk drives in our computer system.

    In Figure 12 we show a simple hierarchical model for a computer system. This language containsprobabilistic classes for Computer, Motherboard, OS, Hard-Drive, Drive-Mechanism, Drive-Motor,and Disk-Surface. The Computer model has an attribute Has-Hard-Drive of class Hard-Drive;the Hard-Drive class, in turn, has attribute Has-Drive-Mechanism of class Drive-Mechanism. Wecan reuse our model to easily represent situations where an object has several components of thesame type; for example, the Hard-Drive model contains attributes Has-Surface-1, Has-Surface-2, Has-Surface-3 and Has-Surface-4, all of class Disk-Surface. There are also a large number ofsimple attributes, such as Hard-Drive.Status with values f Good, Minor-Damage,Major-Damage,Unreadable g. Most of the di�erent components, in fact, have a Status attribute. Although theyhave the same name, they are in fact di�erent attributes, because they are attributes of di�erentobjects.

    The Hard-Drive class has inputs Temperature, Age and OS-Status, and the outputs Status andFull. Although the hard drive has a rich internal state, the only aspects of its state that inuenceobjects outside the hard drive are whether or not it is working properly and whether or not it isfull. The value of the Temperature input of the hard drive in a computer will be obtained fromthe value of the Temperature attribute of the computer itself. A similar process happens for otherinputs.

    Besides showing the dependency graph for the classes Computer, Hard-Drive, Drive-Mechanismand Drive-Motor, the �gure also indicates other aspects of the class model. Complex attributes

  • Daphne Koller, Stanford University CS228 Notes, Handout #20 16

    (ones with a hierarchical model) are shown as rectangles, while simple attributes are ellipses. Eachclass model is contained in a box. Input attributes intersect the top edge of the box, indicatingthe fact that their values are received from outside the class, while output attributes intersect thebottom. The rectangles representing the complex components also have little bubbles on theirborders, showing that attributes are passed into and out of those components.

    6 Continuous Variables

    So far, we have restricted attention to discrete variables with �nitely many values. What if oneor more variables have in�nitely many values? Clearly, we can't even consider the idea of usingtables as a representation. This situation is quite common: many of the attributes we want torepresent actually take values in a continuous space: temperature, velocity, location, pressure,etc. One solution, which is often used, is to discretize these variables. While this is often done,it is not ideal. In order to get a reasonably accurate model, we often have to use a fairly �nediscretization, leading to very large CPTs. For example, in the application of probabilistic modelsto robot localization (which we will discuss later on), the resolution required for the discretizedversion was 2o for the angle (resulting in 180 values for the variable) and 15cm for the x and ylocation variables. For a reasonably sized environment, the resulting representation had around150 million states.

    The view of a probability distribution as a function allows us to provide an alternative solution.All we need is a way of representing the CPD P (X j PaX) in some computer-readable form.

    6.1 Density functions

    To understand this issue better, let's consider what a continuous distribution looks like. A proba-bility density function (PDF) p is shown in Figure 13.

    The probability of the variable being in some range [a; b] is simply

    P (X 2 [a; b]) =Z bap(x)dx

    In particular, Z1

    �1

    P (x)dx = 1:

    It is important to understand the di�erence between the density function p and the associatedprobability distribution P . At one level, we can view the height of the density function p at eachpoint as representing the \probability" of the variable taking that value. However, that perspectiveis somewhat simpli�ed. First, the actual probability of any given value x is 0. Furthermore, thevalue p(x) is not necessarily in the range [0; 1]. (The only requirement is that the function benon-negative and integrate to 1.) A somewhat more accurate intuition is that the \height" p(x) isthe \contribution" that the value x adds to the integral that allows us to compute P .

    As is usual for continuous functions, we represent them using some algebraic formula. There aremany classes of density functions, each associated with some particular template for the algebraicformula. Speci�c densities are instantiations of this template, with actual values substituted forcertain parameters in the template. The most commonly used density function is the Gaussian(normal) distribution. In the univariate case, the Gaussian distribution is parameterized by two

  • Daphne Koller, Stanford University CS228 Notes, Handout #20 17

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    -4 -2 0 2 4

    x

    P(x)

    N(0,1)

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    -4 -2 0 2 4

    x

    P(x)

    N(1,1)

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    -4 -2 0 2 4

    x

    P(x)

    N(0,4)

    (a) (b) (c)

    Figure 13: Two univariate Gaussians. (a) Mean 0 and variance 1. (b) Mean 1 and variance 1. (c)Mean 0 and variance 4.

    parameters | a mean � and a variance �2. The template has the following form:

    p(x) =1p2��2

    exp

    �(x� �)2

    2�2

    !:

    We typically denote this density function using the notation N(�;�2). Intuitively, we can view theexpression in the top of the exponent as the number squared of standard deviations � that x isaway from the mean �. The more standard deviations x is from the mean, the lower its probability.In fact, the probability gets exponentially lower as x gets further away from the mean. Figure 13shows two examples of Gaussian distributions, for di�erent values of the parameters:

    6.2 Conditional distributions

    Marginal PDFs are a useful building block, but a BN node is associated with a conditional distri-bution. In a hybrid probabilistic model | one involving both discrete and continuous variables |there are four types of dependencies we should think about representing:

    � a discrete node with a discrete parent� a continuous node with a discrete parent� a continuous node with a continuous parent� a discrete node with a continuous parent

    Of these, the �rst is the case that we have been exploring until now. We give only one example ofeach of the others, simply to illustrate the basic principles.

    Let us �rst consider a continuous node with a discrete parent. As we discussed above, onepossible CPD for a single continuous node X is the Gaussian distribution; this can be representedusing two parameters: the mean and the variance. The simplest way of making the continuousnode X depend on a discrete node U is to de�ne a di�erent set of parameters for every value ofthe discrete parent. More precisely, for every value a 2 Val(U), the CPD for X has a parameter�u and �

    2u. The CPD for X is then:

    p(X j u) = N(�u;�2u):It is clear that this model extends easily to multiple discrete parents U: we simply have a di�erentset of parameters for every instantiation of values u 2 Val(U).

  • Daphne Koller, Stanford University CS228 Notes, Handout #20 18

    Now, let's consider a continuous nodeX with a continuous parent Y . Again, one simple solutionis to decide to model the distribution of X as a Gaussian, whose parameters depend on the valueof Y . In this case, we need to have a set of parameters for every one of the in�nitely many valuesy 2 Val(Y ). The simplest and most common solution is to decide that the mean of X is a linearfunction of Y , and that the variance of X does not depend on Y . For example, we might have that

    p(X j y) = N(�2x+ 0:9; 1)

    This type of dependence is called a linear Gaussian model. It extends to multiple continuousparents in a straightforward way:

    De�nition 6.1: Let X be a continuous node with continuous parents Y1; : : : ; Yk. We say that Xhas a linear Gaussian model if there exist parameters a0; : : : ; ak and �

    2 such that

    p(X j y1; : : : ; yk) = N(a0 + a1y1 + � � �+ akyk;�2)

    We can easily extend this model, of course, to have the mean and variance of X depend on thevalue y of Y in any way we want. For example, we might have that the mean of X is sin(y) and thevariance y2=7. However, the linear Gaussian model is a very natural one, which is useful in manypractical applications. One reason is that this type of linear dependence is often quite natural: theposition of a robot at time t can often be viewed as a linear function of its position at time t� 1and its velocity at time t � 1, with some white (Gaussian) noise. Another reason is that linearGaussian dependencies give rise to multivariate Gaussian joint distributions.

    More precisely, let X1; : : : ;Xn be a set of random variables. We say that a joint PDF overX1; : : : ;Xn is a multivariate Gaussian if it has the form:

    p(X1; : : : ;Xn) =1

    (2�)d=2j�j1=2 exp��12(x� �)T��1(x� �)

    where � is an n� n covariance matrix, where �i;i represents the variance of Xi and �i;j for i 6= jrepresents the covariance of Xi and Xj . Figure 14 shows two multivariate Gaussians, one wherethe covariances are zero, and one where they are positive.

    It turns out that continuous Bayesian networks with linear Gaussian models are equivalent tomultivariate Gaussians

    Theorem 6.2: Every continuous Bayesian network where all of the dependency models are linearGaussian de�nes a multivariate Gaussian distribution. Conversely, every linear Gaussian distribu-

    tion can be represented as a Bayesian network with linear Gaussian models.

    In fact, every multivariate distribution (except the one where all variables are independent) hasmultiple representations as a BN, with di�erent structures. For example, the disribution in Fig-ure 14(b) can be represented either as the network where X ! Y or as the network where Y ! X.

    We can extend the class of linear Gaussian networks to allow the continuous nodes to havediscrete variables. The idea is the same as the one used above. If a node X has continuous parentsY1; : : : ; Yk and discrete parents U, we simply parameterize it using: for every u 2 Val(U), we haveau;0; : : : ; au;k and �

    2u. Then

    p(X j y;y) = N(au;0 +kXi=1

    au;iyi;�2u):

  • Daphne Koller, Stanford University CS228 Notes, Handout #20 19

    xy

    P(x,y)

    xy

    P(x,y)

    (a) (b)

    Figure 14: Gaussians over two variables X and Y . (a) X and Y uncorrelated. (b) X and Ycorrelated.

    This dependency model is called a conditional linear Gaussian. It induces joint distributions thatare mixtures | weighted averages | of Gaussians, with one component in the mixture for eachvalue of the discrete network variables, and the weight of the component being the probability ofthat value. Note that the conditional linear Gaussian model does not allow for continuous nodesto have discrete children.

    Finally, we move to the case of a discrete child with a continuous parent. The simplest modelis a threshold model. Assume we have a binary discrete node U with a continuous parent Y . Wecan de�ne:

    P (u1) =

    (0:9 y � 650:05 otherwise

    Such a model may be appropriate, for example, if Y is the temperature (in fahrenheit) and U isthe thermostat turning the heater on.

    The problem with the threshold model is that the change in probability is discontinuous as afunction of X. A somewhat more reasonable model is the following softmax model. Intuitively,the softmax CPD de�nes a set of R regions (for some parameter R of our choice). The regions arede�ned by a set of R linear functions over the continuous variables. A region is characterized asthat part of the space where one particular linear function is higher than all the others. Each regionis also associated with some distribution over the values of the discrete child; this distribution isthe one used for the variable within this region. The actual CPD is a continuous version of thisregion-based idea, allowing for smooth transitions between the distributions in neighboring regionsof the space.

    More precisely, let U be a discrete variable, with continuous parents Y = fY1; : : : ; Ykg. Assumethat U has k possible values, fu1; u2; : : : ; umg. Each of the R regions is de�ned via two vectors ofparameters �r;pr. The vector �r is a vector of weights �r0; �

    r1; : : : ; �

    rk specifying the linear function

    associated with the region. The vector pi = fpr1; : : : ; prmg is the probability distribution overu1; : : : ; um associated with the region (i.e.,

    Pmj=1 p

    rj = 1). The CPD is now de�ned as: P (U = uj j

    Y) =PR

    r=1wrprj where w

    r =exp(�r

    0+Pk

    i=1�riYi)PR

    q=1exp(�q

    0+Pk

    i=1�qiYi)

    . In other words, the distribution is a weighted

    average of the region distributions, where the weight of each \region" depends exponentially on

  • Daphne Koller, Stanford University CS228 Notes, Handout #20 20

    −5.0 0.0 5.0 10.00.0

    0.2

    0.4

    0.6

    0.8

    1.0

    −2.0 −1.0 0.0 1.0 2.00.0

    0.2

    0.4

    0.6

    0.8

    1.0

    P(C=low|X)P(C=medium|X)P(C=high|X)

    Figure 15: Expressive power of a generalized softmax CPD.

    how high the value of its de�ning linear function is, relative to the rest. The choice of �i determinesboth the regions and the slope of the transitions between them; the choice of pi determines thedistribution de�ning each region.

    The power to choose the number of regions R to be as large as we wish is the key to therich expressive power of the generalized softmax CPD. Figure 15 demonstrates this expressivity. InFigure 15(a), we present an example CPD for a binary variable with R = 4 regions. In Figure 15(b),we show how this CPD can be used to represent a simple classi�er. Here, U is a sensor with threevalues: low, medium and high. The probability of each of these values depends on the value of thecontinuous parent Y . Note that we can easily accomodate a variety of noise models for the sensor:we can make it less reliable in borderline situations by making the transitions between regions moremoderate; we can make it inherently more noisy by having the probabilities of the di�erent valuesin each of the regions be farther away from 0 and 1.

    As for the conditional linear Gaussian CPD, our softmax CPD will have a separate componentfor each instantiation of the discrete parents.

    We have chosen to focus on a small set of models. Of course, there is an unlimited range ofrepresentations that we can use: any parametric representation for a function of the appropriatetype is �ne in principle. Indeed, the continuous distributions used for the robot grid describedat the beginning of this section were not all linear Gaussian models. The only diÆculty, as faras representation is concerned, is in creating a language that allows for it. Other tasks, such asinference and learning, are a di�erent issue. As we will see, these tasks are diÆcult even for verysimple linear Gaussian hybrid models.