Notes on Bayesian Networks - · PDF fileNotes on Bayesian Networks Stefano Nasini Dept. of Statistics and Operations Research Universitat Polit ecnica de Catalunya 1 Concepts of Bayesian

  • Upload
    lykhanh

  • View
    218

  • Download
    2

Embed Size (px)

Citation preview

  • Notes on Bayesian Networks

    Stefano NasiniDept. of Statistics and Operations Research

    Universitat Politecnica de Catalunya

    1 Concepts of Bayesian Networks

    1.1 Introduction

    A Bayesian network (from now on BN) of a set of variables X = {X1, . . . , Xn} represents a jointprobability distribution over those variables. It consists of a network structure that encodes asser-tions of conditional independence in the distribution and a set of conditional probability distributionscorresponding to that structure. It is graphically represented by directed acyclic graphs, whose nodesdenotes the random variables, which may be observable quantities, latent variables, unknown param-eters or hypotheses. Edges represent conditional dependencies, so that nodes which are disconnectedrepresent variables which are conditionally independent of each other.

    1.2 Easy example

    Suppose that there are two events which could cause a person to cry: either the pain or theembarrassment. Also, suppose that the pain has a direct effect on the use of the embarrassment(namely that when a person has pain, he fees more embarrassed). Then the situation can be mod-eled with a BN, as the one in Figure (1.2). All three variables have two possible values, 1 (indi-cating the realization of the event) and 0 (for the opposite case). The joint probability function is:P (Cry, Pain,Embarrassment) = P (Cry|Pain,Embarrassment)P (Embarrassment|Pain)P (Pain).

    Figure 1: Small example of BN

    1

  • We can use this formalization to make diagnostic inferential questions like What is the probabilitythat a person is experiencing pain, given hi/shi is crying? by using the conditional probability formulaand summing over all nuisance variables:

    P (P |C) = P (P,C)/P (C) == (P (P,C,E) + P (P,C,E))/P (C) =

    = (P (C|P,E)P (P,E) + P (C|P,E)P (P,E))/P (C) == (P (C|P,E)P (P |E)P (E) + P (C|P,E)P (P |E)P (E))/P (C).

    (1)

    Noting that P (C) can also be decomposed in a similar way, i.e. P (C) = P (P,C,E)+P (P,C,E)+P (P ,C,E)+P (P ,C,E) = P (C|P,E)P (P |E)P (E)+P (C|P,E)P (P |E)P (E)+P (C|P ,E)P (P |E)P (E)+P (C|P ,E)P (P |E)P (E). By substituting out the conditional probability values provided in Figure(1.2) in (1) we have:

    P (P |C) = P (C|P,E))/P (C)= (0.99 0.01 0.3 + 0.8 0.99 0.3)/(0.99 0.01 0.3 + 0.8 0.99 0.3 + 0.9 0.4 0.7)

    = 0.4884(2)

    1.3 Conditional independence

    Two events A and B are conditionally independent given a third event C if the conditional proba-bility of B A given C is equal to the probability of A given C times the probability of B given C, i.eP (AB|C) = P (A|C)P (B|C). In other words, A and B are conditionally independent given C whenthe knowledge of C, make the knowledge of B irrelevant for the characterization of the probability ofA.

    1.4 D-separation

    Under the assumption of conditional independence, each variable in a BN is independent of itsancestors given the values of its parents. With the causal Markov assumption, we can check someconditional independence in Bayesian networks by the d-separation criterion: if two sets of nodes X1and X2 are d-separated in BN by a third set Y (excluding X1 and X2), the corresponding variable setsX1 and X2 are independent given the variables in Y . The definition of d-separation is as follows: twosets of nodes X1 and X2 are d-separated in BN by a third set Y (excluding X1 and X2) if and only ifevery path between X1 and X2 is blocked, where the term blocked means that there is an intermediatevariable Y1 (distinct from X1 and X2) such that:

    - The connection through Y1 is tail-to-tail or tail-to-head and Y1 is instantiated- Or, the connection through Y1 is head-to-head and neither Y1 nor any of Y1s descendants have

    received evidence.The graph patterns of tail-to-tail, tail-to-head and head-to-head are shown in Figure (1.2).

    Figure 2: Divergent, Convergent, Serial pattern

    2

  • 1.5 Markov blanket

    The Markov blanket for a node X is the set of nodes M(X) composed of Xs parents, its children,and its childrens other parents. Every set of nodes in the network is conditionally independentof X when conditioned on the set , that is, when conditioned on the Markov blanket of the node:P (X|M(X), Y ) = P (X|M(X)) for whatever Y in the BN which is not in the Markov blanket of X.

    Figure 3: The Markov blanket for a node X

    Thus, the Markov blanket of a node contains all the variables that shield the node from the restof the network, so that it is the only knowledge needed to predict the behavior of a node.

    1.6 Parameter learning and structural learning

    There are two components of BN learning:

    i. A network with a defined structure but without specified parameters requires computing theparameters from a data set with samples of the network variables. This problem can be stated asmaximizing the likelihood of the parameters given observed data and a network structure.

    ii. Given only data and no network structure, we must learn the structure of the network and thenlearn the parameters. The structure learning problem is to maximize a score function, whichdepends both on the data and on the network structure. Search and score approaches are themost common methods in finding network structures that fit a data set. The search component isan algorithm with the goal of identifying high scoring Bayesian network structures. The scoringfunction returns a score indicating how well a structure fits the given data.

    A common scoring function is posterior probability of the structure given the training data. Thetime requirement of an exhaustive search returning back a structure that maximizes the score is superexponential in the number of variables. A local search strategy makes incremental changes aimed atimproving the score of the structure.

    1.7 The probabilistic logic sampling

    3

  • The probabilistic logic sampling algorithm is a simulation procedure first proposed by Henrion in1988.

    Each node is randomly instantiated to one of its possible states, according to the probability ofthis state given the instantiated states of its parents. This requires every instantiation to be performedin the topological order, i.e., parents are sampled before their children. Nodes with observed states(evidence nodes) are also sampled, but if the outcome of the sampling process is inconsistent with theobserved state, the entire sample is discarded.

    Essentially, the algorithm is based on forward (i.e., according to the weak ordering implied by thedirected graph) generation of instantiations of nodes guided by their probability. If a generated in-stantiation of an evidence node is different from its observed value, then the entire sample is discarded.This makes the algorithm very inefficient if the prior probability of evidence is low. The algorithm isvery efficient in cases when no evidence has been observed or the evidence is very likely.

    2 A computational example with R using the bnlearn package

    3 Sampling from a BN in R

    In this subsection we are randomly generating N instances of 6 binary variables (X1, . . . , X6), inaccordance with the structure of dependencies defined by the following BN.

    P (X1 = 1) = 0.3P (X2 = 1) = 0.7P (X3 = 1|X1 = 1;X2 = 1) = 0.8P (X3 = 1|X1 = 1;X2 = 0) = 0.6P (X3 = 1|X1 = 0;X2 = 1) = 0.6P (X3 = 1|X1 = 0;X2 = 0) = 0.1P (X4 = 1|X1 = 1) = 0.1P (X4 = 1|X1 = 0) = 0.8P (X5 = 1|X2 = 1) = 0.1P (X5 = 1|X2 = 0) = 0.5P (X6 = 1|X1 = 1;X5 = 1) = 0.9P (X6 = 1|X1 = 1;X5 = 0) = 0.5P (X6 = 1|X1 = 0;X5 = 1) = 0.5P (X6 = 1|X1 = 0;X5 = 0) = 0.1

    (3)

    To simulate from this probabilistic structure we use the probabilistic logic sampling algorithmdescribed in the previous section.

    We generate 100 occurrences of the 6 variables described in Figure (3) and apply the hill climbinggreedy search to the sampled data.

    4

  • Figure 4: R implementation of the Probabilistic Logic Sampling for the BN plotted in panel (a)

    5

  • Figure 5: R implementation of the Probabilistic Logic Sampling for the BN plotted in panel (a)

    X1 X2 X3 X4 X5 X61 0 1 1 1 0 02 0 1 0 1 0 03 1 1 1 0 1 14 0 1 0 1 0 05 0 1 0 0 0 06 0 0 1 0 1 07 0 0 0 1 1 18 0 0 0 1 0 09 0 1 1 1 0 0

    10 1 1 1 0 0 0...

    ......

    ......

    ......

    90 0 0 0 1 0 191 0 1 0 1 0 092 0 1 1 1 0 093 0 1 0 0 0 094 0 0 0 1 1 195 0 1 1 1 0 096 0 0 0 1 1 097 1 0 1 0 1 098 0 1 1 0 0 099 0 0 0 1 0 0

    100 0 1 1 1 0 0

    4 Estimating conditional probability under known network struc-ture

    As previously seen, a network with a defined structure but without specified parameters requiresestimating the parameters from a data. This problem can be stated as maximizing the likelihood ofthe parameters given observed data and a network structure.

    In what follows we are using the R function bn.fit() to estimate the parameters of the BN,assuming that the network structure is known to be the one we simulated from (Figure (3)). The Rfunction bn.fit() fits the parameters of a BN given its structure and a data set.

    We observe from the result below that if the network structure is known the ML estimationobtained is quite close to the real value of the parameters.

    6

  • P (X1 = 1) = 0.35P (X2 = 1) = 0.77P (X3 = 1|X1 = 1;X2 = 1) = 0.76P (X3 = 1|X1 = 1;X2 = 0) = 0.70P (X3 = 1|X1 = 0;X2 = 1) = 0.30P (X3 = 1|X1 = 0;X2 = 0) = 0.24P (X4 = 1|X1 = 1) = 0.08P (X4 = 1|X1 = 0) = 0.80P (X5 = 1|X2 = 1) = 0.11P (X5 = 1|X2 = 0) = 0.52P (X6 = 1|X1 = 1;X5 = 1) = 0.83P (X6 = 1|X1 = 1;X5 = 0) = 0.55P (X6 = 1|X1 = 0;X5 = 1) = 0.28P (X6 = 1|X1 = 0;X5 = 0) = 0.07

    (4)

    Figure 6: ML estimation of the parameters

    We have previously seen that, under the assumption of conditional independence, if two sets ofnodes X1 and X2 are d-separated in BN by a third set Y (excluding X1 and X2), the correspon