Introduction to Bayesian Networks - Practical and Technical Perspectives

Introduction to Bayesian Networks

Practical and Technical Perspectives

Stefan Conrady, [email protected]

Dr. Lionel Jouffe, [email protected]

February 15, 2011

Conrady Applied Science, LLC - Bayesia’s North American Partner for Sales and Consulting

mailto:[email protected]




Table of Contents

Introduction

Bayesian Networks from a Practitioner’s PerspectiveKnowledge Uni!cation 2

Knowledge Representation & Communication 3

Reasoning 4

Summary 4

Technical IntroductionIntroduction 5

Probabilistic Semantics 7

Evidential Reasoning 8

Learning Bayesian Networks 9

Causal Networks 10

Causal Discovery 11

References 12

Contact Information 13

Conrady Applied Science, LLC 13

Bayesia SAS 13


www.conradyscience.com | www.bayesia.com i

http://www.conradyscience.com


http://www.bayesia.com


Introduction

A simplistic analogy may help to jump-start our introduction to Bayesian networks: In the same way one can use a

phone book — without having to memorize all the names and numbers, one can deliberately (and correctly) reason with the domain knowledge contained in a Bayesian network — without having to become a domain expert.

Over the last 25 years, Bayesian networks have emerged as a practically feasible form of knowledge representation, pri-

marily through the seminal works of UCLA Professor Judea Pearl. With the ever-increasing computing power, Bayesian networks are now a powerful tool for deep understanding of very complex, high-dimensional problem domains. Their

computational ef!ciency and inherently visual structure make Bayesian networks attractive for exploring and explaining

complex problems.

However, Bayesian networks are somewhat of a disruptive technology, as they challenge a number common practices in

the world of business and science. So, beyond the world of academia, promoting Bayesian networks as a new tool for

practical knowledge management and reasoning still requires signi!cant persuasion efforts. With this short paper, we

attempt to provide a concise justi!cation, both from a practitioner’s and a technical perspective1, why Bayesian net-works are so important.


www.conradyscience.com | www.bayesia.com 1

1 Author notes: portions of the technical chapter of this paper are adapted, with permission, from Pearl and Russell

(2000).





Bayesian Networks from a Practitioner’s Perspective

In our quest to “evangelize” about Bayesian networks (and the BayesiaLab software package2 ), we are often limited to

presenting our case in just a few PowerPoint slides and only using a few catchy bullet points. In this context, and this is obviously not comprehensive, we selected the following headings to highlight the key bene!ts of Bayesian networks to

research practitioners and business executives:

1. Knowledge Uni!cation

2. Knowledge Representation & Communication

3. Reasoning

Under these headlines, the following paragraphs are meant to provide a glimpse of the powerful properties and wide-ranging practical advantages of Bayesian networks.

Knowledge Uni!cationMany !elds are characterized by the proverbial con"ict between “art” and “science.” This manifests itself in debates,

such as the one about evidence-based medicine versus the prevailing practice of physicians with years of experience.

Even more common is the discrepancy between scienti!cally derived market research insights and expertise-based mar-

keting decisions of business executives. Traditional frameworks typically don't facilitate leveraging the knowledge avail-able on both sides.

Bayesian networks have the ability of capturing both qualitative knowledge (through their network structure), and

quantitative knowledge (through their parameters). While expert knowledge from practitioners is mostly qualitative, it can be used directly for building the structure of a Bayesian network. In addition, data mining algorithms can encode

both qualitative and quantitative knowledge and encode both forms simultaneously in a Bayesian network. As a result,

Bayesian networks can bridge the gap between different types of knowledge and serve to unify all available knowledge into a single form of representation.

Introduction to Bayesian Networks - Practitioner's Perspective


2 Developed by Bayesia SAS, BayesiaLab is a comprehensive software package designed for learning, editing and analyz-

ing Bayesian networks. It is available in North America from Conrady Applied Science, LLC.





“Art” “Science”

ExpertKnowledgeQualitative

Mathematical RepresentationQuantitative

Bayesian NetworkUni!ed Knowledge Representation

Domain

Figure 1: Knowledge uni!cation with Bayesian networks

Knowledge Representation & CommunicationRelaying knowledge typically includes an array of factual and causal statements. In natural language communication, such statements will often contain generalizations, approximations, and implicit assumptions regarding their probability.

Such simpli!cations are widely accepted in casual conversation or in media headlines.

However, for more precise communication, which is required in science or business, spelling out exceptions, uncertainty and conditions regarding statements about knowledge is necessary. With natural language expressions, however, this can

become very cumbersome, especially when it concerns a complex domain (hence the substantial girth of many text-

books).

Also, the need for precision in describing complex domains is often at odds with the modern business culture, which, as

already mentioned in the introduction, dictates communication via PowerPoint in few, concise bullet points. Needless to

say, the complex dynamics of a domain can thus often not be relayed correctly to policy makers and other stakeholders.

Bayesian networks are very well suited for capturing probabilistic and incomplete causal knowledge regarding a do-main. They can easily accommodate exceptions to a rule, e.g. “all swans are white, except for a certain species,” as well

as partial causal information, for instance “alcohol caused the accident,” even though more factors may actually be in-

volved, such as poor road conditions.

Through its structure and its parameters, a Bayesian networks comprehensively describes what is known about a par-

ticular domain and especially the interactions of all the variables contained within that domain. As such, a Bayesian

network is a “Portable Knowledge Format,” that can succinctly and compactly communicate the state of the domain as well as its dynamics.







ReasoningBy representing the interactions, a (correctly formulated) Bayesian network can yield a deep understanding of a domain.

Deep understanding means knowing, not merely how things behaved yesterday, but also how things will behave under

new hypothetical circumstances tomorrow. More speci!cally, a Bayesian network allows explicit reasoning, and deliber-ate reasoning allows us to anticipate the consequences of actions we have not yet taken. Bayesian networks thus become

an instrument for formal reasoning that is entirely transparent to stakeholders, as opposed to a more opaque, internal-

ized process in the decision maker’s mind (or gut).

Data

Manipulation

DomainunderStudy

HypotheticalDomain

Manipulation

BayesianNetwork

Figure 2: Using Bayesian networks for formal reasoning about consequences of hypothetical actions

SummaryIn summary, Bayesian networks are a highly universal knowledge framework and they provide a common reasoning

language between stakeholders from different backgrounds, such as business executives and market research scientists.

With all available knowledge uni!ed, properly communicated and quite literally put into a “reasonable” format, Bayes-ian network are a powerful tool for making decisions and shaping policies.







Technical Introduction

For the technical portion of this introduction, we defer to the words of Judea Pearl, who originally coined the term

“Bayesian network”. We are grateful to him for allowing us to use and adapt large sections from one of his technical reports for our purposes (Pearl and Russell, 2000).

IntroductionProbabilistic models based on directed acyclic graphs have a long and rich tradition, beginning with the work of geneti-

cist Sewall Wright in the 1920s. Variants have appeared in many !elds. Within statistics, such models are known as di-

rected graphical models; within cognitive science and arti!cial intelligence, such models are known as Bayesian net-works. The name honors the Rev. Thomas Bayes (1702-1761), whose rule for updating probabilities in the light of new

evidence is the foundation of the approach.

Rev. Bayes addressed both the case of discrete probability distributions of data and the more complicated case of con-tinuous probability distributions. In the discrete case, Bayes’ theorem relates the conditional and marginal probabilities

of events A and B, provided that the probability of B does not equal zero:

P(A∣B) = P(B∣A)P(A)P(B)

In Bayes’ theorem, each probability has a conventional name:

• P(A) is the prior probability (or “unconditional” or “marginal” probability) of A. It is “prior” in the sense that it does not take into account any information about B; however, the event B need not occur after event A. In the nineteenth

century, the unconditional probability P(A) in Bayes’s rule was called the “antecedent” probability; in deductive logic,

the antecedent set of propositions and the inference rule imply consequences. The unconditional probability P(A) was called “a priori” by Ronald A. Fisher.

• P(A|B) is the conditional probability of A, given B. It is also called the posterior probability because it is derived from

or depends upon the speci!ed value of B.

• P(B|A) is the conditional probability of B given A. It is also called the likelihood.

• P(B) is the prior or marginal probability of B, and acts as a normalizing constant.

Bayes theorem in this form gives a mathematical representation of how the conditional probability of event A given B is

related to the converse conditional probability of B given A.

The initial development of Bayesian networks in the late 1970s was motivated by the need to model the top-down (se-

mantic) and bottom-up (perceptual) combination of evidence in reading. The capability for bidirectional inferences,

combined with a rigorous probabilistic foundation, led to the rapid emergence of Bayesian networks as the method of choice for uncertain reasoning in AI and expert systems replacing earlier, ad hoc rule-based schemes.

Introduction to Bayesian Networks - Technical Perspective






The nodes in a Bayesian network represent propositional variables of interest (e.g. the temperature of a device, the gen-

der of a patient, a feature of an object, the occurrence of an event) and the links represent statistical (informational)3 or causal dependencies among the variables. The dependencies are quanti!ed by conditional probabilities for each node

given its parents in the network. The network supports the computation of the posterior probabilities of any subset of

variables given evidence about any other subset.

Figure 1 shows a very simple Bayesian network consisting of only two nodes and one link, representing the joint prob-

ability distribution of the variables Eye Color and Hair Color in a given population. In this case, the conditional prob-

abilities of Hair Color given the values of its parent, Eye Color, are provided in a table. It is important to point out that this Bayesian network does not contain any causal assumptions, i.e. we have no knowledge of the causal order between

the variables, so the interpretation here should be merely statistical (informational).

Figure 1: A Bayesian network representing the statistical relationship between to two variables.

Figure 2 illustrates another simple yet typical Bayesian network. In contrast to the statistical relationships in Figure 1,

the diagram in Figure 2 describes the causal relationships among the season of the year (X1), whether it’s raining (X2),

whether the sprinkler is on (X3), whether the pavement is wet (X4), and whether the pavement is slippery (X5). Here the absence of a direct link between X1 and X5, for example, captures our understanding that there is no direct in"uence of

season on slipperiness — the in"uence is mediated by the wetness of the pavement (if freezing is a possibility then a di-

rect link could be added).



3 “informational” and “statistical” are treated here as equivalent concepts and can be used interchangeably.





Figure 2: A Bayesian network representing causal in"uences among !ve variables

Perhaps the most important aspect of a Bayesian networks is that they are direct representations of the world, not of

reasoning processes. The arrows in the diagram represent real causal connections and not the "ow of information during

reasoning (as in rule-based systems and neural networks). Reasoning processes can operate on Bayesian networks by

propagating information in any direction. For example, if the sprinkler is on, then the pavement is probably wet (predic-tion, simulation); if someone slips on the pavement, that also provides evidence that it is wet (abduction, reasoning to a

probable cause or diagnosis). On the other hand, if we see that the pavement is wet, that makes it more likely that the

sprinkler is on or that it is raining (abduction); but if we then observe that the sprinkler is on, that reduces the likelihood that it is raining (explaining away). It is this last form of reasoning, explaining away, that is especially dif!cult to model

in rule-based systems and neural networks in any natural way, because it seems to require the propagation of informa-

tion in two directions.

Probabilistic SemanticsAny complete probabilistic model of a domain must, either explicitly or implicitly, represent the joint probability distri-bution — the probability of every possible event as de!ned by the combination of the values of all the variables. There

are exponentially many such events, yet Bayesian networks achieve compactness by factoring the joint distribution into

local, conditional distributions for each variable given its parents. If xi denotes some value of the variable Xi and pai

denotes some set of values for the parents of Xi, then P(xi|pai) denotes this conditional distribution. For example, P(x4|x2,x3) is the probability of wetness given the values of sprinkler and rain. The global semantics of Bayesian net-

works speci!es that the full joint distribution is given by the product

P(xi ,..., xn ) = P(xii∏ pai ) (1)

In our example network, we have

P(x1, x2 , x3, x4 , x5 ) = P(x1)P(x2∣x1)P(x3∣x1)P(x4∣x2 , x3)P(x5∣x4 ) . (2)

It becomes clear that the number of parameters grows linearly with the size of the network, i.e. the number of variables, however, the conditional probability distribution grows exponentially with the number of parents. Further savings can







be achieved using compact parametric representations — such as noisy-OR models, decision trees, or neural networks

— for the conditional distributions.

There is also an entirely equivalent local semantics, which asserts that each variable is independent of its nondescen-

dants in the network given its parents. For example, the parents of X4 in Figure 2 are X2 and X3 and they render X4

independent of the remaining nondescendant, X1. That is,

P(x4∣x 1 , x2 , x3) = P(x4∣x2 , x3) .

Non-Descendants

Descendant

Parents

Figure 3: Variable X4 is independent of its nondescendants, in this case X1, given its parents, X3 and X2

The collection of independence assertions formed in this way suf!ces to derive the global assertion in Equation 1, and

vice versa. The local semantics is most useful in constructing Bayesian networks, because selecting as parents all the di-

rect causes (or direct relationships) of a given variable invariably satis!es the local conditional independence conditions. The global semantics leads directly to a variety of algorithms for reasoning.

Evidential ReasoningFrom the product speci!cation in Equation 1 one can express the probability of any desired proposition in terms of the

conditional probabilities speci!ed in the network. For example the probability that the sprinkler is on given that the

pavement is slippery is







P(X3 = on∣X5 = true) =P(X3 = on,X5 = true)

P(X5 = true)

= x1 ,x2 ,x4∑ P(x1, x2 ,X3 = on, x4 ,X5 = true)

x1 ,x2 ,x3 ,x4P(x1, x2 , x3, x4 ,X5 = true)∑

== x1 ,x2 ,x4∑ P(x1)P(x2∣x1)P(X3 = on∣x1)P(x4∣x2 ,X3 = on)P(X5 = true∣x4 )

x1 ,x2 ,x3 ,x4P(x1)P(x2∣x1)P(x3∣x1)P(x4∣x2 , x3)P(X5 = true∣x4 )∑

These expressions can often be simpli!ed in ways that re"ect the structure of the network itself. The !rst algorithms proposed for probabilistic calculations in Bayesian networks used a local distributed message-passing architecture, typi-

cal of many cognitive activities. Initially this approach was limited to tree-structured networks, but was later extended

to general networks in Lauritzen and Spiegelhalter’s (1988) method of junction tree propagation. A number of other exact methods have been developed and can be found in recent textbooks.

It is easy to show that reasoning in Bayesian networks subsumes the satis!ability problem in propositional logic and

hence is NP-hard Monte Carlo simulation methods can be used for approximate inference (Pearl, 1988) giving gradually

improving estimates as sampling proceeds. These methods use local message propagation on the original network struc-ture unlike junction tree methods. Alternatively, variational methods provide bounds on the true probability.

Learning Bayesian NetworksThe conditional probabilities P(xi|pai) of a given structure can be estimated from data by using the maximum likelihood

approach (observed frequencies). They can also be updated continuously from observational data using gradient-based

or EM methods that use just local information derived from inference — in much the same way as weights are adjusted in neural networks.

It is also possible to machine-learn the structure of a Bayesian network and two families of methods are available for

that purpose. The !rst one, the constraint-based algorithms, is based on the probabilistic semantic of Bayesian networks. Links are added or deleted according to the results of statistical tests, which identify marginal and conditional independ-

encies. The second approach, the score-based algorithms, is based on a metric measuring the quality of candidate net-

works with respect to the observed data. This metric trades off network complexity against degree of !t to the data,

typically expressed as the likelihood of the data given the network.

As a substrate for learning, Bayesian networks have the advantage that it is relatively easy to encode prior knowledge in

network form, either by !xing portions of the structure or by using prior distributions over the network parameters.

Such prior knowledge can allow a system to learn accurate models from much less data than are required for tabula rasa approaches.

Uncertainty Over Time

Entities that live in a changing environment must keep track of variables whose values change over time. Dynamic

Bayesian networks capture this process by representing multiple copies of the state variables, one for each time step. A

set of variables Xt denotes the world state at time t and a set of sensor variables Et denotes the observations available at time t. The sensor model P(Et|Xt) is encoded in the conditional probability distributions for the observable variables,

given the state variables. The transition model P(Xt+1|Xt) relates the state at time t to the state at time t+1. Keeping track

of the world means computing the current probability distribution over world states given all past observations, i.e.,







P(Xt|E1,…,Et). Dynamic Bayesian networks are strictly more expressive than other temporal probability models such as

hidden Markov models and Kalman !lters.

Causal NetworksMost probabilistic models including, general Bayesian networks, describe a distribution over possible observed events —as in Equation 1 — but say nothing about what will happen if a certain intervention occurs. For example, what if I turn

the sprinkler on? What effect does that have on the season, or on the connection between wetness and slipperiness? A

causal network, intuitively speaking, is a Bayesian network with the added property that the parents of each node are its direct causes — as in Figure 2. In such a network, the result of an intervention is obvious: the sprinkler node is set to

X3 = on and the causal link between the season X1 and the sprinkler X3 is removed (see Figure 4). All other causal links

and conditional probabilities remain intact so the new model is

P(x1, x2 , x4 , x5 ) = P(x1)P(x2∣x1)P(x4∣x2 ,X3 = on)P(x5∣x4 ).

Notice that this differs from observing that X3=on, which would result in a new model that included the term

P(X3=on|x1). This mirrors the difference between seeing and doing: after observing that the sprinkler is on, we wish to infer that the season is dry, that it probably did not rain, and so on; an arbitrary decision to turn the sprinkler on should

not result in any such beliefs.

Figure 4: A causal network re"ecting the intervention, X3=on

Causal networks are more properly de!ned, then, as Bayesian networks in which the correct probability model after

intervening to !x any node’s value is given simply by deleting links from the node’s parents. For example, Fire → Smoke

is a causal network whereas Smoke → Fire is not, even though both networks are equally capable of representing any

joint distribution on the two variables. Causal networks model the environment as a collection of stable component

mechanisms. These mechanisms may be recon!gured locally by interventions, with correspondingly local changes in the model. This, in turn, allows causal networks to be used very naturally for prediction by an agent that is considering

various courses of action.







Causal DiscoveryOne of the most exciting prospects in recent years has been the possibility of using Bayesian networks to discover

causal structures in raw statistical data — a task previously considered impossible without controlled experiments. Con-

sider, for example, the following intransitive pattern of dependencies among three events: A and B are dependent. B and C are dependent, yet A and C are independent. If you ask a person to supply an example of three such events, the exam-

ple would invariably portray A and C as two independent causes and B as their common effect, namely, A → B ← C.

(For instance A and C could be the outcomes of two fair coins and B represents a bell that rings whenever either coin comes up heads.)

Figure 4: Causal model for variables A, C and B, representing two fair coins and a bell respectively.

Fitting this dependence pattern with a scenario in which B is the cause and A and C are the effects is mathematically feasible but very unnatural (see Figure 5), because it must entail !ne tuning of the probabilities involved; the desired

dependence pattern will be destroyed as soon as the probabilities undergo a slight change.

Such thought experiments tell us that certain patterns of dependency, which are totally void of temporal information, are conceptually characteristic of certain causal directionalities and not others. When put together systematically, such

patterns can be used to infer causal structures from raw data and to guarantee that any alternative structure compatible

with the data must be less stable than the one(s) inferred; namely slight "uctuations in parameters will render that struc-ture incompatible with the data.







References

Barber, David. “Bayesian Reasoning and Machine Learning.” http://www.cs.ucl.ac.uk/staff/d.barber/brml.

Barber, David. Bayesian Reasoning and Machine Learning. Cambridge University Press, 2011.

Barnard, G. A, and T. Bayes. “Studies in the History of Probability and Statistics: IX. Thomas Bayes's Essay Towards Solving a Problem in the Doctrine of Chances.” Biometrika 45, no. 3 (1958): 293–315.

Darwiche, Adnan. “Bayesian networks.” Communications of the ACM 53, no. 12 (12, 2010): 80.

Hilbert, M., and P. Lopez. “The World's Technological Capacity to Store, Communicate, and Compute Information.” Science (2, 2011). http://www.sciencemag.org/cgi/doi/10.1126/science.1200970.

Koller, Daphne, and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques. The MIT Press, 2009.

Neapolitan, Richard E., and Xia Jiang. Probabilistic Methods for Financial and Marketing Informatics. 1st ed. Morgan Kaufmann, 2007.

Pearl, Judea, and Stuart Russell. Bayesian Networks. UCLA Cognitive Systems Laboratory, November 2000. http://bayes.cs.ucla.edu/csl_papers.html.

Pearl, Judea. Causality: Models, Reasoning and Inference. Cambridge University Press, 2000.

Pearl, Judea. Causality: Models, Reasoning and Inference. 2nd ed. Cambridge University Press, 2009.

Spirtes, Peter, Clark Glymour, and Richard Scheines. Causation, Prediction, and Search, Second Edition. 2nd ed. The MIT Press, 2001.



http://www.cs.ucl.ac.uk/staff/d.barber/brml

http://www.cs.ucl.ac.uk/staff/d.barber/brml

http://www.sciencemag.org/cgi/doi/10.1126/science.1200970

http://www.sciencemag.org/cgi/doi/10.1126/science.1200970

http://bayes.cs.ucla.edu/csl_papers.html

http://bayes.cs.ucla.edu/csl_papers.html





Contact Information

Conrady Applied Science, LLC312 Hamlet’s End Way

Franklin, TN 37067

USA

+1 888-386-8383 [email protected]

www.conradyscience.com

Bayesia SAS6, rue Léonard de Vinci

BP 119

53001 Laval CedexFrance

+33(0)2 43 49 75 69

[email protected]

www.bayesia.com















Technology

Introduction to Bayesian Networks - Practical and Technical Perspectives