43
Overview of Bayesian Networks With Examples in R (Scutari and Denis 2015)

Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

Overview of Bayesian Networks With Examples in R

(Scutari and Denis 2015)

Page 2: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

Overview

• Please install bnlearn in R install.packages(“bnlearn”)

• Theory

• Types of Bayesian networks

• Learning Bayesian networks• Structure learning• Parameter learning

• Using Bayesian networks• Queries

• Conditional independence• Inference based on new evidence

• Hard vs. soft evidence• Conditional probability vs. most likely outcome (a.k.a maximum a posteriori)

• Exact• Approximate

• R packages for Bayesian networks

• Case study: protein signaling network

Page 3: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

Theory

Page 4: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

Bayesian networks (BNs)

• Represent a probability distribution as a probabilistic directed acyclic graph (DAG)

• Graph = nodes and edges (arcs) denote variables and dependencies, respectively

• Directed = arrows represent the directions of relationships between nodes

• Acyclic = if you trace arrows with a pencil, you cannot traverse back to the same node without picking up your pencil

• Probabilistic = each node has an associated probability that can be influenced by values other nodes assume based on the structure of the graph

• The node at the tail of a connection is called the parent, and the node at the head of the connection is called its child• Ex. A B: A is the parent, B is its child

Page 5: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

Which is a BN?

A B

Taken from http://www.cse.unsw.edu.au/~cs9417ml/Bayes/Pages/Bayesian_Networks_Definition.html

Page 6: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

Factorization into local distributions

• Factorization of the joint distribution of all variables (global distribution) into local distributions encoded by the DAG is:

1) intuitively appealing

2) reduces the variables/computational requirements when using the BN for inference

3) increases power for parameter learning• The dimensions of local distributions usually do not scale with the

size of the BN

• Each variable (node) depends only on its parents

• For the BN here:

P(A, S, E, O, R, T) = P(A)P(S)P(E|A:S)P(O|E)P(R|E)P(T|O:R)• A and S are often referred to as root nodes

Page 7: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

Fundamental connections 1) Serial connection, e.g. A E O

2) Divergent connection, e.g. O E R

3) Convergent connection, e.g. A E S (also referred to as a v-structure)• The child of a convergent connection is often referred

to as a collider

• Only (immoral) v-structures uniquely define probabilistic relationships• Ex.

AB

AC

AA AA

AC

AB

P(B)P(A|B)P(C|A) = P(A,B)P(C|A) = P(B|A)P(A)P(C|A)

Page 8: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

d-separation

• “d” stands for dependence

• Defines conditional independencies/dependencies

• Determines whether a set of X variables is independent of another set Y, given a third set Z

• Intuitively important because it reveals how variables are related

• Computationally important because it provides a means for efficient inferencing• Reduces the effective dimension of inference problems

Page 9: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

d-separation

• Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along every path between a node in A and a node in B there is a node v satisfying one of the following two conditions:

1) v has converging arcs (i.e. there are two arcs pointing to v from the adjacent nodes in the path) and neither v nor any of its descendants (i.e. the nodes that can be reached from v) are in C

or

2) v is in C and does not have converging arcs

Page 10: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

d-separation practice• In R:

> library(bnlearn)

> dag <-model2network("[A][S][E|A:S][O|E][R|E][T|O:R][Z|T]")

> dsep(bn = dag, x = "A", y = "O", z = "E")

[1] TRUE• What this says is that given E, A and O are independent

> dsep(bn = dag, x = "A", y = "S")

[1] TRUE

> dsep(bn = dag, x = "A", y = "S", z = "E")

[1] [FALSE]

• Conditioning on a collider or its descendants (Z) makes the parent nodes dependent• Intuitively, if we know E, then certain combinations of A and S are more

likely and hence conditionally dependent

• Note that it is impossible for nodes directly linked by an edge to be independent conditional on any other node

Page 11: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

Equivalent class = CPDAG• Two DAGs defined over the same set of

variables are equivalent if and only if they:1) have the same skeleton (i.e. the same

underlying undirected graph)and2) the same v-structures• Compelled edges: edges whose directions are

oriented in the equivalence class because assuming the opposite direction would:

1) introduce new v-structures (and thus a different DAG)

or2) cycles (and thus the resulting graph would no

longer be a DAG)

• Note that DAGs can be probabilistically equivalent but encode very different causal relationships!

Page 12: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

Markov blankets• Information (evidence) on the values of

parents, children, and nodes sharing a child for a given node give information on that node• Inference is most powerful when considering

all these nodes (due to the use of Bayes’ theorem when querying)

• Markov blanket defines this set of nodes and effectively d-separates a given node from the rest of the graph

• Symmetry of Markov blankets• If node A is in the Markov blanket of node B,

then B is in the Markov blanket of A

The Markov Blanket of Node X9

Page 13: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

Beyond dependencies: causal inference• While a directed graph seems to suggest

causality, in reality additional criteria must be met

• Specially designed perturbation experiments can be employed to characterize causal relationships

• Algorithms also exist that attempt to elucidate causal relationships from observational data• Often times, the “high p, small n” nature of

the data result in subsets (equivalent classes) of possible causal networks

• “If conditional independence judgments are byproducts of stored causal relationships, then tapping and representing those relationships directly would be a more natural and more reliable way of expressing what we know or believe about the world”

• Ex. The presence of a latent variable significantly altered DAG used to represent the relationships between test scores

Page 14: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

Types of BNs

Page 15: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

Discrete BNs

• All variables contain discrete data

• Ex. Multinomial distribution• A = age, young, adult, or old

• S = gender, male or female

• E = education, high or uni

• R = residence, small or big

Conditional probability table of A

young adult old

0.3 0.5 0.2

Conditional probability table of E

Gender = M

A

E young adult old

high 0.75 0.72 0.88

univ 0.25 0.28 0.12

Gender = F

A

E young adult old

high 0.64 0.70 0.90

uni 0.36 0.30 0.10

Conditional probability table of R

E

R high uni

small 0.25 0.20

big 0.75 0.80

Page 16: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

Gaussian BNs (GBNs)

• Assumptions• Each node follows a normal distribution

• Root nodes are described by the respective marginal distributions

• The conditioning effect of the parent nodes is given by an additive linear term in the mean, and does not affect the variance• In other words, each node has a variance that is specific to that node and does not depend on

the values of the parents

• The local distribution of each node can be equivalently expressed as a Gaussian linear model which includes an intercept and the node’s parents as explanatory variables, without any interaction terms

• Based on these assumptions, the joint distribution of all nodes (global distribution) is multivariate normal

Page 17: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

Gaussian BNs (GBNs)E ~ N (50, 102)

V|G,E ~ N (-10.35 + 0.5G + 0.77E, 52)

W|V ~ N (15 + 0.7V, 52)

Page 18: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

Hybrid BNs• Contains both discrete and

continuous variables

• One common class of hybrid BNs is conditional Gaussian BNs• Continuous variables cannot be

parents of discrete variables

• The Gaussian distribution of continuous variables is conditional on the configuration of its discrete parent(s)• In other words, the variable can

have a unique linear model (i.e. mean, variance) for each configuration of its discrete parent(s)

CL ~ Beta (3,1)

G1|PR, CL ~ Pois (CL*g(PR))

TR|G1 ~ Ber (logit-1[G1-5/2.5])

Page 19: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

Comparison of BNs• Discrete BNs

• Local probability distributions can be plotted using the bn.fit.barchart function from bnlearn

• The iss argument to include a weighted prior for parameter learning using the bn.fit function from the bnlearn only works with discrete data

• Discretization produces better BNs than misspecifieddistributions and coarse approximations of the conditional probabilities

• GBNs• Perform better than hybrid BNs when few

observations are available• Greater accuracy than discretization for continuous

variables• Computationally more efficient than hybrid BNs

• Hybrid BNs• Greater flexibility• No dedicated R package• No structure learning

Page 20: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

Learning BNs

Page 21: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

Structure learning

• All structure learning methods boil down to three approaches:1) Constraint-based

2) Score-based

3) Hybrid-based

Page 22: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

1) Constraint-based• Constraint-based algorithms rely on conditional independence tests

• All modern algorithms first learn Markov blankets• Simplifies the identification of neighbors and in turn reduces computational complexity• Symmetry of Markov blankets also leveraged

• Discrete BNs• Tests are functions of observed frequencies

• GBNs• Tests are functions of partial correlation coefficients

• For both cases:• We are checking the independence of two sets of variables given a third set

• Null hypothesis is conditional independence

• Test statistics are utilized

• Functions in bnlearn include gs, iamb, fast.iamb, inter.iamb, mmpc, and si.hiton.pc

Page 23: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

2) Score-based• Candidate BNs are assigned a goodness-of-fit “network score” that heuristic

algorithms then attempt to maximize

• Due to the difficulty assigning scores, only two options are common:1) BDe(discrete case)/BGe(continuous case)

2) BIC

• Larger values = better fit

• Classes of heuristic algorithms include greedy search, genetic, and simulated annealing

• Functions in bnlearn include hc and tabu

Page 24: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

3) Hybrid-based• Combine constraint-based and score-based algorithms to offset respective

weaknesses

• Two steps:

1) Restrict• Constraint-based algorithms are utilized to reduce the set of candidate DAGs

2) Maximize• Score-based algorithms are utilized to find optimal DAG from the reduced set

• Functions in bnlearn include mmhc and rsmax2 where for rsmax2 you can specify your own combinations of restrict and maximize algorithms

Page 25: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

Parameter learning

• Once the structure of a DAG has been determined, the parameters can be determined as well

• Two most common approaches are maximum likelihood estimation and Bayesian estimation (not available for GBNs in bnlearn)

• Parameter estimates are based only on the subset of data spanning the considered variable and its parents

• The bn.fit function from bnlearn will automatically determine the type of data and fit parameters

Page 26: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

Notes on learning

• Three learning techniques:

1) unsupervised, i.e. from the data set

2) supervised, i.e. from experts in the field of the phenomenon being studied

3) a combination of both

• The arguments blacklist and whitelist can be specified in structure learning functions to force the absence and presence of specific edges, respectively

• For GBNs, you can easily replace parameter estimates with your own regression fit• Ex. The penalized package in R can be used to perform ridge, lasso, or elastic net

regression for biased coefficient estimates

Page 27: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

Using BNs

Page 28: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

Querying

• Once a BN has been constructed, it can be used

• The term query is derived from computer science terminology and means to ask questions

• Two main types of queries:

1) conditional independence• Uses only the DAG structure to explain how variables are associated with one

another, i.e. d-separation

2) inference, a.k.a. probabilistic reasoning or belief updating• Uses the local distributions

Page 29: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

2) Inference

• Investigates the distribution of one or more variables under non-trivial conditioning• Variable(s) being conditioned on are the new evidence• The probability of the variable(s) of interest are then re-evaluated

• Works in the framework of Bayesian statistics because it focuses on the computation of posterior probabilities or densities

• Based on the basic principle of modifying the joint distributions of nodes to incorporate a new piece of information• Uses the fundamental properties of BNs in that only local distributions are

considered when computing posterior probabilities to reduce dimensionality

• The network structure and distributional assumptions of a BN are treated as fixed when performing inference

Page 30: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

Types of evidence

• Hard evidence• Instantiation of one or more variables in the network

• Soft evidence• New distribution for one or more variables in the network, i.e. a new set of

parameters

Page 31: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

Types of queries

• Conditional probability• Interested in the marginal posterior probability distribution of variables given

evidence on other variables

• Most likely outcome (a.k.a. maximum a posteriori)• Interested in finding the configuration of the variables that have the highest

posterior probability (discrete) or maximum posterior density (continuous)

Page 32: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

Types of inference

• Exact inference• Repeated applications of Bayes’ theorem with local computations to obtain

exact probability values• Feasible only for small or very simple graphs

• Approximate inference• Monte Carlo simulations are used to sample from the global distribution and

thus estimate probability values• Several approaches can be used for both random sampling and weighting

• There are functions in bnlearn to generate random observations and calculate probability distributions given evidence using these techniques

Page 33: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

R packages for BNs

Page 34: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

R packages

• Two categories:1) those that implement structure and parameter learning2) those that focus on parameter learning and inference

• Some packages of note:• bnlearn (developed by the authors)• deal

• Can handle conditional Gaussian BNs

• pcalg• Focuses on causal inference (implements the PC algorithm)

• Other packages include catnet, gRbase, gRain, and rbmn• Some of these packages augment bnlearn

Page 35: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

Questions?

Page 36: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

Case study: protein signaling network

Page 37: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

Overview

• Analysis published in Sachs, K., Perez, O., Pe'er, D., Lauffenburger, D.A., Nolan, G.P. (2005). Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data. Science, 308(5721):523-529.

• Hypothesis: Machine learning for the automated derivation of a protein signaling network will elucidate many of the traditionally reported signaling relationships and predict novel causal pathways

• Methods• Measure concentrations of pathway molecules in primary immune system

cells

• Perturbation experiments to confirm causality

Page 38: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

The data

> sachs <- read.table("sachs.data.txt", header = TRUE)

> head(sachs)

• Continuous data

Raf Mek Plcg PIP2 PIP3 Erk Akt PKA PKC P38 Jnk

1 26.4 13.2 8.82 18.3 58.8 6.61 17 414 17 44.9 40

2 35.9 16.5 12.3 16.8 8.13 18.6 32.5 352 3.37 16.5 61.5

3 59.4 44.1 14.6 10.2 13 14.9 32.5 403 11.4 31.9 19.5

4 73 82.8 23.1 13.5 1.29 5.83 11.8 528 13.7 28.6 23.1

5 33.7 19.8 5.19 9.73 24.8 21.1 46.1 305 4.66 25.7 81.3

6 18.8 3.75 17.6 22.1 10.9 11.9 25.7 610 13.7 49.1 57.8

Page 39: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

Data exploration• Violations of the assumptions of

GBNs• Highly skewed

• Concentrations cluster around 0

• Nonlinear correlations• Difficult for accurate structure learning

• What can we do?• Data transformations (log)

• Hybrid network: specify an appropriate conditional distribution for each variable• Requires extensive prior knowledge of the

signaling pathway

• Discretize

Concentration of PKA vs. concentration of PKC along with the fitted

regression line

Densities of Mek, P38, PIP2, and PIP3 along with the normal distribution

curves

Page 40: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

Discretizing the data

• Information-preserving discretization algorithm introduced by Hartemink (2001)

1) Discretizes each variable into a large number of intervals • idisc argument = type of intervals

• ibreaks argument = number of intervals

2) Iterates over the variables and collapses, for each of them, the pair of adjacent intervals that minimize the lost of pairwise mutual information

• Basically does its best to reflect the dependence structure of the original data

> dsachs <- discretize(sachs, method = "hartemink", breaks = 3, ibreaks = 60, idisc = "quantile")

• breaks = number of desired levels (“low”, “medium”, and “high” concentrations)

Page 41: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

Model averaging

• The quality of the structure learned from the data can be improved by averaging multiple CPDAGs

• Bootstrap resampling as described in Friedman et al. (1999)• “Perturb" the data

• Frequencies of edges and directions is their confidence measure

> boot <- boot.strength(dsachs, R = 500, algorithm = "hc", algorithm.args = list(score = "bde", iss = 10))

• R = number of network structures

Page 42: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

Model averaging results> boot[boot$strength > 0.85 & boot$direction >= 0.5, ]

• strength = frequency of edge

• direction = frequency of edge direction conditional on the edge’s presence

• Many score-equivalent edges• This means the directions are not well

established

> avg.boot <- averaged.network(boot, threshold = 0.85)

from to strength direction

1 Raf Mek 1 0.518

23 Plcg PIP2 1 0.509

24 Plcg PIP3 1 0.519

34 PIP2 PIP3 1 0.508

56 Erk Akt 1 0.559

57 Erk PKA 0.984 0.568089

67 Akt PKA 1 0.566

89 PKC P38 1 0.508

90 PKC Jnk 1 0.509

100 P38 Jnk 0.95 0.505263

Note: your numbers may differ since no seed was set but you should still have the same edges passing the threshold

Page 43: Overview of Bayesian Networks With Examples in R · •Formal definition: If A, B, and C are three disjoint subsets of nodes in a DAG G, then C is said to d-separate A from B if along

The network

> avg.boot

• Network learned from the discretized, observational data

• Since we are not confident in the directions of any of the edges, we remove them by constructing the skeleton

> avg.boot <- skeleton(avg.boot)