.
Using Bayesian Networks to Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er Hebrew University, Jerusalem
Central Dogma
Transcription
mRNA
Cells express different subset of the genesIn different tissues and under different conditions
Gene
Translation
Protein
Microarrays (aka “DNA chips”)
New technological breakthrough: Measure RNA expression levels of thousands
of genes in one experiment Measure expression on
a genomic scale Opens up new
experimental designs Many major labs are using,
or will use this technology in the near future
The ProblemGenes
Exp
erim
ents
j
i
Aij - the mRNA level of gene j in experiment iGoal:
Learn regulatory/metabolic networks Identify causal sources of the biological
phenomena of interest
Our Approach
Characterize statistical relationships between expression patterns of different genes
Beyond pair-wise interactions Many interactions are explained by intermediate factors Regulation involves combined effects of several gene-
products
We build on the language of Bayesian networks
Modeling assumptions: Ancestors can effect descendants' genotype only by passing
genetic materials through intermediate generations
Network: Example
Noisy stochastic process:
Example: Pedigree A node represents
an individual’sgenotype
Homer
Bart
Marge
Lisa Maggie
Network Structure
Generalizing to DAGs: A child is conditionally
independent from its non-descendents, given the value of its parents
Often a natural assumption for causal processes if we believe that we capture
the relevant state of each intermediate stage.
X
Y1 Y2
Descendent
Ancestor
Parent
Non-descendentNon-descendent
Associated with each variable Xi is a conditional probability distribution P(Xi|Pai:)
Discrete variables: Multinomial distribution
Continuous variables: Choice: for example linear gaussian
Local Probabilities
XY
P(Y
| X
)
X
Y
0.9 0.1
x 0.3 0.7
x
X P(Y |X)
Qualitative partDAG specifies
conditionalindependence
statements
+
Quantitative part
localprobability
models
Unique jointdistribution
over domain=
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E) versusP(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
E
R
B
A
C
Bayesian Network Semantics
Compact & efficient representation: k parents O(2kn) vs. O(2n) params parameters pertain to local interactions
Why Bayesian Networks?
Bayesian Networks: Flexible representation of dependency structure
of multivariate distributions Natural for modeling processes with local
interactions
Learning of Bayesian Networks Can learn dependencies from observations Handles stochastic processes:
“true” stochastic behavior noise in measurements
Modeling Regulatory Interactions
Variables of interest: Expression levels of genes Concentration levels of proteins (proteomics!) Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature, Phenotype information …
Bayesian Network Structure: Capture dependencies among these variables
Examples
Interactions are represented by a graph: Each gene is represented by a node in the graph Edges between the nodes represent direct
dependency
Measured expression level of each gene
Gene interaction
Random variables
Probabilistic dependencies
A BX BA
More Complex Examples
Dependencies can be mediated through other nodes
Common effects can imply conditional dependence
Common cause
A CB
Intermediate gene
A
C
B
B
A C
Outline of Our Approach
Use learned network to make predictions about
structure of the interactions between genes
Bayesian NetworkLearning Algorithm
E
R
B
A
C
Expression data
Experiment
Data from Spellman et al. (Mol.Bio. of the Cell 1998)
Contains 76 samples of all the yeast genome:
Different methods for synchronizing cell-cycle in yeast
Time series at few minutes (5-20min) intervals
Spellman et al. identified 800 cell-cycle regulated genes.
Methods Treat samples as IID (ignoring temporal order)
Experiment 1: Discretized into three levels of expression
Learn multinomial probabilities
Experiment 2: Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
-0.5 0.5
0 +-
Log(ratio to control)
Network Learned
Challenge: Statistical Significance
Sparse Data Small number of samples “Flat posterior” -- many networks fit the data
Solution estimate confidence in network features Two types of features
Markov neighbors: X directly interacts with Y Order relations: X is an ancestor of Y
Confidence Estimates
D resample
resample
resample
D1
D2
Dm
...
Learn
Learn
Learn
E
R
B
A
C
E
R
B
A
C
E
R
B
A
C
m
iiGf
mfC
1
11
)(Estimate:
Bootstrap approach[FGW, UAI99]
Testing for Significance
We run our procedure on randomized data where we reshuffled the order of values for each gene
Histograms of number of Markov features at each confidence level
Original Data Randomized Data
RandomReal
Testing for Significance
0
500
1000
1500
2000
2500
3000
3500
4000
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fea
ture
s w
ith C
onfid
ence
abo
ve t
t
0
50
100
150
200
250
300
350
400
450
500
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
RandomReal
We run our procedure on randomized data where we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Testing for Significance
0
200
400
600
800
1000
1200
1400
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fea
ture
s w
ith C
onfid
ence
abo
ve t
t
RandomReal
Markov w/ Multinomial Models
0
50
100
150
200
250
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
RandomReal
Local Map
Finding Key GenesKey gene: a gene that preceeds many other genes YLR183C MCD1 Mitotic Chromosome Determinant; RAD27 DNA repair protein CLN2 role in cell cycle START SRO4 involved in cellular polarization during budding YOX1 Homeodomain protein that binds leu-tRNA gene POL30 required for DNA replication and repair YLR467W CDC5 MSH6 Homolog of the human GTBP protein YML119W CLN1 role in cell cycle START
Future Work
Finding suitable local distribution models Correct handling of hidden variables
Can we recognize hidden causes of coordinated regulation events?
Incorporating prior knowledge Incorporate large mass of biological knowledge, and
insight from sequence/structure databases Abstraction
Combine with cluster analysis of higher confidence conclusions