Download ppt - Using Bayesian Networks to Analyze Expression Data

.

Using Bayesian Networks to Analyze Expression Data

N. Friedman M. Linial I. Nachman D. Pe’er Hebrew University, Jerusalem

Central Dogma

Transcription

mRNA

Cells express different subset of the genesIn different tissues and under different conditions

Gene

Translation

Protein

Microarrays (aka “DNA chips”)

New technological breakthrough: Measure RNA expression levels of thousands

of genes in one experiment Measure expression on

a genomic scale Opens up new

experimental designs Many major labs are using,

or will use this technology in the near future

The ProblemGenes

Exp

erim

ents

j

i

Aij - the mRNA level of gene j in experiment iGoal:

Learn regulatory/metabolic networks Identify causal sources of the biological

phenomena of interest

Our Approach

Characterize statistical relationships between expression patterns of different genes

Beyond pair-wise interactions Many interactions are explained by intermediate factors Regulation involves combined effects of several gene-

products

We build on the language of Bayesian networks

Modeling assumptions: Ancestors can effect descendants' genotype only by passing

genetic materials through intermediate generations

Network: Example

Noisy stochastic process:

Example: Pedigree A node represents

an individual’sgenotype

Homer

Bart

Marge

Lisa Maggie

Network Structure

Generalizing to DAGs: A child is conditionally

independent from its non-descendents, given the value of its parents

Often a natural assumption for causal processes if we believe that we capture

the relevant state of each intermediate stage.

X

Y1 Y2

Descendent

Ancestor

Parent

Non-descendentNon-descendent

Associated with each variable Xi is a conditional probability distribution P(Xi|Pai:)

Discrete variables: Multinomial distribution

Continuous variables: Choice: for example linear gaussian

Local Probabilities

XY

P(Y

| X

)

X

Y

0.9 0.1

x 0.3 0.7

x

X P(Y |X)

Qualitative partDAG specifies

conditionalindependence

statements

+

Quantitative part

localprobability

models

Unique jointdistribution

over domain=

P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E) versusP(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)

E

R

B

A

C

Bayesian Network Semantics

Compact & efficient representation: k parents O(2kn) vs. O(2n) params parameters pertain to local interactions

Why Bayesian Networks?

Bayesian Networks: Flexible representation of dependency structure

of multivariate distributions Natural for modeling processes with local

interactions

Learning of Bayesian Networks Can learn dependencies from observations Handles stochastic processes:

“true” stochastic behavior noise in measurements

Modeling Regulatory Interactions

Variables of interest: Expression levels of genes Concentration levels of proteins (proteomics!) Exogenous variables: Nutrient levels, Metabolite

Levels, Temperature, Phenotype information …

Bayesian Network Structure: Capture dependencies among these variables

Examples

Interactions are represented by a graph: Each gene is represented by a node in the graph Edges between the nodes represent direct

dependency

Measured expression level of each gene

Gene interaction

Random variables

Probabilistic dependencies

A BX BA

More Complex Examples

Dependencies can be mediated through other nodes

Common effects can imply conditional dependence

Common cause

A CB

Intermediate gene

A

C

B

B

A C

Outline of Our Approach

Use learned network to make predictions about

structure of the interactions between genes

Bayesian NetworkLearning Algorithm

E

R

B

A

C

Expression data

Experiment

Data from Spellman et al. (Mol.Bio. of the Cell 1998)

Contains 76 samples of all the yeast genome:

Different methods for synchronizing cell-cycle in yeast

Time series at few minutes (5-20min) intervals

Spellman et al. identified 800 cell-cycle regulated genes.

Methods Treat samples as IID (ignoring temporal order)

Experiment 1: Discretized into three levels of expression

Learn multinomial probabilities

Experiment 2: Learn linear interactions (w/ Gaussian noise)

No prior biological knowledge was used

-0.5 0.5

0 +-

Log(ratio to control)

Network Learned

Challenge: Statistical Significance

Sparse Data Small number of samples “Flat posterior” -- many networks fit the data

Solution estimate confidence in network features Two types of features

Markov neighbors: X directly interacts with Y Order relations: X is an ancestor of Y

Confidence Estimates

D resample

resample

resample

D1

D2

Dm

...

Learn

Learn

Learn

E

R

B

A

C

E

R

B

A

C

E

R

B

A

C

m

iiGf

mfC

1

11

)(Estimate:

Bootstrap approach[FGW, UAI99]

Testing for Significance

We run our procedure on randomized data where we reshuffled the order of values for each gene

Histograms of number of Markov features at each confidence level

Original Data Randomized Data

RandomReal


0

500

1000

1500

2000

2500

3000

3500

4000

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fea

ture

s w

ith C

onfid

ence

abo

ve t

t

0

50

100

150

200

250

300

350

400

450

500

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

RandomReal

We run our procedure on randomized data where we reshuffled the order of values for each gene

Markov w/ Gaussian Models


0

200

400

600

800

1000

1200

1400

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fea

ture

s w

ith C

onfid

ence

abo

ve t

t

RandomReal

Markov w/ Multinomial Models

0

50

100

150

200

250

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

RandomReal

Local Map

Finding Key GenesKey gene: a gene that preceeds many other genes YLR183C MCD1 Mitotic Chromosome Determinant; RAD27 DNA repair protein CLN2 role in cell cycle START SRO4 involved in cellular polarization during budding YOX1 Homeodomain protein that binds leu-tRNA gene POL30 required for DNA replication and repair YLR467W CDC5 MSH6 Homolog of the human GTBP protein YML119W CLN1 role in cell cycle START

Future Work

Finding suitable local distribution models Correct handling of hidden variables

Can we recognize hidden causes of coordinated regulation events?

Incorporating prior knowledge Incorporate large mass of biological knowledge, and

insight from sequence/structure databases Abstraction

Combine with cluster analysis of higher confidence conclusions