From Sequence to Expression: A Probabilistic Framework Eran Segal (Stanford) Joint work with: Yoseph...
Preview:
Citation preview
- Slide 1
- From Sequence to Expression: A Probabilistic Framework Eran
Segal (Stanford) Joint work with: Yoseph Barash (Hebrew U.) Itamar
Simon (Whitehead Inst.) Nir Friedman (Hebrew U.) Daphne Koller
(Stanford)
- Slide 2
- Understanding Cellular Processes u Complex biological processes
(e.g. cell cycle) Coordination of multiple events Each event
requires different modules S G2 M G1 Can we recover the regulatory
circuits that control such processes?
- Slide 3
- Gene Structure Coding Region Promoter Region CTAGTAGATATCGATCAG
mRNA Protein
- Slide 4
- Gene Regulation Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 A AGACTTCAGA
Sequence Motif mRNA
- Slide 5
- Gene Regulation Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 A A A Swi5 -
Transcription Factor mRNA
- Slide 6
- Gene Regulation Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 A A A
Activated A Swi5 mRNA More mRNA (higher expression)
- Slide 7
- Gene Regulation Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 A A A
Activated A Swi5 B B B B AGTTGA mRNA
- Slide 8
- Gene Regulation Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 A A A Swi5 B
B B B Ndd1 Activated B A +mRNA
- Slide 9
- Goal ACTAGTGCTGA CTATTATTGCA CTGATGCTAGC +
AGCTAGCTGAGACTGCACACTGATCGAG CCCCACCATAGCTTCGGACTGCGCTATA
TAGACTGCAGCTAGTAGAGCTCTGCTAG AGCTCTATGACTGCCGATTGCGGGGCGT
CTGAGCTCTTTGCTCTTGACTGCCGCTTA TTGATATTATCTCTCTTGCTCGTGACTGC
TTTATTGTGGGGGGGACTGCTGATTATGC TGCTCATAGGAGAGACTGCGAGAGTCGT
CGTAGGACTGCGTCGTCGTGATGATGCT GCTGATCGATCGGACTGCCTAGCTAGTA
GATCGATGTGACTGCAGAAGAGAGAGGG TTTTTTCGCGCCGCCCCGCGCGACTGCT
CGAGAGGAAGTATATATGACTGCGCGCG CCGCGCGCCGGACTGCAGCTGATGCAT
GCATGCTAGTAGACTGCCTAGTCAGCTG CGATCGACTCGTAGCATGCATCGACTGC
AGTCGATCGATGCTAGTTATTGGACTGC GTAGTAGTGCGACTGCTCGTAGCTGTAG R(t 1 )
G1 t 1 Motif R(t 2 ) G2 t 2 Motif
- Slide 10
- Model of Gene Regulation GeneExperiment Expression Sequence
Probabilistic Relational Models (PRMs) Pfeffer and Koller (1998)
Friedman et al (1999) Segal et al (2001) Promoter sequences
Regulation by transcription factors Expression measurements Context
Cluster
- Slide 11
- Regulation to Expression Level GeneExperiment Expression R(t 1
) R(t 2 ) Exp. type R(t 1 ) = yes t 1 regulates gene R(t 1 ) = no t
1 does not regulate gene Exp. cluster
- Slide 12
- Regulation to Expression Level GeneExperiment Expression R(t 1
) R(t 2 ) Exp. type R(t 1 ) R(t 2 ) E type 0 0 I -0.7 1.2 0 1 II
0.8 0.6 CPD P(Level) Level -0.7 0.8 P(Level) Level Exp.
cluster
- Slide 13
- Modeling Context Specificity Level GeneExperiment Expression
R(t 1 ) Exp. type Exp. type = G1 R(t 2 )=ye s true false true R(t 1
) = Yes false true false... 3 P(Level) Level 0 P(Level) Level 2
P(Level) Level u Gaussian decision tree u T1 only relevant in G1 u
T2 only relevant in G2 Exp. cluster R(t 2 )
- Slide 14
- Sequence Model Level GeneExperiment Expression R(t 1 ) R(t 2 )
Exp. type Sequence Assumptions: Binding site is of length k Binding
may occur at any k-mer TF regulates gene if binding occurs anywhere
Exp. cluster
- Slide 15
- From Sequence to Regulation u Assumptions: Binding site is of
length k Binding may occur at any k-mer TF regulates gene if
binding occurs anywhere u PSSM: Background distribution Motif
distribution Discriminative training where
- Slide 16
- From Sequence to Regulation u Model for one gene g, promoter
region of length 5 and k=2 S1S1 S3S3 S2S2 S4S4 S5S5 sequence
residues g.R(t) variable for t regulates g m[1].B m[2].B m[3].B
m[4].B k-mer binding events Logistic function motif model
- Slide 17
- Joint Probabilistic Model Level GeneExperiment Expression R(t 1
) R(t 2 ) Exp. type Exp. Cluster k-mer s1s1 sksk B(t 1 )B(t 2 )
Discriminative model: Maximizes Discriminative model:
Maximizes
- Slide 18
- Localization Assay
- Slide 19
- Swi5 DNA u Induce TF protein level Swi5
- Slide 20
- DNA Localization Assay Swi5 Gene Bound Gene Not Bound TF binds
to targets u Induce TF protein level
- Slide 21
- Localization Assay DNA u Measure TF binding to promoter of
every gene Assign confidence for each binding Swi5 Gene Bound Gene
Not Bound TF binds to targets u Induce TF protein level
- Slide 22
- Localization Assay Simon et al (2001) u Localization data:
measure TF binding to promoter of each gene (assign binding
confidence)
- Slide 23
- Is Regulation Observed? u Not quite u Localization is measured
for specific conditions u Localization is measured for large DNA
regions u Localization is noisy
- Slide 24
- Incorporating Localization Level GeneExperiment Expression R(t
1 ) R(t 2 ) Exp. type Exp. Cluster L(t 1 ) L(t 2 ) Observed
localization u Localization p-value is noisy sensor of actual
regulation If regulation occurs, p-value likely to be low If no
regulation, p-value likely to be high
- Slide 25
- Gene R(t 1 ) L(t 1 ) Localization Model u Localization p-value
is noisy sensor of actual regulation If regulation occurs, p-value
likely to be low If no regulation, p-value likely to be high
Observed
- Slide 26
- Joint Probabilistic Model Level GeneExperiment Expression R(t 1
) R(t 2 ) Exp. type Exp. Cluster promoter s1s1 sksk L(t 1 ) L(t 2
)
- Slide 27
- Learning the Models ACGCCTAACGCCTA Experimental Details L E A R
N E R Level Gene R(t 1 ) R(t 2 ) Ehase ster Clu s1s1 sksk B(t 1
)B(t 2 ) Localization Data Exp. Phase = IV R(t 1 ) true false true
R(t 1 ) = Yes false R(t 2 ) = Yes true false truefalse R(t 1 ) R(t
2 ) E Phase 0 0 I 0.8 1.2 0 1 II -0.7 0.6
- Slide 28
- Learning the Models u Ndd1 activates Ace2 and Swi5 in G1, which
together activate in S u Mcm1 activates the DNA repair pathway in S
ACGCCTAACGCCTA Experimental Details L E A R N E R Level Gene R(t 1
) R(t 2 ) Ehase ster Clu s1s1 sksk B(t 1 )B(t 2 ) Localization
Data
- Slide 29
- Model Learning u Structure Learning: Tree structure u Missing
Data: Experiment cluster Regulation variables u Motif Model:
Parameter estimation u Expectation Maximization u Bayesian score u
Heuristic search u Discriminative training (conjugate
gradient)
- Slide 30
- Model Learning Gene Expression R(t 2 ) R(t 1 ) Experiment Exp.
type Level + Experimental Details Localization Data ACGCCTAACGCCTA
promoter s1s1 sksk Exp. cluster L(t 1 )
- Slide 31
- Resulting Bayesian Network Level 1,2 R(t 2 ) 1 R(t 1 ) 1 Exp.
type Exp. type 2 Level 1,1 Level2, 2 R(t 2 ) 2 R(t 1 ) 2 Level 2,1
Level 3,2 R(t 2 ) 3 R(t 1 ) 3 Level 3,1 L(t 2 ) 1 L(t 1 ) 1 L(t 2 )
2 L(t 1 ) 2 L(t 2 ) 3 L(t 1 ) 3 s 11 s k1 s 12 s k2 s 13 s k3 Exp.
cluster
- Slide 32
- Model Learning: E-Step Level 1,2 R(t 2 ) 1 R(t 1 ) 1 Exp. type
Exp. type 2 Level 1,1 Level2, 2 R(t 2 ) 2 R(t 1 ) 2 Level 2,1 Level
3,2 R(t 2 ) 3 R(t 1 ) 3 Level 3,1 L(t 2 ) 1 L(t 1 ) 1 L(t 2 ) 2 L(t
1 ) 2 L(t 2 ) 3 L(t 1 ) 3 s 11 s k1 s 12 s k2 s 13 s k3 Exp.
cluster Loopy belief propagation
- Slide 33
- Model Learning: M-Step Level 1,2 R(t 2 ) 1 R(t 1 ) 1 Exp. type
Exp. type 2 Level 1,1 Level2, 2 R(t 2 ) 2 R(t 1 ) 2 Level 2,1 Level
3,2 R(t 2 ) 3 R(t 1 ) 3 Level 3,1 L(t 2 ) 1 L(t 1 ) 1 L(t 2 ) 2 L(t
1 ) 2 L(t 2 ) 3 L(t 1 ) 3 s 11 s k1 s 12 s k2 s 13 s k3 Exp.
cluster Standard ML estimation Conjugate Gradient
- Slide 34
- Experimental Results Yeast u Cell Cycle expression data
(Spellman et al) u Localization data for 9 TFs (Simon et al) u
Yeast genome (promoters)
- Slide 35
- Generalization Level Gene Expression R(t 1 ) R(t 2 ) Experiment
Exp. Cluster Gene log-likelihood u Clustering genes -112.24
- Slide 36
- Generalization Level Gene Expression L(t 1 ) L(t 2 ) Experiment
Exp. type Gene log-likelihood u Clustering genes -112.24 u
Localization -121.48 -112.24
- Slide 37
- Generalization Level Gene Expression R(t 1 ) R(t 2 ) Experiment
Exp. type Exp. Cluster L(t 1 ) L(t 3 ) Gene log-likelihood u
Clustering genes -112.24 u Localization -121.48 u Localization +
exp. cluster -103.76 -112.24
- Slide 38
- Generalization Level Gene Expression R(t 1 ) R(t 2 ) promoter
s1s1 sksk Experiment Exp. type Exp. Cluster L(t 1 ) L(t 3 ) Gene
log-likelihood u Clustering genes -112.24 u Localization -121.48 u
Localization + exp. cluster -103.76 u + Sequence -94.59
-112.24
- Slide 39
- Generating Hypotheses Example: Genes regulated by Swi6, not by
Mcm1 and not by Fkh2, exhibit unique expression pattern in phase G1
in the cell cycle Gene functions: DNA repair [P 3e-09] DNA
synthesis [P 7e-05]
- Slide 40
- Expression vs Regulation
02142638410510701001301601902202500306090120150090180270360 alpha
cdc15cdc28elu -0.5 0 0.5 1 Phase Swi5 regulated Swi5 expression
Genes predicted to be regulated by Swi5 are probably real Swi5
targets
- Slide 41
- Combinatorial Effects
02142638410510701001301601902202500306090120150090180270360 alpha
cdc15cdc28elu -0.5 0 0.5 1 Phase Fkh2 & Swi4 Fkh2 &
Ndd1
- Slide 42
- Combinatorial Effects -0.5 0 0.5 1
02142638410510701001301601902202500306090120150090180270360 alpha
cdc15cdc28elu Mcm1 & Ndd1 Mcm1 & Ace2 Mcm1 & Swi5
Phase
- Slide 43
- Localization Assignment Changes
- Slide 44
- Motifs Found u Ndd1 Simon et al. Expanded Set Remaining Genes
17 1 28 Expanded set identified additional genes regulated by
Ndd1
- Slide 45
- TFSimonExpandedRestP-Value Ace210911.4e-6 Fkh1292584.4e-10
Fkh229 105.4e-11 Mbp1665681.9e-45 Mcm1282424.2e-18 Ndd1172811.9e-24
Swi4413756.4e-26 Swi5282324.9e-15 Swi6505262.3e-48
- Slide 46
- Induced Interaction Network u TF pairs whose regulation
predicts expression of same gene cluster Ace2 Swi5 Ndd1 Fkh2 Fkh1
Swi4 Swi6 Mcm1 Mbp1 G1 S G2 M M/G1 M G1 G2 S
- Slide 47
- Conclusions u Unified probabilistic model explaining gene
regulation using sequence, localization and expression data u
Models complex interactions between regulators u Discriminative
model maximizing P(Expr. | Seq.) u Sequence data helps explain
expression patterns
- Slide 48
- Big Picture u Goal: unified probabilistic framework Models
complex biological domains Incorporates heterogeneous data u
Framework incorporates explicitly within model basic biological
building blocks: Genes, TFs, proteins, patients, cells, species, u
Much closer connection between biology and model Can read biology
directly from model Can incorporate prior knowledge easily u Can
explicitly represent and learn biological models