View
230
Download
1
Tags:
Embed Size (px)
Citation preview
mRNA
protein
DNAActivationRepression
TranslationLocalizationStability
Pol II
3’UTR
Transcriptional and post-transcriptional regulation of gene
expression
Using gene expression to identify regulatory
elements5’ upstream
Feb 2007: ~110,000 arrays in NCBI GEO
Gene expression
Allen Institute Mouse Brain Gene expression Atlas (in
situ hybridization, ~23,000 genes)
Feb 2008: ~202,000 arrays
Roth et al., 1998, Tavazoie et al., 1999:co-expressed genes often share the same
regulatory elements
Expression5’ upstream
How do you identify regulatory elements from gene
expression ?
Motif finding programs:
- AlignACE (Hughes et al, 2000)
- MEME
- REDUCE (Bussemaker et al, 2001)
and many others …
Problems with current motif finding approaches
Different approaches for different types of expression data
Single microarray (e.g. log-ratios) REDUCE
C0
C1
C2
C3
C4
Co-expression clusters ALIGNACE
Problems with current motif finding approaches
Many unrealistic assumptions, e.g., zeroth order
background sequence model in AlignACE 1kb upstream region of Plasmodium falciparum PF11_0108
TTTAAAAAAAAAAAAAAAAAGAGAAAAACCATATTTATATGGATATAATATTTTTAAAGTATAGAAAAAATAATATATATTTATATACATTTATATTAATGAAAAAGCAAACAGCTAAATTACAAAAAAAAAAAAAAATTAGATTATCTCAATTAAAAGAACAATATATAAATAATTAATCCATGCTATTTTTTGATATATATAAGAATTTAATGCCTTATATTATAAATAGAGAAATAAATAAATAAATAAATATATAAACATATATATTATATATATATATATATATATAGTTATACATTATGATTTTGAAAAAATAGATATATACTATTAATTGTATATGTTTATACATAAAGCATATTTTTATTAATTGTAATATATAGATTTTTTATTATAATAATATTATATATATATATATATATATATATATTTTTTTTTTTTTGTTAAATAGCGAAATAAAAATACCTGACCTTTGTAATCTTTATTTGATTACTTCCTTCTTCATTCCTTCTTTGTTTGTTTGTTTGTTTCCCTTTTTTTTTTTTTTTTTTTTTTTTTTAGTTAATTCTTTTATATGTATAATAATATTATAAGACAATTGGACAATGATTACAAAAAGGTAAAAGTAATAATTTTCTAAAGTATAATATAATATTATAATAATATAATATAAATTTTTAATAAAATTTAATAAAAAAGTTTATAAATACTTATCGACCATAAGTCGTTTAAGAAAAAAAAAAAAAAGAAAAAAAAAAAAAAATATTACAAAAATATTATAGTGTATTATATTTTATCATATCATCTTTTTTTTATTTTTTTTTATATTTTTGTTACGGCACATCAAGCAACTATAAATATTTAAGATCAACCACCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACATTATTTATGGTATTTTAAA
Problems with current motif finding approaches
Elevated false positive rate
k-means clustered gene expression
randomly clustered gene expression
many motifs
AlignACE
many motifs
Re-thinking motif discovery from expression data
• One approach for all types of expression data
• Make as few a priori assumptions as possible
• Very low false positive rate
• Scale to complex metazoan and plant genomes
Noam Slonim
0
1
2
3
4
Cluster Index
Microarray Conditions
All Genes on array
Clusters of co-expressed genes
5’ upstream regions
Cluster index
1
1
1
1
1
2
2
2
2
0
0
0
0
0
2
correlation is quantified using
the mutual information
0.27 0.07 0.33
0.07 0.27 0.00Motif
Expression (Cluster Indices)
Absent
Present
0 1 2
Our approach: look for motifs whose profile of
presence/absence is informative about the
expression profile
2
1
3
1 )()(
),(log),()expression; motif(
i j jPiP
jiPjiPI
These genes belong to cluster
0
These genes belong to cluster
1
These genes belong to cluster
2
...
0.45
0.12
0.01
-0.08
-0.87
-1.56
-2.32
-2.89
-5.65
1.54
1.98
3.50
4.39
6.45
-8.90
5’ upstream region
Log-ratio
Continuous expression variables (e.g. microarray log-ratios)
3’UTRs
1
1
1
1
1
2
2
2
2
0
0
0
0
0
2
5’ upstream regions
Expression cluster index
1
1
1
1
1
2
2
2
2
0
0
0
0
0
2
Expression cluster index
DNA RNA
finding all informative DNA and RNA motifs
How do we do it ?
Possible motif representations
Search space
Accuracy
very good
good
acceptable
very large
large
smallWords (k-mers)
GCGATGAG
Weight matrices
Degenerate code[AC]CGATGAG[TC]
Motif Search Algorithmk-mer MI CTCATCG 0.0618TCATCGC 0.0485AAAATTT 0.0438GATGAGC 0.0434AAAAATT 0.0383ATGAGCT 0.0334TTGCCAC 0.0322TGCCACC 0.0298ATCTCAT 0.0265......ACGCGCG 0.0018CGACGCG 0.0012TACGCTA 0.0011ACCCCCT 0.0010CCACGGC 0.0009TTCAAAA 0.0005AGACGCG 0.0004CGAGAGC 0.0003CTTATTA 0.0002
Not informative
Highly informative
...
MI=0.081
MI=0.045
MI=0.040
Optimizing k-mers into more informative degenerate motifs
ATCCGTACA
ATCC[C/G]TACAwhich character
increases the mutual information by the largest amount ?
5’ upstream regions
Cluster Indices
1
1
1
1
1
2
2
2
2
0
0
0
0
0
2
A/G
T/GC/G A/C/G
A/T/G
C/G/T
Optimizing k-mers into more informative degenerate motifs
ATCC[C/G]TACA
5’ upstream regions
Cluster Indices
1
1
1
1
1
2
2
2
2
0
0
0
0
0
2
A/C
T/CC/G A/C/G
A/T/C
C/G/T
.
.
.
change
Motif Conservation with S. bayanus
Similarity to ChIP-chip RAP1 motif
Mutual information
RAP1 binding site (ChIP-chip)
k-mer MI CTCATCG 0.0618TCATCGC 0.0485AAAATTT 0.0438GCTCATC 0.0434AAAAATT 0.0383ATGAGCT 0.0334TTGCCAC 0.0322TGCCACC 0.0298ATCTCAT 0.0265...
Highly informative
k-mers
Only optimize k-mer if
I(k-mer;expression | motif)
is large enough
(for all motifs optimized so far)
MI=0.081
MI=0.045
Motifs optimized so far
optimize ?
Conditional mutual information I(X;Y|Z)
Each motif is subjected to a stringent statistical significance
test
Real mutual information value
Maximum of 10,000 expression-shuffled mutual information values
The regulation of gene expression is highly
combinatorial
DNA
Pol II
Expression pattern 1
Expression pattern 2
Expression pattern 3
Can we group our predicted motifs into modules of combinatorially acting regulatory elements ?
The regulation of gene expression is highly
combinatorial
Predicting combinatorial regulation using mutual
information5’ upstream region
0.43 0.07
0.07 0.43Motif 1
Motif 2
Absent
Present
Absent Present
2
1
2
1 )()(
),(log),()2 motif1; motif(
i j jPiP
jiPjiPI
Is the presence of motif 1 informative about the presence of motif 2 ?
Discovering modules of combinatorially acting
motifs
modules
YeastP. falciparum Huma
n
Results
(malaria parasite)
Yeast stress gene expression program (Gasch et al, 2000)
• 173 microarray conditions
• ~ 5,500 genes
• 80 co-expression clusters
• Runtime ~ 1h (standard PC)
Predicted Motifs
Expression Clusters
17 motifs in 5’ upstream regions 6 motifs in 3’UTRs
0 “motifs” when shuffling the gene labels of the clustering partition
1129 motifs when applying AlignACE (with default parameters) to each cluster independently880 “motifs” when applying AlignACE to the same shuffled clusters as above
Predicted Motifs
13 modules of co-occurring motifs
All 23 motifs are highly conserved with S. bayanus
PAC
RRPE
PUF4
PUF3
MSN2/4
RAP1
RPN4
REB1
MBP1
HAP4
XBP1
BAS1
CBF1
SWI4
14 previously known motifs
Expression Clusters
PAC is under-represented in cluster
13 (p<1e-5)
PAC is highly over-represented in cluster 66
(p<1e-20)
over-
rep
resen
tion
un
der-
rep
resen
tion
PAC
RRPE
Puf4
Expression ClustersMotifs
5’
5’
3’UTR
Predicted cooperation between DNA and RNA
motifs
Predicted Motifs
Expression Clusters
PAC
RRPE
PUF4
PUF3
MSN2/4
RAP1
RPN4
REB1
MBP1
HAP4
XBP1
BAS1
CBF1
SWI4
Mitochondrial ribosome, p<1e-33
Mitochondrial ribosome, p<1e-29
Puf3
Cytosolic ribosome, p<1e-18
Functional enrichments
Proteasome complex (p<1e-44)
DNA replication (p<1e-7)
Oxydative phosphorylation (p<1e-17)
Rpn4
Mbp1
Novel motif
Beer and Tavazoie, 2004; Elemento and Tavazoie, 2005
We also use mutual information to discover …
Non-random spatial
distribution
Orientation preferences
Co-localization
0
0
0
0
0
1
1
1
1
5’ upstream region
Cluster Indices
2
2
2
2
0
0
0
0
0
1
1
1
1
2
2
2
2
Cluster IndicesCl
ose
Far
Very
far
Distance to TSS is informative about
expression
Distance between two motifs is informative
about expression
Predicted Motifs
Clusters
YY
YYY
YY
Y
Y
YY
Y
> 50% of our predicted motifs have a non-random spatial distribution
Clusters where the motif is over-represented
Clusters where the motif is NOT over-represented
ATG-600bp
PAC has a non-random spatial distribution
RAP1 motif has a different kind of non-random spatial distribution
Unique cluster where the motif is over-represented
Clusters where the motif is NOT over-represented
RAP1 motif also has a strong orientation preference
Clusters where the TWO motifs are both over-represented
Clusters where the motifs are NOT over-represented
-600bp ATG
PAC and RRPE tend be co-localize on the DNA
-600bp ATGPAC and the Msn2/4 binding site tend to avoid being in the same promoters
Single array analysis
Down-regulated Up-regulatedCy3/Cy5 expression log-ratios
PAC
Rpn4
Yap1
Puf3
H2O2 treatment in ΔMsn2/ΔMsn4 background
Bozdech, Llinás, et al., PLoS Biol, 2003
P. falciparum intra-erythrocytic developmental cycle
~ 2
,70
0 p
eri
od
ically
exp
ress
ed
g
en
es
0h Time 48h
Associate a “phase” to each gene, which reflects the timing of maximal expression
-0.25
-0.12
0.01
0.08
0.34
0.67
2.32
2.89
3.01
-0.38
-1.68
-2.34
-2.56
-3.14
3.14
5’ upstream region
Phase
Discovering motifs that are informative about the expression phase
21 motifs in 5’ upstream regions 0 motifs in 3’UTRs
0 “motifs” when shuffling the gene labels of the phase profile
-π Phase +π
71% highly conserved with P. yoelli
DNA replication, p<1e-4plastid, p<0.01
ribosome, p<0.001
Independent biochemical validation
- Purified 3/26 predicted TF in P. falciparum
- Identified DNA-binding specificities using protein binding microarrays
Bulyk lab, Harvard
Llinás lab, Princeton University, submitted
Motifs Match PredictionsP
rote
in B
ind
ing
M
icro
arra
y F
IRE
P
red
icti
on
GST AP2 GST AP2 AP2 GST AP2
-π Phase +π
Independent biochemical validation for 3/21 motifs
More TFs being purified ...
bound by MAL6P1.44
bound by PF11_0404
bound by PF14_0633
Human gene expression atlas(Su et al, 2004, PNAS)
• 79 human tissues
• >17,000 genes
• 120 co-expression clusters
• Runtime ~ 24h (standard PC)
73 DNA motifs 42 RNA motifs
ELK4Sp1AhRbZIP911NF-YE2F1TCF11-MafGPax2E2Fv-MybTEADDof2CHOP-C/EBPalphaHAND1-TCF3GBPSkn-1HFH-3Sox17
miR-499/miR-505/miR-200a/miR-141miR-525/miR-518f*/miR-526c/miR-526a/miR-520a*miR-380-3p/miR-215/miR-485-3p/hsa-let-7g/miR-610/hsa-let-7i/hsa-let-7b/hsa-let-7amiR-30d/miR-30c/miR-30a-5p/miR-30e-5p/miR-30BmiR-200b/miR-429/miR-200cmiR-663
71 modules
NF-Y
novel
M phase (p<1e-43)
TCF11-MafG
novel
Olfactory receptor activity (p<1e-43)
miR-525miR-518f*miR-526c
Sp1
(NF-Y binding site)
let-7b over-expression in human fibroblasts
let-7 microRNAs are up-regulated when fibroblasts enter quiescence
Let-7 target genes ?
A. Lagesse-Miller, O. Elemento, …, and H. Coller, submitted
~14,0
00
gen
es
C1
C2
C3
C4
C1
C2
C4
C5
A. Lagesse-Miller, O. Elemento, …, and H. Coller, submitted
0h 12h 24h 36h 48h
C5
let-7b over-expression in human fibroblasts
UACCUC |||||| uugguguguuggaugAUGGAGu-5’ let-7b
seed
-1000bp
TSSArabidopsis thaliana
Experimental testing:
Ken Birnbaum (NYU)
Phil Benfey (Duke)
~22,300 genes on Affy chip
Biological insights• Importance of RNA motifs in
shaping transcriptomes ~30% of yeast, worm, human, arabidopsis motifs are RNA motifs
• In worm/human/mouse, many RNA motifs match miRNA targets
• “Cooperation” between DNA and RNA motifs
UGUGAU |||||| cgaguaguuucgaccgACACUAu
Yeast Puf4 motif
Novel worm 3’UTR motif
Biological insights
• Avoidance of joint-presence for certain motifs
• Under-representation of certain motifs
works with any type of gene expression data
FIRE(Finding Informative Regulatory Elements)
Single microarray (e.g. log-ratios)
Elemento, Slonim and Tavazoie, 2007, Molecular Cell
C0
C1
C2
C3
C4
Clustered microarrays
Gene expression
phase
It looks for both DNA and RNA motifs
FIRE(Finding Informative Regulatory Elements)
5’5’
3’UTR3’UTR
5’5’5’5’
3’UTR3’UTR
5’5’5’5’
Elemento, Slonim and Tavazoie, 2007, Molecular Cell
It is fast and scales well to large metazoan and plant genomes
FIRE(Finding Informative Regulatory Elements)
Elemento, Slonim and Tavazoie, 2007, Molecular Cell
It yields few or no false positives
FIRE(Finding Informative Regulatory Elements)
Real clustered gene expression
randomly clustered gene expression
115 motifs
0 motifs
(Human tissue microarray dataset)
It automatically evaluates:
FIRE(Finding Informative Regulatory Elements)
Functional coherence
Defense response (p<1e-32)
Inter-species conservation
Spatial and orientation biases
Compare to known motifs (JASPAR)
Cooperativity and co-localization
FIRE
fire --expfile=human_clusters.txt --exptype=discrete --species=human
Expression file Expression type Species
FIRE(Finding Informative Regulatory Elements)
Usage:
http://tavazoielab.princeton.edu/FIRE/
NM_000030 0NM_000040 0NM_000042 0NM_000045 0NM_000046 1NM_000053 1NM_000065 1NM_000066 1 ...
- discrete- continuous
- human- mouse- arabidopsis- drosophila- worm- plasmodium- budding yeast- fission yeast- sea squirt
~250 downloads since Nov 2007
Bambi Tsui
http://tavazoielab.princeton.edu/FIRE/
~1500 queries since Nov 2007
Acknowledgements
Saeed Tavazoie Noam Slonim
Sasan Amini
Chang Chan
Gordon Freckleton
Hany Girgis
Yir-Chung Liu
Ilias Tagkopoulos
Tiffany Vora
Scott Breunig
Anand Dharan
Hani Goodarzi
Danny Lieber
Yael Marshall
Kellen Olszewski
Bambi Tsui
Eric Wieschaus
Manuel Llinás
Hilary Coller
Aster Lagesse-Miller
Xuemin Lu
Erandi De Silva
Collaborators at
Princeton
Additional slides
…Chan*, Elemento*, Tavazoie, PLoS Computational Biology, 2005
k-mer MI CTCATCG 0.0618TCATCGC 0.0485AAAATTT 0.0438GATGAGC 0.0434AAAAATT 0.0383ATGAGCT 0.0334TTGCCAC 0.0322TGCCACC 0.0298CATCGCA 0.0293AGATGAG 0.0288TTTTTCA 0.0280ATCTCAT 0.0265...ACGCGCG 0.0168CGACGCG 0.0167TACGCTA 0.0167ACCCCCT 0.0167CCACGGC 0.0164TTCAAAA 0.0163AGACGCG 0.0163CGAGAGC 0.0163GATAGAG 0.0155GTAGCTC 0.0143CTTATTA 0.0142...
TestPASSPASSPASSPASSPASSPASSPASSPASSPASSPASSPASSPASS
PASSDON’T PASSDON’T PASSDON’T PASSDON’T PASSDON’T PASSDON’T PASSDON’T PASSDON’T PASSDON’T PASSDON’T PASS
Most informative
Less informative
10 consecutive
“don’t pass”
Optimize “seeds” into
more degenerate
motifs
Optimizing seeds into more informative degenerate
motifsA
TCG
S=C/GM=A/CW=A/TR=A/GK=T/GY=T/C
V=A/C/GH=A/C/TD=A/G/TB=C/G/T
N=A/C/G/T
TCC[C/G]TAC matches TCCCTAC
and
TCCGTAC
Predicted Motifs
Clusters
PAC
RRPE
PUF4
PUF3
MSN2/4
RAP1
RPN4
REB1
MBP1
HAP4
XBP1
BAS1
CBF1
SWI4
5’
3’UTR
3’UTR
Another example of predicted cooperation between DNA and RNA
motifs
RAP1
Novel
Novel
A gene expression map of Arabidopsis thaliana
development (Schmid et al, 2005)
• 79 different tissue samples
• 22,300 genes
• 140 clusters
Schmid et al, 2005, Nature Genetics
-Log(p) over-rep
Log(p) under-rep
114 motifs in 5’ upstream regions 66 motifs in 3’UTRs
0 “motifs” when shuffling the gene labels of the clustering partition
-Log(p) over-rep
Log(p) under-rep
telo-box
ABRE-like
W-box
I-box
DRE-core
Few of these motifs are known
-Log(p) over-rep
Log(p) under-rep
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Many have a non-random spatial distribution
Clusters where the motif is over-represented
Clusters where the motif is NOT over-represented
-1000bp
TSSMotif has a non-random spatial distribution
Functional enrichments
Defense response (p<1e-32)
Ribosome (p<1e-84)
Localized to chloroplast (p<1e-35)
3’UTR
5’
3’UTR
-Log(p) over-rep
Log(p) under-rep
These two motifs are predicted to co-localize extensively
Clusters where the motifs are over-represented
Clusters where the motifs are NOT over-represented
-1000bp
TSS
These two motifs co-localize on the DNA
Examples of tissue-specific motifs
... Collaborations with Phil Benfey (Duke), Ken Birnbaum
(NYU)
Other data-types
strong binding
no bindingp-values
20 Bicoid-bound vs 100 non-bound enhancers
20 Dorsal-bound vs 100 non-bound enhancers
• ChiP-chip, e.g. HNF6 (in human islet cells)
• Enhancers (Drosophila)
No
No
No
No
No
No
No
No
No
5’ upstream region
Tissue-specific expression
Yes
Yes
Yes
Yes
Yes
Yes
All other genes
Tissue-specific genes
Binary expression variables (e.g. tissue specific expression)