Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Predicting the targets of mRNA-binding proteins
+ GeneMANIA & ISOLATE
Quaid Morris
Banting and Best Department of Medical Research,
Departments of Computer Science and Molecular Genetics
The Donnelly Centre, University of Toronto
3
Xiao Li Hilal Kazan
mRNA-binding
proteins
ISOLATE for in
silico purification of
tumour samples
Gerald Quon
Outline
Collaborators: Paul Boutros & Syed Haider (OICR), Blaise Clarke (UHN)
Ang Cui,
Amit Deshwar
(no pictures avail)
Collaborators: Tim Hughes & Deb Ray (Donnelly), Howard Lipshitz (U of T, MoGen),
Wei Jiao (U of T, MoGen)
4
GeneMANIA: gene function prediction for the masses
Gary
Bader
Sara Mostafavi David Warde-Farley Khalid Zuberi
Christian Lopes Jason Montojo Harold Rodriguez
Max Franz
Deb Ray Chris
Grouios
Farzana
Kazi
5
Other projects in the lab
Cancer networks (w/ Sara and Gerald)
GWAS & QTLs of complex disease (w/ David)
Anna Goldenberg, PhD
Hossein Radfar
microRNA target prediction
mRNA-binding proteins
Post-transcriptional regulation (PTR)
transcription
poly-adenylation
export
mRNA stability
AAAAAA
Nucleus
Cytoplasm
pre-mRNA
mature mRNA
genomic DNA
splicing vs
vs AAA
vs
localization mRNA translation rate
vs
vs
spliced mRNA
vs
C C
Patterns of mRNA localization before the MZT
Adapted from LeCuyer et al, (2008) Cell 131
Red: Nucleus
Green: mRNA
9
Recipe for modeling PTR
• Identify the trans-acting regulators:
– RNA-binding proteins and other ncRNAs
• Identify sequences recognized by ncRNAs (e.g. miRNAs):
– TargetScan, PicTar, miRanda, GenMiR, PITA, rna22, etc.
• Identify sequences recognized by RNA-binding proteins:
– RBPDB (http://rbpdb.ccbr.utoronto.ca) is a start
• Design algorithms to predict post-transcriptional fate given cis-elements and the mRNA sequence.
9
✔
✔
✔
10
RNAcompete Method
dsDNA Pool
Cleave linker/ RNA Transcription/
DNase
ssDNA
Linker Ligation
Primer Extension
PCR
Label pulldown
RNA
Hybridize To Array/ Computational
Analysis
RNA Pool
Pulldown RNA
Affinity Matrix
Epitope Tag
RBP
RNA
Agilent microarray
RNA Pool
Strip ssDNA
RNA Pulldown
Unstructured RNA
Structured RNA (8loop)
Structured RNA (7loop)
Label RNA pool
Deb Ray and Tim Hughes
Hilal Kazan
Validation of RNAcompete
RNAcompete 7mers
0 6
0
6
Log affinity from Set A (a.u.) 2 4
2
4
8 10
8
10 UUUUUUU
UUUAUUU
UUUGUUU
10 highest ratio probes
(RED indicates motif match)
AGACCUUCAUGUUCUGUUUCUUUUACUGCUUUGGUUGU
AGAAGUUUGUUUUGGUUUUUGCUUUCUCCUCGUGUGCA
AGAUUUUUUACUUUCAUGUCUAACUCACCUUUGGGGAA
AGGUUUCGUUUUGUGUACUUCUCCCUAGUUUCUUUGGC
AGGAGUUAGUUUGGUUUUACUUUGUGUCCGCCUACGGU
AGAAGUAGGGAAGUUUUUUUCGUUGAUUUAGUGGGCCU
AGAUUAUUUCUCGUUAUAAUAAUGCGUUUAGUGCCCAC
AGAUUCAGAUUUAGCACCACUGUUGUAGUAUGUUUUGU
AGAAUUGAUUUAUUUUUGUUGUCUCACCCAUCUGCGUU
AGACUGCUAACUUAUUUUUUUGUCGUCCUAACGUUGCA
AGGGCCAAUGCGGUGCAGGCGCUGCGUUGGC
AGGUUCAAACGCAUGCGGGUCAAGCUACGAACGCCACU
AGAGAUGGAUUGGGCUGGCACCGGUCCAUC
AGGGCGUUUAACGGCGGGUCCCGUUAGGCGC
AGAGAGUGCAUAGCUGGCUUCUAUGCGCUC
AGGGCCAAUGCGGUGCAGGCGCUGCGUUGGC
AGAGUGUCAGGUACAUAAAGUAGCUCAAUGAGUUCGUU
AGAGUCCAGAACGGCAGGCACGUUUUGGAC
AGAGACAGGAGCUUAUUCAUUGAUCAGCGAUCGCG
AGAGGUUGAGACGGCAGGUCCCGUCUCGGCC
(RED indicates motif match BLUE indicates intended paired bases GREEN indicates predicted paired bases)
Log
affi
nit
y fr
om
Set
B (
a.u
.)
0 15
0
15
5 10
5
10
GCUGGCC
GCUGGUC
GCGGGCC
RNAcompete 7mers
10 highest ratio probes
Vts1 HuR
Log affinity from Set A (a.u.)
Log
affi
nit
y fr
om
Set
B (
a.u
.)
FUSIP1
AAAGAAA
GAAAGAA
AAAGAAG
HuR/ELAV1
UUUGUUU
UUUUUUU
UUUAUUU
PTB
UCUUUCU
CUUUCUU
RBM4
GCGCGCG
GCGCGGG
CGCGCGG
GCGAGCG
GAGCGAG
CGAGCGG
SFRS1/SF2/ASF
SLM2 AGAUAAA
AUUAAAC
AUAAACA
U1A
UUGCACU
AUUGCAC
UUGCACG
Vts1
GCUGGUC
CGCUGGC
YB1
CCUGCGG
0
5 10 15
5
10
15
GCGGGCC
0
-2
0 2 4
0
2
4
-2 6 8
6
8
0 2 4
0
5
-2 6 8
10
10
GCCUGCG
CUGCGGU
-2
0 2 4
0
2
4
-2 6 8
6
-2
0 2 4
0
2
4
-2 6 8
6
-2
0 2 4
0
2
4
-2 6
6
0
2 4 6
2
4
6
0 8 10
8
10
-5
0 5
0
5
-5 10
10
-4
0 4
-2
0
2
-4 8
4 CUUUCCU
SetA and SetB 7-mers are highly correlated for all 9 RBPs profiled and reflect known sequence binding preferences where known.
Log affinity from Set B (a.u.)
Log
affi
nit
y fr
om
Set
A (
a.u
.)
Gerber et. 2004, PLoS Biology
RIP-chip for detecting in vivo RBP targets
YB1 RNAcompete predicts its in vivo targets
Skabkina 2005 (50.4)
PWM A (96.6) PWM B (58.6)
7mers A (98.8) 7mers B (98.7)
1-Specificity 1 0.8 0.6 0.4 0.2 0
0
Sen
siti
vity
1
0.8
0.6
0.4
0.2
UCCA(A/G)CAA Skabkina 2005
CCUGCGG GCCUGCG CUGCGGU GGUCUGC CCCUGCG
Top five 7-mers
RNAcompete PWM A
YB1 co-immunoprecipitates (cDNAs) from (Dong, J. et al. RNA Biol 2009), 95 positives, 99 negatives
SF2/ASF RNAcompete predicts its in vivo targets
1-Specificity 1 0.8 0.6 0.4 0.2 0
0
Sen
siti
vity
1
0.8
0.6
0.4
0.2
Sanford 2009
Sanford 2008
Clip-seq data from (Sanford et al Genome Research 2009) 296 positives, 314 negatives
RNAcompete PWM A
Sanford 2009 A (87.9) Sanford 2009 B (90.3) Tacke 1995 A (47.7) Tacke 1995 B (54.4) Liu 1998 A (51.6) Liu 1998 B (58.3)
PWM A (90.7) PWM B (90.6)
7mers A (91.2) 7mers B (91.5)
17
RNAcompete summary
• Measures in vitro affinity of an RBP for ~250k custom-designed 35-mer RNA sequences,
• These affinities can be summarized into highly reproducible 7-mer scores (or affinities),
• RNAcompete 7-mer scores predict in vivo targets for the RBPs that we’ve looked at.
17
Good news:
We’re doing
RNAcompete for all
Drosophila and human
RBPs!
Bad news:
RBP sequence binding
preferences alone might
not be good enough.
Most potential Puf3p sites are not Puf3p-bound
mRNAs whose
3’UTR
contains a
Puf3p site
158
244
Puf3p bound
mRNAs
mRNAs in control
but not Puf3p
bound
Puf3p RIP-chip data from Gerber et. 2004, PLoS Biology
Sequence-specific RBPs bind ssRNA
DNA: B-Form Helix
Xiao Li
Transcription
factor goes
here
RNA: A-Form Helix
22
Other RBPs recognize different RNA structures
C
N G
G
N(0-3)
VTS1
Staufen
HIV-1 TAT
A
G
A
C
U C
U
A
G
U
C
Aviv et al, Nat Struct Biol (2003), 10:614
Wickham et al, Mol Cell Biol (1999) 19:2220
Weeks et al, Cell (1992) 70:1069 Hilal Kazan
5’ 3’
1. Does the secondary structure of naked mRNA
constrain RBP binding?
2. Are mRNA secondary structure predictions made with
current tools useful?
Accessibility: Probability being unpaired in the thermodynamic ensemble
5’ 3’
Ensemble:
5’ 3’
1 0.5 0
Targ
et S
ite
Accessib
ility
mRNA
26
RNAplfold: local thermodynamic folding
mRNA transcript
Local folding windows
w
l u
Proportion of unbound sequences
(false positive rate)
Pro
port
ion o
f bound s
equences
(tru
e p
ositiv
e r
ate
)
0.2 0.4 0.6 0.8 1.0
0.15 0.10 0.05 0
Me
dia
n s
ite
acce
ssib
ility
Area under the
ROC curve: 74%
(P < 2 x 10-14,
ranksum test)
Accessibility
# of target sites
1.0
0.8
0.6
0.4
0.2
0
Target site accessibility predicts Puf3p binding
Bound
No
t b
ou
nd
Bo
un
d
Ranking by site number (#TS)
3 and 3 2 and 2
Area under curve (AUROC): 1. 0 vs. 0.5
FP rate
TP r
ate
Ranking by total accessibility (#ATS)
2.7 1.7 1.3 1.0
1 0.7
1
0.8 0.9
0.7 0.6
Bound
Unbound
#ATS
#TS
Consensus sites
RIP-chip result
0.1
0
0.9
ROC curve for predicting in vivo binding
Scoring accessibility for RBPs with >1 binding sites
0 0 1
1
#ATS Score
#TS Score
Accessibility significantly increases predictive
accuracy for 71% of RBPs tested
Pub1 Nab2
Puf3
Yll032c
Vts1 PTB
PTB PTB
Puf4
Msl5 Puf3
Pumilio
Puf4
HuR
Khd1
*
*
*
* * *
* *
* * *
*
* *
60%
Ran
kin
g b
y to
tal a
cces
sib
ility
(A
UR
OC
)
Ranking by # of target sites (AUROC)
50%
70%
80%
50% 60% 70% 80%
* indicates a significant
improvement in AUROC
(FDR < 0.05, Delong-Delong-
Clarke-Pearson test, BH
correction)
30
32
34
1. Does the secondary structure of naked mRNA
constrain RBP binding?
Yes.
2. Are mRNA secondary structure predictions made with
current tools useful?
Yes, much more useful than expected.
Also, we have developed an in silico benchmark for
testing predictions of mRNA secondary structure.
Citation: Li, Quon, Lipshitz, Morris, RNA 2010 (16)
37
38
Representing structural context by annotating bases
hairpin
loop
hairpin
loop
multiloop
external
region
internal loop
bulge loop
stem P: paired
U: unpaired
P: paired
L: loop
U: unstructured
M: miscellaneous
Representing ensembles of structures using our structural annotation (i)
…
AGACGCGCGCGUUCGCCGCGCUCGGCGCAUGC
1 UPPMPPPPPPLLLLPPMPPPPPPMPPLLLLPP
2 UUUPPMPPPPPLLLLLPPPPPMPPUUUUUUUU
3 UUUUPPPPMPPMMMPPLLLPPMPPPPPPUUUU
4 UPPPPLLLLPPPPMPPPPLLLLPPPPUUUUUU
5 UUUUUUUUPPPPMPPPPPLLLLPPPPPMPPPP
...
AGACGCGCGCGUUCGCCGCGCUCGGCGCAUGC
P: paired L: loop M: miscellaneous U: unstructured
Annotation
profile
6.1
5.6
.
.
0.5
0.1
RBP
affinities
sequences and
annotation profiles
RNAcontext inputs and outputs
RNAcontext
..GGGAUACCC..
..GGGCUACCG..
.
.
.
..CCCAUACCC..
..GGGAAACCC..
.
.
.
P
L
M
U
0
1 relative
preference
base
preference
structural
preference
RNAcontext motif model
binding site score
structural context
base identity × =
Roider et al (2006) Bioinformatics 23:134
*
*
44
46
Comparison with other motif models
Proteins RNAcontext MEMERIS MatrixREDUCE Precision
improvement
Reduction
in error
RBM4 91% 43% 63% 28% 75.7%
Fusip 53% 31% 32% 21% 30.9%
VTS1 65% 58% 56% 7% 16.7%
YB1 17% 7% 11% 6% 6.8%
SLM2 81% 49% 77% 4% 17.4%
SFRS1 (SF2/ASF)
70% 50% 66% 4% 11.8%
HuR 96% 74% 94% 2% 33.3%
PTB 69% 26% 67% 2% 6.1%
U1A 30% 27% 21% 3% 4.1%
Average precision on held-out data
Bold
indicates stat.
sig. reduction
in error
i) Predicted motifs and structural contexts ii) Relative structural
preferences
0 0.5 1
Legend
Paired
Loop
Misc
Unstructured
49
RNA-map from SFRS1 knock-down
more exclusion no change more inclusion
Alternatively spliced exon sequences
RNAcontext predicted sequence preferences
Mea
n 7
-mer
aff
init
y
% exon length
structural preference: Loop 1.0 Multi/Internal 0.5
0-10% 20-30% 40-50% 60-70% 80-90%
Change in splicing level
Summary
• is a new discriminative motif model for finding RBP binding preferences
• incorporates structural information with a new way of representing secondary structure
• recovers known sequence and structure binding preferences of most RBPs
RNAcontext:
Acknowledgements: mRNA
Morris Lab Hilal Kazan Xiao Li, Gerald Quon Timothy Hughes Debashish Ray Kate Cook Esther Chan: PWMs Shaheynoor Talukder
Benjamin Blencowe Arneet Saltzman: SF2 knockdown Sidharth Chaudhry: cloning Helpful Feedback: Craig Smibert (U Toronto) Howard Lipshitz (U Toronto) Frank Sicheri (U Toronto) Mariana Kekis (Hughes Lab) Reagent gifts: Carol Lutz: U1A (UMDNJ) Craig Smibert: Vts1 Mariano Garcia-Blanco: PTB (Duke) Kimitoshi Kohno: YB1 (U of OEH)
Howard Lipshitz
The GeneMANIA project: gene function prediction for the
masses
Genomics revolution, the bad news
• noisy,
• redundant,
• incomplete,
• mysterious,
• massive
Functional genomic datasets are:
56
How can genomic data be used in the lab?
?!?
Microarray expression profiles
Genetic interactions
CHiP-chip regulatory networks
Protein-protein interactions
Use cases:
1. What is the function of this gene?
2. Give me more genes like these.
3. Here’s a list of genes -- what does it mean?
58
Recommender systems
Google can’t do biology
62
Label propagation
MCA1
CDC48
CPR3
TDH2
GeneMANIA gene scoring via label propagation
MCA1 CDC48
CPR3
TDH2
MCA1
CDC48
CPR3
TDH2
Guilt-by-association
-1 …
……
…...
.+1
Discriminant Value
The problem of gene multi-functionality
• Gene function could be a/the:
– Biological process,
– Biochemical/molecular function,
– Subcellular/Cellular localization,
– Regulatory targets,
– Temporal expression pattern,
– Phenotypic effect of deletion.
Some networks may be better for some types of gene function than others
64
65
Three rules for network weighting
– Reliability
• Noisy networks get low weight
– Relevance
• Irrelevant networks get low weight
– Redundancy
• Redundant weights share their weights
66
68
http://www.genemania.org/plugin/
69
70
71
http://cytoscapeweb.cytoscape.org/
Data providers
+ MODs + ENSEMBL + Gene Ontology
Who?
Gary Bader Quaid Morris
Sara Mostafavi
David Warde-Farley
Ovi Comes
Khalid Zuberi
Christian Lopes
Jason Montojo
Harold Rodriguez Max Franz
Sylva Donaldson
Farzana Kazi
Principal
Investigators
Students
Developers
Outreach
GeneMANIA: the next steps
1. More data, organisms, and functions
2. Grouping functions together
3. Better handling of profiling data
ISOLATE: a tool for purifying heterogeneous tumour array data
Tumour samples are heterogeneous
A B
A)An endometrial endometrioid carcinoma with significant amounts of healthy tissue.
B) A sample from the same endometrial endometrioid carcinoma, but almost exclusively tumor.
77
78
Acknowledgements
Morrislab
Gerald Quon
Ang Cui
Khalid Zuberi
Amit Deshwar
Paul Boutros (OICR)
Syed Haider
Peter Zandstra (Donnelly)
Wendy Qiao
Rae Yeang (Sick Kids)
Blaise Clarke (UHN)
Funding
Questions?
RNA probe set / pool design
81% of all 10-mers >= 12 repeats of all 8-mers >= 64 repeats of all 7-mers
Unstructured or weakly structured sequences
Stem-loops containing loops with <= 7 bases
99 – 100% represented, some 6-loops are repeated. 10bp length stems
Stem-loops containing loops with 8 bases
59% represented 10bp length stems
Two independent 35nt probe sets (SetA and SetB), each probe set contains:
Sequence type Representation
+ ~30k replicate probes + controls on an Agilent 244k format
Now also available: 8-mer design
PTB 7mers A (74.8)
7mers B (74.6)
Perez 1997 UCUU (65.8) Perez 1997 UCUUC (62.6)
PWM A (71.8) PWM B (74.7)
Oberstrass 2005 CUNN (73.5) Oberstrass 2005 YCN (67.3) Oberstrass 2005 YCU (75.9)
7-mer scores more accurately predict transcripts that co-purify with PTB
1-Specificity
0 1 0.8 0.6 0.4 0.2 0
Sen
siti
vity
1
0.8
0.6
0.4
0.2
CUUUCUU UUUCUUU UCUUUCU UUACUUU UCUAUCU
PWM A
5 highest scoring 7-mers
PTB co-immunoprecipitates (cDNAs) from (Gama-Carvalho, M et al GB 2006)
2,951 positives, 2,052 negatives
83
84
85
86
87
88
89