Upload
bryce-watson
View
226
Download
0
Embed Size (px)
Citation preview
The modern RNA world: computational screens for noncoding RNA genes
Eddy labHHMI/Washington University, Saint Louis
The human genome sequence is (almost) done
The genome, famously, is digital
1892: Miescher postulates that genetic information may be encoded in a linear form using a few different chemical units:
“...just as all the words and concepts in all languages can find expression in twenty-four to thirty letters of the alphabet.”
Symbolic texts can be cracked
“Cryptography has contributed a new weapon to the student of unknown scripts.... the basic principle is theanalysis and indexing of coded texts, sothat underlying patterns and regularitiescan be discovered. If a number of instances can be collected, it may appearthat a certain group of signs in the codedtext has a particular function....” - John Chadwick, The Decipherment of Linear B, Cambridge Univ. Press, 1958
Michael Ventris and John Chadwick, 1953
The phylogenetic history of life
Comparative genome analysisVISTA plot; I. Dubchak, E. Rubin, et al.
human, mouse, dog genomes
Estimates of human gene numberwww.ensembl.org/Genesweep/
mean: 61,710
high: 153,478low: 27,462
Want to place a bet? The book is held by the bartender at Cold Spring Harbor Laboratory.
Life with 6000 Genes
A. Goffeau, B.G. Barrell, H. Bussey, R.W. Davis, B. Dujon,H. Feldmann, F. Galibert, J.D. Hoheisel, C. Jacq, M. Johnston,
E.J. Louis, H.W. Mewes, Y. Murakami, P. Phillippsen,H. Tettelin, S.G. Oliver
Science 274:546, 1996
but besides the ~6000 large protein-coding genes, there’s also:140 ribosomal RNA genes,275 transfer RNA genes,~40 small nuclear RNA genes,~100 small nucleolar RNA genes,... and ... ?
The yeast genome completed
where “gene” = ORF of 100 amino acids or more.
Structure of the large ribosomal subunitHaloarcula marismortui
Ban et. al., Science 289:905, 2000
inside-out genes
Human UHG (U22 host gene)no significant ORFs; not conserved with mouse; rapidly degraded
Eight intron-encoded snoRNAsconserved with mouse; stable
Tycowski, Shu, and SteitzNature 379:464, 1996
An RNA motorSimpson et al, Nature 408:745, 2000
“Structure of the bacteriophage 29 DNA packaging motor”
Cartilage-hair hypoplasia mapped to an RNAM. Ridanpaa et al. Cell 104:195, 2001
RMRP: Human RNase MRP, 267 nt
microRNAs (miRNAs) in metazoa
~22-mer processed from ~70-mer precursorby RNAi pathway
lin-4 acts as translational repressorby binding 3’ UTR
T. Tuschl; D. Bartel; V. Ambros
RNA genes can be hard to detect
UGAGGUAGUAGGUUGUAUAGU
C. elegans Let-7; 21 ntPasquinelli et al. Nature 408:86, 2000
• often small• sometimes multicopy and redundant• often not polyadenylated (and remember EST libraries are poly-A selected)• immune to frameshift and nonsense mutation• no open reading frame or codon bias• relatively little information in primary sequence consensus
Two computational analysis problems
1. Similarity search (e.g. BLAST): I give you a query; you find sequences in a database that look like the query.
For RNA, you want to take the secondary structure of the query into account.
2. Genefinding (e.g. GENSCAN): Based solely on a priori knowledge of what a “gene” looks like, find genes in a genome sequence.
For RNA – with no open reading frame and no codon bias – what do you look for?
RNA structure: nested pairwise correlations
Context-free grammarsNoam Chomsky, 1956
a CFG “derivation”Basic CFG “production rules”
Sequence vs. secondary structure alignmentR Durbin, SR Eddy, GJ Mitchison, A Krogh
Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic AcidsCambridge Univ. Press, 1998
Goal
optimal alignmentP(sequence | model)
EM parameter estimation
memory complexity:time complexity (general):time complexity (as used):
HMM algorithm(sequence)
ViterbiForward
Forward-Backward
O(MN)O(M2N)O(MN)
SCFG algorithm(structure)
CYKInside
Inside-Outside
O(MN2)O(M3N3)O(MN3)
• we can analyze target sequences with secondary structure models;• but the algorithms are computationally expensive.
SCFG-based RNA similarity searchC/D methylation guide snoRNA consensus:
Graphical model, prior to conversion to probabilistic model:
the program snoscan was used to detect C/D snoRNA homologues in Archaea;Omer et al., Science 288:517-522, 2000
SCFGs for RNA folding
Full SCFG analogue of Michael Zuker’s minimum energy RNA folding –
means we can apply statistical models to any RNA structure(e. g., what’s the probability that this is a plausible RNA structure?)
Elena Rivas and S.R. Eddy, Bioinformatics 16:573, 2000
Genefinding by comparative analysisJonathan Badger, Gary Olsen: CRITICA, Mol Biol. Evol. 16:512, 1999
The OTHER model:score with terms P(a,b | OTH)models divergence only
the CODING model:score with terms P(aaa,bbb | COD)models divergence, constrained byamino acid substitution matrix andcodon bias
Most comparative analysis relies just on differential rates of evolution.However, the pattern of mutation is also informative.
add: a comparative model of structural RNAs
The RNA model:terms: P(a-a’, b-b’ | RNA)models DNA divergence constrained by a secondary structure
Elena Rivas, S.R. Eddy: QRNA, BMC Bioinformatics 2:8, 2001
Some technical issues
- The structure is unknown; must do ensemble averaging.
- model must deal with gapped alignments.
- bounds of conservation or alignment don’t correspond to bounds of RNA.
- evolutionary divergence times of the three models must be the same.
We use a form of probabilistic model called “pair-SCFGs”.
Three models – examples of their scores
A screen for novel ncRNAs in E. coliElena Rivas et al., Curr Biol 11:1369, 2001
2367 E. coli intergenic sequences >50 nt in length
WUBLASTN vs. S. typhi, S. paratyphi, S. enteriditis, K. pneumoniaegave 23,674 WUBLASTN alignments w/ E<0.01, length >50 nt, >65% identity
QRNA classified: 556 candidate RNA loci 160 candidate small ORFs (not examined further)
281 candidate loci are explainable: cis-regulatory RNA structures (terminators, attenuators, etc.) and certain inverted repeat elements
leaves 275 candidate ncRNA gene loci
Northerns on 49 candidates: 11/49 are expressed as small stable RNAsin exponentially growing E. coli in rich media
Northern blots confirming E. coli RNAs
The Altuvia screen
“Over a period of about 30 years, only four bona fide regulatory RNAs have been discovered in E. coli. Here we report on the discovery of 14 novel small RNA-encoding genes....”
Argaman et al., Current Biology 11:941, 2001“Novel small RNA-encoding genes in the intergenic regions of E. coli”
sraA 120 ntsraB 149-168 ntrprA 105 ntsraC 234-249 ntsraD 70 ntgcvB 205 ntsraE 88 ntsraF 189 ntsraG 146-174 ntsraH 88-108 ntsraI 91-94 ntsraJ 172 ntsraK 245 ntsraL 140 nt
• start w/ “intergenic” regions
• computational identification of putative promoter and terminator, 50-400 nt apart
• select regions conserved with other bacteria by BLAST
The Gottesman screenWassarman et al., Genes Dev. 15:1637, 2001
“Identification of novel small RNAs using comparative genomics and microarrays”
rydB 60 ntryeE 86 ntryfA 320 ntryhA 45 nt (sraH)ryhB 90 nt (sraI)ryiA 210 ntryjA 92 ntrybB 80 ntryiB 270 nt (sraK, csrC)rybA 205 ntrygA 89 nt (sraE)rygB 83 ntryeA 275 ntryeB 100 ntryeC 107,143 ntryeD 102,137 ntrygC 107,139 nt
• intergenic regions >= 180 nt
• conserved w/ other bacteria by BLAST
• manual inspection of location & sequence
• expression detected on high-density oligo probe array
“... a multifaceted search strategy to predict sRNA genes was validated by our discovery of 17 novel sRNAs....”
Summary of three E. coli screens
31 different new RNAs found and confirmed by the three screens:Altuvia: 14Gottesman: 19 (1 showed no expression; 1 untested)Rivas: 22 (1 showed no expression; 10 untested)
Conclusions: Sensitivity of QRNA is respectable; most E. coli ncRNAs conserve secondary structure
Only 4/11 of our confirmed ncRNAs are in the Altuvia or Gottesman genes
Conclusions: These screens have not saturated E. coli for new ncRNAs; We have >200 other candidates in testing; We have confirmed transcripts as short as 40 nt; The functions of these RNAs are unknown.
Pyrococcus: three hyperthermophile genomes
A “black smoker” – deep sea hydrothermal ventphoto: American Natural History Museum
• P. horikoshii 1.8 Mb, complete isolated off Okinawa, 1400m depth Kawarabayasi et al. (NITE, Tokyo)
• P. furiosus 1.9 Mb, complete from Vulcano Island, Italy Robb et al. (Utah Genome Center)
• P. abyssi 1.8 Mb, complete from South Pacific vent, 3500m depth Genoscope (France)
G/C composition detects RNAs in Pyrococcus
RNAs stand out in AT-rich hyperthermophiles
Methanococcus 85 31% 67% 36% 97%Pyrococcus 98 42% 71% 29% 52%Borrelia 37 29% 54% 25% 29%Aquifex 90 44% 68% 24% 14%Archaeoglobus 83 48% 68% 20% 2%S. cerevisiae 30 38% 54% 16% 0E. coli 37 51% 59% 8% 0
grow
th tem
p (C)
% G
C (gen
ome)
% G
C (RNA)
%RNA-%
geno
me
% kn
own R
NAs dete
cted
!!
The G/C computational screen
Implemented as a 2-state hidden Markov model, using Viterbi or posterior decoding algorithms.
Methanococcus jannaschii: (Viterbi parse alone)43 regions detected (some span multiple RNAs)includes 36/37 tRNAs; SSU and LSU rRNA; 5S, 7S, RNase P.9 unassigned candidates.4/9 express small RNAs detectable on Northern.
Pyrococcus furiosus: (posterior decoding, plus conservation w. P.a., P.h.)51 regions detected (some span multiple RNAs)includes 46/46 tRNAs, SSU and LSU rRNA; 2 5S, 7S, and RNase P.8 unassigned candidates.4/8 express small RNAs detectable on Northern.
Robbie Klein et al., manuscript submitted
pyrococcus genome comparisons
Comparison of G/C to QRNA screenRobbie Klein et al., PNAS, in press
Candidate loci:
G/C screen QRNA screen
51 73
known tRNAs detected (of 46): 46 45
novel loci: 8 17
Both
n.d.
45
4
Confirmed by Northern: 4 4 3
• Like the E. coli screen, about 25% of QRNA candidates were confirmed by Northern (again in a single growth condition only).
• QRNA is detecting most novel structural RNA genes.
P. furiosus – screened by QRNA by comparison to P. horikoshii, P. abyssi
Archaeal RNA Northerns
human/mouse ncRNA detection
the cartilage-hair hypoplasia region:
QRNA is a general genefinder for structural ncRNA genes.
The ancient RNA WorldGesteland, Cech, Atkins: The RNA World, CSHL Press, 1999
RNA is very good at recognizing RNAHa, Wightman, Ruvkun; Genes Dev. 10:3041, 1996
A closing idea: The modern RNA world
Hypothesis:When a cell needs a molecule that specifically recognizes a target RNA molecule, and the function is either:
- catalytically unsophisticated - something that can be abstracted onto a shared protein (e.g.
many guide snoRNAs, one methylase)
then RNA may be the material of choice. Specific RNA-binding proteins are big, expensive, and more difficult to evolve.
In fact, an old idea...Jacob and Monod, JMB 3:318, 1961
Summary
• There appear to be many noncoding RNA genes.
• Methods to find homologous RNAs by structural similarity have been greatly improved, using stochastic context free grammar algorithms.
• Methods to find novel RNAs by de novo genefinding have finally become possible, for instance by using comparative genome analysis.
.
[SR Eddy, Nature Reviews Genetics, 2:919, 2001]
[R Durbin et al., Biological Sequence Analysis, Cambridge U. Press 1998]
[E Rivas, RJ Klein, TA Jones, SR Eddy, Curr Biol 11:1369, 2001;E Rivas, SR Eddy, BMC Bioinformatics, 2:8, 2001]
Acknowledgementsthe Eddy lab: http://www.genetics.wustl.edu/eddy/the Eddy lab: http://www.genetics.wustl.edu/eddy/
senior scientist: Elena Rivas
students:Zhirong BaoChristian ZmasekRobin DowellRobbie KleinSteve JohnsonShawn StricklinJohn McCutcheon
systems:Goran Ceric
webmaster:Ajay Khanna
wet lab:Ziva Misulovin
secret agent man:Tom Jones
funding:HHMINIH NHGRINSFMonsanto