Upload
ernest-powell
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
GeneSweep; or, Ewan’s definition of a genehttp://www.ensembl.org/Genesweep
Footnote 1.
“We are restricting ourselves to protein coding genes to allow an effective assessment. RNA genes were considered too difficult to assess by 2003.”
Rule 2.
“A gene is a set of connected transcripts.... A transcript is a set of exons.... one transcript must encode a protein (see footnotes).”
Life with 6000 Genes
A. Goffeau, B.G. Barrell, H. Bussey, R.W. Davis, B. Dujon,H. Feldmann, F. Galibert, J.D. Hoheisel, C. Jacq, M. Johnston,
E.J. Louis, H.W. Mewes, Y. Murakami, P. Phillippsen,H. Tettelin, S.G. Oliver
Science 274:546-567, 1996
but besides the ~6000 protein-coding genes, there’s also:140 ribosomal RNA genes,275 transfer RNA genes,40 small nuclear RNA genes,~100 small nucleolar RNA genes,... ?
life with 6000 genes
Structure of the large ribosomal subunitHaloarcula marismortui
Ban et. al., Science 289:905-920, 2000
inside-out genes?
Human UHG (U22 host gene)no significant ORFs; not conserved with mouse; rapidly degraded
Eight intron-encoded snoRNAsconserved with mouse; stable
Tycowski, Shu, and SteitzNature 379:464-466, 1996
pRNA in 29 rotary packaging motor Simpson et al, Nature 408:745-750, 2000
“Structure of the bacteriophage 29 DNA packaging motor”
Cartilage-hair hypoplasia mapped to an RNARidanpaa et al. Cell 104:195-203, 2001
RMRP: Human RNase MRP, 267 nt
RNA genes can be hard to detect
UGAGGUAGUAGGUUGUAUAGU
C. elegans let-7; 21 ntPasquinelli et al. Nature 408:86-89, 2000
• often small• sometimes multicopy and redundant• often not polyadenylated (and remember EST libraries are poly-A selected)• immune to frameshift and nonsense mutation• no open reading frame or codon bias• often rapidly evolving in primary sequence
The Altuvia screen
“Over a period of about 30 years, only four bona fide regulatory RNAs have been discovered in E. coli. Here we report on the discovery of 14 novel small RNA-encoding genes....”
Argaman et al., Current Biology 11:941-50, 2001“Novel small RNA-encoding genes in the intergenic regions of E. coli”
sraA 120 ntsraB 149-168 ntrprA 105 ntsraC 234-249 ntsraD 70 ntgcvB 205 ntsraE 88 ntsraF 189 ntsraG 146-174 ntsraH 88-108 ntsraI 91-94 ntsraJ 172 ntsraK 245 ntsraL 140 nt
• start w/ “intergenic” regions
• computational identification of putative promoter and terminator, 50-400 nt apart
• select regions conserved with other bacteria by BLAST
The Gottesman screenWassarman et al., Genes Dev. 15:1637-51, 2001
“Identification of novel small RNAs using comparative genomics and microarrays”
rydB 60 ntryeE 86 ntryfA 320 ntryhA 45 nt (sraH)ryhB 90 nt (sraI)ryiA 210 ntryjA 92 ntrybB 80 ntryiB 270 nt (sraK, csrC)rybA 205 ntrygA 89 nt (sraE)rygB 83 ntryeA 275 ntryeB 100 ntryeC 107,143 ntryeD 102,137 ntrygC 107,139 nt
• intergenic regions >= 180 nt
• conserved w/ other bacteria by BLAST
• manual inspection of location & sequence
• expression detected on high-density oligo probe array
“... a multifaceted search strategy to predict sRNA genes was validated by our discovery of 17 novel sRNAs....”
Two computational analysis problems
1. Similarity search (e.g. BLAST): I give you a query; you find sequences in a database that look like the query.
For RNA, you want to take the secondary structure of the query into account.
2. Genefinding (e.g. GENSCAN): Based solely on a priori knowledge of what a “gene” looks like, find genes in a genome sequence.
For RNA – with no open reading frame and no codon bias – what do you look for?
Context-free grammarspioneered in comp bio by David Searls
a CFG “derivation”Basic CFG “production rules”
HMM and SCFG algorithmsR Durbin, SR Eddy, GJ Mitchison, A Krogh
Biological Sequence Analysis:Probabilistic Models of Proteins and Nucleic AcidsCambridge Univ. Press, 1998
Goal
optimal alignmentP(sequence | model)
EM parameter estimation
memory complexity:time complexity (general):time complexity (as used):
HMM algorithm
ViterbiForward
Forward-Backward
O(MN)O(M2N)O(MN)
SCFG algorithm
CYKInside
Inside-Outside
O(MN2)O(M3N3)O(MN3)
• we can analyze target sequences with secondary structure models;• but the algorithms are computationally expensive.
SCFG-based RNA similarity search
COVE (Eddy and Durbin, 1994) structural profiles of RNA sequence families
tRNAscan-SE (Lowe and Eddy, 1997) fast prescreens + COVE model of tRNA – large scale tRNA detection
snoscan (Lowe and Eddy, 1998) C/D box small nucleolar RNA detection in yeast genome
snoRNAs detected in Archaea (Omer et al. 2000) C/D box snoRNA homologues detected in many Archaea collaboration with Pat Dennis’ lab at UBC Vancouver
FOLDALIGN (Jan Gorodkin) – automatic recognition and alignment of consensus secondary structure elements
also Y. Sakakibara, F. Lefebvre, B. Knudsen, I. Holmes, others...
profile HMMs from A. Krogh, D. Haussler “profile SCFGs”, or “covariance models”
SCFGs for RNA folding
• Minimum energy RNA folding by dynamic programming – Michael Zuker
• Partition function calculations (weighted summations over ensemble of all structures) – J. McCaskill
• SCFG analogue of the Zuker program; maximum likelihood folds by the CYK algorithm; summations by the Inside algorithm – E. Rivas and S.R. Eddy, 2000
Genefinding by comparative analysisJonathan Badger, Gary Olsen: CRITICA
The OTHER model:score with terms P(a,b | OTH)models divergence only
the CODING model:score with terms P(aaa,bbb | COD)models divergence, constrained byamino acid substitution matrix andcodon bias
Most comparative analysis relies just on differential rates of evolution.However, the pattern of mutation is also informative.
add: a comparative model of structural RNAs
The RNA model:terms: P(a-a’, b-b’ | RNA)models DNA divergence constrained by a secondary structure
Technical issues
- The structure is unknown; must do weighted sum over all possible structures. We extended an SCFG model of RNA folding (Rivas and Eddy, 2000) to a pair-SCFG, and we use an Inside algorithm to score it.
- model must deal with gapped alignments that are heterogeneous w.r.t. models – e.g. BLAST may align beyond the edge of the real RNA. We use pair-grammar formalisms for all three models, and include flanking models of conserved nonstructured alignment.
- though we want to sum over all structures, we also want to recover maximum likelihood start/end points of an RNA within a longer alignment. We use the generalized HMM parsing trick introduced by Stormo and Haussler (aka “semi Markov models” in Burge’s GENSCAN), and treat our RNA model as an i,j feature score in a generalized HMM.
- divergence times of the three models must be the same. We tie all model parameters to a choice of amino acid substitution matrix.
-
A screen for novel ncRNAs in E. coliE. Rivas, R. Klein, T. Jones, S.R. Eddy, submitted
2367 E. coli intergenic sequences >50 nt in length
WUBLASTN vs. S. typhi, S. paratyphi, S. enteriditis, K. pneumoniaegave 23,674 WUBLASTN alignments w/ E<0.01, length >50 nt, >65% identity
QRNA classified: 556 candidate RNA loci 160 candidate small ORFs (not examined further)
281 candidate loci are explainable: cis-regulatory RNA structures (terminators, attenuators, etc.) and certain inverted repeat elements
leaves 275 candidate ncRNA gene loci
Northerns on 49 candidates: 11/49 are expressed as small stable RNAsin exponentially growing E. coli in rich media
Summary of three E. coli screens
10/14 of the RNAs found by the Altuvia screen are in QRNA candidate list3 are just below 5 bit cutoff; one (sraI) completely missed
14/17 of the RNAs found by the Gottesman screen are in candidate list2 are just below cutoff; 1 was thrown out mistakenly (QRNAfound it, we thought it was just a terminator)
Conclusions: Sensitivity of QRNA is respectable; most E. coli ncRNAs conserve secondary structure
Only 4/11 of our confirmed ncRNAs are in the Altuvia or Gottesman genes
Conclusions: These screens have not saturated E. coli for new ncRNAs; A total of 34 new ncRNAs confirmed. We have >200 other candidates in testing; We have confirmed transcripts as short as 40 nt; The functions of these RNAs are unknown.
human/mouse ncRNA detection
the cartilage-hair hypoplasia region:
QRNA is a general genefinder for structural ncRNA genes.
RNA is very good at recognizing RNA
RA Lease & M Belfort, PNAS 97:9919-24, 2000“A trans-acting RNA as a control switch in Escherichia coli...”
A closing idea: The modern RNA world
Hypothesis:When a cell needs to make a molecule X that specifically recognizes a target RNA molecule, and the function of X is either:
- catalytically unsophisticated (e.g. steric repression of translation); or- something that can be abstracted onto a single protein (e.g.
many guide snoRNAs, one catalytic methylase)
then RNA may be the material of choice. Small, highly specific complementary RNAs can be generated by simply duplicating part of the antisense strand of the target RNA. Specific RNA-binding proteins are big, expensive, and more difficult to evolve.
Summary
• Noncoding RNAs are genes too.
• Methods to find homologous RNAs by structural similarity have been greatly improved, using stochastic context free grammar algorithms.
• Methods to find novel RNAs by de novo genefinding are our current aim. Two different screens detect new structural RNAs:
- a simple GC screen in AT-rich hyperthermophile genomes;
- QRNA, an RNA genefinder using comparative sequence analysis.
[SR Eddy, Curr Opin Genet Dev 9:965, 1999]
[R Durbin et al., Biological Sequence Analysis, Cambridge U. Press 1998]
[RJ Klein, Z Misulovin, SR Eddy, in preparation]
[E Rivas, RJ Klein, TA Jones, SR Eddy, submitted]