Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Evidence of abundant stop codon readthrough in Drosophila and other metazoa Irwin Jungreis1,2, Michael F. Lin1,2, Clara S. Chan3, Manolis Kellis1,2,4 1MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, Massachusetts 02139, USA; 2Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02139, USA; 3MIT Biology Department, Cambridge, Massachusetts 02139, USA.
4Corresponding author. E-mail [email protected]; fax 617-452-5034
Running Title: Abundant stop codon readthrough in fly and metazoa Keywords: Translational read-through; termination codon suppression; comparative genomics; regulation; evolution; UTRs; editing; recoding; dicistronic; selenocystein SECIS. Abstract: While translational stop codon readthrough is often utilized by viral genomes it has been observed for only a handful of eukaryotic genes. We previously used comparative genomics evidence to recognize protein-coding regions in 12 species of Drosophila and showed that for 149 genes the open reading frame following the stop codon has a protein-coding conservation signature, hinting that stop codon readthrough might be common in Drosophila. We return to this observation armed with deep RNA sequence data from the modENCODE project, an improved higher-resolution comparative genomics metric for detecting coding regions, and new comparative information from additional species. We report an expanded set of 283 readthrough candidates, including 16 double readthroughs, that were manually curated to rule out alternatives such as A-to-I editing, alternative splicing, dicistronic translation, and selenocysteine incorporation. We report mass spectrometry evidence of translation for at least 9 readthrough regions. We find that the set of readthrough candidates differs from other genes in length, composition, conservation, stop codon context, and in some cases, conserved stem loops, providing clues about their regulation and potential mechanisms of readthrough. Lastly, we expand our studies beyond Drosophila and find evidence of abundant readthrough in mosquito and silkworm, and several readthrough genes in bee, beetle, nematode, and human, suggesting that functionally important translational stop codon readthrough is significantly more prevalent in Metazoa than previously recognized. Supplementary Material: Supplementary Figures S1-S6 Supplementary Text S1-S12 Supplementary File DmelReadthroughCandidates.csv containing list of candidate readthrough
transcripts with information about the readthrough region and indication of double readthrough, readthrough in other insects, and peptide matches.
Abundant stop codon readthrough in fly and metazoa
2
Introduction:
While the central dogma of molecular biology, DNA makes RNA makes protein, has elegantly
described the basic information transfer of almost every living cell, we are continuing to discover
fascinating ways nature has found to regulate these processes. From chromatin regulation to
transcription initiation and elongation, splicing, export, degradation, translation initiation and
elongation, and post-translational modifications, each layer of regulation has been optimized for specific
situations and conditions.
Translational stop codon readthrough, also known as stop codon suppression, provides a compelling
regulatory mechanism for exposing additional C-terminal domains of a protein at a lower abundance,
which in contrast to alternative splicing can be regulated at the translational level. Viruses use this
mechanism not only to increase functional versatility in a compact genome but also to control the ratio
of two protein isoforms (Namy and Rousset 2010; Beier and Grimm 2001). In eukaryotes the ratio
between readthrough and non-readthrough proteins can vary based on tissue (Robinson and Cooley
1997) and development stage (Samson et al. 1995), and can be controlled for many genes
simultaneously by regulation of release factor proteins, which can itself be post-translational (von der
Haar and Tuite 2007). Readthrough interacts with many cellular processes: it is enhanced by factors that
reduce translational elongation and fidelity; it can also affect mRNA abundance by slowing down
transcript deadenylation, reducing nonsense-mediated decay, and triggering no-go decay or non-stop
decay (von der Haar and Tuite 2007). In yeast, epigenetic control of readthrough via a prion in response
to stress conditions has been proposed as an evolutionary catalyst, enabling the adaptation of new
domains that are translated at a very low rate in normal growth conditions but at a high rate in periods of
stress when they might provide an advantage (True and Lindquist 2000). Lastly, better understanding of
readthrough regulation could provide potential new therapeutic avenues for patients with nonsense
mutations (Keeling and Bedwell 2010).
Mechanistically, readthrough occurs when a ribosome fails to terminate translation upon encountering a
stop codon, instead inserting an amino acid and continuing to translate in the same reading frame, so that
a fraction of the resulting proteins include the downstream peptide (Namy and Rousset 2010; Doronina
and Brown 2006). The tRNA that inserts the amino acid can be a cognate of the stop codon (a stop
suppressor tRNA). Alternatively, a selenocysteine tRNA can insert a selenocysteine amino acid for
Abundant stop codon readthrough in fly and metazoa
3
UGA stop codons, if a selenocysteine insertion sequence (SECIS element) is present in the 3'-
untranslated region (UTR). Lastly, for certain "leaky" stop codon contexts (more frequently subject to
readthrough), a near-cognate tRNA can insert its amino acid instead (Poole et al. 1998; Bonetti et al.
1995), which can result in a readthrough level of more than 5% (Namy et al. 2001).
In eukaryotic genomes, readthrough has been thought to play only a minor role outside selenocysteine
incorporation, experimentally observed for only six wild-type genes in three species. In D. melanogaster
readthrough has been observed for syn (Klagges et al. 1996), kel (Robinson and Cooley 1997), and hdc
(Steneberg and Samakovlis 2001), two nonsense alleles of elav (Samson et al. 1995), and a nonsense
allele of wg (Chao et al. 2003); two additional candidates, Sxl and oaf, have been proposed based on
long ORFs downstream of the stop codon (Samuels et al. 1991; Bergstrom et al. 1995). Readthrough
has also been observed for two genes in S. cerevisiae (Namy et al. 2002; Namy et al. 2003) and for the
rabbit beta-globin gene (Chittum et al. 1998). According to the Recode-2 database (Bekaert et al. 2010),
the only other known cases of readthrough in eukaryotes are in transposable elements that could be
endogenous retroviruses.
The story changed dramatically with the sequencing of 12 Drosophila genomes (Drosophila 12
Genomes Consortium et al. 2007; Stark et al. 2007), which enabled a search for readthrough genes
through the evolutionary lens of comparative genomics analysis. While evolutionary signatures
associated with protein-coding selection usually showed sharp boundaries coinciding with the exact
boundaries of protein-coding genes we found a striking 149 readthrough candidates for which the
protein-coding constraint extends past the stop codon until the next in-frame stop codon (Figure 1) (Lin
et al. 2007). The evolutionary evidence suggested that the downstream region was specifically selected
for its translated polypeptide sequence and biochemical properties. These candidates are unlikely to be
selenocysteine genes because only three selenocysteine genes have been found in a directed search
(Martin-Romero et al. 2001). Furthermore, no cognate stop suppressor tRNA genes have been found in
the Drosophila genome (Chan and Lowe 2009). At the time, we postulated that perhaps Adenosine-to-
Inosine (A-to-I) editing was responsible for the observed signature, by editing away stop codons into
tryptophan codons, since readthrough candidates were enriched for nervous system genes where editing
is most active. However, we acknowledged that alternative explanations were possible, such as
precisely-positioned alternative splicing, that would also explain the observed signatures without
readthrough.
Abundant stop codon readthrough in fly and metazoa
4
For the three years since our initial findings the phenomenon has remained a mystery, with no resolution
about what underlying mechanism may enable translation of our candidate regions, no direct evidence of
translation, and no evidence of how extensive readthrough is across the animal kingdom. To address
these questions, we exploit the vast new transcriptional evidence provided by modENCODE datasets
(Graveley 2010) and new phylogenetic methods for detecting protein-coding selection (Lin et al. 2010).
We show that translational readthrough is the only plausible explanation for the observed signatures, and
use a mass spectrometry database to obtain experimental evidence of downstream translation for some
of our candidates. We also investigate additional genomic properties of readthrough candidates that
yield insights into their mechanisms of function. Lastly, we apply several genomic techniques to search
for readthrough in related insect species, and more broadly across the animal kingdom. The result is an
expanded picture, with a manually-curated list nearly doubling the number of candidates in Drosophila,
many new insights into their mechanism and evolution, and a dramatically expanded list of species with
translational readthrough, including human.
Results:
Comparative evidence and list of candidates
PhyloCSF provides a high resolution metric for coding regions
We first sought to revise and expand our previous comparative genomics approach to discovering
candidate readthrough genes (Lin et al. 2007; Lin et al. 2008) using PhyloCSF, an improved version of
the method we previously applied to identify short, conserved coding regions (Lin et al. 2010). Our new
metric uses the topology and branch lengths of the phylogenic tree, weighing all possible ancestral
codons and sequences of substitutions. It computes a log-likelihood ratio of the observed aligned codons
occurring in a coding versus a non-coding region, with positive score indicating that a sequence of
substitutions is more likely to occur in a coding region. The increased evidence used enables
significantly higher resolution at detecting constraint in small regions given a sufficient number of
substitutions. For example, the sequence of substitutions in the second codon in Figure 1B (headed by
AGG) is 1,145 times more likely to occur in a coding region than in a non-coding region.
For each D. melanogaster annotated protein-coding transcript we computed the PhyloCSF score of the
region from the annotated stop codon ('first stop codon') until the next in-frame stop codon ('second stop
codon'), which we refer to as the 'second ORF', or, for readthrough candidates, as the 'readthrough
Abundant stop codon readthrough in fly and metazoa
5
region'. We refer to the codon position aligned to one of these stop codons in each species as the first or
second 'stop locus', respectively.
Manual curation distinguishes readthrough candidates from alternative
mechanisms
There are 750 transcripts with positive second-ORF PhyloCSF score. We excluded 339 with low scores
as potentially due to chance (see Methods) and then examined the remaining ones both through
computational tests and manual curation, eliminating those for which there is a plausible explanation for
the observed protein-coding selection other than translational stop codon readthrough (Figure 2A).
We used deep next-generation transcript sequencing data (RNA-seq) to detect alternative splicing events
or RNA editing that could potentially explain the observed protein-coding selection in the absence of
readthrough. The modENCODE project has provided 2.3 billion reads across 30 stages of development
which recover 93% of known splice junctions across all genes we were evaluating. We used this data,
supplemented with EST and mRNA evidence reported by the UCSC genome browser and with two
genomic splice prediction algorithms, to look for possible 3'-splice sites shortly downstream of the first
stop codon. This revealed likely splice sites in only 40 of the remaining 411 candidates, suggesting that
splicing is not a likely explanation for the observed wide-spread phenomenon, and leaving 371
candidates that are not explained by alternative splicing. We also used RNA-seq data to rule out RNA
editing (no edited stops were found among our candidates, discussed below).
We next excluded 42 readthrough regions that overlap possible second cistrons or other genes in the
same frame by examining existing annotations and by looking for readthrough regions that contain a
low-scoring region followed by an ATG codon that is in turn followed by a high-scoring region. Finally,
we excluded 17 recent nonsense mutations in D. melanogaster, 5 cases where the positive PhyloCSF
score could be due to overlap with coding regions on the opposite strand ('antisense'), and 24 for other
reasons such as incorrect alignment. Additional details of manual curation are provided in
Supplementary Text S1.
We used an existing algorithm, SECISearch (Kryukov et al. 2003) to rule out selenocysteine insertion,
as no SECIS elements were found. As further evidence, we note that D. willistoni lacks the
selenocysteine incorporation machinery, and provides a more specific characteristic evolutionary
Abundant stop codon readthrough in fly and metazoa
6
signature found at all known selenocysteine insertion sites, with TGA selenocysteine codons aligned in
all other species, and sense codons for cysteine (or arginine) in D. willistoni (Drosophila 12 Genomes
Consortium et al. 2007). In contrast, for almost all readthrough candidates with TGA first stop codons,
D. willistoni retains the TGA stop, and shows continued protein-coding signatures in the second ORF.
We were left with 283 transcripts (Supplementary File DmelReadthrough2010.csv) that we believe are
examples of conserved, functional, translational stop codon readthrough, henceforth referred to as the
'readthrough candidates'. The readthrough regions vary in length from 4 to 788 codons, with mean of 67
codons. While the longest second ORFs are more likely to be readthrough regions, only 51 (25%) of 202
in-frame second ORFs longer than 100 codons are readthrough candidates (Figure 2B), whereas 232
(1.5%) of 15031 shorter in-frame second ORFs are readthrough candidates, emphasizing the importance
of using comparative evidence and our additional metrics. We have applied similar curation to the
subsequent open reading frame ('third ORF') of each readthrough candidate, and found 16 transcripts
that appear to have double readthrough (Figure 1C). We did not find any examples of triple readthrough.
Single-species evidence of protein translation
Several additional lines of genomic evidence support that the predicted readthrough regions are protein-
coding without relying on comparative evidence.
Nucleotide k-mer composition matches protein-coding regions
First, we found that the nucleotide k-mer composition of readthrough regions matched that of protein-
coding regions. We quantified protein-coding-like compositional properties using the Z curve score
(Gao and Zhang 2004) which provides a single-species test for protein-coding regions using mono-, di-,
and tri-nucleotide frequencies. Although most individual readthrough regions are too short to reliably
call as protein-coding using the Z curve score alone, we found that it provided very strong group
discrimination between coding regions (immediately upstream of stop codons) and non-coding regions
(immediately downstream of stop codons) for non-readthrough genes (Figure 3A, top two panels). For
readthrough candidates, the distributions of scores upstream of the first stop was indistinguishable from
that of coding regions for non-readthrough genes, and the distribution downstream of the second stop
was indistinguishable from that of non-coding regions, as expected. Strikingly, the distribution of scores
for the readthrough portion of readthrough candidates (Figure 3A, bottom panel) matched that of coding
Abundant stop codon readthrough in fly and metazoa
7
regions. We conclude that single-species sequence composition strongly supports the comparative
evidence that the candidate readthrough regions are protein-coding.
Periodic secondary structure matches protein-coding regions
As a second line of evidence, we found that the mRNA secondary structures of the readthrough regions
exhibit a statistical periodicity property that distinguishes coding from non-coding regions. In coding
regions of humans, the fraction of genes for which the base at a given position is paired in the predicted
least energy mRNA secondary structure varies periodically, with bases in the third codon position more
likely to be paired than bases in the other two positions, and this 3-periodicity does not continue past the
stop codon into the 3'-UTR (Shabalina et al. 2006), a result recently reproduced experimentally in yeast
(Kertesz et al. 2010). We reproduced the result for predicted structure in D. melanogaster and found that
among our readthrough candidates the 3-periodicity continues beyond the annotated stop codon and
abruptly ends after the second stop codon, consistent with the intervening region being protein-coding
(Figure 3B and Supplementary Figure S1).
Mass spectrometry evidence of protein translation
To obtain direct evidence that readthrough regions are translated, we searched a mass spectrometry
database for sequenced Drosophila peptides that match within a readthrough region. We used the
Drosophila PeptideAtlas (Loevenich et al. 2009), which contains 75,753 distinct peptides of length 4 to
55 amino acids that have been analyzed using mass spectrometry and sequenced by matching to
annotated coding regions, and an additional 971 peptides that have been sequenced by comparing to the
whole genome in all six reading frames.
We found 14 distinct peptides that match within one of our readthrough regions (Figure 3C,
Supplementary File DmelReadthrough2010.csv). We used BLAST to verify that these peptides do not
match elsewhere within the D. melanogaster genome. Since most peptides in the database were
sequenced by comparison to annotated coding regions, it is not surprising that there were no matches in
most of our readthrough regions.
These sequenced peptides provide experimental evidence of translation within the readthrough regions
of nine of our readthrough candidates, only two of which, syn and kel, had been previously
experimentally observed. We found strong evidence that downstream translation in these cases is not
Abundant stop codon readthrough in fly and metazoa
8
due to alternative splicing, based on the lack of any splice variants from RNA-Seq data across at least
139 and typically several thousand reads (Supplementary Text S11). We also provide evidence that they
are not due to dicistronic translation, by specifically evaluating the PhyloCSF protein-coding selection
score between the stop codon and the first in-frame ATG, which we find to be positive (in two cases, no
ATG was present, ruling out dicistronic translation altogether).
We used the peptide matches to estimate the readthrough rate for syn, the only gene with more than one
sequenced peptide in its readthrough region. There are 13 observed peptides in syn's 444-codon
readthrough region (six of them distinct), while there were 71 in the 538-codon coding region before the
stop codon. Under the simplifying assumption that the number of peptide matches is proportional to the
peptide length and to the expression level, we can estimate the rate of readthrough for syn to be 22%.
This agrees with the 20-25% readthrough efficiency of this gene determined using antibodies (Klagges
et al. 1996).
Protein Domain evidence of protein translation
To obtain further evidence of translation, we searched for homology between the readthrough regions
and known protein domains. As such homology is more likely to be found in protein-coding regions
than non-coding ones, this strategy was used in an earlier attempt to identify readthrough genes
computationally (Sato et al. 2003), and provided further support for our candidates here. Using Search
Pfam (Finn et al. 2010) we found 13 significant matches to Pfam protein domain families within the
readthrough regions, compared to only 5 in non-readthrough controls with similar second ORF length (p
= 0.045). For example, the readthrough region of CG14669 shows a match to a RAS family GTPase
domain in the curated Pfam-A database (e-value = 6.9 10-21), suggesting roles in vesicle formation,
motility and fusion.
Discerning readthrough mechanism
Several lines of evidence support that the observed protein-coding translation is the result of stop codon
readthrough rather than alternative mechanisms.
Reading frame offset bias confirms readthrough mechanism
First, comparing the PhyloCSF scores of regions downstream of the stop codon in different frames
provided evidence that the observed phenomenon is not due to splicing, dicistronic translation,
Abundant stop codon readthrough in fly and metazoa
9
ribosomal bypassing ("hopping") (Wills 2010), or chance, and also provided an independent estimate of
the number of readthrough genes.
We found that many more in-frame second ORFs (662) have positive PhyloCSF score than do the ORFs
starting 1 or 2 bases after the annotated stop codon (219 and 225, respectively) (Figure 4A). These
counts exclude transcripts for which the region overlaps a known splice variant or other annotated
coding exon, since that could account for any protein-coding signature found.
The strong excess for in-frame ORFs (frame 0, as expected for readthrough) argues for over 400
readthrough genes (Figure 4B). On one hand, unannotated splice variants, unannotated second
independent cistrons, overlapping genes, antisense genes, ribosomal hopping, and regions with high
score by chance would all be equally likely in any of the 3 frames (and in fact there is a bias away from
frame 0 for many of these, Supplementary Text S2). On the other hand, our manual curation determined
that nonsense mutations, A->I editing, and selenocysteine insertion, which would explain a positive
PhyloCSF score specifically in frame 0, account for less than 4% of the observed excess. Finally,
programmed frame shifting would lead to a positive score in frame 1 or frame 2, but not frame 0. Thus,
only translational stop codon readthrough could explain the excess of high scoring downstream regions
in frame 0.
RNA-seq evidence rules out stop codon editing
We evaluated whether translation beyond the stop codon is a result of Adenosine-to-Inosine RNA
editing (Aphasizhev 2007), which would convert stop codons to UGG tryptophan codons, as we had
originally hypothesized (Lin et al. 2007). We analyzed deep RNA-seq data from the modENCODE
project (Graveley 2010) for A-to-G single-nucleotide differences between the DNA sequence and the
mRNA reads, which provide a median coverage of 1500 reads in the regions evaluated.
Among all annotated stop codons we found only three that are edited to UGG: DopEcR, qvr, and
CG8481. The first results in a previously-reported (Stapleton et al. 2002) two-codon extension with
positive PhyloCSF score, but below our significance threshold. The other two were also not in our
candidate list, as their second ORFs have negative PhyloCSF scores and are unlikely to correspond to
functionally-selected translation (Supplementary Text S6).
Abundant stop codon readthrough in fly and metazoa
10
Outside these three stop codons, showing 193, 708 and 601 edited reads respectively, and corresponding
to 51%, 41% and 13% of all reads at those loci, we found no evidence of editing. At every other stop
codon there were fewer than 4 A-to-G mismatched reads, or such mismatched reads accounted for fewer
than 4% of all reads at that locus, and we believe these are due to sequencing errors. Other (non-A-to-I)
mismatches affecting stop codons also appear to be due to sequencing or mapping errors as well
(Supplementary Text S7). We conclude that downstream translation of our readthrough candidates is not
due to RNA editing.
Stop codon context supports translational leakage
Translation termination efficiency is known to be affected by several signals including the stop codon
itself, the 3 nucleotide positions downstream, the final two amino acids in the nascent peptide, and the
tRNA in the ribosomal p-site (Bertram et al. 2001). Of these signals, the strongest determinants consist
of the 4-nucleotide sequence that includes the stop codon and one downstream position, which we will
refer to as the stop codon ‘context’. In both yeast and mammals, the contexts most efficient for
termination are the most common in the genome, while leaky contexts are the most rare, especially
among highly expressed genes (Bonetti et al. 1995; McCaughan et al. 1995). Thus, the frequency of
different contexts may be an indication of their efficiency at translation termination.
We found a striking inverse correlation between the usage of 4-base stop codon contexts among
readthrough candidates and non-readthrough transcripts (Figure 4C). For example, TGA-C, the least
common context among non-readthrough transcripts (3.1%) is by far the most common among
readthrough candidates (32.2%). In fact, genes containing a TGA-C stop context are nearly 10-fold more
likely to be readthrough candidates (16.4% vs. 1.9%), suggesting that perhaps TGA-C is a leaky stop
codon context enabling read-through. Indeed, we found that non-readthrough genes are more likely to
contain an additional in-frame stop codon after a TGA-C context than after other contexts (8.4% vs.
5.5% in the second in-frame codon after the stop), and this could be a result of selection to limit the
number of extraneous amino acids added after accidental readthrough (Supplementary Text S3).
The relative frequency of various contexts suggests that the stop codon and the subsequent base
contribute independently to the potentially leaky context. For example, the frequency of TGA-C is not
significantly different from the product of independent frequencies of TGA and C. Indeed, TGA is the
most frequent stop codon among readthrough candidates and least frequent among non-readthrough
Abundant stop codon readthrough in fly and metazoa
11
transcripts when coupled with any subsequent downstream base. Similarly, C is the most frequent
downstream base in readthrough and least frequent in non-readthrough, coupled with any stop codon.
Our results suggest the most leaky stop codons are TGA > TAG > TAA and the most leaky downstream
position is C > T > G > A. These results are consistent with TGA being the leakiest of stop codons in
natural stop suppression in yeast (Firoozan et al. 1991), and stop codons followed by C showing
preferential readthrough in eRF mutants in Drosophila (Chao et al. 2003).
The unusual stop codon contexts of the readthrough candidates provide further evidence that
downstream translation requires identification of the stop codon, which occurs at the ribosome, and
cannot be due to alternatives such as splicing (Supplementary Figure S3).
Conservation pattern suggests stop codons encode functional amino acids
For non-readthrough transcripts, the three stop codons appear to have little functional difference besides
the predicted readthrough rates, and substitutions between them are frequent. However, the situation is
strikingly different for readthrough candidates. While different readthrough genes utilize all three stop
codons (182 use TGA, 73 TAG and 28 TAA), any given readthrough candidate rarely showed
substitutions between the three stop codons at the readthrough locus (Figure 4D). Overall, 83% of
readthrough stop codons show perfect conservation across the 12 Drosophila species, compared to 13%
for all genes. If we consider only transcripts that contain stop codons in all species, the difference
remains striking, with 97% of readthrough stops perfectly conserved, compared to 39% for all genes.
This suggests that the readthrough stop codons are not functionally equivalent and may encode
functional amino-acids.
Indeed, experiments in mammals and yeast have shown that specific amino acids can be inserted when a
stop codon is read through, with glutamine incorporated for UAA or UAG and arginine, cysteine, or
tryptophan for UGA (Feng et al. 1990; Lin et al. 1986; Pure et al. 1985; Weiss and Friedberg 1986;
Chittum et al. 1998) (Supplementary Text S10). Two alternative mechanisms for stop codon translation
are consistent with these observations, both translating stop codons into functional amino-acids using a
single codon-anti-codon mismatch. The first mechanism would use a G-U pair in the first codon
position, which has been demonstrated as a stop-codon suppression mechanism (Lin et al. 1986), and
would result in UGA translation to Arginine and UAA and UAG translation to Glutamine (Figure 4E).
Alternatively, non-canonical pairing in the third codon position would lead to Cysteine or Tryptophan
Abundant stop codon readthrough in fly and metazoa
12
translation for UGA and Tyrosine translation for both UAA and UAG. It is also possible that both
mechanisms are at play.
The few substitutions that have occurred in readthrough positions convert between UAA and UAG stop
codons, consistent with their translation into identical amino-acids, while UGA would in fact lead to a
different translation. However, the specific tRNA used would likely affect rate of readthrough, providing
a potential explanation for the observed low rate of substitution between UAA and UAG stop codons.
Thus, we conclude that both the amino-acid translation and the tRNA-dependent rate of read-through
likely contribute to the observed conservation patterns of readthrough stop codons.
We ruled out two alternative explanations for the observed conservation. Although there is experimental
evidence that at certain Drosophila loci readthrough is dependent on the choice of stop codon, in all
known cases where UAA or UAG is read through UGA can also be read through (Supplementary Text
S12) so this would not explain the high conservation of TAA and TAG stop codons. It is also unlikely
that conservation of readthrough stop codons is due to mRNA secondary structure because the second or
third base of the stop codon is predicted to be paired for only about half of the readthrough candidates.
Regulation and functional enrichments
Several conserved stem loops could enhance readthrough
To determine if mRNA secondary structures participate in the readthrough mechanism, we looked for
conserved stem loops in the readthrough regions of some of our candidates. We first studied the
conservation of an 80 nucleotide stem loop immediately downstream of the stop codon in hdc that is
known to trigger readthrough, possibly through interference with the ribosome or release factors, even if
the TAA stop codon is changed to a TAG or TGA (Steneberg and Samakovlis 2001). We found that the
energy of this stem loop is preferentially conserved among the 12 Drosophila species by numerous
conservative substitutions (p = 0.05), suggesting functional selection of the secondary structure. At least
two other readthrough candidates, CG3638 and Mctp, have somewhat similar stem loops soon after the
stop codon that are also significantly conserved (p = 0.016 and p = 0.0006, respectively).
However, many of the other readthrough candidates do not have conserved stem loops immediately
following the readthrough stop codon, suggesting stable RNA structures regulate only some readthrough
genes.
Abundant stop codon readthrough in fly and metazoa
13
Long UTRs and enriched 8-mer suggest expanded readthrough regulation
We observed that readthrough candidates are considerably longer on average than other transcripts
(Figure 5A), suggesting they may be subject to diverse forms of post-transcriptional regulation. The
average overall pre-mRNA length (including both introns and exons) of a readthrough candidate is
19,087 nt, as compared to 5,805 for non-readthrough. Readthrough candidates are significantly larger in
every gene component, including total exon length, coding length, total intron length, 5'-UTR length,
and adjusted 3'-UTR length (after subtracting the readthrough region). They also have more exons, and
their genes have more splice variants than non-readthrough genes. All four components that make up the
primary transcript (5'-UTR, coding sequence, 3'-UTR, and introns) make independent contributions to
the excess length, as controlling for any three of these components still results in greater lengths for the
fourth component among the readthrough candidates (Supplementary Text S4).
The longer 5'- and 3'-UTRs in the readthrough candidates could contain target sites for factors mediating
translational regulation, which often bind in UTRs (Wilkie et al. 2003). Furthermore, the long 3'-UTRs
of the readthrough candidates could be directly involved in the readthrough mechanism by physically
separating the stop codon from the poly(A) tail. Indeed, termination efficiency can be affected by
interaction between eRF3, a component of the termination complex, and the poly(A)-binding protein
(Amrani et al. 2004; Kobayashi et al. 2004), and thus by separating the two, long 3'-UTRs might
decrease termination efficiency.
To find potential binding sites of RNA or protein cofactors that may participate in the readthrough
mechanism, we searched for common sequence motifs in the readthrough regions of the readthrough
candidates (Supplementary Text S8). The most significantly enriched n-mer is the 8-mer CAGCAGCA
which is found in 22.6% of readthrough regions (64) but only 4.2% of similar-length coding regions
before the stop codon of non-readthrough candidates and only 1.4% of regions after the stop (Figure
S5A). These 8-mers are distributed more or less uniformly through the readthrough regions, but the
enrichment stops at the second stop codon. The readthrough candidates are also enriched for
CAGCAGCA in same-sized regions before the first stop codon, appearing in 43 of them (15.19%), of
which 22 include the 8-mer after the stop as well. The enrichment is concentrated in the last 10% of the
coding portion of these transcripts, around 250 nt. Overall, this 8-mer appears within a few hundred nt
on one side or the other of the stop codon of about 30% of readthrough candidates.
Abundant stop codon readthrough in fly and metazoa
14
Readthrough candidates, and other long transcripts, also showed higher than average content in their
coding regions for C and G nucleotides. However, readthrough candidates are still enriched for these
nucleotides even after correcting for gene length, and, conversely, readthrough candidates are still
enriched for length even after correcting for nucleotide frequencies. After correcting for nucleotide
frequencies, readthrough candidates are enriched for the di-nucleotide CA and 5 tri-nucleotides, and
depleted for GA and 5 tri-nucleotides (Supplementary Figure S4). Moreover, the 8-mer enrichment for
CAGCAGCA remains highly significant even compared to transcripts with similar frequencies of the
three tri-nucleotides within it.
Since CAGCAGCA often codes for QQQ, we wondered whether its enrichment could be related to
polyQ repeats, and found that readthrough regions are also highly enriched for long polyQ repeats
(Figure S5B). The CAGCAGCA motif is highly enriched in all 3 frames but slightly more so in frame 0,
so its effect is probably at the nucleotide level, with, perhaps, a small additional effect at the protein
level.
Positions likely relevant to readthrough mechanism pinpointed by conservation
and base frequencies
In order to obtain clues as to which base positions might play a role in the readthrough mechanism, we
analyzed evolutionary conservation across the 12 Drosophila species at base positions before the stop
codon. The first, second, third, and fifth positions before the stop codon are significantly more
conserved among readthrough candidates than among non-readthrough transcripts (restricting to
transcripts with perfectly conserved stop codons to remove biases arising from the high conservation of
the stop codon itself). A similar analysis of the positions after the stop codon would not be meaningful
because this region is coding in the readthrough candidates but not in the background transcripts.
Some positions near the stop codon also exhibit unusual base frequencies among the readthrough
candidates. Among positions upstream of the stop codon, position -10 shows high G content, -5 low A,
and -4 high A (where the stop codon is at positions -3, -2, and -1). Among the first six downstream
positions, +1 showed high C/low A, +2 high A, +5 low T/high G, and +4 showed the least difference in
base frequencies (where +1 is the position immediately downstream of the stop codon). These
differences are significant even when comparing to non-readthroughs with similar overall base
Abundant stop codon readthrough in fly and metazoa
15
composition, or when comparing the positions after the stop to typical coding regions. Indeed, base
frequencies in the 6 positions downstream of a stop codon are known to correlate with experimentally
determined levels of readthrough (in yeast), and the 4th position after the stop has the least value in
predicting readthrough level (Williams et al. 2004). Our findings are thus consistent with previous
experimental observations, and provide a global view of the positions and nucleotides likely involved in
readthrough regulation.
Readthroughs are enriched for some functional roles
To study potential functional roles for readthrough proteins, we studied the enrichment of our candidates
for gene ontology (GO) functional terms. We found that readthrough candidates are significantly
enriched for 110 GO categories (p < 0.05, Bonferroni corrected), including nervous system
development, as we had previously noted (Lin et al. 2007). However, genes in enriched GO categories
tend to be longer than average (Figure 5B), and among genes of similar length, the only enrichments that
remain are for transcription factor activity, DNA binding, transcription regulator activity (p = 0.0016,
0.0019, and 0.016, respectively, with Bonferonni correction for number of GO categories tested). After
also correcting for UTR lengths, no significant GO enrichments remain.
Thus, while these functional enrichment provide potential clues into the neuronal and regulatory
functions of readthrough proteins, their interpretation should be taken with the understanding of the
observed length biases. The longer lengths of enriched GO categories could be an indication that both
neuronal and regulatory genes are overall highly regulated, and readthrough might be only one means of
several regulatory mechanisms. Alternatively, neuronal genes and readthrough genes could both be long
for reasons unrelated to readthrough, and the observed enrichments could be an artifact of this
confounding factor.
Evolution of readthrough across animal species
Readthrough candidates evolved as extensions, not truncations
We next studied the evolutionary history of readthrough loci to understand the origins of readthrough
genes. As the readthrough candidates were defined based on the locations of consecutive in-frame stop
codons in D. melanogaster, we used the conservation patterns of the first and the second stop codon loci
across the 12 Drosophila species spanning 60 million years of evolution, to distinguish between two
Abundant stop codon readthrough in fly and metazoa
16
ancestry possibilities. (i) Extension scenario: If the first stop codon was deeply conserved but the second
stop codon was seemingly more recently evolved, we reasoned the readthrough transcript could have
evolved as a recent extension of an ancestrally shorter gene into a formerly untranslated region. (ii)
Truncation scenario. Alternatively, if the second stop codon was deeply conserved and the first stop
codon had recently replaced an ancestral sense codon, we reasoned that the ancestral transcript likely
underwent a nonsense mutation at the first stop locus, with the readthrough mechanism partially
rescuing the ancestral, longer version of the protein.
We found that for more than 90% of readthrough candidates, the readthrough state appears to be
ancestral to the divergence of the 12 species, with ancestral stop codons at both stop loci. Of the
remaining cases, several appeared to have evolved through the extension scenario, even though such
transcripts would not usually be found by our method because when the second ORF is non-coding in a
subset of the species we would expect lower PhyloCSF scores. In contrast, no cases were found that
matched the truncation scenario, even though we would expect to find any such transcripts by our
method because the second ORF would be coding in all 12 species. Lastly, for 9 cases which were not
part of our readthrough candidate list as they were attributed to recent nonsense mutations, we could not
unambiguously argue presence of readthrough as they occurred since the more recent D.
melanogaster/D. simulans split and may have retained protein-coding evolutionary signatures solely due
to lack of sufficient divergence to accumulate additional substitutions (Supplementary Text S8).
Thus, it seems likely that most or all of the readthrough candidates evolved by extension rather than
truncation. Extension-by-readthrough provides a compelling mechanism for evolving new domains that
extend existing proteins, as the extension can evolve modularly without affecting the independent
functionality of the shorter protein. In contrast, in a truncation-with-readthrough scenario, mutations that
enhance the truncated protein would also affect the longer protein, providing little opportunity to
optimize the short protein in a modular way, without potentially damaging the previously-adapted longer
protein.
Striking examples of readthrough in humans, nematodes, and other insects
We used PhyloCSF to look for stop codon readthrough in other sets of closely related species whose
genomes have been sequenced and aligned.
Abundant stop codon readthrough in fly and metazoa
17
Using mammalian alignments, we found four non-selenocysteine readthrough candidates in the human
genome: OPRK1, OPRL1, SACM1L, and BRI3BP (Figure 6A). Although rabbit beta-globin is a known
readthrough gene (Chittum et al. 1998) it was not found by our method as it is not conserved, even in
other rodents (Supplementary Figure S6).
Using a 6-way nematode alignment we found five C. elegans readthrough candidates, namely shk-1,
F38E11.5, F38E11.6, K07C11.4, and C18B2.6 (Figure 6B). There is one known case of selenocysteine
insertion in C. elegans (Buettner et al. 1999) and a search for SECIS elements and homology to
selenocysteine genes in other species found no others (Taskov et al. 2005), so we consider it is unlikely
that these 5 readthrough candidates are selenocysteine genes. Surprisingly, F38E11.5 and F38E11.6 are
neighboring genes on opposite strands; we have no explanation for this coincidence.
Although several readthrough genes have been identified in Saccharomyces cerevisiae (Namy et al.
2002) (Namy et al. 2003), our method did not find any conserved readthrough genes in Saccharomyces
or Candida.
In order to find readthrough candidates in A. gambiae (mosquito), A. mellifera (honey bee), and T.
castaneum (red flour beetle), we instead looked for a preponderance of synonymous substitutions
specifically in these other species within the orthologs of our readthrough regions, using a 15 way
alignment with the 12 Drosophila species. We found that 17 of our candidates appear to be readthrough
in mosquitoes, 7 of which appear to be readthrough in beetles, and 4 of those appear to be readthrough
in bees as well (Figure 6C; Supplementary File DmelReadthrough2010.csv).
Estimation of readthrough abundance across animal species.
Beyond these compelling candidates, we sought to estimate the abundance of readthrough across
sequenced animal genomes.
To estimate the overall number of readthrough genes in C. elegans we again compared the PhyloCSF
scores of post-stop regions in 3 frames, but unlike our finding for Drosophila there was no excess in
frame 0, suggesting that despite the five examples we found, stop codon readthrough is not very
abundant in C. elegans (Figure 7A).
Abundant stop codon readthrough in fly and metazoa
18
To recognize other species with potentially abundant readthrough even in absence of comparative
information, we created a 3-frame comparison test that requires an annotated genome sequence but no
alignment. We first determined the number of second ORFs in each frame with positive Z curve scores,
indicating that they are likely protein coding (excluding very short regions and regions that overlap the
coding portion of another annotated transcript). Since recent nonsense mutations could contribute to the
number of positive scores in frame 0, we subtracted an estimate for the number of such mutations. A
large excess of positive scores for frame 0 in the remainder would suggest abundant readthrough
transcripts, though it would not tell us which ones they are.
We applied this test to four additional insect species, and estimated that A. gambiae (mosquito) and B.
mori (silkworm) have approximately 70 readthrough transcripts, while A. mellifera (bee) and T.
castaneum (beetle) do not have an abundance of readthrough transcripts (Figure 7B). Additional
comparative genomics resources will be needed to resolve individual cases and also investigate an
observed difference between frames 1 and 2 in silkworm.
We also applied the test to 8 other eukaryotic species, including one plant, two fungi, humans, and four
other animals (Supplementary Text S9), but did not find a statistically significant excess in frame 0 in
any of these species, suggesting that they do not have an abundance of readthrough transcripts.
In summary, while we have found several striking examples of conserved readthrough transcripts in
many animals spanning large evolutionary distances, the abundant readthrough that we discovered in
Drosophila exists only in some related insect species including mosquito and silkworm, but does not
appear to extend elsewhere in the tree of life.
By examining the stop codons of the readthrough transcripts in the various species we observed another
distinguishing characteristic of the species with abundant readthrough. All of the readthrough stop
codons we found in humans, nematodes, bees, and beetles are TGA, and 13 out of 16 have context
TGA-C. The readthrough stop codon in the rabbit beta-globin gene, the only non-Drosophila
readthrough previously discovered in animals, is also TGA. In contrast, the readthroughs we found
common to fruit fly and mosquito include all 3 stop codons, suggesting that Diptera are distinguished
among animals not only by the large number of readthrough transcripts but also by readthrough of non-
TGA stop codons. The TAA readthroughs are particularly notable because there are no others known in
Abundant stop codon readthrough in fly and metazoa
19
any other eukaryotic organism, and a search for conserved readthrough stop codons in prokaryotes using
homology and dN/dS analysis did not find any TAA readthroughs either (Fujita et al. 2007).
Discussion:
By eliminating all other known mechanisms, we have left translational stop codon readthrough as the
only plausible explanation for the evolutionary signatures in 283 D. melanogaster protein-coding genes.
For three of these genes with very long readthrough regions (hdc, kel, and syn), the extended form of the
protein had been experimentally observed before we began our investigations, but the comparative
genomic method enabled us to go much further than these three genes, finding abundant readthrough
and also indicating that the amino acid sequence downstream of the stop codon is not only expressed but
serves a conserved function, even for very short peptides. These genomic methods indicate that such
functional stop codon readthrough is considerably more prevalent among Metazoa than previously
recognized, affecting many genes in mosquitoes and silkworms and several in other insects, nematodes,
and mammals, including humans.
Double readthrough, context variety, and 3'-UTR length variability suggest
readthrough regulation is complex
Readthrough in Drosophila is considerably more complex than one would expect under a simple model
in which each ribosome reads through a stop codon according to some probability, up to perhaps 5%,
that depends only on the stop codon context. The stop codons that are affected, the readthrough rate,
and readthrough regulation all suggest regulatory complexity beyond this simple model.
As noted above, termination efficiency is affected by 3'-UTR length as well as context, but even that is
not enough to identify the readthrough stop codons. While the majority of our readthrough candidates
have rare stop codon context and unusually long 3'-UTRs, there are also many that do not. For example,
AlCR2, one of our double readthrough candidates, has both a frequent and presumably less leaky context
(TAA-G) and a short 3'-UTR (227 bp). Hence, readthrough targeting must depend on other unidentified
characteristics as well.
The readthrough rate in Drosophila exceeds the 5% level found in context studies in yeast (Namy et al.
2001). Readthrough rates as high as 20%, 50%, and 90% have been observed in Drosophila for syn, kel,
and elav nonsense alleles, respectively (Klagges et al. 1996; Robinson and Cooley 1997; Samson et al.
Abundant stop codon readthrough in fly and metazoa
20
1995). Existence of 16 double readthroughs provides further evidence of high readthrough rate, as
independent readthrough of both stop codons at the 5% level would result in translation of this ORF
only 0.0025 of the time.
Finally, we note that readthrough is known to be subject to tissue-specific regulation (kel),
developmental stage-specific regulation (elav nonsense alleles), and temperature-specific regulation
(Chao et al. 2003), and post-transcriptional modification of a tRNA anticodon offers one possible
mechanism for regulating readthrough of UAG stop codons (Bienz and Kubli 1981).
Readthrough machinery is distinct from selenocysteine insertion
The lack of selenocysteine insertion machinery in D. willistoni shows that the machinery for
readthrough is distinct from that for selenocysteine insertion. Like all other 11 species, D. willistoni
shows the signature of translational readthrough, with conservation of both first and second stop codons
for readthrough candidates, and protein-coding selection in readthrough regions. Moreover, readthrough
in kel has been experimentally observed in D. willistoni, confirming experimentally that the readthrough
mechanism is functional. However, the genes implicated in selenoprotein synthesis are absent in D.
willistoni (Drosophila 12 Genomes Consortium et al. 2007), and thus could not be required for the
observed readthrough.
Furthermore, this shows that the conserved readthrough of our candidates is independent of yet another
Drosophila mechanism that relies on a SECIS element and proteins involved in selenocysteine insertion
to read through TGA stop codons without inserting selenocysteine (Hirosawa-Takamori et al. 2009).
This conclusion is further supported by the lack of SECIS elements in our readthrough candidates and
the fact that many of them have TAA and TAG stop codons.
Readthrough could be a competitor to NMD for rescuing premature stop codons
Conserved readthrough in Drosophila could be one manifestation of a more general stop codon
readthrough mechanism whose primary function is to partially rescue nonsense mutations until a sense
mutation at the same locus restores full function to the gene. Under this scenario, the readthrough
candidates are genes whose stop codon was originally misidentified as premature by this mechanism,
but for which the downstream peptide evolved to offer a fitness advantage.
Abundant stop codon readthrough in fly and metazoa
21
Such nonsense rescue has been observed at several naturally occurring and artificially introduced
premature stop codons in Drosophila (Samson et al. 1995; Yildiz et al. 2004; Washburn and O'Tousa
1992; Chao et al. 2003). Furthermore, suppressor tRNAs, whose mutated anticodons are cognate to a
stop codon, can rescue nonsense mutations in strains of E. coli, yeast, and C. elegans, but artificially
introduced suppressor tRNAs do not work well in Drosophila, suggesting there could be an alternative
mechanism for rescuing nonsense mutations (Washburn and O'Tousa 1992).
A rescue origin would predict that the conditions that trigger readthrough would be ones that typically
distinguish premature from normal stop codons. Two of the characteristics we have observed in
readthrough candidates, unusual stop codon context and long 3'-UTRs, are in fact shared with premature
stop codons: premature stop codons can have an arbitrary context and would not favor contexts that
maximize termination efficiency, and of course a transcript with a premature stop will usually appear to
have a long 3'-UTR.
The system for identifying which stop codons to read through might share some components with
nonsense-mediated decay (NMD), another pathway that identifies premature stop codons, in order to
degrade their transcripts. How NMD recognizes premature stop codons in Drosophila is not well
understood, but there is evidence that NMD is more likely to be triggered if the 3'-UTR length is greater
than around 742 nt (Hansen et al. 2009). The average 3'-UTR length for readthrough transcripts is 1079
nt compared to 381 for non-readthrough, so perhaps readthrough and NMD share this part of the
recognition mechanism.
This raises the question of how readthrough transcripts avoid NMD, as they must for the downstream
amino acid sequence to be functionally conserved. NMD prevents formation of potentially toxic
truncated proteins but does not restore production of the full length protein, so a cell would benefit from
distinguishing mRNAs with premature stops that do not code for toxic proteins and invoking
readthrough on these rather than NMD, though we do not know how a cell could make this distinction.
Understanding of Diptera readthrough could have medical applications
One approach to treating genetic diseases caused by nonsense mutations has been to develop small
molecules that induce stop codon readthrough (Keeling and Bedwell 2010; Schmitz and Famulok 2007).
Abundant stop codon readthrough in fly and metazoa
22
Ideally such a drug would trigger readthrough and disable the NMD pathway for a specific nonsense
mutation, while allowing translation to terminate normally at other stop loci.
Readthrough also has methodological applications, such as selecting cells expressing a protein at a
desired level by fusing the protein with a transmembrane anchor separated by a leaky stop codon
(Jostock 2010).
The prevalence of conserved readthrough in Drosophila offers a rich opportunity to enhance our
understanding of the underlying molecular mechanisms, which could lead to advances in these
applications. Furthermore, abundant readthrough in mosquitoes distinguishes that vector from most
other species and offers a possible target for further research.
Methods:
Transcripts
We obtained the D. melanogaster genome and annotations from flybase.org (Tweedie et al. 2009),
version 5.13, release FB2008-10, excluding mitochondrial DNA, trans-spliced transcripts, and
transcripts whose final CDS does not end in a stop codon. We used the Drosophila tree and divergences
from (Stark et al. 2007). We used nematode genome and annotations, from http://www.wormbase.org,
release WS190, 16-May-2008. The 15-way dm3 insect alignments, the 29-way hg19 alignments, and the
6-way ce6 nematode MULTIZ alignments (Blanchette et al. 2004) were obtained from the UCSC
genome browser (Kent et al. 2002; Kuhn et al. 2009). Genomes and annotations for other species came
from the UCSC genome browser, except that the following species came from the specified web sites:
B. mori, silkdb.org/silkdb; C. albicans, candidagenome.org; G. max, phytozome.net; T. castaneum 3.0,
beetlebase.org.
Z curve score
Z curve score was computing as described in (Gao and Zhang 2004) and trained using all coding exons
and similar sized intergenic regions as positive and negative examples. We offset the score so that the
minimum average error threshold on the training examples was at 0. For the 3-frames comparison test
for abundant readthrough we computed the linear discriminant using a prior probability for coding of
0.02, to account for the expectation that most second ORFs are non-coding, and excluded second ORFs
less than 10 codons because the Z curve score has high variance for such short regions. To estimate the
Abundant stop codon readthrough in fly and metazoa
23
number of recent nonsense mutations for the 3-frames comparison test, we used an upper estimate for
the number (n=17) of such mutations in D. melanogaster found during manual curation of the
readthrough candidates list, and then multiplied by the ratio of the number of annotated transcripts in the
species being tested to the number in D. melanogaster. The error bars in the charts show the square root
of the number of transcripts with positive score in that frame, which is good approximation to the
standard deviation if we assume all transcripts are equally likely to have a positive score.
Using PhyloCSF to find protein-coding second ORFs in D. melanogaster
We computed the PhyloCSF score for the second ORF of each transcript, excluding the final stop codon.
For transcripts with no annotated 3'-UTR, or if the second ORF extended beyond the end of the
annotated 3'-UTR, we defined the second ORF as continuing beyond the end of the annotated transcript
without splicing. Among transcripts with identical second ORFs we considered only one.
To obtain a p-value for a non-coding region of length L codons to have a PhyloCSF score above some
threshold, we approximated the score distribution with a normal of mean and standard deviation C1 L
and C2 LEX where the coefficients C1 = 11.55, C2 = 10.81, and EX = 0.7184 were obtained
empirically from second ORFs of all transcripts, excluding very high scores as likely coming from
coding regions. For multiple hypothesis correction we computed the Local FDR (Efron et al. 2001).
We excluded any transcript whose second ORF has PhyloCSF score (in decibans) less than 16,
indicating it is less than 40 times as likely to occur in a protein-coding region as in a non-coding one.
We included as protein-coding any of the remaining transcripts for which the p-value is less than 0.001;
each of these second ORFs has a Local FDR less than around 0.33. We individually examined each of
the approximately 100 remaining transcripts whose second ORFs have PhyloCSF score >= 16 and p-
value >= 0.001, and considered it protein-coding or not based on various factors, including the presence
of frame shifting insertions and deletions (indels).
RNA-seq
RNA-seq data was obtained from the October 2009 data freeze on modencode.org (Graveley 2010). We
used development timecourse data from submissions 2245-2259, 2262-2263, 2265-2267, and 2388-
2397, consisting of 76-base reads from poly-A RNA extracted from 30 development stages of D.
melanogaster, mapped using Tophat, and processed using samtools (Li et al. 2009). When looking for
Abundant stop codon readthrough in fly and metazoa
24
reads overlapping a particular locus, we excluded reads for which the locus was within 7 nt of the end of
the read, to avoid splice-mapping errors. When examining candidates for RNA editing we checked for
erroneous mapping to paralogs using BLASTN from blast.ncbi.nlm.nih.gov.
Splice prediction
We did not consider a transcript to be readthrough if any modENCODE RNA-seq reads or any EST or
cDNA in the UCSC genome browser support a 3'-splice site early in the second ORF. We also used two
splice prediction algorithms to look for additional splice sites. We used the web interface on fruitfly.org
for splice site prediction using neural networks (Reese et al. 1997) and a maximum entropy method
(Yeo and Burge 2004). If a candidate 3'-splice site has a neural network score above around 0.7 or a
maximum entropy score above around 6 we considered the transcript to be readthrough or not based on
a combination of factors including these scores and the PhyloCSF score of the region before the
potential splice site.
SECIS elements
We searched for SECIS elements in the 3'-UTRs of D. melanogaster readthrough candidates using
SECISearch (Kryukov et al. 2003). If the 3'-UTR was not annotated, we searched the 1000 nt
downstream of the stop codon. If the annotated 3'-UTR was less than 1000 nt long, we extended it
without splicing to 1000 nt.
Base pairing
To calculate frequency of mRNA base pairing we used RNAfold (Zuker and Stiegler 1981) from the
Vienna RNA Package version 1.8.2 (Hofacker et al. 1994) with the default parameters to calculate the
minimum free energy secondary structure of the 403-base region of each transcript centered at the stop
codon (extending without splicing if there was no annotated 3'-UTR or it was less than 200 nt long).
Each distinct region was counted only once even if it is contained in multiple transcripts. To decrease
the variance arising from the relatively small number of readthrough candidates, codons before the stop
have been averaged with the previous four codons, and codons after the stop have been averaged with
the next four codons.
Length, composition, and enrichment statistics
Abundant stop codon readthrough in fly and metazoa
25
In computing length and composition statistics for readthrough candidates and non-readthrough
transcripts we eliminated duplicates (transcripts with identical second ORFs), leaving 15211 transcripts.
For computing UTR lengths we included only transcripts that have annotated UTRs.
To adjust some statistic for one or more other statistics (for example, computing average 5'-UTR length
of transcripts with similar length and 3'-UTR length as readthrough candidates) we divided transcripts
into bins and weighted the non-readthrough transcripts so as to make the total weight in each bin equal
to the fraction of readthrough candidates in that bin. Bin size was 300 when adjusting for one statistic.
When adjusting for two statistics we used bins of size 750 for the first statistic and a bin of size 150 for
the second statistic for each bin of the first statistic. For three statistics we used bins of size 750, 150,
and 15, and averaged the results over all permutations to make the result order independent.
GO classifications were obtained from http://www.sb.cs.cmu.edu/stem/ (Ernst and Bar-Joseph 2006).
Stem loop conservation
We tested for conservation of stem loops as follows. For every possible substitution within the stem
loop, we calculated how much that hypothetical substitution would perturb the energy of the most stable
secondary structure. We considered a stem loop conserved if the substitutions that actually occurred
within the stem loop in a parsimonious reconstruction of the 12 genomes perturbed the energy
significantly less than an equal number of random substitutions. For the stem loop investigation, RNA
folding was done using quickfold on the DINAMelt Server, http://dinamelt.bioinfo.rpi.edu/quikfold.php
(Markham and Zuker 2005).
Homology to known protein domains
We searched for homology to known protein domain families using Search Pfam from
http://pfam.sanger.ac.uk/search?tab=searchProteinBlock#tabview=tab1 (Finn et al. 2010). We excluded
regions less than 10 codons long, and regions that overlapped a known CDS, both for readthroughs and
controls. We also excluded control regions with positive PhyloCSF, since these could be actual
readthroughs excluded from our readthrough list by overly conservative curation. We considered a
match to be significant if e-value < 0.05 / (number of regions tested).
Abundant stop codon readthrough in fly and metazoa
26
Acknowledgements We would like to thank William M. Gelbart for suggesting this line of investigation during the 12-flies
project, the members of the modENCODE project for early release of data, Erich Brunner for help
analyzing mass spectrometry data, Jade Vinson for use of his splice prediction code, and Matt
Rasmussen, Loyal Goff, Jason Ernst, Rogerio Candeias, Alex Lancaster, Pouya Kheradpour, Michal
Rabani, Stefan Washietl, Grigoriy Kryukov, Jessica Wu, Rintaro Saito, and Peter Everett for helpful
discussions and suggestions. This work was supported by the National Institutes of Health (U54
HG00455-01), NSF CAREER 0644282, and a Sloan Foundation award.
Abundant stop codon readthrough in fly and metazoa
27
Figure Legends Figure 1. Evolutionary signatures of typical and readthrough stop codons. Alignments of 12 Drosophila
species and inferred parsimonious common ancestor surrounding the annotated stop codons of three genes. The
color coding of substitutions and insertions/deletions (indels) relative to common ancestor is a simplification for
visualization purposes, as the actual PhyloCSF score sums over all possible ancestral sequences and weighs every
codon substitution by its probability. Insertions in other species relative to D. melanogaster are not shown. (A)
Alignment of a typical gene (bw), shows abundant synonymous substitutions upstream of the stop codon, and
many non-conservative substitutions, frame-shifting indels, and in-frame stop codons downstream of the stop
codon. The stop codon position shows several substitutions between different stop codons. (B) Alignment of a
readthrough candidate gene, CG17319 (one of 283 cases). The region between the annotated stop codon and the
next in-frame stop codon shows mostly synonymous substitutions, and lacks frame-shifting indels, while the
region downstream of the second stop shows substitutions and indels typical of non-coding regions downstream
of typical stop codons. The first stop codon position is perfectly conserved, while the second stop codon position
shows substitutions between different stop codons. (C) Alignment of a double readthrough candidate, Glu-RIB
(one of 16 cases). Both the second ORF and the third ORF show protein-coding signatures, indicating that both
stop codons are likely readthrough. Both readthrough stop codon positions show no substitutions.
Figure 2. Manual curation distinguishes 283 readthrough candidates. (A) Steps of filtering method used to
eliminate transcripts with other plausible explanations for the observed second-ORF protein-coding selection,
leading to the final list of 283 unambiguous readthrough candidates. (B) Fraction of in-frame second ORFs of
given length (x axis) that are readthrough candidates (red), that have positive PhyloCSF scores with alternative
plausible explanations (blue), or that have negative PhyloCSF score (green). Counts for each category indicated
for each interval. Even among second ORFs greater than 100 amino-acids, only a quarter are readthrough
candidates, highlighting the need for comparative evidence to distinguish them. Overall, 2% of second ORFs
tested are readthrough candidates.
Figure 3. Non-comparative evidence of protein translation. (A) Most readthrough regions show codon
composition typical of protein-coding regions, measured as positive Z curve scores (x-axis). Top panel: Coding
regions before the first stop show positive scores for both read-through (squares) and non-readthrough (crosses)
typical of protein-coding regions. Middle panel: Non-coding regions after the second stop for readthrough
candidates (squares) and after the first stop for typical transcripts (crosses) show negative scores typical of non-
coding regions. Bottom panel: Score distribution for read-through regions matches that of protein-coding regions,
providing single-species evidence that readthrough regions are protein-coding. Evaluated regions before the first
stop, after the second stop for readthrough, and after the first stop for non-readthrough selected to match length
distribution of readthrough regions. (B) Periodic base-pairing frequency in readthrough regions (red) matches that
Abundant stop codon readthrough in fly and metazoa
28
of known coding regions (blue) but is different from that of UTRs (green). Fraction of transcripts for which a
given nucleotide is paired in predicted RNA secondary structures (y-axis) at each position relative to a stop codon
(x-axis). Third codon positions (purple) are paired more frequently than first or second positions, and stop codons
(positions -3, -2, and -1) show decreased pairing, as previously observed computationally in humans and
experimentally in yeast (top panel). Transition from periodic to non-periodic pairing happens at second stop
codon for readthrough candidates (bottom panel). Signal is averaged over four codon positions (see Methods),
with raw data shown in Supplementary Figure S1. Error bars show Standard Error of Mean (SEM). (C) Example
of readthrough region (gish) supported by a 22 amino acid peptide match (red rectangle) to mass spectrometry
Drosophila PeptideAtlas (one of nine cases). With no ATG codon between the stop codon and the peptide, and no
observed alternative splicing events despite thousands of RNA-Seq reads overlapping this region, readthrough
seems the only plausible explanation for translation of this peptide.
Figure 4. Evidence of read-through mechanism. (A,B) Excess of high-scoring regions in-frame (frame 0)
compared to out of frame (frame 1, frame 2) suggests readthrough as the likely mechanism and provides an
estimate of readthrough count. (A) PhyloCSF score per-codon (x-axis) of the regions starting 0, 1, or 2 bases after
the stop codon (red, green, purple, respectively) and continuing until the next stop codon in that frame. All
annotated protein-coding transcript of D. melanogaster are included except for duplicate regions, regions
overlapping the coding portion of another annotated transcript, and regions with no alignment. Frame 0 shows an
excess of over 400 predicted protein-coding transcripts compared to the other reading frames suggesting
widespread readthrough. (B) Possible mechanisms associated with protein-coding function downstream of D.
melanogaster stop codons (rows) and associated reading frame offsets where corresponding protein-coding
function is expected (columns). Unannotated alternative splice variants, unannotated second cistrons, and random
fluctuations would lead to an even distribution among the three frames, while frame-shift events and overlapping
coding regions would bias away from frame 0. A bias for in-frame protein-coding selection is expected only for
stop codon readthrough, recent nonsense mutations, A-to-I editing, and selenocysteine, the latter three together
accounting for only 17 cases. This leaves readthrough as the only plausible explanation for an excess of
approximately 420 frame 0 regions with positive PhyloCSF scores. (C) Usage of stop codon context (stop codon
and subsequent base) provides additional evidence of readthrough mechanism. The 4-base contexts are sorted in
order of decreasing frequency among the 14,928 non-readthrough stop codons (red), with less frequent stop
codons (top, e.g. TGA-C) experimentally associated with translational leakage in other species and most frequent
associated with efficient termination (bottom, e.g. TAA-A). Context frequencies for readthrough candidates (blue)
are opposite of non-readthrough genes, suggesting a preference for leaky context, with one third using TGA-C
and almost none using TAA-A. (D-E) Stop codon conservation patterns in readthrough candidates suggest
functional amino-acid translation by tRNA wobble pairing. (D) Increased stop codon conservation in readthrough
candidates (blue) suggests the three stop codons are not functionally equivalent. Only about 1/3 of all annotated
Abundant stop codon readthrough in fly and metazoa
29
D. melanogaster stop codons have aligned stops in all 12 species, and only about 1/3 of those are perfectly
conserved, i.e., have the same stop codon in all 12 species. In contrast, 83% of candidate readthrough stop codons
have an aligned stop in all 12 species, and 97% of those are perfectly conserved. While all three stop codons are
involved in readthrough of different genes, individual genes rarely switch between different amino acids, with no
substitutions observed between TGA and other stops, and only 7 substitutions between TAA and TAG (not
shown). (E) Proposed amino acids inserted at stop codon position and likely wobble mechanism of incorporation.
For each stop codon are listed near-cognate tRNA anti-codons and the type of wobble resulting in incorporation.
GU pairing in the 1st codon position has previously been proposed in yeast and AA, AC, and GA pairing have all
been experimentally observed in the 3rd codon position. In both scenarios, UAG and UAA are synonymous with
each other but not with UGA, consistent with observed substitution frequencies at readthrough stop codon
positions.
Figure 5. Functional and length enrichments. (A) Readthrough candidates as a group (blue) show increased
length (in nucleotides) and splicing complexity (number of transcript variants) compared to other transcripts (red).
Overall Length: sum of all introns, coding exons, and UTRs. Coding Length: length of first ORF. Adjusted 3'-
UTR Length excludes second ORF for readthrough candidates. Splice Variants: number of annotated transcripts
of the same gene. Numbers show actual size and bars denote size relative to non-readthrough genes. Error bars
show standard error of mean (SEM) for readthrough candidates. Enrichments persist even after correcting for total
intron length (green), 5’UTR length, 3’UTR length, or coding exon length (not shown). The greater lengths may
be due to increased need for regulatory signals in each part of the transcript, or may be directly involved in
readthrough regulation, for example due to increased spacing between the stop codon and poly(A) signals that
have been shown to interact with ribosomal release factors. (B) Length distribution biases may confound GO
enrichment analysis. Distribution of total gene length (log scale) for readthrough candidates (red), non-
readthrough transcripts (blue), and genes with the GO category "neurogenesis" (green), one of the most enriched
among readthrough candidates. It is possible that neurogenesis genes show increased regulation of many types,
only one of which is readthrough. Alternatively, readthrough genes and neurogenesis genes could both have
greater lengths for independent reasons, leading to artifactual functional enrichments.
Figure 6. Examples of readthrough in other species. (A) Alignment across 29 mammals for predicted
readthrough in human gene SACM1L, one of four mammalian candidates. (B) Alignment across 5 worm species
for predicted readthrough in C. elegans gene C18B2.6, one of five nematode candidates. Stop codon context in all
five is TGA-C, and is perfectly conserved among Caenorhabditis species. P. pacificus is not aligned in any of the
five readthrough regions. (C) Alignment across 12 Drosophila and 3 other insect species, A. gambiae (mosquito),
A. mellifera (honey bee), and T. castaneum (red flour beetle), for the predicted readthrough region of the D.
melanogaster slo gene, one of 17 readthrough candidates conserved in mosquitoes, and one of 4 conserved across
Abundant stop codon readthrough in fly and metazoa
30
all 15 aligned insects. Although PhyloCSF cannot tell us whether the region is protein-coding in a particular
subset of species, the large number of synonymous substitutions specifically in the other 3 insects, lack of non-
synonymous substitutions and frame-shifting indels, and perfectly conserved 'leaky' TGA-C stop codon context
provide strong evidence that readthrough also occurs in these other insects.
Figure 7. Estimated extent of readthrough in other species. (A) Estimated extent of readthrough in C. elegans
using comparative evidence. Frequency (y-axis) of PhyloCSF scores per codon (x-axis) for downstream ORFs in
three frames, analogous to Figure 4A. No significant excess is found in frame 0, suggesting lack of abundant
readthrough in C. elegans. Scores of five readthrough candidates found by manual curation are indicated with
stars. (B) Estimated extent of readthrough in 5 insect species and humans using single-species sequence-
composition evidence quantified by Z curve scores. Only positive scores are shown for downstream ORFs in three
frames to detect excess in frame 0 associated with abundant readthrough (RT). For each species, line graphs show
the distribution of positive Z curve scores, and bar charts show the total number of positive scores. Lower bound
(dotted boxes) for the excess in frame 0 is estimated as the difference between the lower error bar for frame 0 and
the higher error bar maximum between frames 1 and 2. Number of readthrough genes (solid boxes) is estimated as
the difference between the center of the error bars for frame 0 and the center of the error bars for frames 1 and 2
(subtracting an estimate for the number of recent nonsense mutations). This matches our estimates for D.
melanogaster, and predicts more than 70 readthrough genes in mosquito and silkworm, but no abundant
readthrough in bees, beetles, and humans.
Abundant stop codon readthrough in fly and metazoa
References Amrani N, Ganesan R, Kervestin S, Mangus DA, Ghosh S, Jacobson A. 2004. A faux 3'-UTR promotes
aberrant termination and triggers nonsense-mediated mRNA decay. Nature 432: 112-8. Aphasizhev R. 2007. RNA Editing. Mol Biol 41: 227-239. Beier H, Grimm M. 2001. Misreading of termination codons in eukaryotes by natural nonsense
suppressor tRNAs. Nucleic Acids Res 29: 4767-82. Bekaert M, Firth AE, Zhang Y, Gladyshev VN, Atkins JF, Baranov PV. 2010. Recode-2: new design,
new search tools, and many more genes. Nucleic Acids Res 38: D69-74. Bergstrom DE, Merli CA, Cygan JA, Shelby R, Blackman RK. 1995. Regulatory autonomy and
molecular characterization of the Drosophila out at first gene. Genetics 139: 1331-46. Bertram G, Innes S, Minella O, Richardson J, Stansfield I. 2001. Endless possibilities: translation
termination and stop codon recognition. Microbiol 147: 255-69. Bienz M, Kubli E. 1981. Wild-type tRNA-Tyr-G reads the TMV RNA stop codon but Q base-modified
tRNA-Tyr-Q does not. Nature 294: 188-190. Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AFA, Roskin KM, Baertsch R, Rosenbloom K,
Clawson H, Green ED, et al. 2004. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res 14: 708-15.
Bonetti B, Fu L, Moon J, Bedwell DM. 1995. The efficiency of translation termination is determined by a synergistic interplay between upstream and downstream sequences in Saccharomyces cerevisiae. J Mol Biol 251: 334-45.
Buettner C, Harney JW, Berry MJ. 1999. The Caenorhabditis elegans homologue of thioredoxin reductase contains a selenocysteine insertion sequence (SECIS) element that differs from mammalian SECIS elements but directs selenocysteine incorporation. J Biol Chem 274: 21598-602.
Chan PP, Lowe TM. 2009. GtRNAdb: a database of transfer RNA genes detected in genomic sequence. Nucleic Acids Res 37: D93-7.
Chao A, Dierick H, Addy T, Bejsovec A. 2003. Mutations in Eukaryotic Release Factors 1 and 3 Act as General Nonsense Suppressors in Drosophila. Genetics 165: 601.
Chittum H, Lane W, Carlson B, Roller P, Lung F, Lee B, Hatfield D. 1998. Rabbit Beta-Globin Is Extended Beyond Its UGA Stop Codon by Multiple Suppressions and Translational Reading Gaps. Biochem 37: 10866-10870.
Doronina VA, Brown JD. 2006. Non-canonical decoding events at stop codons in eukaryotes. Molekuliarnaia biologiia 40: 731-41.
Drosophila 12 Genomes Consortium, Clark AG, Eisen MB, Smith DR, Bergman CM, Oliver B, Markow TA, Kaufman TC, Kellis M, Gelbart W, et al. 2007. Evolution of genes and genomes on the Drosophila phylogeny. Nature 450: 203-18.
Efron B, Tibshirani R, Storey J, Tusher V. 2001. Empirical Bayes Analysis of a Microarray Experiment. J Am Stat Assoc 96: 1151-1160.
Ernst J, Bar-Joseph Z. 2006. STEM: a tool for the analysis of short time series gene expression data. BMC bioinformatics 7: 191.
Feng YX, Copeland TD, Oroszlan S, Rein A, Levin JG. 1990. Identification of amino acids inserted during suppression of UAA and UGA termination codons at the gag-pol junction of Moloney murine leukemia virus. Proc Natl Acad Sci U S A 87: 8860-3.
Finn R, Mistry J, Tate J, Coggill P, Heger A, Pollington J, Gavin O, Gunasekaran P, Ceric G, Forslund K, et al. 2010. The Pfam protein families database. Nucleic Acids Res 38: D211.
Abundant stop codon readthrough in fly and metazoa
Firoozan M, Grant C, Duarte J, Tuite M. 1991. Quantitation of Readthrough of Termination Codons in Yeast using a Novel Gene Fusion Assay. Yeast 7: 173-183.
Fujita M, Mihara H, Goto S, Esaki N, Kanehisa M. 2007. Mining prokaryotic genomes for unknown amino acids: a stop-codon-based approach. BMC Bioinformatics 8: 225.
Gao F, Zhang CT. 2004. Comparison of various algorithms for recognizing short coding sequences of human genes. Bioinformatics 20: 673-81.
Graveley B. 2010. DOI 10.1038/nature09715. Nature. Hansen KD, Lareau L, Blanchette M, Green R, Meng Q, Rehwinkel J, Gallusser F, Izaurralde E, Rio D,
Dudoit S, et al. 2009. Genome-wide identification of alternative splice forms down-regulated by nonsense-mediated mRNA decay in Drosophila. PLoS Genet 5: e1000525.
Hirosawa-Takamori M, Ossipov D, Novoselov S, Turanov A, Zhang Y, Gladyshev V, Krol A, Vorbruggen G, Jackle H. 2009. A novel stem loop control element-dependent UGA read-through system without translational selenocysteine incorporation in Drosophila. The FASEB Journal 23: 107-13.
Hofacker I, Fontana W, Stadler P, Bonhoeffer L, Tacker M, Schuster P. 1994. Fast Folding and Comparison of RNA Secondary Structures. Monatshefte fur Chemie.
Jostock T. 2010. CELL SURFACE DISPLAY OF POLYPEPTIDE ISOFORMS BY STOP CODON READTHROUGH. U.S. Patent WO/2010/022961.
Keeling KM, Bedwell DM. 2010. Recoding Therapies for Genetic Diseases. In Recoding: Expansion of Decoding Rules Enriches Gene Expression (ed. JF Atkins and RF Gesteland), pp. 123-146. Springer, New York, NY.
Kent W, Sugnet C, Furey T, Roskin K, Pringle T, Zahler A, Haussler D. 2002. The human genome browser at UCSC. Genome Res 12: 996-1006.
Kertesz M, Wan Y, Mazor E, Rinn J, Nutter R, Chang H, Segal E. 2010. Genome-wide measurement of RNA secondary structure in yeast. Nature 467: 103-7.
Klagges BR, Heimbeck G, Godenschwege TA, Hofbauer A, Pflugfelder GO, Reifegerste R, Reisch D, Schaupp M, Buchner S, Buchner E. 1996. Invertebrate synapsins: a single gene codes for several isoforms in Drosophila. J neurosci 16: 3154-65.
Kobayashi T, Funakoshi Y, Hoshino SI, Katada T. 2004. The GTP-binding release factor eRF3 as a key mediator coupling translation termination to mRNA decay. J Biol Chem 279: 45693-700.
Kryukov GV, Castellano S, Novoselov SV, Lobanov AV, Zehtab O, Guigó R, Gladyshev VN. 2003. Characterization of mammalian selenoproteomes. Science 300: 1439-43.
Kuhn RM, Karolchik D, Zweig AS, Wang T, Smith KE, Rosenbloom KR, Rhead B, Raney BJ, Pohl A, Pheasant M, et al. 2009. The UCSC Genome Browser Database: update 2009. Nucleic Acids Res 37: D755-61.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25: 2078-9.
Lin JP, Aker M, Sitney KC, Mortimer RK. 1986. First position wobble in codon-anticodon pairing: amber suppression by a yeast glutamine tRNA. Gene 49: 383-8.
Lin MF, Carlson JW, Crosby MA, Matthews BB, Yu C, Park S, Wan KH, Schroeder AJ, Gramates LS, St Pierre SE, et al. 2007. Revisiting the protein-coding gene catalog of Drosophila melanogaster using 12 fly genomes. Genome Res 17: 1823-36.
Lin MF, Deoras AN, Rasmussen MD, Kellis M. 2008. Performance and scalability of discriminative metrics for comparative gene identification in 12 Drosophila genomes. PLoS Comput Biol 4: e1000067.
Lin M, Jungreis I, Kellis M. 2010. PhyloCSF: a comparative genomics method for distinguishing protein-coding and non-coding regions. Nature Precedings. doi: 10101/npre.2010.4784.1.
Abundant stop codon readthrough in fly and metazoa
Loevenich SN, Brunner E, King NL, Deutsch EW, Stein SE, FlyBase Consortium, Aebersold R, Hafen E. 2009. The Drosophila melanogaster PeptideAtlas facilitates the use of peptide data for improved fly proteomics and genome annotation. BMC bioinformatics 10: 59.
Markham NR, Zuker M. 2005. DINAMelt web server for nucleic acid melting prediction. Nucleic Acids Res 33: W577-81.
Martin-Romero FJ, Kryukov GV, Lobanov AV, Carlson BA, Lee BJ, Gladyshev VN, Hatfield DL. 2001. Selenium metabolism in Drosophila: selenoproteins, selenoprotein mRNA expression, fertility, and mortality. J Biol Chem 276: 29798-804.
McCaughan KK, Brown CM, Dalphin ME, Berry MJ, Tate WP. 1995. Translational termination efficiency in mammals is influenced by the base following the stop codon. Proc Natl Acad Sci U S A 92: 5431-5.
Namy O, Duchateau-Nguyen G, Hatin I, Denmat S, Termier M, Rousset J. 2003. Identification of stop codon readthrough genes in Saccharomyces cerevisiae. Nucleic Acids Res 31: 2289-2296.
Namy O, Duchateau-Nguyen G, Rousset TrotPscmcliScJ. 2002. Translational readthrough of the PDE2 stop codon modulates cAMP levels in Saccharomyces cerevisiae. Mol Microbiol 43: 641-52.
Namy O, Hatin I, Rousset JP. 2001. Impact of the six nucleotides downstream of the stop codon on translation termination. EMBO Rep 2: 787-93.
Namy O, Rousset JP. 2010. Specification of Standard Amino Acids by Stop Codons. In Recoding: Expansion of Decoding Rules Enriches Gene Expression (ed. JF Atkins and RF Gesteland), pp. 79-100. Springer, New York, NY.
Poole ES, Major LL, Mannering SA, Tate WP. 1998. Translational termination in Escherichia coli: three bases following the stop codon crosslink to release factor 2 and affect the decoding efficiency of UGA-containing signals. Nucleic Acids Res 26: 954-60.
Pure GA, Robinson GW, Naumovski L, Friedberg EC. 1985. Partial suppression of an ochre mutation in Saccharomyces cerevisiae by multicopy plasmids containing a normal yeast tRNAGln gene. J Mol Biol 183: 31-42.
Reese MG, Eeckman FH, Kulp D, Haussler D. 1997. Improved splice site detection in Genie. J Comput Biol 4: 311-23.
Robinson DN, Cooley L. 1997. Examination of the function of two kelch proteins generated by stop codon suppression. Development 124: 1405-17.
Samson ML, Lisbin MJ, White K. 1995. Two distinct temperature-sensitive alleles at the elav locus of Drosophila are suppressed nonsense mutations of the same tryptophan codon. Genetics 141: 1101-11.
Samuels M, Schedl P, Cline T. 1991. The complex set of late transcripts from the Drosophila sex determination gene sex-lethal encodes multiple related polypeptides. Mol Cell Biol 11: 3584-602.
Sato M, Umeki H, Saito R, Kanai A, Tomita M. 2003. Computational analysis of stop codon readthrough in D.melanogaster. Bioinformatics 19: 1371-1380.
Schmitz A, Famulok M. 2007. Ignore the nonsense. Nature 447: 42-3. Shabalina SA, Ogurtsov AY, Spiridonov NA. 2006. A periodic pattern of mRNA secondary structure
created by the genetic code. Nucleic Acids Res 34: 2428-37. Stapleton M, Carlson J, Brokstein P, Yu C, Champe M, George R, Guarin H, Kronmiller B, Pacleb J,
Park S, et al. 2002. A Drosophila full-length cDNA resource. Genome biol 3: RESEARCH0080. Stark A, Lin MF, Kheradpour P, Pedersen JS, Parts L, Carlson JW, Crosby MA, Rasmussen MD, Roy S,
Deoras AN, et al. 2007. Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures. Nature 450: 219-232.
Steneberg P, Samakovlis C. 2001. A novel stop codon readthrough mechanism produces functional Headcase protein in Drosophila trachea. EMBO Rep 2: 593-597.
Abundant stop codon readthrough in fly and metazoa
Taskov K, Chapple C, Kryukov GV, Castellano S, Lobanov AV, Korotkov KV, Guigó R, Gladyshev VN. 2005. Nematode selenoproteome: the use of the selenocysteine insertion system to decode one codon in an animal genome? Nucleic Acids Res 33: 2227-38.
True H, Lindquist S. 2000. A yeast prion provides a mechanism for genetic variation and phenotypic diversity. Nature 407: 477-483.
Tweedie S, Ashburner M, Falls K, Leyland P, McQuilton P, Marygold S, Millburn G, Osumi-Sutherland D, Schroeder A, Seal R, et al. 2009. FlyBase: enhancing Drosophila Gene Ontology annotations. Nucleic Acids Res 37: D555-9.
von der Haar T, Tuite MF. 2007. Regulated translational bypass of stop codons in yeast. Trends Microbiol 15: 78-86.
Washburn T, O'Tousa JE. 1992. Nonsense suppression of the major rhodopsin gene of Drosophila. Genetics 130: 585-595.
Weiss WA, Friedberg EC. 1986. Normal yeast tRNA(CAGGln) can suppress amber codons and is encoded by an essential gene. J Mol Biol 192: 725-35.
Wilkie GS, Dickson KS, Gray NK. 2003. Regulation of mRNA translation by 5'- and 3'-UTR-binding factors. Trends Biochem Sci 28: 182-188.
Williams I, Richardson J, Starkey A, Stansfield I. 2004. Genome-wide prediction of stop codon readthrough during translation in the yeast Saccharomyces cerevisiae. Nucleic Acids Res 32: 6605-6616.
Wills NM. 2010. Translational Bypassing -- Peptidyl-tRNA Re-pairing at Non-overlapping Sites. In Recoding: Expansion of Decoding Rules Enriches Gene Expression (ed. JF Atkins and RF Gesteland), pp. 365-381. Springer, New York, NY.
Yeo G, Burge C. 2004. Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. J Comput Biol 11: 377-94.
Yildiz O, Kearney H, Kramer BC, Sekelsky JJ. 2004. Mutational Analysis of the Drosophila DNA Repair and Recombination Gene mei-9. Genetics 167: 263.
Zuker M, Stiegler P. 1981. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res 9: 133-48.
A
B
C
Figure 1: Evolutionary signatures of typical and readthrough stop codons.
Annotated Stop Codon
A
B
!!"
#$"
$$"
%&"
&'"
&("
#" $"
)"
*"
%"
'" $"
%"
&" )"
*'$"
*)"
&&"
&&"
&%"
#" !"
$"
#"
%" %"
&" &"
&" %"
!"
+#$$"
*%%'"
&(%!"
*$!"
&%+"
'$" %+"
&!"
$"
'" *"
%" &"
&" ("
*"
(,"
%(,"
$(,"
#(,"
!(,"
&((,"
("
%("
$("
#("
!("
&(("
&%("
&$("
&#("
&!("
%(("
%%("
%$("
%#("
%!("
-*(("!"#$%&'(&)(*"#'+$",-.+(
/0$&'1(23!(40'5.6(7,'($&1&'+8(
./012345"6"("789&$$+$:"
;1<=>8?@A="BCD1?8?@28E"789$##:"
F=?G</>2HI/"3?8GJG?<=E"789%!*:"
750 Second ORFs with positive score
!
PhyloCSF, p-value, Local FDR ==> 339 Chance
!
RNA-seq and splice predictors ==> 40 Alternative Splice
!
Score to first ATG, prior annotations ==> 42 Dicistronic or overlapping gene
!
Stop only in D mel or closely related ==> 17 Recent nonsense
!
Score on opposite strand ==> 5 Antisense
!
RNA-seq ==> 0 A-to-I editing (3 found, all low score)
!
Search for SECIS ==> 0 Selenocysteine
!
Visual check of alignment ==> 24 Alignment errors, etc.
!
283 Readthrough Candidates
Figure 2 : Manual curation distinguishes 283 readthrough candidates
B
A C
Figure 3: Non-comparative evidence of protein-translation
(continued)
A
C
D
B
E !"#$%
!&$$'())*#+%
,(-./+*)0
12%
&)/3(
!"#$%
4#5#+
"167%
7+8-#5#+
!"# 9:7 94: $%&'('() $%&*+
,- 97: 49: ./0123'() ./(*4
#" 977 99: ./0123'() ./(*4
747 5671)'() 567*5
447 8%691:9;2( 8%9*<
,- 97: 797 86%:7'() 86%*=
#" 977 797 86%:7'() 86%*=
1(5;0*)0/"-.
.>?92'%'(&?'(?
!71?@:A:(?
9:7'B:(
C:(D@2(:('@2/?
92'%'(&?'(?-%A?
@:A:(?9:7'B:(
9:7
70*+#%/-*5%*+)('"(5
!"#
Frame 0 Frame 1 Frame 2
Interpretation
Readthrough !
Recent nonsense !
A -> I editing !
Selenocysteine !
Alternative splicing ! ! !
Dicistronic ! ! !
Cryptic promotor ! ! !
Hopping ! ! !
Antisense ! ! !
Chance ! ! !
Frame shift ! !
CDS overlaps stop ! !# with PhyloCSF > 0 662 219 225
!"#$%&'%()*+*,%
-./%()"010)2+*,%
Figure 4: Evidence of read-through mechanism
A
B
Figure 5: Functional and length enrichments
A
B
C
Figure 6: Examples of read-through in other species
A
B
Figure 7: Estimated extent of readthrough in other species
Species with excess in frame 0 (abundant RT) Species with no excess (little or no RT)