Evidence of abundant stop codon readthrough in Drosophila and other …web.mit.edu/manoli/tenurecase/M16_Jungreis_Genome... · 2011. 1. 26. · Evidence of abundant stop codon readthrough

Evidence of abundant stop codon readthrough in Drosophila and other metazoa Irwin Jungreis1,2, Michael F. Lin1,2, Clara S. Chan3, Manolis Kellis1,2,4 1MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, Massachusetts 02139, USA; 2Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02139, USA; 3MIT Biology Department, Cambridge, Massachusetts 02139, USA.

4Corresponding author. E-mail [email protected]; fax 617-452-5034

Running Title: Abundant stop codon readthrough in fly and metazoa Keywords: Translational read-through; termination codon suppression; comparative genomics; regulation; evolution; UTRs; editing; recoding; dicistronic; selenocystein SECIS. Abstract: While translational stop codon readthrough is often utilized by viral genomes it has been observed for only a handful of eukaryotic genes. We previously used comparative genomics evidence to recognize protein-coding regions in 12 species of Drosophila and showed that for 149 genes the open reading frame following the stop codon has a protein-coding conservation signature, hinting that stop codon readthrough might be common in Drosophila. We return to this observation armed with deep RNA sequence data from the modENCODE project, an improved higher-resolution comparative genomics metric for detecting coding regions, and new comparative information from additional species. We report an expanded set of 283 readthrough candidates, including 16 double readthroughs, that were manually curated to rule out alternatives such as A-to-I editing, alternative splicing, dicistronic translation, and selenocysteine incorporation. We report mass spectrometry evidence of translation for at least 9 readthrough regions. We find that the set of readthrough candidates differs from other genes in length, composition, conservation, stop codon context, and in some cases, conserved stem loops, providing clues about their regulation and potential mechanisms of readthrough. Lastly, we expand our studies beyond Drosophila and find evidence of abundant readthrough in mosquito and silkworm, and several readthrough genes in bee, beetle, nematode, and human, suggesting that functionally important translational stop codon readthrough is significantly more prevalent in Metazoa than previously recognized. Supplementary Material: Supplementary Figures S1-S6 Supplementary Text S1-S12 Supplementary File DmelReadthroughCandidates.csv containing list of candidate readthrough

transcripts with information about the readthrough region and indication of double readthrough, readthrough in other insects, and peptide matches.

Abundant stop codon readthrough in fly and metazoa

2

Introduction:

While the central dogma of molecular biology, DNA makes RNA makes protein, has elegantly

described the basic information transfer of almost every living cell, we are continuing to discover

fascinating ways nature has found to regulate these processes. From chromatin regulation to

transcription initiation and elongation, splicing, export, degradation, translation initiation and

elongation, and post-translational modifications, each layer of regulation has been optimized for specific

situations and conditions.

Translational stop codon readthrough, also known as stop codon suppression, provides a compelling

regulatory mechanism for exposing additional C-terminal domains of a protein at a lower abundance,

which in contrast to alternative splicing can be regulated at the translational level. Viruses use this

mechanism not only to increase functional versatility in a compact genome but also to control the ratio

of two protein isoforms (Namy and Rousset 2010; Beier and Grimm 2001). In eukaryotes the ratio

between readthrough and non-readthrough proteins can vary based on tissue (Robinson and Cooley

1997) and development stage (Samson et al. 1995), and can be controlled for many genes

simultaneously by regulation of release factor proteins, which can itself be post-translational (von der

Haar and Tuite 2007). Readthrough interacts with many cellular processes: it is enhanced by factors that

reduce translational elongation and fidelity; it can also affect mRNA abundance by slowing down

transcript deadenylation, reducing nonsense-mediated decay, and triggering no-go decay or non-stop

decay (von der Haar and Tuite 2007). In yeast, epigenetic control of readthrough via a prion in response

to stress conditions has been proposed as an evolutionary catalyst, enabling the adaptation of new

domains that are translated at a very low rate in normal growth conditions but at a high rate in periods of

stress when they might provide an advantage (True and Lindquist 2000). Lastly, better understanding of

readthrough regulation could provide potential new therapeutic avenues for patients with nonsense

mutations (Keeling and Bedwell 2010).

Mechanistically, readthrough occurs when a ribosome fails to terminate translation upon encountering a

stop codon, instead inserting an amino acid and continuing to translate in the same reading frame, so that

a fraction of the resulting proteins include the downstream peptide (Namy and Rousset 2010; Doronina

and Brown 2006). The tRNA that inserts the amino acid can be a cognate of the stop codon (a stop

suppressor tRNA). Alternatively, a selenocysteine tRNA can insert a selenocysteine amino acid for


3

UGA stop codons, if a selenocysteine insertion sequence (SECIS element) is present in the 3'-

untranslated region (UTR). Lastly, for certain "leaky" stop codon contexts (more frequently subject to

readthrough), a near-cognate tRNA can insert its amino acid instead (Poole et al. 1998; Bonetti et al.

1995), which can result in a readthrough level of more than 5% (Namy et al. 2001).

In eukaryotic genomes, readthrough has been thought to play only a minor role outside selenocysteine

incorporation, experimentally observed for only six wild-type genes in three species. In D. melanogaster

readthrough has been observed for syn (Klagges et al. 1996), kel (Robinson and Cooley 1997), and hdc

(Steneberg and Samakovlis 2001), two nonsense alleles of elav (Samson et al. 1995), and a nonsense

allele of wg (Chao et al. 2003); two additional candidates, Sxl and oaf, have been proposed based on

long ORFs downstream of the stop codon (Samuels et al. 1991; Bergstrom et al. 1995). Readthrough

has also been observed for two genes in S. cerevisiae (Namy et al. 2002; Namy et al. 2003) and for the

rabbit beta-globin gene (Chittum et al. 1998). According to the Recode-2 database (Bekaert et al. 2010),

the only other known cases of readthrough in eukaryotes are in transposable elements that could be

endogenous retroviruses.

The story changed dramatically with the sequencing of 12 Drosophila genomes (Drosophila 12

Genomes Consortium et al. 2007; Stark et al. 2007), which enabled a search for readthrough genes

through the evolutionary lens of comparative genomics analysis. While evolutionary signatures

associated with protein-coding selection usually showed sharp boundaries coinciding with the exact

boundaries of protein-coding genes we found a striking 149 readthrough candidates for which the

protein-coding constraint extends past the stop codon until the next in-frame stop codon (Figure 1) (Lin

et al. 2007). The evolutionary evidence suggested that the downstream region was specifically selected

for its translated polypeptide sequence and biochemical properties. These candidates are unlikely to be

selenocysteine genes because only three selenocysteine genes have been found in a directed search

(Martin-Romero et al. 2001). Furthermore, no cognate stop suppressor tRNA genes have been found in

the Drosophila genome (Chan and Lowe 2009). At the time, we postulated that perhaps Adenosine-to-

Inosine (A-to-I) editing was responsible for the observed signature, by editing away stop codons into

tryptophan codons, since readthrough candidates were enriched for nervous system genes where editing

is most active. However, we acknowledged that alternative explanations were possible, such as

precisely-positioned alternative splicing, that would also explain the observed signatures without

readthrough.


4

For the three years since our initial findings the phenomenon has remained a mystery, with no resolution

about what underlying mechanism may enable translation of our candidate regions, no direct evidence of

translation, and no evidence of how extensive readthrough is across the animal kingdom. To address

these questions, we exploit the vast new transcriptional evidence provided by modENCODE datasets

(Graveley 2010) and new phylogenetic methods for detecting protein-coding selection (Lin et al. 2010).

We show that translational readthrough is the only plausible explanation for the observed signatures, and

use a mass spectrometry database to obtain experimental evidence of downstream translation for some

of our candidates. We also investigate additional genomic properties of readthrough candidates that

yield insights into their mechanisms of function. Lastly, we apply several genomic techniques to search

for readthrough in related insect species, and more broadly across the animal kingdom. The result is an

expanded picture, with a manually-curated list nearly doubling the number of candidates in Drosophila,

many new insights into their mechanism and evolution, and a dramatically expanded list of species with

translational readthrough, including human.

Results:

Comparative evidence and list of candidates

PhyloCSF provides a high resolution metric for coding regions

We first sought to revise and expand our previous comparative genomics approach to discovering

candidate readthrough genes (Lin et al. 2007; Lin et al. 2008) using PhyloCSF, an improved version of

the method we previously applied to identify short, conserved coding regions (Lin et al. 2010). Our new

metric uses the topology and branch lengths of the phylogenic tree, weighing all possible ancestral

codons and sequences of substitutions. It computes a log-likelihood ratio of the observed aligned codons

occurring in a coding versus a non-coding region, with positive score indicating that a sequence of

substitutions is more likely to occur in a coding region. The increased evidence used enables

significantly higher resolution at detecting constraint in small regions given a sufficient number of

substitutions. For example, the sequence of substitutions in the second codon in Figure 1B (headed by

AGG) is 1,145 times more likely to occur in a coding region than in a non-coding region.

For each D. melanogaster annotated protein-coding transcript we computed the PhyloCSF score of the

region from the annotated stop codon ('first stop codon') until the next in-frame stop codon ('second stop

codon'), which we refer to as the 'second ORF', or, for readthrough candidates, as the 'readthrough


5

region'. We refer to the codon position aligned to one of these stop codons in each species as the first or

second 'stop locus', respectively.

Manual curation distinguishes readthrough candidates from alternative

mechanisms

There are 750 transcripts with positive second-ORF PhyloCSF score. We excluded 339 with low scores

as potentially due to chance (see Methods) and then examined the remaining ones both through

computational tests and manual curation, eliminating those for which there is a plausible explanation for

the observed protein-coding selection other than translational stop codon readthrough (Figure 2A).

We used deep next-generation transcript sequencing data (RNA-seq) to detect alternative splicing events

or RNA editing that could potentially explain the observed protein-coding selection in the absence of

readthrough. The modENCODE project has provided 2.3 billion reads across 30 stages of development

which recover 93% of known splice junctions across all genes we were evaluating. We used this data,

supplemented with EST and mRNA evidence reported by the UCSC genome browser and with two

genomic splice prediction algorithms, to look for possible 3'-splice sites shortly downstream of the first

stop codon. This revealed likely splice sites in only 40 of the remaining 411 candidates, suggesting that

splicing is not a likely explanation for the observed wide-spread phenomenon, and leaving 371

candidates that are not explained by alternative splicing. We also used RNA-seq data to rule out RNA

editing (no edited stops were found among our candidates, discussed below).

We next excluded 42 readthrough regions that overlap possible second cistrons or other genes in the

same frame by examining existing annotations and by looking for readthrough regions that contain a

low-scoring region followed by an ATG codon that is in turn followed by a high-scoring region. Finally,

we excluded 17 recent nonsense mutations in D. melanogaster, 5 cases where the positive PhyloCSF

score could be due to overlap with coding regions on the opposite strand ('antisense'), and 24 for other

reasons such as incorrect alignment. Additional details of manual curation are provided in

Supplementary Text S1.

We used an existing algorithm, SECISearch (Kryukov et al. 2003) to rule out selenocysteine insertion,

as no SECIS elements were found. As further evidence, we note that D. willistoni lacks the

selenocysteine incorporation machinery, and provides a more specific characteristic evolutionary


6

signature found at all known selenocysteine insertion sites, with TGA selenocysteine codons aligned in

all other species, and sense codons for cysteine (or arginine) in D. willistoni (Drosophila 12 Genomes

Consortium et al. 2007). In contrast, for almost all readthrough candidates with TGA first stop codons,

D. willistoni retains the TGA stop, and shows continued protein-coding signatures in the second ORF.

We were left with 283 transcripts (Supplementary File DmelReadthrough2010.csv) that we believe are

examples of conserved, functional, translational stop codon readthrough, henceforth referred to as the

'readthrough candidates'. The readthrough regions vary in length from 4 to 788 codons, with mean of 67

codons. While the longest second ORFs are more likely to be readthrough regions, only 51 (25%) of 202

in-frame second ORFs longer than 100 codons are readthrough candidates (Figure 2B), whereas 232

(1.5%) of 15031 shorter in-frame second ORFs are readthrough candidates, emphasizing the importance

of using comparative evidence and our additional metrics. We have applied similar curation to the

subsequent open reading frame ('third ORF') of each readthrough candidate, and found 16 transcripts

that appear to have double readthrough (Figure 1C). We did not find any examples of triple readthrough.

Single-species evidence of protein translation

Several additional lines of genomic evidence support that the predicted readthrough regions are protein-

coding without relying on comparative evidence.

Nucleotide k-mer composition matches protein-coding regions

First, we found that the nucleotide k-mer composition of readthrough regions matched that of protein-

coding regions. We quantified protein-coding-like compositional properties using the Z curve score

(Gao and Zhang 2004) which provides a single-species test for protein-coding regions using mono-, di-,

and tri-nucleotide frequencies. Although most individual readthrough regions are too short to reliably

call as protein-coding using the Z curve score alone, we found that it provided very strong group

discrimination between coding regions (immediately upstream of stop codons) and non-coding regions

(immediately downstream of stop codons) for non-readthrough genes (Figure 3A, top two panels). For

readthrough candidates, the distributions of scores upstream of the first stop was indistinguishable from

that of coding regions for non-readthrough genes, and the distribution downstream of the second stop

was indistinguishable from that of non-coding regions, as expected. Strikingly, the distribution of scores

for the readthrough portion of readthrough candidates (Figure 3A, bottom panel) matched that of coding


7

regions. We conclude that single-species sequence composition strongly supports the comparative

evidence that the candidate readthrough regions are protein-coding.

Periodic secondary structure matches protein-coding regions

As a second line of evidence, we found that the mRNA secondary structures of the readthrough regions

exhibit a statistical periodicity property that distinguishes coding from non-coding regions. In coding

regions of humans, the fraction of genes for which the base at a given position is paired in the predicted

least energy mRNA secondary structure varies periodically, with bases in the third codon position more

likely to be paired than bases in the other two positions, and this 3-periodicity does not continue past the

stop codon into the 3'-UTR (Shabalina et al. 2006), a result recently reproduced experimentally in yeast

(Kertesz et al. 2010). We reproduced the result for predicted structure in D. melanogaster and found that

among our readthrough candidates the 3-periodicity continues beyond the annotated stop codon and

abruptly ends after the second stop codon, consistent with the intervening region being protein-coding

(Figure 3B and Supplementary Figure S1).

Mass spectrometry evidence of protein translation

To obtain direct evidence that readthrough regions are translated, we searched a mass spectrometry

database for sequenced Drosophila peptides that match within a readthrough region. We used the

Drosophila PeptideAtlas (Loevenich et al. 2009), which contains 75,753 distinct peptides of length 4 to

55 amino acids that have been analyzed using mass spectrometry and sequenced by matching to

annotated coding regions, and an additional 971 peptides that have been sequenced by comparing to the

whole genome in all six reading frames.

We found 14 distinct peptides that match within one of our readthrough regions (Figure 3C,

Supplementary File DmelReadthrough2010.csv). We used BLAST to verify that these peptides do not

match elsewhere within the D. melanogaster genome. Since most peptides in the database were

sequenced by comparison to annotated coding regions, it is not surprising that there were no matches in

most of our readthrough regions.

These sequenced peptides provide experimental evidence of translation within the readthrough regions

of nine of our readthrough candidates, only two of which, syn and kel, had been previously

experimentally observed. We found strong evidence that downstream translation in these cases is not


8

due to alternative splicing, based on the lack of any splice variants from RNA-Seq data across at least

139 and typically several thousand reads (Supplementary Text S11). We also provide evidence that they

are not due to dicistronic translation, by specifically evaluating the PhyloCSF protein-coding selection

score between the stop codon and the first in-frame ATG, which we find to be positive (in two cases, no

ATG was present, ruling out dicistronic translation altogether).

We used the peptide matches to estimate the readthrough rate for syn, the only gene with more than one

sequenced peptide in its readthrough region. There are 13 observed peptides in syn's 444-codon

readthrough region (six of them distinct), while there were 71 in the 538-codon coding region before the

stop codon. Under the simplifying assumption that the number of peptide matches is proportional to the

peptide length and to the expression level, we can estimate the rate of readthrough for syn to be 22%.

This agrees with the 20-25% readthrough efficiency of this gene determined using antibodies (Klagges

et al. 1996).

Protein Domain evidence of protein translation

To obtain further evidence of translation, we searched for homology between the readthrough regions

and known protein domains. As such homology is more likely to be found in protein-coding regions

than non-coding ones, this strategy was used in an earlier attempt to identify readthrough genes

computationally (Sato et al. 2003), and provided further support for our candidates here. Using Search

Pfam (Finn et al. 2010) we found 13 significant matches to Pfam protein domain families within the

readthrough regions, compared to only 5 in non-readthrough controls with similar second ORF length (p

= 0.045). For example, the readthrough region of CG14669 shows a match to a RAS family GTPase

domain in the curated Pfam-A database (e-value = 6.9 10-21), suggesting roles in vesicle formation,

motility and fusion.

Discerning readthrough mechanism

Several lines of evidence support that the observed protein-coding translation is the result of stop codon

readthrough rather than alternative mechanisms.

Reading frame offset bias confirms readthrough mechanism

First, comparing the PhyloCSF scores of regions downstream of the stop codon in different frames

provided evidence that the observed phenomenon is not due to splicing, dicistronic translation,


9

ribosomal bypassing ("hopping") (Wills 2010), or chance, and also provided an independent estimate of

the number of readthrough genes.

We found that many more in-frame second ORFs (662) have positive PhyloCSF score than do the ORFs

starting 1 or 2 bases after the annotated stop codon (219 and 225, respectively) (Figure 4A). These

counts exclude transcripts for which the region overlaps a known splice variant or other annotated

coding exon, since that could account for any protein-coding signature found.

The strong excess for in-frame ORFs (frame 0, as expected for readthrough) argues for over 400

readthrough genes (Figure 4B). On one hand, unannotated splice variants, unannotated second

independent cistrons, overlapping genes, antisense genes, ribosomal hopping, and regions with high

score by chance would all be equally likely in any of the 3 frames (and in fact there is a bias away from

frame 0 for many of these, Supplementary Text S2). On the other hand, our manual curation determined

that nonsense mutations, A->I editing, and selenocysteine insertion, which would explain a positive

PhyloCSF score specifically in frame 0, account for less than 4% of the observed excess. Finally,

programmed frame shifting would lead to a positive score in frame 1 or frame 2, but not frame 0. Thus,

only translational stop codon readthrough could explain the excess of high scoring downstream regions

in frame 0.

RNA-seq evidence rules out stop codon editing

We evaluated whether translation beyond the stop codon is a result of Adenosine-to-Inosine RNA

editing (Aphasizhev 2007), which would convert stop codons to UGG tryptophan codons, as we had

originally hypothesized (Lin et al. 2007). We analyzed deep RNA-seq data from the modENCODE

project (Graveley 2010) for A-to-G single-nucleotide differences between the DNA sequence and the

mRNA reads, which provide a median coverage of 1500 reads in the regions evaluated.

Among all annotated stop codons we found only three that are edited to UGG: DopEcR, qvr, and

CG8481. The first results in a previously-reported (Stapleton et al. 2002) two-codon extension with

positive PhyloCSF score, but below our significance threshold. The other two were also not in our

candidate list, as their second ORFs have negative PhyloCSF scores and are unlikely to correspond to

functionally-selected translation (Supplementary Text S6).


10

Outside these three stop codons, showing 193, 708 and 601 edited reads respectively, and corresponding

to 51%, 41% and 13% of all reads at those loci, we found no evidence of editing. At every other stop

codon there were fewer than 4 A-to-G mismatched reads, or such mismatched reads accounted for fewer

than 4% of all reads at that locus, and we believe these are due to sequencing errors. Other (non-A-to-I)

mismatches affecting stop codons also appear to be due to sequencing or mapping errors as well

(Supplementary Text S7). We conclude that downstream translation of our readthrough candidates is not

due to RNA editing.

Stop codon context supports translational leakage

Translation termination efficiency is known to be affected by several signals including the stop codon

itself, the 3 nucleotide positions downstream, the final two amino acids in the nascent peptide, and the

tRNA in the ribosomal p-site (Bertram et al. 2001). Of these signals, the strongest determinants consist

of the 4-nucleotide sequence that includes the stop codon and one downstream position, which we will

refer to as the stop codon ‘context’. In both yeast and mammals, the contexts most efficient for

termination are the most common in the genome, while leaky contexts are the most rare, especially

among highly expressed genes (Bonetti et al. 1995; McCaughan et al. 1995). Thus, the frequency of

different contexts may be an indication of their efficiency at translation termination.

We found a striking inverse correlation between the usage of 4-base stop codon contexts among

readthrough candidates and non-readthrough transcripts (Figure 4C). For example, TGA-C, the least

common context among non-readthrough transcripts (3.1%) is by far the most common among

readthrough candidates (32.2%). In fact, genes containing a TGA-C stop context are nearly 10-fold more

likely to be readthrough candidates (16.4% vs. 1.9%), suggesting that perhaps TGA-C is a leaky stop

codon context enabling read-through. Indeed, we found that non-readthrough genes are more likely to

contain an additional in-frame stop codon after a TGA-C context than after other contexts (8.4% vs.

5.5% in the second in-frame codon after the stop), and this could be a result of selection to limit the

number of extraneous amino acids added after accidental readthrough (Supplementary Text S3).

The relative frequency of various contexts suggests that the stop codon and the subsequent base

contribute independently to the potentially leaky context. For example, the frequency of TGA-C is not

significantly different from the product of independent frequencies of TGA and C. Indeed, TGA is the

most frequent stop codon among readthrough candidates and least frequent among non-readthrough


11

transcripts when coupled with any subsequent downstream base. Similarly, C is the most frequent

downstream base in readthrough and least frequent in non-readthrough, coupled with any stop codon.

Our results suggest the most leaky stop codons are TGA > TAG > TAA and the most leaky downstream

position is C > T > G > A. These results are consistent with TGA being the leakiest of stop codons in

natural stop suppression in yeast (Firoozan et al. 1991), and stop codons followed by C showing

preferential readthrough in eRF mutants in Drosophila (Chao et al. 2003).

The unusual stop codon contexts of the readthrough candidates provide further evidence that

downstream translation requires identification of the stop codon, which occurs at the ribosome, and

cannot be due to alternatives such as splicing (Supplementary Figure S3).

Conservation pattern suggests stop codons encode functional amino acids

For non-readthrough transcripts, the three stop codons appear to have little functional difference besides

the predicted readthrough rates, and substitutions between them are frequent. However, the situation is

strikingly different for readthrough candidates. While different readthrough genes utilize all three stop

codons (182 use TGA, 73 TAG and 28 TAA), any given readthrough candidate rarely showed

substitutions between the three stop codons at the readthrough locus (Figure 4D). Overall, 83% of

readthrough stop codons show perfect conservation across the 12 Drosophila species, compared to 13%

for all genes. If we consider only transcripts that contain stop codons in all species, the difference

remains striking, with 97% of readthrough stops perfectly conserved, compared to 39% for all genes.

This suggests that the readthrough stop codons are not functionally equivalent and may encode

functional amino-acids.

Indeed, experiments in mammals and yeast have shown that specific amino acids can be inserted when a

stop codon is read through, with glutamine incorporated for UAA or UAG and arginine, cysteine, or

tryptophan for UGA (Feng et al. 1990; Lin et al. 1986; Pure et al. 1985; Weiss and Friedberg 1986;

Chittum et al. 1998) (Supplementary Text S10). Two alternative mechanisms for stop codon translation

are consistent with these observations, both translating stop codons into functional amino-acids using a

single codon-anti-codon mismatch. The first mechanism would use a G-U pair in the first codon

position, which has been demonstrated as a stop-codon suppression mechanism (Lin et al. 1986), and

would result in UGA translation to Arginine and UAA and UAG translation to Glutamine (Figure 4E).

Alternatively, non-canonical pairing in the third codon position would lead to Cysteine or Tryptophan


12

translation for UGA and Tyrosine translation for both UAA and UAG. It is also possible that both

mechanisms are at play.

The few substitutions that have occurred in readthrough positions convert between UAA and UAG stop

codons, consistent with their translation into identical amino-acids, while UGA would in fact lead to a

different translation. However, the specific tRNA used would likely affect rate of readthrough, providing

a potential explanation for the observed low rate of substitution between UAA and UAG stop codons.

Thus, we conclude that both the amino-acid translation and the tRNA-dependent rate of read-through

likely contribute to the observed conservation patterns of readthrough stop codons.

We ruled out two alternative explanations for the observed conservation. Although there is experimental

evidence that at certain Drosophila loci readthrough is dependent on the choice of stop codon, in all

known cases where UAA or UAG is read through UGA can also be read through (Supplementary Text

S12) so this would not explain the high conservation of TAA and TAG stop codons. It is also unlikely

that conservation of readthrough stop codons is due to mRNA secondary structure because the second or

third base of the stop codon is predicted to be paired for only about half of the readthrough candidates.

Regulation and functional enrichments

Several conserved stem loops could enhance readthrough

To determine if mRNA secondary structures participate in the readthrough mechanism, we looked for

conserved stem loops in the readthrough regions of some of our candidates. We first studied the

conservation of an 80 nucleotide stem loop immediately downstream of the stop codon in hdc that is

known to trigger readthrough, possibly through interference with the ribosome or release factors, even if

the TAA stop codon is changed to a TAG or TGA (Steneberg and Samakovlis 2001). We found that the

energy of this stem loop is preferentially conserved among the 12 Drosophila species by numerous

conservative substitutions (p = 0.05), suggesting functional selection of the secondary structure. At least

two other readthrough candidates, CG3638 and Mctp, have somewhat similar stem loops soon after the

stop codon that are also significantly conserved (p = 0.016 and p = 0.0006, respectively).

However, many of the other readthrough candidates do not have conserved stem loops immediately

following the readthrough stop codon, suggesting stable RNA structures regulate only some readthrough

genes.


13

Long UTRs and enriched 8-mer suggest expanded readthrough regulation

We observed that readthrough candidates are considerably longer on average than other transcripts

(Figure 5A), suggesting they may be subject to diverse forms of post-transcriptional regulation. The

average overall pre-mRNA length (including both introns and exons) of a readthrough candidate is

19,087 nt, as compared to 5,805 for non-readthrough. Readthrough candidates are significantly larger in

every gene component, including total exon length, coding length, total intron length, 5'-UTR length,

and adjusted 3'-UTR length (after subtracting the readthrough region). They also have more exons, and

their genes have more splice variants than non-readthrough genes. All four components that make up the

primary transcript (5'-UTR, coding sequence, 3'-UTR, and introns) make independent contributions to

the excess length, as controlling for any three of these components still results in greater lengths for the

fourth component among the readthrough candidates (Supplementary Text S4).

The longer 5'- and 3'-UTRs in the readthrough candidates could contain target sites for factors mediating

translational regulation, which often bind in UTRs (Wilkie et al. 2003). Furthermore, the long 3'-UTRs

of the readthrough candidates could be directly involved in the readthrough mechanism by physically

separating the stop codon from the poly(A) tail. Indeed, termination efficiency can be affected by

interaction between eRF3, a component of the termination complex, and the poly(A)-binding protein

(Amrani et al. 2004; Kobayashi et al. 2004), and thus by separating the two, long 3'-UTRs might

decrease termination efficiency.

To find potential binding sites of RNA or protein cofactors that may participate in the readthrough

mechanism, we searched for common sequence motifs in the readthrough regions of the readthrough

candidates (Supplementary Text S8). The most significantly enriched n-mer is the 8-mer CAGCAGCA

which is found in 22.6% of readthrough regions (64) but only 4.2% of similar-length coding regions

before the stop codon of non-readthrough candidates and only 1.4% of regions after the stop (Figure

S5A). These 8-mers are distributed more or less uniformly through the readthrough regions, but the

enrichment stops at the second stop codon. The readthrough candidates are also enriched for

CAGCAGCA in same-sized regions before the first stop codon, appearing in 43 of them (15.19%), of

which 22 include the 8-mer after the stop as well. The enrichment is concentrated in the last 10% of the

coding portion of these transcripts, around 250 nt. Overall, this 8-mer appears within a few hundred nt

on one side or the other of the stop codon of about 30% of readthrough candidates.


14

Readthrough candidates, and other long transcripts, also showed higher than average content in their

coding regions for C and G nucleotides. However, readthrough candidates are still enriched for these

nucleotides even after correcting for gene length, and, conversely, readthrough candidates are still

enriched for length even after correcting for nucleotide frequencies. After correcting for nucleotide

frequencies, readthrough candidates are enriched for the di-nucleotide CA and 5 tri-nucleotides, and

depleted for GA and 5 tri-nucleotides (Supplementary Figure S4). Moreover, the 8-mer enrichment for

CAGCAGCA remains highly significant even compared to transcripts with similar frequencies of the

three tri-nucleotides within it.

Since CAGCAGCA often codes for QQQ, we wondered whether its enrichment could be related to

polyQ repeats, and found that readthrough regions are also highly enriched for long polyQ repeats

(Figure S5B). The CAGCAGCA motif is highly enriched in all 3 frames but slightly more so in frame 0,

so its effect is probably at the nucleotide level, with, perhaps, a small additional effect at the protein

level.

Positions likely relevant to readthrough mechanism pinpointed by conservation

and base frequencies

In order to obtain clues as to which base positions might play a role in the readthrough mechanism, we

analyzed evolutionary conservation across the 12 Drosophila species at base positions before the stop

codon. The first, second, third, and fifth positions before the stop codon are significantly more

conserved among readthrough candidates than among non-readthrough transcripts (restricting to

transcripts with perfectly conserved stop codons to remove biases arising from the high conservation of

the stop codon itself). A similar analysis of the positions after the stop codon would not be meaningful

because this region is coding in the readthrough candidates but not in the background transcripts.

Some positions near the stop codon also exhibit unusual base frequencies among the readthrough

candidates. Among positions upstream of the stop codon, position -10 shows high G content, -5 low A,

and -4 high A (where the stop codon is at positions -3, -2, and -1). Among the first six downstream

positions, +1 showed high C/low A, +2 high A, +5 low T/high G, and +4 showed the least difference in

base frequencies (where +1 is the position immediately downstream of the stop codon). These

differences are significant even when comparing to non-readthroughs with similar overall base


15

composition, or when comparing the positions after the stop to typical coding regions. Indeed, base

frequencies in the 6 positions downstream of a stop codon are known to correlate with experimentally

determined levels of readthrough (in yeast), and the 4th position after the stop has the least value in

predicting readthrough level (Williams et al. 2004). Our findings are thus consistent with previous

experimental observations, and provide a global view of the positions and nucleotides likely involved in

readthrough regulation.

Readthroughs are enriched for some functional roles

To study potential functional roles for readthrough proteins, we studied the enrichment of our candidates

for gene ontology (GO) functional terms. We found that readthrough candidates are significantly

enriched for 110 GO categories (p < 0.05, Bonferroni corrected), including nervous system

development, as we had previously noted (Lin et al. 2007). However, genes in enriched GO categories

tend to be longer than average (Figure 5B), and among genes of similar length, the only enrichments that

remain are for transcription factor activity, DNA binding, transcription regulator activity (p = 0.0016,

0.0019, and 0.016, respectively, with Bonferonni correction for number of GO categories tested). After

also correcting for UTR lengths, no significant GO enrichments remain.

Thus, while these functional enrichment provide potential clues into the neuronal and regulatory

functions of readthrough proteins, their interpretation should be taken with the understanding of the

observed length biases. The longer lengths of enriched GO categories could be an indication that both

neuronal and regulatory genes are overall highly regulated, and readthrough might be only one means of

several regulatory mechanisms. Alternatively, neuronal genes and readthrough genes could both be long

for reasons unrelated to readthrough, and the observed enrichments could be an artifact of this

confounding factor.

Evolution of readthrough across animal species

Readthrough candidates evolved as extensions, not truncations

We next studied the evolutionary history of readthrough loci to understand the origins of readthrough

genes. As the readthrough candidates were defined based on the locations of consecutive in-frame stop

codons in D. melanogaster, we used the conservation patterns of the first and the second stop codon loci

across the 12 Drosophila species spanning 60 million years of evolution, to distinguish between two


16

ancestry possibilities. (i) Extension scenario: If the first stop codon was deeply conserved but the second

stop codon was seemingly more recently evolved, we reasoned the readthrough transcript could have

evolved as a recent extension of an ancestrally shorter gene into a formerly untranslated region. (ii)

Truncation scenario. Alternatively, if the second stop codon was deeply conserved and the first stop

codon had recently replaced an ancestral sense codon, we reasoned that the ancestral transcript likely

underwent a nonsense mutation at the first stop locus, with the readthrough mechanism partially

rescuing the ancestral, longer version of the protein.

We found that for more than 90% of readthrough candidates, the readthrough state appears to be

ancestral to the divergence of the 12 species, with ancestral stop codons at both stop loci. Of the

remaining cases, several appeared to have evolved through the extension scenario, even though such

transcripts would not usually be found by our method because when the second ORF is non-coding in a

subset of the species we would expect lower PhyloCSF scores. In contrast, no cases were found that

matched the truncation scenario, even though we would expect to find any such transcripts by our

method because the second ORF would be coding in all 12 species. Lastly, for 9 cases which were not

part of our readthrough candidate list as they were attributed to recent nonsense mutations, we could not

unambiguously argue presence of readthrough as they occurred since the more recent D.

melanogaster/D. simulans split and may have retained protein-coding evolutionary signatures solely due

to lack of sufficient divergence to accumulate additional substitutions (Supplementary Text S8).

Thus, it seems likely that most or all of the readthrough candidates evolved by extension rather than

truncation. Extension-by-readthrough provides a compelling mechanism for evolving new domains that

extend existing proteins, as the extension can evolve modularly without affecting the independent

functionality of the shorter protein. In contrast, in a truncation-with-readthrough scenario, mutations that

enhance the truncated protein would also affect the longer protein, providing little opportunity to

optimize the short protein in a modular way, without potentially damaging the previously-adapted longer

protein.

Striking examples of readthrough in humans, nematodes, and other insects

We used PhyloCSF to look for stop codon readthrough in other sets of closely related species whose

genomes have been sequenced and aligned.


17

Using mammalian alignments, we found four non-selenocysteine readthrough candidates in the human

genome: OPRK1, OPRL1, SACM1L, and BRI3BP (Figure 6A). Although rabbit beta-globin is a known

readthrough gene (Chittum et al. 1998) it was not found by our method as it is not conserved, even in

other rodents (Supplementary Figure S6).

Using a 6-way nematode alignment we found five C. elegans readthrough candidates, namely shk-1,

F38E11.5, F38E11.6, K07C11.4, and C18B2.6 (Figure 6B). There is one known case of selenocysteine

insertion in C. elegans (Buettner et al. 1999) and a search for SECIS elements and homology to

selenocysteine genes in other species found no others (Taskov et al. 2005), so we consider it is unlikely

that these 5 readthrough candidates are selenocysteine genes. Surprisingly, F38E11.5 and F38E11.6 are

neighboring genes on opposite strands; we have no explanation for this coincidence.

Although several readthrough genes have been identified in Saccharomyces cerevisiae (Namy et al.

2002) (Namy et al. 2003), our method did not find any conserved readthrough genes in Saccharomyces

or Candida.

In order to find readthrough candidates in A. gambiae (mosquito), A. mellifera (honey bee), and T.

castaneum (red flour beetle), we instead looked for a preponderance of synonymous substitutions

specifically in these other species within the orthologs of our readthrough regions, using a 15 way

alignment with the 12 Drosophila species. We found that 17 of our candidates appear to be readthrough

in mosquitoes, 7 of which appear to be readthrough in beetles, and 4 of those appear to be readthrough

in bees as well (Figure 6C; Supplementary File DmelReadthrough2010.csv).

Estimation of readthrough abundance across animal species.

Beyond these compelling candidates, we sought to estimate the abundance of readthrough across

sequenced animal genomes.

To estimate the overall number of readthrough genes in C. elegans we again compared the PhyloCSF

scores of post-stop regions in 3 frames, but unlike our finding for Drosophila there was no excess in

frame 0, suggesting that despite the five examples we found, stop codon readthrough is not very

abundant in C. elegans (Figure 7A).


18

To recognize other species with potentially abundant readthrough even in absence of comparative

information, we created a 3-frame comparison test that requires an annotated genome sequence but no

alignment. We first determined the number of second ORFs in each frame with positive Z curve scores,

indicating that they are likely protein coding (excluding very short regions and regions that overlap the

coding portion of another annotated transcript). Since recent nonsense mutations could contribute to the

number of positive scores in frame 0, we subtracted an estimate for the number of such mutations. A

large excess of positive scores for frame 0 in the remainder would suggest abundant readthrough

transcripts, though it would not tell us which ones they are.

We applied this test to four additional insect species, and estimated that A. gambiae (mosquito) and B.

mori (silkworm) have approximately 70 readthrough transcripts, while A. mellifera (bee) and T.

castaneum (beetle) do not have an abundance of readthrough transcripts (Figure 7B). Additional

comparative genomics resources will be needed to resolve individual cases and also investigate an

observed difference between frames 1 and 2 in silkworm.

We also applied the test to 8 other eukaryotic species, including one plant, two fungi, humans, and four

other animals (Supplementary Text S9), but did not find a statistically significant excess in frame 0 in

any of these species, suggesting that they do not have an abundance of readthrough transcripts.

In summary, while we have found several striking examples of conserved readthrough transcripts in

many animals spanning large evolutionary distances, the abundant readthrough that we discovered in

Drosophila exists only in some related insect species including mosquito and silkworm, but does not

appear to extend elsewhere in the tree of life.

By examining the stop codons of the readthrough transcripts in the various species we observed another

distinguishing characteristic of the species with abundant readthrough. All of the readthrough stop

codons we found in humans, nematodes, bees, and beetles are TGA, and 13 out of 16 have context

TGA-C. The readthrough stop codon in the rabbit beta-globin gene, the only non-Drosophila

readthrough previously discovered in animals, is also TGA. In contrast, the readthroughs we found

common to fruit fly and mosquito include all 3 stop codons, suggesting that Diptera are distinguished

among animals not only by the large number of readthrough transcripts but also by readthrough of non-

TGA stop codons. The TAA readthroughs are particularly notable because there are no others known in


19

any other eukaryotic organism, and a search for conserved readthrough stop codons in prokaryotes using

homology and dN/dS analysis did not find any TAA readthroughs either (Fujita et al. 2007).

Discussion:

By eliminating all other known mechanisms, we have left translational stop codon readthrough as the

only plausible explanation for the evolutionary signatures in 283 D. melanogaster protein-coding genes.

For three of these genes with very long readthrough regions (hdc, kel, and syn), the extended form of the

protein had been experimentally observed before we began our investigations, but the comparative

genomic method enabled us to go much further than these three genes, finding abundant readthrough

and also indicating that the amino acid sequence downstream of the stop codon is not only expressed but

serves a conserved function, even for very short peptides. These genomic methods indicate that such

functional stop codon readthrough is considerably more prevalent among Metazoa than previously

recognized, affecting many genes in mosquitoes and silkworms and several in other insects, nematodes,

and mammals, including humans.

Double readthrough, context variety, and 3'-UTR length variability suggest

readthrough regulation is complex

Readthrough in Drosophila is considerably more complex than one would expect under a simple model

in which each ribosome reads through a stop codon according to some probability, up to perhaps 5%,

that depends only on the stop codon context. The stop codons that are affected, the readthrough rate,

and readthrough regulation all suggest regulatory complexity beyond this simple model.

As noted above, termination efficiency is affected by 3'-UTR length as well as context, but even that is

not enough to identify the readthrough stop codons. While the majority of our readthrough candidates

have rare stop codon context and unusually long 3'-UTRs, there are also many that do not. For example,

AlCR2, one of our double readthrough candidates, has both a frequent and presumably less leaky context

(TAA-G) and a short 3'-UTR (227 bp). Hence, readthrough targeting must depend on other unidentified

characteristics as well.

The readthrough rate in Drosophila exceeds the 5% level found in context studies in yeast (Namy et al.

2001). Readthrough rates as high as 20%, 50%, and 90% have been observed in Drosophila for syn, kel,

and elav nonsense alleles, respectively (Klagges et al. 1996; Robinson and Cooley 1997; Samson et al.


20

1995). Existence of 16 double readthroughs provides further evidence of high readthrough rate, as

independent readthrough of both stop codons at the 5% level would result in translation of this ORF

only 0.0025 of the time.

Finally, we note that readthrough is known to be subject to tissue-specific regulation (kel),

developmental stage-specific regulation (elav nonsense alleles), and temperature-specific regulation

(Chao et al. 2003), and post-transcriptional modification of a tRNA anticodon offers one possible

mechanism for regulating readthrough of UAG stop codons (Bienz and Kubli 1981).

Readthrough machinery is distinct from selenocysteine insertion

The lack of selenocysteine insertion machinery in D. willistoni shows that the machinery for

readthrough is distinct from that for selenocysteine insertion. Like all other 11 species, D. willistoni

shows the signature of translational readthrough, with conservation of both first and second stop codons

for readthrough candidates, and protein-coding selection in readthrough regions. Moreover, readthrough

in kel has been experimentally observed in D. willistoni, confirming experimentally that the readthrough

mechanism is functional. However, the genes implicated in selenoprotein synthesis are absent in D.

willistoni (Drosophila 12 Genomes Consortium et al. 2007), and thus could not be required for the

observed readthrough.

Furthermore, this shows that the conserved readthrough of our candidates is independent of yet another

Drosophila mechanism that relies on a SECIS element and proteins involved in selenocysteine insertion

to read through TGA stop codons without inserting selenocysteine (Hirosawa-Takamori et al. 2009).

This conclusion is further supported by the lack of SECIS elements in our readthrough candidates and

the fact that many of them have TAA and TAG stop codons.

Readthrough could be a competitor to NMD for rescuing premature stop codons

Conserved readthrough in Drosophila could be one manifestation of a more general stop codon

readthrough mechanism whose primary function is to partially rescue nonsense mutations until a sense

mutation at the same locus restores full function to the gene. Under this scenario, the readthrough

candidates are genes whose stop codon was originally misidentified as premature by this mechanism,

but for which the downstream peptide evolved to offer a fitness advantage.


21

Such nonsense rescue has been observed at several naturally occurring and artificially introduced

premature stop codons in Drosophila (Samson et al. 1995; Yildiz et al. 2004; Washburn and O'Tousa

1992; Chao et al. 2003). Furthermore, suppressor tRNAs, whose mutated anticodons are cognate to a

stop codon, can rescue nonsense mutations in strains of E. coli, yeast, and C. elegans, but artificially

introduced suppressor tRNAs do not work well in Drosophila, suggesting there could be an alternative

mechanism for rescuing nonsense mutations (Washburn and O'Tousa 1992).

A rescue origin would predict that the conditions that trigger readthrough would be ones that typically

distinguish premature from normal stop codons. Two of the characteristics we have observed in

readthrough candidates, unusual stop codon context and long 3'-UTRs, are in fact shared with premature

stop codons: premature stop codons can have an arbitrary context and would not favor contexts that

maximize termination efficiency, and of course a transcript with a premature stop will usually appear to

have a long 3'-UTR.

The system for identifying which stop codons to read through might share some components with

nonsense-mediated decay (NMD), another pathway that identifies premature stop codons, in order to

degrade their transcripts. How NMD recognizes premature stop codons in Drosophila is not well

understood, but there is evidence that NMD is more likely to be triggered if the 3'-UTR length is greater

than around 742 nt (Hansen et al. 2009). The average 3'-UTR length for readthrough transcripts is 1079

nt compared to 381 for non-readthrough, so perhaps readthrough and NMD share this part of the

recognition mechanism.

This raises the question of how readthrough transcripts avoid NMD, as they must for the downstream

amino acid sequence to be functionally conserved. NMD prevents formation of potentially toxic

truncated proteins but does not restore production of the full length protein, so a cell would benefit from

distinguishing mRNAs with premature stops that do not code for toxic proteins and invoking

readthrough on these rather than NMD, though we do not know how a cell could make this distinction.

Understanding of Diptera readthrough could have medical applications

One approach to treating genetic diseases caused by nonsense mutations has been to develop small

molecules that induce stop codon readthrough (Keeling and Bedwell 2010; Schmitz and Famulok 2007).


22

Ideally such a drug would trigger readthrough and disable the NMD pathway for a specific nonsense

mutation, while allowing translation to terminate normally at other stop loci.

Readthrough also has methodological applications, such as selecting cells expressing a protein at a

desired level by fusing the protein with a transmembrane anchor separated by a leaky stop codon

(Jostock 2010).

The prevalence of conserved readthrough in Drosophila offers a rich opportunity to enhance our

understanding of the underlying molecular mechanisms, which could lead to advances in these

applications. Furthermore, abundant readthrough in mosquitoes distinguishes that vector from most

other species and offers a possible target for further research.

Methods:

Transcripts

We obtained the D. melanogaster genome and annotations from flybase.org (Tweedie et al. 2009),

version 5.13, release FB2008-10, excluding mitochondrial DNA, trans-spliced transcripts, and

transcripts whose final CDS does not end in a stop codon. We used the Drosophila tree and divergences

from (Stark et al. 2007). We used nematode genome and annotations, from http://www.wormbase.org,

release WS190, 16-May-2008. The 15-way dm3 insect alignments, the 29-way hg19 alignments, and the

6-way ce6 nematode MULTIZ alignments (Blanchette et al. 2004) were obtained from the UCSC

genome browser (Kent et al. 2002; Kuhn et al. 2009). Genomes and annotations for other species came

from the UCSC genome browser, except that the following species came from the specified web sites:

B. mori, silkdb.org/silkdb; C. albicans, candidagenome.org; G. max, phytozome.net; T. castaneum 3.0,

beetlebase.org.

Z curve score

Z curve score was computing as described in (Gao and Zhang 2004) and trained using all coding exons

and similar sized intergenic regions as positive and negative examples. We offset the score so that the

minimum average error threshold on the training examples was at 0. For the 3-frames comparison test

for abundant readthrough we computed the linear discriminant using a prior probability for coding of

0.02, to account for the expectation that most second ORFs are non-coding, and excluded second ORFs

less than 10 codons because the Z curve score has high variance for such short regions. To estimate the


23

number of recent nonsense mutations for the 3-frames comparison test, we used an upper estimate for

the number (n=17) of such mutations in D. melanogaster found during manual curation of the

readthrough candidates list, and then multiplied by the ratio of the number of annotated transcripts in the

species being tested to the number in D. melanogaster. The error bars in the charts show the square root

of the number of transcripts with positive score in that frame, which is good approximation to the

standard deviation if we assume all transcripts are equally likely to have a positive score.

Using PhyloCSF to find protein-coding second ORFs in D. melanogaster

We computed the PhyloCSF score for the second ORF of each transcript, excluding the final stop codon.

For transcripts with no annotated 3'-UTR, or if the second ORF extended beyond the end of the

annotated 3'-UTR, we defined the second ORF as continuing beyond the end of the annotated transcript

without splicing. Among transcripts with identical second ORFs we considered only one.

To obtain a p-value for a non-coding region of length L codons to have a PhyloCSF score above some

threshold, we approximated the score distribution with a normal of mean and standard deviation C1 L

and C2 LEX where the coefficients C1 = 11.55, C2 = 10.81, and EX = 0.7184 were obtained

empirically from second ORFs of all transcripts, excluding very high scores as likely coming from

coding regions. For multiple hypothesis correction we computed the Local FDR (Efron et al. 2001).

We excluded any transcript whose second ORF has PhyloCSF score (in decibans) less than 16,

indicating it is less than 40 times as likely to occur in a protein-coding region as in a non-coding one.

We included as protein-coding any of the remaining transcripts for which the p-value is less than 0.001;

each of these second ORFs has a Local FDR less than around 0.33. We individually examined each of

the approximately 100 remaining transcripts whose second ORFs have PhyloCSF score >= 16 and p-

value >= 0.001, and considered it protein-coding or not based on various factors, including the presence

of frame shifting insertions and deletions (indels).

RNA-seq

RNA-seq data was obtained from the October 2009 data freeze on modencode.org (Graveley 2010). We

used development timecourse data from submissions 2245-2259, 2262-2263, 2265-2267, and 2388-

2397, consisting of 76-base reads from poly-A RNA extracted from 30 development stages of D.

melanogaster, mapped using Tophat, and processed using samtools (Li et al. 2009). When looking for


24

reads overlapping a particular locus, we excluded reads for which the locus was within 7 nt of the end of

the read, to avoid splice-mapping errors. When examining candidates for RNA editing we checked for

erroneous mapping to paralogs using BLASTN from blast.ncbi.nlm.nih.gov.

Splice prediction

We did not consider a transcript to be readthrough if any modENCODE RNA-seq reads or any EST or

cDNA in the UCSC genome browser support a 3'-splice site early in the second ORF. We also used two

splice prediction algorithms to look for additional splice sites. We used the web interface on fruitfly.org

for splice site prediction using neural networks (Reese et al. 1997) and a maximum entropy method

(Yeo and Burge 2004). If a candidate 3'-splice site has a neural network score above around 0.7 or a

maximum entropy score above around 6 we considered the transcript to be readthrough or not based on

a combination of factors including these scores and the PhyloCSF score of the region before the

potential splice site.

SECIS elements

We searched for SECIS elements in the 3'-UTRs of D. melanogaster readthrough candidates using

SECISearch (Kryukov et al. 2003). If the 3'-UTR was not annotated, we searched the 1000 nt

downstream of the stop codon. If the annotated 3'-UTR was less than 1000 nt long, we extended it

without splicing to 1000 nt.

Base pairing

To calculate frequency of mRNA base pairing we used RNAfold (Zuker and Stiegler 1981) from the

Vienna RNA Package version 1.8.2 (Hofacker et al. 1994) with the default parameters to calculate the

minimum free energy secondary structure of the 403-base region of each transcript centered at the stop

codon (extending without splicing if there was no annotated 3'-UTR or it was less than 200 nt long).

Each distinct region was counted only once even if it is contained in multiple transcripts. To decrease

the variance arising from the relatively small number of readthrough candidates, codons before the stop

have been averaged with the previous four codons, and codons after the stop have been averaged with

the next four codons.

Length, composition, and enrichment statistics


25

In computing length and composition statistics for readthrough candidates and non-readthrough

transcripts we eliminated duplicates (transcripts with identical second ORFs), leaving 15211 transcripts.

For computing UTR lengths we included only transcripts that have annotated UTRs.

To adjust some statistic for one or more other statistics (for example, computing average 5'-UTR length

of transcripts with similar length and 3'-UTR length as readthrough candidates) we divided transcripts

into bins and weighted the non-readthrough transcripts so as to make the total weight in each bin equal

to the fraction of readthrough candidates in that bin. Bin size was 300 when adjusting for one statistic.

When adjusting for two statistics we used bins of size 750 for the first statistic and a bin of size 150 for

the second statistic for each bin of the first statistic. For three statistics we used bins of size 750, 150,

and 15, and averaged the results over all permutations to make the result order independent.

GO classifications were obtained from http://www.sb.cs.cmu.edu/stem/ (Ernst and Bar-Joseph 2006).

Stem loop conservation

We tested for conservation of stem loops as follows. For every possible substitution within the stem

loop, we calculated how much that hypothetical substitution would perturb the energy of the most stable

secondary structure. We considered a stem loop conserved if the substitutions that actually occurred

within the stem loop in a parsimonious reconstruction of the 12 genomes perturbed the energy

significantly less than an equal number of random substitutions. For the stem loop investigation, RNA

folding was done using quickfold on the DINAMelt Server, http://dinamelt.bioinfo.rpi.edu/quikfold.php

(Markham and Zuker 2005).

Homology to known protein domains

We searched for homology to known protein domain families using Search Pfam from

http://pfam.sanger.ac.uk/search?tab=searchProteinBlock#tabview=tab1 (Finn et al. 2010). We excluded

regions less than 10 codons long, and regions that overlapped a known CDS, both for readthroughs and

controls. We also excluded control regions with positive PhyloCSF, since these could be actual

readthroughs excluded from our readthrough list by overly conservative curation. We considered a

match to be significant if e-value < 0.05 / (number of regions tested).


26

Acknowledgements We would like to thank William M. Gelbart for suggesting this line of investigation during the 12-flies

project, the members of the modENCODE project for early release of data, Erich Brunner for help

analyzing mass spectrometry data, Jade Vinson for use of his splice prediction code, and Matt

Rasmussen, Loyal Goff, Jason Ernst, Rogerio Candeias, Alex Lancaster, Pouya Kheradpour, Michal

Rabani, Stefan Washietl, Grigoriy Kryukov, Jessica Wu, Rintaro Saito, and Peter Everett for helpful

discussions and suggestions. This work was supported by the National Institutes of Health (U54

HG00455-01), NSF CAREER 0644282, and a Sloan Foundation award.


27

Figure Legends Figure 1. Evolutionary signatures of typical and readthrough stop codons. Alignments of 12 Drosophila

species and inferred parsimonious common ancestor surrounding the annotated stop codons of three genes. The

color coding of substitutions and insertions/deletions (indels) relative to common ancestor is a simplification for

visualization purposes, as the actual PhyloCSF score sums over all possible ancestral sequences and weighs every

codon substitution by its probability. Insertions in other species relative to D. melanogaster are not shown. (A)

Alignment of a typical gene (bw), shows abundant synonymous substitutions upstream of the stop codon, and

many non-conservative substitutions, frame-shifting indels, and in-frame stop codons downstream of the stop

codon. The stop codon position shows several substitutions between different stop codons. (B) Alignment of a

readthrough candidate gene, CG17319 (one of 283 cases). The region between the annotated stop codon and the

next in-frame stop codon shows mostly synonymous substitutions, and lacks frame-shifting indels, while the

region downstream of the second stop shows substitutions and indels typical of non-coding regions downstream

of typical stop codons. The first stop codon position is perfectly conserved, while the second stop codon position

shows substitutions between different stop codons. (C) Alignment of a double readthrough candidate, Glu-RIB

(one of 16 cases). Both the second ORF and the third ORF show protein-coding signatures, indicating that both

stop codons are likely readthrough. Both readthrough stop codon positions show no substitutions.

Figure 2. Manual curation distinguishes 283 readthrough candidates. (A) Steps of filtering method used to

eliminate transcripts with other plausible explanations for the observed second-ORF protein-coding selection,

leading to the final list of 283 unambiguous readthrough candidates. (B) Fraction of in-frame second ORFs of

given length (x axis) that are readthrough candidates (red), that have positive PhyloCSF scores with alternative

plausible explanations (blue), or that have negative PhyloCSF score (green). Counts for each category indicated

for each interval. Even among second ORFs greater than 100 amino-acids, only a quarter are readthrough

candidates, highlighting the need for comparative evidence to distinguish them. Overall, 2% of second ORFs

tested are readthrough candidates.

Figure 3. Non-comparative evidence of protein translation. (A) Most readthrough regions show codon

composition typical of protein-coding regions, measured as positive Z curve scores (x-axis). Top panel: Coding

regions before the first stop show positive scores for both read-through (squares) and non-readthrough (crosses)

typical of protein-coding regions. Middle panel: Non-coding regions after the second stop for readthrough

candidates (squares) and after the first stop for typical transcripts (crosses) show negative scores typical of non-

coding regions. Bottom panel: Score distribution for read-through regions matches that of protein-coding regions,

providing single-species evidence that readthrough regions are protein-coding. Evaluated regions before the first

stop, after the second stop for readthrough, and after the first stop for non-readthrough selected to match length

distribution of readthrough regions. (B) Periodic base-pairing frequency in readthrough regions (red) matches that


28

of known coding regions (blue) but is different from that of UTRs (green). Fraction of transcripts for which a

given nucleotide is paired in predicted RNA secondary structures (y-axis) at each position relative to a stop codon

(x-axis). Third codon positions (purple) are paired more frequently than first or second positions, and stop codons

(positions -3, -2, and -1) show decreased pairing, as previously observed computationally in humans and

experimentally in yeast (top panel). Transition from periodic to non-periodic pairing happens at second stop

codon for readthrough candidates (bottom panel). Signal is averaged over four codon positions (see Methods),

with raw data shown in Supplementary Figure S1. Error bars show Standard Error of Mean (SEM). (C) Example

of readthrough region (gish) supported by a 22 amino acid peptide match (red rectangle) to mass spectrometry

Drosophila PeptideAtlas (one of nine cases). With no ATG codon between the stop codon and the peptide, and no

observed alternative splicing events despite thousands of RNA-Seq reads overlapping this region, readthrough

seems the only plausible explanation for translation of this peptide.

Figure 4. Evidence of read-through mechanism. (A,B) Excess of high-scoring regions in-frame (frame 0)

compared to out of frame (frame 1, frame 2) suggests readthrough as the likely mechanism and provides an

estimate of readthrough count. (A) PhyloCSF score per-codon (x-axis) of the regions starting 0, 1, or 2 bases after

the stop codon (red, green, purple, respectively) and continuing until the next stop codon in that frame. All

annotated protein-coding transcript of D. melanogaster are included except for duplicate regions, regions

overlapping the coding portion of another annotated transcript, and regions with no alignment. Frame 0 shows an

excess of over 400 predicted protein-coding transcripts compared to the other reading frames suggesting

widespread readthrough. (B) Possible mechanisms associated with protein-coding function downstream of D.

melanogaster stop codons (rows) and associated reading frame offsets where corresponding protein-coding

function is expected (columns). Unannotated alternative splice variants, unannotated second cistrons, and random

fluctuations would lead to an even distribution among the three frames, while frame-shift events and overlapping

coding regions would bias away from frame 0. A bias for in-frame protein-coding selection is expected only for

stop codon readthrough, recent nonsense mutations, A-to-I editing, and selenocysteine, the latter three together

accounting for only 17 cases. This leaves readthrough as the only plausible explanation for an excess of

approximately 420 frame 0 regions with positive PhyloCSF scores. (C) Usage of stop codon context (stop codon

and subsequent base) provides additional evidence of readthrough mechanism. The 4-base contexts are sorted in

order of decreasing frequency among the 14,928 non-readthrough stop codons (red), with less frequent stop

codons (top, e.g. TGA-C) experimentally associated with translational leakage in other species and most frequent

associated with efficient termination (bottom, e.g. TAA-A). Context frequencies for readthrough candidates (blue)

are opposite of non-readthrough genes, suggesting a preference for leaky context, with one third using TGA-C

and almost none using TAA-A. (D-E) Stop codon conservation patterns in readthrough candidates suggest

functional amino-acid translation by tRNA wobble pairing. (D) Increased stop codon conservation in readthrough

candidates (blue) suggests the three stop codons are not functionally equivalent. Only about 1/3 of all annotated


29

D. melanogaster stop codons have aligned stops in all 12 species, and only about 1/3 of those are perfectly

conserved, i.e., have the same stop codon in all 12 species. In contrast, 83% of candidate readthrough stop codons

have an aligned stop in all 12 species, and 97% of those are perfectly conserved. While all three stop codons are

involved in readthrough of different genes, individual genes rarely switch between different amino acids, with no

substitutions observed between TGA and other stops, and only 7 substitutions between TAA and TAG (not

shown). (E) Proposed amino acids inserted at stop codon position and likely wobble mechanism of incorporation.

For each stop codon are listed near-cognate tRNA anti-codons and the type of wobble resulting in incorporation.

GU pairing in the 1st codon position has previously been proposed in yeast and AA, AC, and GA pairing have all

been experimentally observed in the 3rd codon position. In both scenarios, UAG and UAA are synonymous with

each other but not with UGA, consistent with observed substitution frequencies at readthrough stop codon

positions.

Figure 5. Functional and length enrichments. (A) Readthrough candidates as a group (blue) show increased

length (in nucleotides) and splicing complexity (number of transcript variants) compared to other transcripts (red).

Overall Length: sum of all introns, coding exons, and UTRs. Coding Length: length of first ORF. Adjusted 3'-

UTR Length excludes second ORF for readthrough candidates. Splice Variants: number of annotated transcripts

of the same gene. Numbers show actual size and bars denote size relative to non-readthrough genes. Error bars

show standard error of mean (SEM) for readthrough candidates. Enrichments persist even after correcting for total

intron length (green), 5’UTR length, 3’UTR length, or coding exon length (not shown). The greater lengths may

be due to increased need for regulatory signals in each part of the transcript, or may be directly involved in

readthrough regulation, for example due to increased spacing between the stop codon and poly(A) signals that

have been shown to interact with ribosomal release factors. (B) Length distribution biases may confound GO

enrichment analysis. Distribution of total gene length (log scale) for readthrough candidates (red), non-

readthrough transcripts (blue), and genes with the GO category "neurogenesis" (green), one of the most enriched

among readthrough candidates. It is possible that neurogenesis genes show increased regulation of many types,

only one of which is readthrough. Alternatively, readthrough genes and neurogenesis genes could both have

greater lengths for independent reasons, leading to artifactual functional enrichments.

Figure 6. Examples of readthrough in other species. (A) Alignment across 29 mammals for predicted

readthrough in human gene SACM1L, one of four mammalian candidates. (B) Alignment across 5 worm species

for predicted readthrough in C. elegans gene C18B2.6, one of five nematode candidates. Stop codon context in all

five is TGA-C, and is perfectly conserved among Caenorhabditis species. P. pacificus is not aligned in any of the

five readthrough regions. (C) Alignment across 12 Drosophila and 3 other insect species, A. gambiae (mosquito),

A. mellifera (honey bee), and T. castaneum (red flour beetle), for the predicted readthrough region of the D.

melanogaster slo gene, one of 17 readthrough candidates conserved in mosquitoes, and one of 4 conserved across


30

all 15 aligned insects. Although PhyloCSF cannot tell us whether the region is protein-coding in a particular

subset of species, the large number of synonymous substitutions specifically in the other 3 insects, lack of non-

synonymous substitutions and frame-shifting indels, and perfectly conserved 'leaky' TGA-C stop codon context

provide strong evidence that readthrough also occurs in these other insects.

Figure 7. Estimated extent of readthrough in other species. (A) Estimated extent of readthrough in C. elegans

using comparative evidence. Frequency (y-axis) of PhyloCSF scores per codon (x-axis) for downstream ORFs in

three frames, analogous to Figure 4A. No significant excess is found in frame 0, suggesting lack of abundant

readthrough in C. elegans. Scores of five readthrough candidates found by manual curation are indicated with

stars. (B) Estimated extent of readthrough in 5 insect species and humans using single-species sequence-

composition evidence quantified by Z curve scores. Only positive scores are shown for downstream ORFs in three

frames to detect excess in frame 0 associated with abundant readthrough (RT). For each species, line graphs show

the distribution of positive Z curve scores, and bar charts show the total number of positive scores. Lower bound

(dotted boxes) for the excess in frame 0 is estimated as the difference between the lower error bar for frame 0 and

the higher error bar maximum between frames 1 and 2. Number of readthrough genes (solid boxes) is estimated as

the difference between the center of the error bars for frame 0 and the center of the error bars for frames 1 and 2

(subtracting an estimate for the number of recent nonsense mutations). This matches our estimates for D.

melanogaster, and predicts more than 70 readthrough genes in mosquito and silkworm, but no abundant

readthrough in bees, beetles, and humans.


References Amrani N, Ganesan R, Kervestin S, Mangus DA, Ghosh S, Jacobson A. 2004. A faux 3'-UTR promotes

aberrant termination and triggers nonsense-mediated mRNA decay. Nature 432: 112-8. Aphasizhev R. 2007. RNA Editing. Mol Biol 41: 227-239. Beier H, Grimm M. 2001. Misreading of termination codons in eukaryotes by natural nonsense

suppressor tRNAs. Nucleic Acids Res 29: 4767-82. Bekaert M, Firth AE, Zhang Y, Gladyshev VN, Atkins JF, Baranov PV. 2010. Recode-2: new design,

new search tools, and many more genes. Nucleic Acids Res 38: D69-74. Bergstrom DE, Merli CA, Cygan JA, Shelby R, Blackman RK. 1995. Regulatory autonomy and

molecular characterization of the Drosophila out at first gene. Genetics 139: 1331-46. Bertram G, Innes S, Minella O, Richardson J, Stansfield I. 2001. Endless possibilities: translation

termination and stop codon recognition. Microbiol 147: 255-69. Bienz M, Kubli E. 1981. Wild-type tRNA-Tyr-G reads the TMV RNA stop codon but Q base-modified

tRNA-Tyr-Q does not. Nature 294: 188-190. Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AFA, Roskin KM, Baertsch R, Rosenbloom K,

Clawson H, Green ED, et al. 2004. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res 14: 708-15.

Bonetti B, Fu L, Moon J, Bedwell DM. 1995. The efficiency of translation termination is determined by a synergistic interplay between upstream and downstream sequences in Saccharomyces cerevisiae. J Mol Biol 251: 334-45.

Buettner C, Harney JW, Berry MJ. 1999. The Caenorhabditis elegans homologue of thioredoxin reductase contains a selenocysteine insertion sequence (SECIS) element that differs from mammalian SECIS elements but directs selenocysteine incorporation. J Biol Chem 274: 21598-602.

Chan PP, Lowe TM. 2009. GtRNAdb: a database of transfer RNA genes detected in genomic sequence. Nucleic Acids Res 37: D93-7.

Chao A, Dierick H, Addy T, Bejsovec A. 2003. Mutations in Eukaryotic Release Factors 1 and 3 Act as General Nonsense Suppressors in Drosophila. Genetics 165: 601.

Chittum H, Lane W, Carlson B, Roller P, Lung F, Lee B, Hatfield D. 1998. Rabbit Beta-Globin Is Extended Beyond Its UGA Stop Codon by Multiple Suppressions and Translational Reading Gaps. Biochem 37: 10866-10870.

Doronina VA, Brown JD. 2006. Non-canonical decoding events at stop codons in eukaryotes. Molekuliarnaia biologiia 40: 731-41.

Drosophila 12 Genomes Consortium, Clark AG, Eisen MB, Smith DR, Bergman CM, Oliver B, Markow TA, Kaufman TC, Kellis M, Gelbart W, et al. 2007. Evolution of genes and genomes on the Drosophila phylogeny. Nature 450: 203-18.

Efron B, Tibshirani R, Storey J, Tusher V. 2001. Empirical Bayes Analysis of a Microarray Experiment. J Am Stat Assoc 96: 1151-1160.

Ernst J, Bar-Joseph Z. 2006. STEM: a tool for the analysis of short time series gene expression data. BMC bioinformatics 7: 191.

Feng YX, Copeland TD, Oroszlan S, Rein A, Levin JG. 1990. Identification of amino acids inserted during suppression of UAA and UGA termination codons at the gag-pol junction of Moloney murine leukemia virus. Proc Natl Acad Sci U S A 87: 8860-3.

Finn R, Mistry J, Tate J, Coggill P, Heger A, Pollington J, Gavin O, Gunasekaran P, Ceric G, Forslund K, et al. 2010. The Pfam protein families database. Nucleic Acids Res 38: D211.


Firoozan M, Grant C, Duarte J, Tuite M. 1991. Quantitation of Readthrough of Termination Codons in Yeast using a Novel Gene Fusion Assay. Yeast 7: 173-183.

Fujita M, Mihara H, Goto S, Esaki N, Kanehisa M. 2007. Mining prokaryotic genomes for unknown amino acids: a stop-codon-based approach. BMC Bioinformatics 8: 225.

Gao F, Zhang CT. 2004. Comparison of various algorithms for recognizing short coding sequences of human genes. Bioinformatics 20: 673-81.

Graveley B. 2010. DOI 10.1038/nature09715. Nature. Hansen KD, Lareau L, Blanchette M, Green R, Meng Q, Rehwinkel J, Gallusser F, Izaurralde E, Rio D,

Dudoit S, et al. 2009. Genome-wide identification of alternative splice forms down-regulated by nonsense-mediated mRNA decay in Drosophila. PLoS Genet 5: e1000525.

Hirosawa-Takamori M, Ossipov D, Novoselov S, Turanov A, Zhang Y, Gladyshev V, Krol A, Vorbruggen G, Jackle H. 2009. A novel stem loop control element-dependent UGA read-through system without translational selenocysteine incorporation in Drosophila. The FASEB Journal 23: 107-13.

Hofacker I, Fontana W, Stadler P, Bonhoeffer L, Tacker M, Schuster P. 1994. Fast Folding and Comparison of RNA Secondary Structures. Monatshefte fur Chemie.

Jostock T. 2010. CELL SURFACE DISPLAY OF POLYPEPTIDE ISOFORMS BY STOP CODON READTHROUGH. U.S. Patent WO/2010/022961.

Keeling KM, Bedwell DM. 2010. Recoding Therapies for Genetic Diseases. In Recoding: Expansion of Decoding Rules Enriches Gene Expression (ed. JF Atkins and RF Gesteland), pp. 123-146. Springer, New York, NY.

Kent W, Sugnet C, Furey T, Roskin K, Pringle T, Zahler A, Haussler D. 2002. The human genome browser at UCSC. Genome Res 12: 996-1006.

Kertesz M, Wan Y, Mazor E, Rinn J, Nutter R, Chang H, Segal E. 2010. Genome-wide measurement of RNA secondary structure in yeast. Nature 467: 103-7.

Klagges BR, Heimbeck G, Godenschwege TA, Hofbauer A, Pflugfelder GO, Reifegerste R, Reisch D, Schaupp M, Buchner S, Buchner E. 1996. Invertebrate synapsins: a single gene codes for several isoforms in Drosophila. J neurosci 16: 3154-65.

Kobayashi T, Funakoshi Y, Hoshino SI, Katada T. 2004. The GTP-binding release factor eRF3 as a key mediator coupling translation termination to mRNA decay. J Biol Chem 279: 45693-700.

Kryukov GV, Castellano S, Novoselov SV, Lobanov AV, Zehtab O, Guigó R, Gladyshev VN. 2003. Characterization of mammalian selenoproteomes. Science 300: 1439-43.

Kuhn RM, Karolchik D, Zweig AS, Wang T, Smith KE, Rosenbloom KR, Rhead B, Raney BJ, Pohl A, Pheasant M, et al. 2009. The UCSC Genome Browser Database: update 2009. Nucleic Acids Res 37: D755-61.

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25: 2078-9.

Lin JP, Aker M, Sitney KC, Mortimer RK. 1986. First position wobble in codon-anticodon pairing: amber suppression by a yeast glutamine tRNA. Gene 49: 383-8.

Lin MF, Carlson JW, Crosby MA, Matthews BB, Yu C, Park S, Wan KH, Schroeder AJ, Gramates LS, St Pierre SE, et al. 2007. Revisiting the protein-coding gene catalog of Drosophila melanogaster using 12 fly genomes. Genome Res 17: 1823-36.

Lin MF, Deoras AN, Rasmussen MD, Kellis M. 2008. Performance and scalability of discriminative metrics for comparative gene identification in 12 Drosophila genomes. PLoS Comput Biol 4: e1000067.

Lin M, Jungreis I, Kellis M. 2010. PhyloCSF: a comparative genomics method for distinguishing protein-coding and non-coding regions. Nature Precedings. doi: 10101/npre.2010.4784.1.


Loevenich SN, Brunner E, King NL, Deutsch EW, Stein SE, FlyBase Consortium, Aebersold R, Hafen E. 2009. The Drosophila melanogaster PeptideAtlas facilitates the use of peptide data for improved fly proteomics and genome annotation. BMC bioinformatics 10: 59.

Markham NR, Zuker M. 2005. DINAMelt web server for nucleic acid melting prediction. Nucleic Acids Res 33: W577-81.

Martin-Romero FJ, Kryukov GV, Lobanov AV, Carlson BA, Lee BJ, Gladyshev VN, Hatfield DL. 2001. Selenium metabolism in Drosophila: selenoproteins, selenoprotein mRNA expression, fertility, and mortality. J Biol Chem 276: 29798-804.

McCaughan KK, Brown CM, Dalphin ME, Berry MJ, Tate WP. 1995. Translational termination efficiency in mammals is influenced by the base following the stop codon. Proc Natl Acad Sci U S A 92: 5431-5.

Namy O, Duchateau-Nguyen G, Hatin I, Denmat S, Termier M, Rousset J. 2003. Identification of stop codon readthrough genes in Saccharomyces cerevisiae. Nucleic Acids Res 31: 2289-2296.

Namy O, Duchateau-Nguyen G, Rousset TrotPscmcliScJ. 2002. Translational readthrough of the PDE2 stop codon modulates cAMP levels in Saccharomyces cerevisiae. Mol Microbiol 43: 641-52.

Namy O, Hatin I, Rousset JP. 2001. Impact of the six nucleotides downstream of the stop codon on translation termination. EMBO Rep 2: 787-93.

Namy O, Rousset JP. 2010. Specification of Standard Amino Acids by Stop Codons. In Recoding: Expansion of Decoding Rules Enriches Gene Expression (ed. JF Atkins and RF Gesteland), pp. 79-100. Springer, New York, NY.

Poole ES, Major LL, Mannering SA, Tate WP. 1998. Translational termination in Escherichia coli: three bases following the stop codon crosslink to release factor 2 and affect the decoding efficiency of UGA-containing signals. Nucleic Acids Res 26: 954-60.

Pure GA, Robinson GW, Naumovski L, Friedberg EC. 1985. Partial suppression of an ochre mutation in Saccharomyces cerevisiae by multicopy plasmids containing a normal yeast tRNAGln gene. J Mol Biol 183: 31-42.

Reese MG, Eeckman FH, Kulp D, Haussler D. 1997. Improved splice site detection in Genie. J Comput Biol 4: 311-23.

Robinson DN, Cooley L. 1997. Examination of the function of two kelch proteins generated by stop codon suppression. Development 124: 1405-17.

Samson ML, Lisbin MJ, White K. 1995. Two distinct temperature-sensitive alleles at the elav locus of Drosophila are suppressed nonsense mutations of the same tryptophan codon. Genetics 141: 1101-11.

Samuels M, Schedl P, Cline T. 1991. The complex set of late transcripts from the Drosophila sex determination gene sex-lethal encodes multiple related polypeptides. Mol Cell Biol 11: 3584-602.

Sato M, Umeki H, Saito R, Kanai A, Tomita M. 2003. Computational analysis of stop codon readthrough in D.melanogaster. Bioinformatics 19: 1371-1380.

Schmitz A, Famulok M. 2007. Ignore the nonsense. Nature 447: 42-3. Shabalina SA, Ogurtsov AY, Spiridonov NA. 2006. A periodic pattern of mRNA secondary structure

created by the genetic code. Nucleic Acids Res 34: 2428-37. Stapleton M, Carlson J, Brokstein P, Yu C, Champe M, George R, Guarin H, Kronmiller B, Pacleb J,

Park S, et al. 2002. A Drosophila full-length cDNA resource. Genome biol 3: RESEARCH0080. Stark A, Lin MF, Kheradpour P, Pedersen JS, Parts L, Carlson JW, Crosby MA, Rasmussen MD, Roy S,

Deoras AN, et al. 2007. Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures. Nature 450: 219-232.

Steneberg P, Samakovlis C. 2001. A novel stop codon readthrough mechanism produces functional Headcase protein in Drosophila trachea. EMBO Rep 2: 593-597.


Taskov K, Chapple C, Kryukov GV, Castellano S, Lobanov AV, Korotkov KV, Guigó R, Gladyshev VN. 2005. Nematode selenoproteome: the use of the selenocysteine insertion system to decode one codon in an animal genome? Nucleic Acids Res 33: 2227-38.

True H, Lindquist S. 2000. A yeast prion provides a mechanism for genetic variation and phenotypic diversity. Nature 407: 477-483.

Tweedie S, Ashburner M, Falls K, Leyland P, McQuilton P, Marygold S, Millburn G, Osumi-Sutherland D, Schroeder A, Seal R, et al. 2009. FlyBase: enhancing Drosophila Gene Ontology annotations. Nucleic Acids Res 37: D555-9.

von der Haar T, Tuite MF. 2007. Regulated translational bypass of stop codons in yeast. Trends Microbiol 15: 78-86.

Washburn T, O'Tousa JE. 1992. Nonsense suppression of the major rhodopsin gene of Drosophila. Genetics 130: 585-595.

Weiss WA, Friedberg EC. 1986. Normal yeast tRNA(CAGGln) can suppress amber codons and is encoded by an essential gene. J Mol Biol 192: 725-35.

Wilkie GS, Dickson KS, Gray NK. 2003. Regulation of mRNA translation by 5'- and 3'-UTR-binding factors. Trends Biochem Sci 28: 182-188.

Williams I, Richardson J, Starkey A, Stansfield I. 2004. Genome-wide prediction of stop codon readthrough during translation in the yeast Saccharomyces cerevisiae. Nucleic Acids Res 32: 6605-6616.

Wills NM. 2010. Translational Bypassing -- Peptidyl-tRNA Re-pairing at Non-overlapping Sites. In Recoding: Expansion of Decoding Rules Enriches Gene Expression (ed. JF Atkins and RF Gesteland), pp. 365-381. Springer, New York, NY.

Yeo G, Burge C. 2004. Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. J Comput Biol 11: 377-94.

Yildiz O, Kearney H, Kramer BC, Sekelsky JJ. 2004. Mutational Analysis of the Drosophila DNA Repair and Recombination Gene mei-9. Genetics 167: 263.

Zuker M, Stiegler P. 1981. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res 9: 133-48.

A

B

C

Figure 1: Evolutionary signatures of typical and readthrough stop codons.

Annotated Stop Codon

A

B

!!"

#$"

$$"

%&"

&'"

&("

#" $"

)"

*"

%"

'" $"

%"

&" )"

*'$"

*)"

&&"

&&"

&%"

#" !"

$"

#"

%" %"

&" &"

&" %"

!"

+#$$"

*%%'"

&(%!"

*$!"

&%+"

'$" %+"

&!"

$"

'" *"

%" &"

&" ("

*"

(,"

%(,"

$(,"

#(,"

!(,"

&((,"

("

%("

$("

#("

!("

&(("

&%("

&$("

&#("

&!("

%(("

%%("

%$("

%#("

%!("

-*(("!"#$%&'(&)(*"#'+$",-.+(

/0$&'1(23!(40'5.6(7,'($&1&'+8(

./012345"6"("789&$$+$:"

;1<=>8?@A="BCD1?8?@28E"789$##:"

F=?G</>2HI/"3?8GJG?<=E"789%!*:"

750 Second ORFs with positive score

!

PhyloCSF, p-value, Local FDR ==> 339 Chance

!

RNA-seq and splice predictors ==> 40 Alternative Splice

!

Score to first ATG, prior annotations ==> 42 Dicistronic or overlapping gene

!

Stop only in D mel or closely related ==> 17 Recent nonsense

!

Score on opposite strand ==> 5 Antisense

!

RNA-seq ==> 0 A-to-I editing (3 found, all low score)

!

Search for SECIS ==> 0 Selenocysteine

!

Visual check of alignment ==> 24 Alignment errors, etc.

!

283 Readthrough Candidates

Figure 2 : Manual curation distinguishes 283 readthrough candidates

B

A C

Figure 3: Non-comparative evidence of protein-translation

(continued)

A

C

D

B

E !"#$%

!&$$'())*#+%

,(-./+*)0

12%

&)/3(

!"#$%

4#5#+

"167%

7+8-#5#+

!"# 9:7 94: $%&'('() $%&*+

,- 97: 49: ./0123'() ./(*4

#" 977 99: ./0123'() ./(*4

747 5671)'() 567*5

447 8%691:9;2( 8%9*<

,- 97: 797 86%:7'() 86%*=

#" 977 797 86%:7'() 86%*=

1(5;0*)0/"-.

.>?92'%'(&?'(?

!71?@:A:(?

9:7'B:(

C:(D@2(:('@2/?

92'%'(&?'(?-%A?

@:A:(?9:7'B:(

9:7

70*+#%/-*5%*+)('"(5

!"#

Frame 0 Frame 1 Frame 2

Interpretation

Readthrough !

Recent nonsense !

A -> I editing !

Selenocysteine !

Alternative splicing ! ! !

Dicistronic ! ! !

Cryptic promotor ! ! !

Hopping ! ! !

Antisense ! ! !

Chance ! ! !

Frame shift ! !

CDS overlaps stop ! !# with PhyloCSF > 0 662 219 225

!"#$%&'%()*+*,%

-./%()"010)2+*,%

Figure 4: Evidence of read-through mechanism

A

B

Figure 5: Functional and length enrichments

A

B

C

Figure 6: Examples of read-through in other species

A

B

Figure 7: Estimated extent of readthrough in other species

Species with excess in frame 0 (abundant RT) Species with no excess (little or no RT)

Documents

Evidence of abundant stop codon readthrough in Drosophila and other …web.mit.edu/manoli/tenurecase/M16_Jungreis_Genome... · 2011. 1. 26. · Evidence of abundant stop codon readthrough