Logos for Ribosome Binding Sites and First Introns in the Human Genome

Logos for Ribosome Binding Sites and First Introns in the Human Genome

Authors: Bjorn Sponberg, Ganapathy Varma, Ramya Seetharaman Masters Students, Bioinformatics Program, Stockholm University

Abstract

The handling and analysis of biological sequences remains one of the prime tasks in the field of

Bioinformatics. Logos are a useful and easy to use tool in genome analysis as they provide an instant

graphic representation of evolutionary conserved elements in sequence datasets. Sequence motifs are

short patterns in biological sequences that are biologically active. Motifs are therefore evolutionary

conserved in sequences that are active in the same biologic process.

The Kozak sequence are located in the start codons context in mRNA and is known to play an

important role in the recognition by ribosomes. Kozak sequence is therefore crucial for regulating

translation.

The first intron in every gene differs from the other introns as it has typical features such as generally

longer sequence length and unique sequence motifs. If unique motifs are detected they can aid

tremendously in gene prediction.

Four DNA datasets from the human genome was collected. One dataset representing the start codon

sequence context in mRNAs and a suitable control dataset. A second dataset of first intron sequences

and a suitable control dataset. For creating logos, the appropriate sequences (start codon sequence

context and first introns respectively) were mined from ensembl 60 using Biomart and RSAT. Python

scripts were then executed to extract the pieces of sequence bits from the datasets that later was used as

input for creating logos. The logo building tool, Weblogo, is a built in function in python and was used

to construct the logos.

Characteristic patterns of Ribosome Binding Sites was observed in weblogo. The Ribosome Binding

Site logo showed a conserved ATG nucleotide start codon along with the characteristic -3 purines and

the +4 guanine flanking bases. In the first intron dataset weblogo displayed characteristic intron motifs.

In addition a unique motif was detected at the 5' end wich was suspected to be a unique motif to first

introns.

Keywords: Bioinformatics, Genome Analysis, Motif, Logo, Pattern, Ribosome Binding Site, mRNA,

Ribosome, Translation, Coding Sequence, Gene, Kozak Sequence, Open Reading Frame, Codon, Exon,

Intron, First Intron, Splice Sites, Python, Biomart, RSAT, ensembl60, Weblogo.

Introduction

Much of the non-coding sequence regions in DNA was earlier termed “Junk-DNA”. However, in the

last two decades many important discoveries have been done in the field which makes the term

misleading.

In genes, non coding elements in the proximal regions close to transcription start site (TSS) is known

to be important for regulation of translation. The non coding first introns in eukaryotic genes contain

motifs that are believed to regulate gene transcription [1].

The central dogma in cell biology describes the events that take place from gene transcription to

translation. Gene transcription is a key biologic process that produces a blueprint of the sequence

information stored in a gene. The blueprint of the gene comes in the form of a physically independent

unspliced transcript called pre-mRNA. After the removal of non-coding sequences within the genes

blueprint (Called introns), the coding sequences (Called exons) is coupled together to produce a

messenger RNA nucleotide string (mRNA).

The resulting mRNA is ready to be transported to ribosomes in the cytosol for translation to a new

sequence form – a string of amino acids which are called proteins (Figure 1). These events just

described are really one biologic process that transfer information from genes to functional proteins –

and is the core of the central dogma [2].

In the world of biological sequences (Including nucleotide sequences) regulatory motifs are very

important. Regulatory motifs communicates between the biological sequences they are a part of and

their outside world. Usually this means, to communicate with a motif binding biologic component(s)

( Usually proteins or RNA's) that are present in the sequences nearby surroundings.

Together, a motif binding component and a regulatory sequence motif triggers a biological process

when they find each other. Since regulatory sequence motifs are a very important element of a

evolutionary conserved biological process, they are themselves conserved.

However, protein binding motifs are often only semi conserved, not fully conserved (Base by base) –

as is the case for example with enzyme restriction DNA binding sites. Restriction enzymes does not

need dynamic regulation due to their all-or-none activity (They destroy invading DNA elements) [3].

However, most motif binding components do trigger and regulate a certain biologic process.

Consequently, a motif-type that regulates needs to show some kind of variability that discriminates

between different output levels of the biological process it are meant to regulate.

In addition, evolutionary robustness could also play a role in the variable nature of sequence motifs.

Since, the more mutations a motif can handle without harming the host – the more biologic fit becomes

the host.

Figure 1. Illustration of the central dogma, including the splicing event of the pre-mRNA [4].

Start codon context

As with most of biologic molecular discoveries, the discovery of an m-RNA motif to be involved in

translation, was made in prokaryotes in the 1970's. This first ribosome binding site motif was called a

Shine-Dalgarno sequence. A variety in this motif (Among different genes) was believed to affect

ribosome binding affinity. How strong the ribosome bound to mRNA was assumed to be important to

the translation output (Protein production).

Later, the theory became that the secondary structure of the 5' end mRNA strand also was important to

ribosome binding affinity and not only the motif-type [4, 5].

After this the same kind of m-RNA motifs (Ribosome binding motif) was found in Eukaryotes as well.

However, in eukaryotes the motif is referred to as the Kozak sequence (Figure 2).

Figure 2. An Weblogo example of a Kozak sequence that surrounds the start codon (ATG) [6].

Often the literature talks about the “AUG nucleotide context” to describe how well the Kozak

sequence promotes mRNA translation. The most crucial elements of the AUG context is believed to be

the -3 position and +4 position (See figure 2). For the start codon to be in an optimal context the -3

position must be a purine base (A or G) and the +4 position must be a guanine base (G) . If not, the start

codon is in a suboptimal contex. MRNA's with a start codon in a suboptimal context is relatively

abundant [7].

Figure 3. mRNA structure (Right) is also important for ribosome binding activity, not only motif-type

[8].

First introns

Splicing in eukaryotic mRNA was discovered in the late 1970s and the intensive focus on exon-intron

structure in eukaryotic genes was born. Before splicing the mRNA exist in a form called pre-mRNA

(Figure 4 under). It was later found that a RNA-Protein complex called spliceosome was responsible

for the removal of introns and ligation of the introns flanking exons. The spliceosome is a protein

complex that also consists of small snRNA's. The function to the small snRNA's is probably to

recognize motifs in the intron/exon boundaries, or within the intron. This motif recognition is believed

to help the spliceosome to correctly cut in the 5' end and 3' end splice points (GT and AG respectively).

Thus, conserved motifs that are complementary to these splicesome snRNA's should be found in the

intron or intron/exon boundary regions.

Figure 4. Pre-mRNA. The first intron (Intron 1) is generally longer than the other introns [9].

These conserved sequence elements includes the 5' end splice, the branch point sequence, the

polypyrimidine tract and the 3' end splice site (Figure 5 under) [10]. The branch point is expected to be

located 18-40 nt upstream from the polypyrimidine tract at the 3' end [11].

Figure 5. The four main conserved motifs in introns: GU, Branchpoint(A), Polypyrimidine and AG

[12].

On average, the first introns are known to have longer sequence regions than other introns. The extra

length in first introns can stem from an intron-mediated enhancement or silencer element(s) (Motifs).

However, exactly how the first introns regulates gene transcription is still not known [12].

Methods

Start codon context: Downloading sequence data (See also figure 6 under)

RSAT (http://rsat.ulb.ac.be/rsat/retrieve-seq_form.cgi) was used to download human CDS sequences.

1.The human gene sequences was downloaded as CDS. Flanking regions relative to start codon was

set to -10 and +5 (+200 and +215 for the control dataset).

2.Downloaded all genes by marking “all” in the Genes section.

3.Prevented overlap with neighboring genes. Admitted imprecise positions. No Mask repeats.

4.The output was shown on the server. Copied and pasted the results into a gedit page. The dataset

was truncated because all genes was too much for weblogo to handle. Truncated until Weblogo

accepted 27168 sequences (See also Result).

Figure 6. Screenshot from the parameter settings in RSAT [13].

http://rsat.ulb.ac.be/rsat/retrieve-seq_form.cgi

The script

We used a Python script to find the ATG start codon within the 15 base long dataset sequences.

We used the re module by importing it with the command: import re

The motif created in the prog variable: prog = re.compile('.{5}.ATG..')

The motif included nine bases upstream of the A in ATG, and two downstream of the G in ATG.

These motifs was sent to the weblogo functions imported from the motif module. The module was

imported with the command:

from Bio.AlignAce.Motif import Motif

See also the Python script in the bin directory named:

prog_ATG_context.py

First introns

Downloading sequence data (See also figure 7 under):

•RSAT was used to obtain data from ensembl for the Homo sapiens dataset.

•Ensembl Gene Id list was pasted into the window. 'First intron' option was selected.

•The result , which had complete sequences of all the first introns of the genes in the list specified

was saved as a text file.

•This data file was corrected for small errors such as warning messages and small length sequences

were removed and the data was ready for use in the python script.

Figure 7. Screenshot from the parameter settings in RSAT [14].

The script

Since the dataset was already extracted as first intron in RSAT it was not necessary to discriminate

first introns from exons with programming. However, it was necessary to cut slices from both ends and

join them to make one weblogo representation of both ends. The script cut 20 bases long slices from

both ends and joined them before sending to weblogo for conservation pattern display.

See also the script in the bin directory named:

prog_first_intron.py

System development

Personal notebooks, filesharing etc was shared at code.google

Results

Start codon context

Logos for 744 and 27168 random human genes respectively are shown in the two figures under

(Figures 8 and 9). The logos flanks the start codon -6 upstream and +1 downstream which was set in

the Python script (See Methods).

Sequence counting begins at the A in the start codon ATG. A is annotated as +1 (Position 7 in the

weblogo figures). Position 6 in the start codon context weblogo figures would be annotated as -1 and so

forth.

The most crucial elements for ribosome recognition in the ATG sequence context is present in the

result (See also introduction). At position -3: A and G (Or the purines) are most conserved. At position

+4: G is the most conserved base.

Figure 8. Human CDS flanked -6 and +1 in weblogo (Based on 744 sequences).

The weblogo for the 27168 random human gene dataset is shown in figure 9. The same ATG context

can be seen as in the 744 sequence dataset logo in figure 8.

Figure 9. Human CDS flanked -6 and +1 displayed in weblogo (Based on 27168 sequences).

A control dataset tested the script. A web logo for 4822 random human genes flanking ATG

downstream +200 and +215 in the same CDS dataset was created (Figure 10).

Figure 10. Human CDS +200 and +215 (based on 4822 sequences).

In this case all human CDS was searched for ATG in the specified region (Downstream +200 +215).

Any ATG triplet found here would most probably not function as a start codon and would be fewer in

number. The number of ATG motifs found in this region was 4822 relative to 27168 in the -6 and +1

flanking dataset (Figure 9 over). As expected, no conserved bases in this ATG context could be noticed

(Figure 10).

First introns

When cutting 20 nt downstream of 5' end and 20 nt upstream of 3' first intron ends only the two end

splice points and the polypyrimidine tract was observed according to the literature. In addition to these

known features, a 4 base long AG rich region adjacent downstream to the 5' end splicing point was

observed (Figure 11).

Figure 11. A weblogo display of a dataset based on 1082 first intron sequences.

Since the branch point motif was expected to be located ca. 18-40 nt upstream of the polypyrimidine

tract the 3' end slicing length was increased to 60 nt. However, no branch point was observed (Figure

12).

Figure 12. This logo shows an increased 60 base slice from the 3' end for the first intron dataset.

The control dataset was obtained from an first exon dataset. No intron-conserved elements should be

observed as was the case in figure 11 and 12 over. This control confirmed that the results over was not

a case of randomly luck relative to the information found in the literature.

Figure 13. Weblogo based on a 1247 exon sequence control dataset.

Discussion

Start codon context

From the literature it is known that the most crucial conserved bases in the ATG context are a purine at

position -3 (Position 4 in weblogo result) and a guanine at position +4 (Position 10 in weblogo) [7].

Figures 8 and 9 in Results shows that both of these elements was conserved in the input dataset. The -3

position is almost solely dominated by the purines. However, position +4 also have some adenine

present, but guanine dominates - as expected.

In addition, the results suggests that cytosine is most dominant in position -1 over guanine and adenine

respectively. Cytosine also show some slightly dominances at position -2. These two extra element

features can be seen in the 744 sequence dataset as well as in the 27168 sequence dataset and are

probably biological relevant to the ATG context.

However, unique to the 27168 sequence dataset is a dominating cytosine conserved over guanine and

adenine -4 upstream of ATG. Since this dataset is over 30 times larger as the 744 sequence dataset, this

unique feature is probably more reliable. That is, this slight cytosine dominance over guanine and

adenine in position -4 is probably a better reflection of the biological truth.

The control dataset is extracted from position +200 to +215 downstream of start codon in the same

CDS dataset. Any ATG's found in this region would not be expected to have any similar context as the

ATG's found in the -10 to +5 region. Figure 13 also confirm this. The figure shows no conserved

elements in the ATG context at all.

Since ATG is not a common triplet in this downstream region [6], only 4822 sequences was obtained

from the CDS dataset that generated 27168 in the -6 +1 region. This also reflects its non-conserved

function in this region of the CDS.

First introns

The 5' GU and 3'AG splicing points are the only two universally conserved elements in the intron.

These two motifs was fully conserved in the resulting weblogos in figure 11 (See results).

The polypyrimidine tract on the 3' end slice was also a distinct pattern that emerged in the weblogo

result.

What the result did lack was the branching point A and its surrounding elements. In Yeast, a branching

point context is very well conserved: UACUAAC (The A in bold is the branching point). This

sequence is complementary to the internal region in U2 snRNA. Mutation experiments in the branch

point region (18 to 40 bases upstream of the polypyrimidine tract) have shown this sequence to be

critical for the splicing events in Yeast. However, the branching point sequence context is known not to

be as conserved in higher eukaryotes, as in Yeast. Thus, this fact could very well be the reason for the

branch point motif to not show up in the weblogo in figures 11 and 12. Further, It is reported that the

polypyrimidine tract assists in binding the U2 snRNA correctly to the branching point in higher

eukaryotes. This biological process can indirectly wipe out the branching point motif as a conserved

motif in the human genome [16]. The branching point motif would not be necessary for correct U2

snRNA binding in higher eukaryotes.

The four conserved elements seen in figure 11 (See Results) downstream of the 5' end splicing point

was not expected. Could these conserved elements be unique to the first introns? Yes, they could. It is

known that the 5'end GT motif is encompassed within a larger less conserved motif [16]. This is

probably what is seen in figure 11. That is, within this GT context region lies unique information to the

intron in question e.g. silencers and/or enhancers to each respective intron/exon overlap region.

Conclusion

Regulatory motifs in biological sequences are known to be semi conserved. Their variable base

conservation makes them hard to detect without informatics methods. Weblogo is an application that

can display conserved regions in sequence datasets.

A dataset of 27168 sequences -6 and +1 downstream of start codon (ATG) from the human genome

was displayed in weblogo. The result showed that the expected conservation of purines in the -3

position and a guanine in the +4 position did emerge.

A dataset of 1082 sequences of first introns from the human genome was displayed in weblogo.

Weblogo displayed two types of result. In one case the 5' end was cut 20 downstream and the 3' end cut

60 bases upstream. This to look for a conserved branching point motif that was known to exist in lower

eukaryotes. The result showed that the branching point was not detected in the human first intron

dataset. In the second case the 5' and 3' end was both cut by 20 bases downstream and upstream

respectively. The result showed that the two splicing points in the 3' and 5'end (GT and AG

respectively) was fully conserved. The polypyrimidine tract elements in the 3'end region was also

conserved.

The lack of finding the branching point in the human first intron dataset could be that the

complementary U2 snRNA found in Yeast binds to the branching point motif differently in humans. In

humans the U2 snRNA binds to the branching point (A) is mediated by proteins already bound to the

polypyrimidine tract – which was found in the human dataset. This observation could explain a

evolutionary process that did not need a conserved branching point motif. Thus, the lack of this motif

in the human genome could be explained.

In the human first intron dataset Weblogo also detected a four base long conserved region, rich in

purines (A and G). These conserved elements could serve a local intron specific regulatory function. It

could either serve to enhance or to silence splicing, which would later in the central dogma pathway

affect the rate of protein production.

REFERENCES

1. A brief history of the status of transposable elements: from junk DNA to major players in evolution.

Biémont C.

Genetics. 2010 Dec;186(4):1085-93.

2. Thieffry D, Sarkar S. Forty years under the central dogma.

Trends Biochem Sci. 1998 Aug;23(8):312-6.

3. Stormo GD.DNA binding sites: representation and discovery. Bioinformatics. 2000 Jan;16(1):16-23.

4. http://compbio.pbworks.com/w/page/16252897/Introduction-and-Basic-Molecular-Biology5. Smit,M.H. and van Duin,J. (1990) Secondary structure ofthe ribosome binding site determines translational efficiency: aquantitative analysis. Proc. Natl. Acad. Sci. USA, 87, 7668–7672.6. Harhay et al. Kozak consensus sequence surrounding bovine start methonine using WebLogo [29]. BMC Genomics 2005 6:166 doi:10.1186/1471-2164-6-166

7. Kochetov AV. Bioinformatics. AUG codons at the beginning of protein coding sequences are frequent in eukaryotic mRNAs with a suboptimal start codon context. 2005 Apr 1;21(7):837-40. Epub 2004 Nov 5.

8. http://www.ask.com/wiki/Primary_structure?qsrc=3044

9. http://library.thinkquest.org/C006188/basics/pictures/introns.gi

10. http://en.wikipedia.org/wiki/Spliceosome

11. http://www.natur e.com/nrg/journal/v5/n5/full/nrg1327.html

12. Pagani F, Baralle FE. Genomic variants in exons and introns: identifying the splicing spoilers.

Nat Rev Genet. 2004 May;5(5):389-96.

13. http://rsat.ulb.ac.be/rsat/retrieve-seq_form.cgi

14. http://rsat.ulb.ac.be/rsat/

15. http://oregonstate.edu/instruction/bb492/lectures/EuTranscriptIII.html

16. Douglas L. Black MECHANISMS OF ALTERNATIVE PRE-MESSENGER RNA SPLICING

Annual Review of Biochemistry Vol. 72: 291-336

17. Marketa Zvelebil and Jeremy O. Baum Understanding Bioinformatics 18. A. Malcom Campbell and Laurie J. Heyer Discovering Genomics, Proteomics and Bioinformatics

19. Jeremy J. Ramsden Bioinformatics- An Introduction

http://oregonstate.edu/instruction/bb492/lectures/EuTranscriptIII.html

http://rsat.ulb.ac.be/rsat/

http://rsat.ulb.ac.be/rsat/retrieve-seq_form.cgi

http://www.ncbi.nlm.nih.gov/pubmed/15168696

http://www.nature.com/nrg/journal/v5/n5/full/nrg1327.html

http://www.nature.com/nrg/journal/v5/n5/full/nrg1327.html

http://en.wikipedia.org/wiki/Spliceosome

http://library.thinkquest.org/C006188/basics/pictures/introns.gi

http://www.ask.com/wiki/Primary_structure?qsrc=3044

http://www.ncbi.nlm.nih.gov/pubmed?term=%22Kochetov%20AV%22%5BAuthor%5D

http://compbio.pbworks.com/w/page/16252897/Introduction-and-Basic-Molecular-Biology



20. Jonathan Pevsner Bioinformatics and Functional Genomics

Documents

Logos for Ribosome Binding Sites and First Introns in the Human Genome