22
Finding Eukaryotic Open reading frames.

Finding Eukaryotic Open reading frames.. Introduction The open reading frame: (OFR) in prokaryotics DNA (test your application) The structure of the Eukaryotic

Embed Size (px)

Citation preview

Page 1: Finding Eukaryotic Open reading frames.. Introduction The open reading frame: (OFR) in prokaryotics DNA (test your application) The structure of the Eukaryotic

Finding Eukaryotic Open reading frames.

Page 2: Finding Eukaryotic Open reading frames.. Introduction The open reading frame: (OFR) in prokaryotics DNA (test your application) The structure of the Eukaryotic

Introduction• The open reading frame: (OFR) in prokaryotics

DNA (test your application)• The structure of the Eukaryotic gene. • Findings gene in Eukaryotes– ORF and problems with ORF – First exon and first intron – Distinguish introns/exons (splice sites)– Proximity of promoters (mentioned)– Bases pair patterns– Homology with existing sequences.

Page 3: Finding Eukaryotic Open reading frames.. Introduction The open reading frame: (OFR) in prokaryotics DNA (test your application) The structure of the Eukaryotic

ORF prokaryotics (Pal Gene E.Coli)

Adapted Understanding bioinformatics 9.3

Page 4: Finding Eukaryotic Open reading frames.. Introduction The open reading frame: (OFR) in prokaryotics DNA (test your application) The structure of the Eukaryotic

Using your assignment code• Open the file: ORF pal gene.fasta• Find all open reading frames. (This time you

must modify your code to translate each codon, copy form convertor_hashtable.txt

• Compare to file: pal protein sequence.fasta.– Visual inspect the files.

• What conclusion can you draw. – On which read frame is the true ORF.

Page 5: Finding Eukaryotic Open reading frames.. Introduction The open reading frame: (OFR) in prokaryotics DNA (test your application) The structure of the Eukaryotic

Structure of eukaryotic “geneThe “basic” transcription/translation of Eukaryotic gene

An ORF in Eukaryotic is a region of the DNA which “could be” a coding sequence (CDS) of a gene.

It has a start codon (ATG) and an end codon [ one of three] (TAA, TAG, TGA)

The diagram shows the DNA sequence of an eukaryotic gene including promoter, UTR…. Eukaryotic expression showing exons/ introns…,

adapted from Zhang 2002

Page 6: Finding Eukaryotic Open reading frames.. Introduction The open reading frame: (OFR) in prokaryotics DNA (test your application) The structure of the Eukaryotic

Structure of Eukaryotic CDS

ALDH10 gene exon 1 shows a 5’UTR in exon

Page 7: Finding Eukaryotic Open reading frames.. Introduction The open reading frame: (OFR) in prokaryotics DNA (test your application) The structure of the Eukaryotic

Global Sequence 7

ORF in Eukaryotes• In comparison to prokaryotes Eukaryotic DNA is :• Gene density is much lower; genes are further apart and can

vary significantly between chromosomes (~ 1.5% of human DNA is CDS).

• The mRNA is monocistronic (one promoter per gene; N.B prokaryotes generally are organised in operons) moreover; A DNA sequence is transcribed into one mRNA sequences [this may not be true of viral DNA]

• The “ORF” in the DNA sequence contains exons (translated sequences (CDS) or exon and introns (Non CDS).

• The Introns are spliced to leave only exons

Page 8: Finding Eukaryotic Open reading frames.. Introduction The open reading frame: (OFR) in prokaryotics DNA (test your application) The structure of the Eukaryotic

ORF in Eukaryotes• Some Impact of differences:

• You can no longer reliably translate an ORF into AA sequence to give you the “true” protein (amino acid) sequence.

• The DNA sequence of Introns is like any non coding region of the DNA in that the bases are just bases and should not be read as elements of a “codon” so in an intron sequence:– ATG does not represent a start codons – TAA/TAG/TGA do not represent stop codons.

• Increase the complexity of determining true ORF in DNA sequences because of “false positive” start and stop codons in introns; thus many attempts to find genes/orf is now down via mRNA (not pre-mRNA)

• The size of introns and exons need not be multiple of three. The impact of this on the DNA ORF analysis is “shift” the DNA reading frames.

Page 9: Finding Eukaryotic Open reading frames.. Introduction The open reading frame: (OFR) in prokaryotics DNA (test your application) The structure of the Eukaryotic

Figure 9.2a the CDK10 geneNote in this ATG is shown in a red box (note it is 12 nucleotides into the first exon) {this will not impact on ORF but will mean an incorrect gene annotation: why)

Only the first exon and intron is shown full; the rest shows partial sequences of introns are shown.

The fully sequences can be found in the PISSRLE DNA sequence.

Page 10: Finding Eukaryotic Open reading frames.. Introduction The open reading frame: (OFR) in prokaryotics DNA (test your application) The structure of the Eukaryotic

Frame shits using ORF technique• When a pre-mRNA is splice into mRNA all the exons will have to be in one

reading frame. • However:

– the splice sites need not occur at the beginning or start of exons. – introns need not be multiples of three in size;

• What is the net effect of this in terms of trying to “predict” translation of proteins using DNA sequences.: – In can affect the Translation of an exon;– It can affect the starting residue of the following exon….– It can mean the “translation” of an exon(s) are being carried out in the incorrect

reading frame.

• The effect of manual translation It is dependent on the starting position of the exon (correct reading frame) , the length of the exon and the length of the intron. – Refer to chapter 9 understanding bioinformatics

Page 11: Finding Eukaryotic Open reading frames.. Introduction The open reading frame: (OFR) in prokaryotics DNA (test your application) The structure of the Eukaryotic

Predictive translation effect Exons/intron length • Consider the following:• We have the mRNA CDS of 60 bp in length (start…stop)

• Let us assume that the intron1 is:– at the end of codon three (position 9)– the length of the intron is 30bp.

• Intron 2 occurs at:– the end of codon 10 (position 30)– and is 45 bp in length

• What is the effect of the translations: on Exon A and Exon B?

Exon 9 bp

Intron 30 bp

Exon 21 bp

Intron 45 bp

Exon 30Bp

BP…

ATG TAAExon A Exon B

DNA Strand

Page 12: Finding Eukaryotic Open reading frames.. Introduction The open reading frame: (OFR) in prokaryotics DNA (test your application) The structure of the Eukaryotic

Predictive translation effect Exons/intron length • Consider the following:• We have the mRNA CDS of 60 bp in length (start…stop)

• Let us assume that the intron1 is:– at the end of codon three (position 9)– the length of the intron is 30bp.

• Intron 2 occurs at:– at position 29 (at the 3rd bp of codon 10)– and is 45 bp in length

• What is the effect of the translations: on Exon A and Exon B?

Exon 9 bp

Intron 30 bp

Exon 20 bp

Intron 45 bp

Exon 31Bp

BP…

ATG TAAExon A Exon B

DNA Strand

Page 13: Finding Eukaryotic Open reading frames.. Introduction The open reading frame: (OFR) in prokaryotics DNA (test your application) The structure of the Eukaryotic

Predictive translation effect Exons/intron length • Consider the following:• We have the mRNA CDS of 60 bp in length (start…stop)

• Let us assume that the intron1 is:– at the end of codon three (position 9)– the length of the intron is 30bp.

• Intron 2 occurs at:– at position 30(the end of codon 10)– and is 43 bp in length

• What is the effect of the translations: on Exon A and Exon B?

Exon 9 bp

Intron 30 bp

Exon 21 bp

Intron 43 bp

Exon 30Bp

BP…

ATG TAAExon A Exon B

DNA Strand

Page 14: Finding Eukaryotic Open reading frames.. Introduction The open reading frame: (OFR) in prokaryotics DNA (test your application) The structure of the Eukaryotic

Effect of Translation

• Example 1 no effect all multiples of 3• Example 2 the last residue of exon 2 is

incorrect. The residue for Exon 3 is correct. (but starts at bp 2 of first codon)

• Example 3 last expn is in different reading frame.

• Refer to Incorrect_translation_examples.rar

Page 15: Finding Eukaryotic Open reading frames.. Introduction The open reading frame: (OFR) in prokaryotics DNA (test your application) The structure of the Eukaryotic

Predicting exons ADH10 gene

The diagram shows the exons 1 and exons 2 for ALDH10 gene.

The correct coding sequence is shown in upper case. : the second ATG is the actual start codon

The sequences can be found in the sample sequence files.

What is the length of each exon (CDS).

Consider what may happened if you applied a translation to each of the reading frames? Exon 1 is position 1352-1762; exon 2 is 2169-2400 .The position of the actual ATG is 1610

Figure 9.7 : understanding Bioinformatics

Page 16: Finding Eukaryotic Open reading frames.. Introduction The open reading frame: (OFR) in prokaryotics DNA (test your application) The structure of the Eukaryotic

Global Sequence 16

Finding Exons Coding regions• In order to ensure you can negate the previous issue(s) it is imperative to

identify splice sites:

• Identify start and stop signals (refer to Zhang 2002 Chasin 2007) – Initial exon (start and 5’ splice site)– Internal exon (3’ and 5’ site)– Terminal site (3’ and and stop codon)

• Identify splice junctions: – the 5’ splice junction is in general GT) – The 3’ splice junction is in general AG.

• Refer to Exon 1 and Exon 2 in the ADH10 gene in previous slide:– Exon 1 is position 1352-1762; exon 2 is 2169-2400 .– The position of the actual ATG is 1610 – (1352-1610: is the 5’ UTR of exon 1; Translation initiation site)

Page 17: Finding Eukaryotic Open reading frames.. Introduction The open reading frame: (OFR) in prokaryotics DNA (test your application) The structure of the Eukaryotic

Splice site prediction• While GT and AG are the general 5’ and 3’

splice; it is obvious that such pairings are not uncommon: in fact there is a high degree of false positives (understanding bioinformatics p. 392).

Figure 9.10 understanding bioinformatics: spliceview… are prediction programs.

Page 18: Finding Eukaryotic Open reading frames.. Introduction The open reading frame: (OFR) in prokaryotics DNA (test your application) The structure of the Eukaryotic

Proximity of promoters• Basically a true CDS (ORF) will have a promoter region near

by : – Promoters in prokaryotes have well defined b.p. sequences

(motifs) upstream of the CDS (true ORF):• The Pribnow box: TATAAT at about position -10• ATTGACA at position -35• An AT rich region before this box.

– Eukaryotic promoters are more complex: there is more than one…• TATA box • CAAT box • GC rich regions

• Conversely the presence of a ORF indicates that there should be a promoter close by. (Bioinformatics 1 will cover promoter prediction in greater detail in the next lecture)

Page 19: Finding Eukaryotic Open reading frames.. Introduction The open reading frame: (OFR) in prokaryotics DNA (test your application) The structure of the Eukaryotic

BP sequences in Exons/Introns

• The DNA sequence of a gene’s CDS contains different ratio of bases as opposed to the non “CDS of a gene” or non geneic DNA. (The student is expected to research this)

• So the ratio of BP to each other and specific BP sequences is different between Exons/Introns and other non coding DNA. (remember the non CDS there are no codons)

• If student requires greater, supplementary material, detail it can be found in Zhang et al 2002 and other references at the end of chapter 9 and 10 in understanding bioinformatics

Page 20: Finding Eukaryotic Open reading frames.. Introduction The open reading frame: (OFR) in prokaryotics DNA (test your application) The structure of the Eukaryotic

Homology in coding regions• The CDS sequence of genes are generally highly: Hypothesis

why this is the case?• Like prokaryotic DNA the CDS sequence is highly conserved so

database searches can facilitate determining exons and thus ORF.

• By extracting a possible exon region. It can be submitted to a search for similar sequences (BLAST search) to see what it may reveal.

• If there are highly probable similarity existing exons then it is likely to be a true exon

• An exon can also be translated and homologs of the translated sequence can also be submitted to search (The SWISS-Prot blast search engine should be used as it contains experimentally determined AA sequences.)

Page 21: Finding Eukaryotic Open reading frames.. Introduction The open reading frame: (OFR) in prokaryotics DNA (test your application) The structure of the Eukaryotic

Alternative splicing• The diagram shows the main

four types of alternative splicing.

• It clearly indicates that the pre-mRNA is not the same as the mRNA (so direct translation via the DNA is fraught with danger)

• Homological analysis and the use of expressed sequence tags (mRNA produced by genes over different times and different tissue types) can help determine the different splices

• Can you think of any issues that may arise, using ORF, if there is alternative splicing?

Page 22: Finding Eukaryotic Open reading frames.. Introduction The open reading frame: (OFR) in prokaryotics DNA (test your application) The structure of the Eukaryotic

Global Sequence 22

Reference• Baxevanis, A.D. 2005 Bioinformatics: a practical guide

to the analysis of genes and proteins. Wiley; Chapter 5. [book is in the library]

• Klug, W.A. et al 2010; Concepts of Genetics; Pearson Education p. 596-p.597

• Zhang, M.Q. 2002 Computational prediction of eukaryotic coding genes. Nat Rev. Genet. 3 698-709.

• Chasin, L.A. 2007 Searching for splicing motifs. Adv Exp Med Biol. 623:85-106

• Zvelebil M. “understanding bioinformatics” chapter 9 {book is in the library]