Download pdf - Gene architecture and sequence annotationandrew-michaelson.com/fweb/lab_website/Bio345/Week-2.pdf · between the two types of Nucleic Acid Sequences 1) Genomic—the sequence of nucleotides

Gene architecture and

sequence annotation

Week 2

Last week:

1) How to search genomic databases

such as NCBI and ensembl

1) How to obtain sequence files

Sequence of the

Cystic Fibrosis

Gene: CFTR

This week we will learn to identify genetic

architecture within sequence files

This week will learn the differences

between the two types of Nucleic Acid

Sequences

1) Genomic—the sequence of nucleotides

on a chromosome

2) Expressed sequences—the sequence

of nucleotides in mRNA/cDNA

DNA RNA protein

Bioinformatics and Functional Genomics, 2nd Edition. http://www.bioinfbook.org (2014).

The expression of genomic

information

http://www.bioinfbook.org

DNA RNA protein

genome transcriptome proteome



DNA RNA

cDNA

ESTs

UniGene

phenotype

genomic

DNA

databases

protein

sequence

databases

protein



Learning Objectives:

Understand sequence differences between genomic

and expressed sequences

Use programs to determine the correct open reading

frame (ORF) of an expressed sequence

Annotate sequence files

Genomic DNA is one source

of nucleic acid sequence

Strachan, T. & Read, A.P. Human Molecular Genetics. (New York; Wiley-Liss, 1999).

Cooper, G.M. The Cell: A Molecular Approach (Sunderland; Sinauer Associates, 2000).

The chemical properties of DNA are

important for sequence analysis


DNA is composed of two anti-parallel strands

5’ is the beginning of the sequence and 3’ is

the end of the sequence

DNA sequence is always written with 5’ at the

left side and 3’ at the right side







Strand 1: 5’ GAT…







Strand 1: 5’ GAT…

Strand 2: 5’ AGT…


DNA has strict base pairing rules that determine

the sequence of the complementary strand

DNA RNA protein


Transcription is the process of making

RNA from a DNA template



During transcription and RNA molecule is

synthesized from genomic DNA


RNA polymerase adds bases to the 3’ end

of the growing RNA molecule

Cooper, G.M. The Cell: A

Molecular Approach

(Sunderland; Sinauer

Associates, 2000).

The rule of complementary base pairing are

followed for RNA transcription

During RNA

transcription Uridine

is added instead of

Thymine. Uridine

base pairs with

Adenine.

In Bioinformatics we

ignore this fact—all

Uridine are written

as Thymine.

Cooper, G.M. The Cell: A

Molecular Approach

(Sunderland; Sinauer

Associates, 2000).

Template strand=

antisense

The template strand is anti-parallel to the

growing mRNA molecule

3’

5’

5’

3’


Template strand=

antisense

non-template

strand =

sense strand

The template strand is anti-parallel to the

growing mRNA molecule

3’

5’

5’

3’

This strand has

the same

sequence as the

mRNA molecule

Genes can be found on both

strands of a chromosomeForward strand

Reverse strand

5’

5’

The original RNA molecule undergoes

processing that changes the sequence

Lodish, H. et al. Molecular Cell Biology (New York; W.H. Freeman, 2000).

The original RNA molecule is processed


Exons are segments

of DNA that are found

in mature mRNA



Introns are segments

of DNA that are

removed through

splicing. They are

not found in mRNA



The sequence in red is

the coding sequence

(often abbreviated

CDS)



The sequence in red is

the coding sequence

(often abbreviated

CDS)

In the mRNA the exons are joined together

as one continuous sequence


Translation is the process by which an

mRNA molecule is used to make a protein

+1 is the first translated

nucleotide (usually the A

(followed by TG

(ATG=Methionine)

Translation is the process by which an

mRNA molecule is used to make a protein

The red indicates all the sequence within

the mRNA that will be used during

translation to code for protein

The sequences within an mRNA that do

not directly code for protein are called

Untranslated Regions

5’ UTR-

UnTranslated Region

before start codon—

does not code for

protein

3’ UTR-

UnTranslated Region

after stop codon—does

not code for protein

mRNA is converted to cDNA using reverse

transcription

Alberts, B. et al. Molecular Biology of the Cell (New York; Garland, 1994).

Because it is cDNA, not mRNA that is

sequenced we use T not U in sequence

files

Alberts, B. et al. Molecular Biology of the Cell (New York; Garland, 1994).

How do we identify introns/exons in our

sequence files?

We will use KRAS as an example

The KRAS gene produces 4 transcripts

(splice variants)

Transcript

Table

This is the transcript diagram for this gene

region

The Transcript Diagram shows the organization

of the transcripts generated from the gene locus

Use the link under the “Transcript ID” column

identify the exons and introns in a specific

transcript

The exon/intron map for a specific transcript

The lines are intronic sequence


The lines are intronic sequence

Bars are exonic sequence: filled bars

mean coding sequence and unfilled bars

are UTR sequence


The number of introns is always the number of exons -1.

5 exons, means 4 introns

The RefSeq link will direct you to the NCBI

nucleotide record for that gene

NCBI nucleotide record

NCBI nucleotide record continued

NCBI nucleotide record also contains the

sequence

60

Every nucleotide within the sequence has

an exact position

Each nucleotide has a number associated

with its position

NCBI nucleotide contains the annotation of

the sequence

The numbers refer to nucleotide positions

Viewing features within the

sequence file

Once you select a sequence feature, the

nucleotide sequence of the feature

become highlighted

CDS stands for coding sequence and this

will also show you the translation of the

nucleotide sequence into amino acid

sequence

DNA RNA protein


The genetic code


The genetic code is based on three

nucleotides “coding” for one amino acid

Korf, Y., Yandell, M. & Bedell, J. BLAST: an essential Guide to the Basic Local Alignment Search Tool (Sebastopol;

O’Reilly, 2003).

Codons

Amino acid

An Open Reading Frame (ORF)

begins with ATG and ends with TAA,

TAG or TGA

Korf, Y., Yandell, M. & Bedell, J. BLAST: an essential Guide to the Basic Local Alignment Search Tool (Sebastopol;

O’Reilly, 2003).

To find the coding sequence you must

identify the start and stop codons within the

sequence

Which start codon is right?

Which start codon is right?

The correct ORF is the longest translated

sequence

Any sequence has 6 possible

reading frames

Two strands of DNA

Triplet code (three

nucleotides in a codon)

Any sequence has 6 possible

reading frames

5’ CGCATGGTCTTACGCTGGAGCTCTCATGGATCGGTTTAA 3’

5’ CGC ATG GTC TTA CGC TGG AGC TCT CAT GGA TCG GTT TAA 3’ FRAME +1

5’ C GCA TGG TCT TAC GCT GGA GCT CTC ATG GAT CGG TTT AA 3’ FRAME +2

5’ CG CAT GGT CTT ACG CTG GAG CTC TCA TGG ATC GGT TTA A 3’ FRAME +3

The next three reading frames are based

on the reverse complement sequence


3’ GCGTACCAGAATGCGACCTCGAGAGTACCTAGCCAAATT 5’ Complement Sequence

5’ TTAAACCGATCCATGAGAGCTCCAGCGTAAGACCATGCG 3’ Reverse Complement

Generating the reverse complement

sequence




The 6 possible reading frames




5’ TTA AAC CGA TCC ATG AGA GCT CCA GCG TAA GAC CAT GCG 3’ FRAME -1

5’ T TAA ACC GAT CCA TGA GAG CTC CAG CGT AAG ACC ATG CG 3’ FRAME -2

5’ TT AAA CCG ATC CAT GAG AGC TCC AGC GTA AGA CCA TGC G 3’ FRAME -3

The correct reading frame will

have the largest ORF

5’ M V L R W S S H G S V Ter 3’


5’ CGC ATG GTC TTA CGC TGG AGC TCT CAT GGA TCG GTT TAA 3’ FRAME +1

ATG (M) is the start codon

TAA, TAG or TGA are the three stop codons—they do

not code for an amino acid

(amino acids)

Always

begins with

ATG

Always ends

with a stop

codon

Using the ORF-finder

program to identify ORFs

http://www.ncbi.nlm.nih.gov/gorf/gorf.html

Or Google “ORF-finder”

http://www.ncbi.nlm.nih.gov/gorf/gorf.html

Using ORF-finder

Using ORF-finder

Using ORF-finder

Results from ORF-finder

There are 6 possible reading

frames

For our purposes, the largest

ORF is the correct one

Selecting an ORF gives you

the translation

ORFs begin with a start codon

and end with a stop codon

ORF-finder results match with

NCBI nucleotide

Sequences found in the genomic DNA

are removed from the mRNA

Sequences found in the genomic DNA

are removed from the mRNA

Introns are the

sequences that

are removed

The mature mRNA

sequence contains only

exonic sequence

An mRNA sequence includes 5’UTR,

ORF, 3’UTR

5’ UTR-

Unstranslated region

before start codon—

does not code for

protein

3’ UTR-

Untranslated

region after stop

codon—does not

code for protein

Coding sequence

(red)

There are 6 possible reading frames in a

nucleic acid sequence

The correct ORF is usually the largest

ORFs start with ATG and end with a stop

codon

Worksheet