Gene architecture and
sequence annotation
Week 2
Last week:
1) How to search genomic databases
such as NCBI and ensembl
1) How to obtain sequence files
Sequence of the
Cystic Fibrosis
Gene: CFTR
This week we will learn to identify genetic
architecture within sequence files
This week will learn the differences
between the two types of Nucleic Acid
Sequences
1) Genomic—the sequence of nucleotides
on a chromosome
2) Expressed sequences—the sequence
of nucleotides in mRNA/cDNA
DNA RNA protein
Bioinformatics and Functional Genomics, 2nd Edition. http://www.bioinfbook.org (2014).
The expression of genomic
information
DNA RNA protein
genome transcriptome proteome
Bioinformatics and Functional Genomics, 2nd Edition. http://www.bioinfbook.org (2014).
DNA RNA
cDNA
ESTs
UniGene
phenotype
genomic
DNA
databases
protein
sequence
databases
protein
Bioinformatics and Functional Genomics, 2nd Edition. http://www.bioinfbook.org (2014).
Learning Objectives:
Understand sequence differences between genomic
and expressed sequences
Use programs to determine the correct open reading
frame (ORF) of an expressed sequence
Annotate sequence files
Genomic DNA is one source
of nucleic acid sequence
Strachan, T. & Read, A.P. Human Molecular Genetics. (New York; Wiley-Liss, 1999).
Cooper, G.M. The Cell: A Molecular Approach (Sunderland; Sinauer Associates, 2000).
The chemical properties of DNA are
important for sequence analysis
Cooper, G.M. The Cell: A Molecular Approach (Sunderland; Sinauer Associates, 2000).
DNA is composed of two anti-parallel strands
5’ is the beginning of the sequence and 3’ is
the end of the sequence
DNA sequence is always written with 5’ at the
left side and 3’ at the right side
Cooper, G.M. The Cell: A Molecular Approach (Sunderland; Sinauer Associates, 2000).
DNA is composed of two anti-parallel strands
5’ is the beginning of the sequence and 3’ is
the end of the sequence
DNA sequence is always written with 5’ at the
left side and 3’ at the right side
Strand 1: 5’ GAT…
Cooper, G.M. The Cell: A Molecular Approach (Sunderland; Sinauer Associates, 2000).
DNA is composed of two anti-parallel strands
5’ is the beginning of the sequence and 3’ is
the end of the sequence
DNA sequence is always written with 5’ at the
left side and 3’ at the right side
Strand 1: 5’ GAT…
Strand 2: 5’ AGT…
Cooper, G.M. The Cell: A Molecular Approach (Sunderland; Sinauer Associates, 2000).
DNA has strict base pairing rules that determine
the sequence of the complementary strand
DNA RNA protein
Bioinformatics and Functional Genomics, 2nd Edition. http://www.bioinfbook.org (2014).
Transcription is the process of making
RNA from a DNA template
Cooper, G.M. The Cell: A Molecular Approach (Sunderland; Sinauer Associates, 2000).
During transcription and RNA molecule is
synthesized from genomic DNA
Cooper, G.M. The Cell: A Molecular Approach (Sunderland; Sinauer Associates, 2000).
RNA polymerase adds bases to the 3’ end
of the growing RNA molecule
Cooper, G.M. The Cell: A
Molecular Approach
(Sunderland; Sinauer
Associates, 2000).
The rule of complementary base pairing are
followed for RNA transcription
During RNA
transcription Uridine
is added instead of
Thymine. Uridine
base pairs with
Adenine.
In Bioinformatics we
ignore this fact—all
Uridine are written
as Thymine.
Cooper, G.M. The Cell: A
Molecular Approach
(Sunderland; Sinauer
Associates, 2000).
Template strand=
antisense
The template strand is anti-parallel to the
growing mRNA molecule
3’
5’
5’
3’
Cooper, G.M. The Cell: A Molecular Approach (Sunderland; Sinauer Associates, 2000).
Template strand=
antisense
non-template
strand =
sense strand
The template strand is anti-parallel to the
growing mRNA molecule
3’
5’
5’
3’
This strand has
the same
sequence as the
mRNA molecule
Genes can be found on both
strands of a chromosomeForward strand
Reverse strand
5’
5’
The original RNA molecule undergoes
processing that changes the sequence
Lodish, H. et al. Molecular Cell Biology (New York; W.H. Freeman, 2000).
The original RNA molecule is processed
Lodish, H. et al. Molecular Cell Biology (New York; W.H. Freeman, 2000).
Exons are segments
of DNA that are found
in mature mRNA
The original RNA molecule is processed
Lodish, H. et al. Molecular Cell Biology (New York; W.H. Freeman, 2000).
Introns are segments
of DNA that are
removed through
splicing. They are
not found in mRNA
The original RNA molecule is processed
Lodish, H. et al. Molecular Cell Biology (New York; W.H. Freeman, 2000).
The sequence in red is
the coding sequence
(often abbreviated
CDS)
The original RNA molecule is processed
Lodish, H. et al. Molecular Cell Biology (New York; W.H. Freeman, 2000).
The sequence in red is
the coding sequence
(often abbreviated
CDS)
In the mRNA the exons are joined together
as one continuous sequence
Lodish, H. et al. Molecular Cell Biology (New York; W.H. Freeman, 2000).
Translation is the process by which an
mRNA molecule is used to make a protein
+1 is the first translated
nucleotide (usually the A
(followed by TG
(ATG=Methionine)
Translation is the process by which an
mRNA molecule is used to make a protein
The red indicates all the sequence within
the mRNA that will be used during
translation to code for protein
The sequences within an mRNA that do
not directly code for protein are called
Untranslated Regions
5’ UTR-
UnTranslated Region
before start codon—
does not code for
protein
3’ UTR-
UnTranslated Region
after stop codon—does
not code for protein
mRNA is converted to cDNA using reverse
transcription
Alberts, B. et al. Molecular Biology of the Cell (New York; Garland, 1994).
Because it is cDNA, not mRNA that is
sequenced we use T not U in sequence
files
Alberts, B. et al. Molecular Biology of the Cell (New York; Garland, 1994).
How do we identify introns/exons in our
sequence files?
We will use KRAS as an example
The KRAS gene produces 4 transcripts
(splice variants)
Transcript
Table
This is the transcript diagram for this gene
region
The Transcript Diagram shows the organization
of the transcripts generated from the gene locus
Use the link under the “Transcript ID” column
identify the exons and introns in a specific
transcript
The exon/intron map for a specific transcript
The lines are intronic sequence
The exon/intron map for a specific transcript
The lines are intronic sequence
Bars are exonic sequence: filled bars
mean coding sequence and unfilled bars
are UTR sequence
The exon/intron map for a specific transcript
The number of introns is always the number of exons -1.
5 exons, means 4 introns
The RefSeq link will direct you to the NCBI
nucleotide record for that gene
NCBI nucleotide record
NCBI nucleotide record continued
NCBI nucleotide record also contains the
sequence
60
Every nucleotide within the sequence has
an exact position
Each nucleotide has a number associated
with its position
NCBI nucleotide contains the annotation of
the sequence
The numbers refer to nucleotide positions
Viewing features within the
sequence file
Once you select a sequence feature, the
nucleotide sequence of the feature
become highlighted
CDS stands for coding sequence and this
will also show you the translation of the
nucleotide sequence into amino acid
sequence
DNA RNA protein
Bioinformatics and Functional Genomics, 2nd Edition. http://www.bioinfbook.org (2014).
The genetic code
The genetic code is based on three
nucleotides “coding” for one amino acid
Korf, Y., Yandell, M. & Bedell, J. BLAST: an essential Guide to the Basic Local Alignment Search Tool (Sebastopol;
O’Reilly, 2003).
Codons
Amino acid
An Open Reading Frame (ORF)
begins with ATG and ends with TAA,
TAG or TGA
Korf, Y., Yandell, M. & Bedell, J. BLAST: an essential Guide to the Basic Local Alignment Search Tool (Sebastopol;
O’Reilly, 2003).
To find the coding sequence you must
identify the start and stop codons within the
sequence
Which start codon is right?
Which start codon is right?
The correct ORF is the longest translated
sequence
Any sequence has 6 possible
reading frames
Two strands of DNA
Triplet code (three
nucleotides in a codon)
Any sequence has 6 possible
reading frames
5’ CGCATGGTCTTACGCTGGAGCTCTCATGGATCGGTTTAA 3’
5’ CGC ATG GTC TTA CGC TGG AGC TCT CAT GGA TCG GTT TAA 3’ FRAME +1
5’ C GCA TGG TCT TAC GCT GGA GCT CTC ATG GAT CGG TTT AA 3’ FRAME +2
5’ CG CAT GGT CTT ACG CTG GAG CTC TCA TGG ATC GGT TTA A 3’ FRAME +3
The next three reading frames are based
on the reverse complement sequence
5’ CGCATGGTCTTACGCTGGAGCTCTCATGGATCGGTTTAA 3’
3’ GCGTACCAGAATGCGACCTCGAGAGTACCTAGCCAAATT 5’ Complement Sequence
5’ TTAAACCGATCCATGAGAGCTCCAGCGTAAGACCATGCG 3’ Reverse Complement
Generating the reverse complement
sequence
5’ CGCATGGTCTTACGCTGGAGCTCTCATGGATCGGTTTAA 3’
3’ GCGTACCAGAATGCGACCTCGAGAGTACCTAGCCAAATT 5’ Complement Sequence
5’ TTAAACCGATCCATGAGAGCTCCAGCGTAAGACCATGCG 3’ Reverse Complement
The 6 possible reading frames
5’ CGCATGGTCTTACGCTGGAGCTCTCATGGATCGGTTTAA 3’
3’ GCGTACCAGAATGCGACCTCGAGAGTACCTAGCCAAATT 5’ Complement Sequence
5’ TTAAACCGATCCATGAGAGCTCCAGCGTAAGACCATGCG 3’ Reverse Complement
5’ TTA AAC CGA TCC ATG AGA GCT CCA GCG TAA GAC CAT GCG 3’ FRAME -1
5’ T TAA ACC GAT CCA TGA GAG CTC CAG CGT AAG ACC ATG CG 3’ FRAME -2
5’ TT AAA CCG ATC CAT GAG AGC TCC AGC GTA AGA CCA TGC G 3’ FRAME -3
The correct reading frame will
have the largest ORF
5’ M V L R W S S H G S V Ter 3’
5’ CGCATGGTCTTACGCTGGAGCTCTCATGGATCGGTTTAA 3’
5’ CGC ATG GTC TTA CGC TGG AGC TCT CAT GGA TCG GTT TAA 3’ FRAME +1
ATG (M) is the start codon
TAA, TAG or TGA are the three stop codons—they do
not code for an amino acid
(amino acids)
Always
begins with
ATG
Always ends
with a stop
codon
Using the ORF-finder
program to identify ORFs
http://www.ncbi.nlm.nih.gov/gorf/gorf.html
Or Google “ORF-finder”
Using ORF-finder
Using ORF-finder
Using ORF-finder
Results from ORF-finder
There are 6 possible reading
frames
For our purposes, the largest
ORF is the correct one
Selecting an ORF gives you
the translation
ORFs begin with a start codon
and end with a stop codon
ORF-finder results match with
NCBI nucleotide
Sequences found in the genomic DNA
are removed from the mRNA
Sequences found in the genomic DNA
are removed from the mRNA
Introns are the
sequences that
are removed
The mature mRNA
sequence contains only
exonic sequence
An mRNA sequence includes 5’UTR,
ORF, 3’UTR
5’ UTR-
Unstranslated region
before start codon—
does not code for
protein
3’ UTR-
Untranslated
region after stop
codon—does not
code for protein
Coding sequence
(red)
There are 6 possible reading frames in a
nucleic acid sequence
The correct ORF is usually the largest
ORFs start with ATG and end with a stop
codon
Worksheet