Upload
oliver-he
View
221
Download
0
Embed Size (px)
Citation preview
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 1/59
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 2/59
Bioinformatics:
Mining a Mountain of Data
Where are theputative genes?
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 3/59
8-3 to reproduce or display
Brenner & Crick: A codon is composed of three nucleotides
and the starting point of each gene establishes a reading frame.
Fig. 8.5
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 4/59
Central Dogma of Molecular Genetics
DNA RNA Protein
Yanofsky: Genes & proteins are colinear
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 5/59
Genome
Random
Pieces Shotgun
Genomic
Libraries
6-8X
Sequencing
Coverage
Overlaps inSmall Pieces
to Form Contigs
Join Large
Pieces into
SequencedGenome
Genetic/
Physical
Map
Subgenomic
Libraries
1X
Sequencing
Coverage
Annotation of
Contig Ends
Gap Closure
Functional Genomics
Annotation
Subgenomic
Mega-fragments
Sequence Annotation & Functional Genomics
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 6/59
Annotation PipelineThe Need for Humans
0 kb 10 kb
20 kb
% with functional % without prediction
Genome # ORFs prediction but with similarity
E. coli K12 DH10B 4126 81.9 14.7 Agrobacterium tumefaciens C58 5402 63.3 34.0
Vibrio cholerae O1 el tor 3835 57.2 38.7
Staphylococcus aureus COL 2652 63.5 32.3
Planctomyces limnophilus 4304 36.2 61.6
What about transmembrane domains, conserved
small domains (e.g., PFAMs), etc.?
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 7/59
Statistical approaches
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 8/59
Genetic Code
• Triplet• Nonoverlapping
• Comma-less
• Redundant
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 9/59
6 frame translation of sequence
ATGCTTTGCTTGGAT
|||||||||||||||
TACGAAACGAACCTA
frame 1 ATG CTT TGC TTG GAT
frame 2 TGC TTT GCT TGG ATT
frame 3 GCT TTG CTT GGA TTC
frame -1 ATC CAA GCA AAG CATframe -1 TCC AAG CAA AGC ATG
frame -1 CCA AGC AAA GCA TGC
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 10/59
Open Reading Frames (ORFs)
Detect potential coding regions by looking at ORFsA genome of length n is comprised of (n/3) codons (at least,
really with six frames, there’s 6n codons)
Stop codons break genome into segments betweenconsecutive Stop codons
The subsegments of these that start from the Start codon
(ATG) are ORFsORFs in different frames may overlap
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 11/59
Long vs.Short ORFs
Long open reading frames may be a gene
At random, we should expect one stop codon every (64/3) ~= 21 codonsHowever , genes are usually much longer than this
A basic approach is to scan for ORFs whose lengthexceeds certain threshold
This is naïve because some genes (e.g. some neural and immune systemgenes) are relatively short
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 12/59
Testing ORFs: Codon Usage
Create a 64-element hash table and count thefrequencies of codons in an ORF
Amino acids typically have more than one codon,
but in nature certain codons are more in use(codon usage)
Uneven use of the codons may characterize a
real gene
This compensates for pitfalls of the ORF lengthtest
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 13/59
Codon Usage in Human Genome
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 14/59
Codon usage in Mouse
AA codon /1000 frac
Leu CTG 39.95 0.40
Leu CTA 7.89 0.08
Leu CTT 12.97 0.13Leu CTC 20.04 0.20
Ala GCG 6.72 0.10
Ala GCA 15.80 0.23
Ala GCT 20.12 0.29
Ala GCC 26.51 0.38
Gln CAG 34.18 0.75
Gln CAA 11.51 0.25
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 15/59
Codon Usage and Likelihood Ratio
An ORF is more “believable” than another if it has more“likely” codons
Do sliding window calculations to find ORFs that have the
“likely” codon usage
Allows for higher precision in identifying true ORFs; muchbetter than merely testing for length.
However, average vertebrate exon length is 130nucleotides, which is often too small to produce reliablepeaks in the likelihood ratio
Further improvement: in-frame hexamer count
(frequencies of pairs of consecutive codons)
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 16/59
TestCode Statistics
Define a window size no less than 200 bp, slide thewindow the sequence down 3 bases.
In each window:
Calculate for each base {A, T, G, C}max (n3k+1, n3k+2, n3k) / min ( n3k+1, n3k+2, n3k)
Use these values to obtain a probability from a
lookup table (which was a previously defined and
determined experimentally with known coding andnoncoding sequences
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 17/59
TestCode Statistics
Probabilities can be classified asindicative of " coding” or “noncoding”regions, or “no opinion” when it is
unclear what level of randomizationtolerance a sequence carries
The resulting sequence of probabilitiescan be plotted
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 18/59
Test code sample output
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 19/59
Gene Prediction:
Similarity Based Approaches
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 20/59
Outline
• The idea of similarity-based approach togene prediction
• Exon Chaining Problem• Spliced Alignment Problem
• Gene prediction tools
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 21/59
Using Known Genes to Predict New Genes
• Some genomes may be very well-studied, with
many genes having been experimentally
verified.
• Closely-related organisms may have similar
genes
• Unknown genes in one species may be
compared to genes in some closely-related
species
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 22/59
Similarity-Based Approach to Gene Prediction
• Genes in different organisms are similar
• The similarity-based approach uses known
genes in one genome to predict (unknown)
genes in another genome
• Problem: Given a known gene and an
unannotated genome sequence, find a set
of substrings of the genomic sequence
whose concatenation best fits the gene
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 23/59
Comparing Genes in Two Genomes
• Small islands of similarity corresponding tosimilarities between exons
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 24/59
Reverse Translation
• Given a known protein, find a gene in thegenome which codes for it
• One might infer the coding DNA of the
given protein by reversing the translationprocess
– Inexact: amino acids map to > 1 codon
– This problem is essentially reduced to analignment problem
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 25/59
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 26/59
Comparing Genomic DNA Against mRNA
Portion of genome
mR N A
( c o
d o n s e q u e n c e )
exon3exon1 exon2 {
{
{
intron1 intron2 {
{
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 27/59
Using Similarities to Find the Exon Structure
• The known frog gene is aligned to different locations inthe human genome
• Find the “best” path to reveal the exon structure of humangene
F r o g G e n e ( k n o w
n )
Human Genome
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 28/59
Finding Local Alignments
Use local alignments to find all islands of similarity
Human Genome
F r o g
G e n e s ( k n o w n )
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 29/59
Chaining Local Alignments
• Find substrings that match a given gene sequence
(candidate exons)
• Define a candidate exons as
(l, r, w )(left, right, weight defined as score of local alignment)
• Look for a maximum chain of substrings
– Chain: a set of non-overlapping nonadjacentintervals.
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 30/59
Exon Chaining Problem
• Locate the beginning and end of each interval
(2n points)• Find the “best” path
3
4
11
915
5
5
0 2 3 5 6 11 13 16 20 25 27 28 30 32
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 31/59
Exon Chaining Problem: Formulation
• Exon Chaining Problem: Given a set ofputative exons, find a maximum set ofnon-overlapping putative exons
• Input: a set of weighted intervals (putativeexons)
• Output: A maximum chain of intervalsfrom this set
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 32/59
Exon Chaining Problem: Formulation
• Exon Chaining Problem: Given a set ofputative exons, find a maximum set ofnon-overlapping putative exons
• Input: a set of weighted intervals (putativeexons)
• Output: A maximum chain of intervalsfrom this set
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 33/59
Exon Chaining Problem: Graph Representation
• This problem can be solved with dynamicprogramming in O(n) time.
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 34/59
Exon Chaining AlgorithmExonChaining (G, n ) //Graph, number of intervals1 for i ← to 2n2 s i ← 03 for i ← 1 to 2n
4if
vertex v i in G corresponds to right end of the interval I5 j ← index of vertex for left end of the interval I6 w ← weight of the interval I7 s j ← max {s j + w, s i-1}8 else
9 s i ← s i-1
1 return s 2n
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 35/59
Exon Chaining: Deficiencies
– Poor definition of the putative exon endpoints
– Optimal chain of intervals may not correspond to any validalignment
• First interval may correspond to a suffix, whereas secondinterval may correspond to a prefix
• Combination of such intervals is not a valid alignment
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 36/59
Infeasible Chains
Red local similarities form two non -overlappingintervals but do not form a valid global alignment
Human Genome
F r o
g G e n e s ( k n o w n )
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 37/59
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 38/59
Using Blueprint
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 39/59
Assembling Putative Exons
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 40/59
Still Assembling Putative Exons
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 41/59
Spliced Alignment
• Mikhail Gelfand and colleagues proposed a splicedalignment approach of using a protein within one
genome to reconstruct the exon-intron structure of a
(related) gene in another genome.
– Begins by selecting either all putative exons betweenpotential acceptor and donor sites or by finding all
substrings similar to the target protein (as in the Exon
Chaining Problem).
– This set is further filtered in a such a way that attempt
to retain all true exons, with some false ones.
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 42/59
Spliced Alignment Problem: Formulation
• Goal: Find a chain of blocks in a genomic
sequence that best fits a target sequence
• Input: Genomic sequences G, target
sequence T , and a set of candidate exons B.
• Output: A chain of exons Γ such that the
global alignment score between Γ* and T is
maximum among all chains of blocks from B.Γ* - concatenation of all exons from chain Γ
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 43/59
Lewis Carroll Example
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 44/59
Spliced Alignment: Idea
• Compute the best alignment between i -prefix of
genomic sequence G and j-prefix of target T:
• S(i,j)
• But what is “i -prefix” of G?
• There may be a few i-prefixes of G depending on
which block B we are in.
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 45/59
Spliced Alignment: Idea
• Compute the best alignment between i -prefix of genomicsequence G and j-prefix of target T:
• S(i,j)
• But what is “i -prefix” of G?• There may be a few i-prefixes of G depending on which
block B we are in.
• Compute the best alignment between i -prefix of genomicsequence G and j-prefix of target T under the assum pt ion
that the alignment uses the block B at position iS(i,j,B)
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 46/59
Spliced Alignment Recurrence
If i is not the starting vertex of block B :• S(i, j, B) =
max { S(i – 1, j, B) – indel penaltyS(i, j – 1, B) – indel penalty S(i – 1, j – 1, B) + δ (g i , t j ) }
If i is the starting vertex of block B :• S(i, j, B) =
max { S(i, j – 1, B) – indel penalty maxall blocks B’ preceding block B S(end(B’), j, B’) – indel
penaltymaxall blocks B’ preceding block B S(end(B’), j – 1, B’) + δ (g i , t j ) }
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 47/59
Spliced Alignment Solution
• After computing the three-dimensionaltable S(i, j, B), the score of the optimalspliced alignment is:
maxall blocks BS(end(B), length(T), B)
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 48/59
Spliced Alignment: Complications
• Considering multiple i -prefixes leads to slow down.running time:
O(mn2 |B|)
where m is the target length, n is the genomicsequence length and |B| is the number of blocks.
• A mosaic effect : short exons are easily combinedto fit any target protein
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 49/59
Exon Chaining vs Spliced Alignment
• In Spliced Alignment, every path spells outstring obtained by concatenation of labels ofits edges. The weight of the path is defined asoptimal alignment score between
concatenated labels (blocks) and targetsequence
– Defines weight of entire path in graph, butnot the weights for individual edges.
• Exon Chaining assumes the positions and weightsof exons are pre-defined
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 50/59
Gene Prediction: Aligning Genome vs. Genome
• Align entire human and mouse genomes
• Predict genes in both sequences
simultaneously as chains of aligned blocks(exons)
• This approach does not assume anyannotation of either human or mousegenes.
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 51/59
Gene Prediction Tools
• GENSCAN/Genome Scan
• TwinScan
• Glimmer• GenMark
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 52/59
The GENSCAN Algorithm
• Algorithm is based on probabilistic model of gene structuresimilar to Hidden Markov Models (HMMs).
• GENSCAN uses a training set in order to estimate theHMM parameters, then the algorithm returns the exonstructure using maximum likelihood approach standardto many HMM algorithms (Viterbi algorithm). – Biological input: Codon bias in coding regions, gene
structure (start and stop codons, typical exon andintron length, presence of promoters, presence ofgenes on both strands, etc)
– Covers cases where input sequence contains nogene, partial gene, complete gene, multiple genes.
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 53/59
GENSCAN Limitations
– Does not use similarity search to predictgenes.
– Does not address alternative splicing.
– Could combine two exons fromconsecutive genes together
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 54/59
• Incorporates similarity information intoGENSCAN: predicts gene structure whichcorresponds to maximum probabilityconditional on similarity information
• Algorithm is a combination of two sources of information – Probabilistic models of exons-introns
– Sequence similarity information
GenomeScan
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 55/59
TwinScan
• Aligns two sequences and marks each
base as gap ( - ), mismatch (:), match (|),
resulting in a new alphabet of 12 letters: Σ
{A-, A:, A |, C-, C:, C |, G-, G:, G |, T-, T:,T|}.
• Run Viterbi algorithm using emissions
ek (b) where b ∊ {A-, A:, A|, …, T|}.
http://www.standford.edu/class/cs262/
Spring2003/Notes/ln10.pdf
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 56/59
TwinScan (cont’d)
• The emission probabilities are estimated
from from human/mouse gene pairs.
– Ex. eI (x|) < eE (x|) since matches are
favored in exons, and eI (x-) > eE (x-) sincegaps (as well as mismatches) are favored
in introns.
– Compensates for dominant occurrence ofpoly-A region in introns
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 57/59
Glimmer
• Gene Locator and Interpolated MarkovModelER
• Finds genes in bacterial DNA
• Uses interpolated Markov Models
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 58/59
The Glimmer Algorithm
• Made of 2 programs
– BuildIMM
• Takes sequences as input and outputs theInterpolated Markov Models (IMMs)
– Glimmer
• Takes IMMs and outputs all candidate genes
• Automatically resolves overlapping genes by
choosing one, hence limited• Marks “suspected to truly overlap” genes forcloser inspection by user
8/10/2019 Gene Calling
http://slidepdf.com/reader/full/gene-calling 59/59
GenMark
• Based on non-stationary Markov chainmodels
• Results displayed graphically with codingvs. noncoding probability dependent onposition in nucleotide sequence