59

Gene Calling

Embed Size (px)

Citation preview

Page 1: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 1/59

Page 2: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 2/59

Bioinformatics:

Mining a Mountain of Data 

Where are theputative genes? 

Page 3: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 3/59

8-3 to reproduce or display

Brenner & Crick: A codon is composed of three nucleotides

and the starting point of each gene establishes a reading frame.

Fig. 8.5

Page 4: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 4/59

Central Dogma of Molecular Genetics

DNA RNA Protein 

Yanofsky: Genes & proteins are colinear 

Page 5: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 5/59

Genome

Random

Pieces Shotgun

Genomic

Libraries

6-8X

Sequencing

Coverage

Overlaps inSmall Pieces

to Form Contigs

Join Large

Pieces into

SequencedGenome

Genetic/

Physical

Map 

Subgenomic

Libraries

1X

Sequencing

Coverage

Annotation of

Contig Ends

Gap Closure

Functional Genomics

Annotation

Subgenomic

Mega-fragments

Sequence Annotation & Functional Genomics 

Page 6: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 6/59

Annotation PipelineThe Need for Humans

0 kb 10 kb

20 kb

% with functional % without prediction

Genome # ORFs prediction but with similarity

E. coli  K12 DH10B 4126 81.9 14.7 Agrobacterium tumefaciens C58 5402 63.3 34.0 

Vibrio cholerae O1 el tor 3835 57.2 38.7

Staphylococcus aureus COL 2652 63.5 32.3

Planctomyces limnophilus  4304 36.2 61.6

What about transmembrane domains, conserved

small domains (e.g., PFAMs), etc.?

Page 7: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 7/59

Statistical approaches

Page 8: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 8/59

Genetic Code

• Triplet• Nonoverlapping

• Comma-less

• Redundant

Page 9: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 9/59

6 frame translation of sequence

ATGCTTTGCTTGGAT

|||||||||||||||

TACGAAACGAACCTA

frame 1 ATG CTT TGC TTG GAT

frame 2 TGC TTT GCT TGG ATT

frame 3 GCT TTG CTT GGA TTC

frame -1 ATC CAA GCA AAG CATframe -1 TCC AAG CAA AGC ATG

frame -1 CCA AGC AAA GCA TGC 

Page 10: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 10/59

Open Reading Frames (ORFs) 

Detect potential coding regions by looking at ORFsA genome of length n is comprised of (n/3) codons (at least,

really with six frames, there’s 6n codons)

Stop codons break genome into segments betweenconsecutive Stop codons

The subsegments of these that start from the Start codon

(ATG) are ORFsORFs in different frames may overlap

Page 11: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 11/59

Long vs.Short ORFs 

Long open reading frames may be a gene

 At random, we should expect one stop codon every (64/3) ~= 21 codonsHowever , genes are usually much longer than this

 A basic approach is to scan for ORFs whose lengthexceeds certain threshold

This is naïve because some genes (e.g. some neural and immune systemgenes) are relatively short

Page 12: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 12/59

Testing ORFs: Codon Usage 

Create a 64-element hash table and count thefrequencies of codons in an ORF

 Amino acids typically have more than one codon,

but in nature certain codons are more in use(codon usage)

Uneven use of the codons may characterize a

real gene

This compensates for pitfalls of the ORF lengthtest

Page 13: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 13/59

Codon Usage in Human Genome 

Page 14: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 14/59

Codon usage in Mouse

AA codon /1000 frac

Leu CTG 39.95 0.40

Leu CTA 7.89 0.08

Leu CTT 12.97 0.13Leu CTC 20.04 0.20

Ala GCG 6.72 0.10

Ala GCA 15.80 0.23

Ala GCT 20.12 0.29

Ala GCC 26.51 0.38

Gln CAG 34.18 0.75

Gln CAA 11.51 0.25 

Page 15: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 15/59

Codon Usage and Likelihood Ratio 

 An ORF is more “believable” than another if it has more“likely” codons

Do sliding window calculations to find ORFs that have the

“likely” codon usage

 Allows for higher precision in identifying true ORFs; muchbetter than merely testing for length.

However, average vertebrate exon length is 130nucleotides, which is often too small to produce reliablepeaks in the likelihood ratio

Further improvement: in-frame hexamer count 

(frequencies of pairs of consecutive codons)

Page 16: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 16/59

TestCode Statistics 

Define a window size no less than 200 bp, slide thewindow the sequence down 3 bases.

In each window:

Calculate for each base {A, T, G, C}max (n3k+1, n3k+2, n3k) / min ( n3k+1, n3k+2, n3k)

Use these values to obtain a probability from a

lookup table (which was a previously defined and

determined experimentally with known coding andnoncoding sequences

Page 17: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 17/59

TestCode Statistics 

Probabilities can be classified asindicative of " coding” or “noncoding”regions, or “no opinion” when it is

unclear what level of randomizationtolerance a sequence carries

The resulting sequence of probabilitiescan be plotted

Page 18: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 18/59

Test code sample output

Page 19: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 19/59

Gene Prediction:

Similarity Based Approaches

Page 20: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 20/59

Outline

• The idea of similarity-based approach togene prediction

• Exon Chaining Problem• Spliced Alignment Problem

• Gene prediction tools

Page 21: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 21/59

Using Known Genes to Predict New Genes

• Some genomes may be very well-studied, with

many genes having been experimentally

verified.

• Closely-related organisms may have similar

genes

• Unknown genes in one species may be

compared to genes in some closely-related

species

Page 22: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 22/59

Similarity-Based Approach to Gene Prediction

• Genes in different organisms are similar

• The similarity-based approach uses known

genes in one genome to predict (unknown)

genes in another genome

• Problem: Given a known gene and an

unannotated genome sequence, find a set

of substrings of the genomic sequence

whose concatenation best fits the gene

Page 23: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 23/59

Comparing Genes in Two Genomes 

• Small islands of similarity corresponding tosimilarities between exons

Page 24: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 24/59

Reverse Translation

• Given a known protein, find a gene in thegenome which codes for it

• One might infer the coding DNA of the

given protein by reversing the translationprocess

 – Inexact: amino acids map to > 1 codon

 – This problem is essentially reduced to analignment problem

Page 25: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 25/59

Page 26: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 26/59

Comparing Genomic DNA Against mRNA

Portion of genome

mR N A 

 (   c  o 

 d  o n s  e  q  u e n c  e  )  

exon3exon1 exon2   {

    {

    {

 

intron1 intron2   {

    {

 

Page 27: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 27/59

Using Similarities to Find the Exon Structure

• The known frog gene is aligned to different locations inthe human genome

• Find the “best” path to reveal the exon structure of humangene

 F  r  o g G  e n e (   k   n o w

 n )  

Human Genome

Page 28: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 28/59

Finding Local Alignments

Use local alignments to find all islands of similarity

Human Genome

 F  r  o g

 G  e n e s  (   k   n o w n )  

Page 29: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 29/59

Chaining Local Alignments

• Find substrings that match a given gene sequence

(candidate exons)

• Define a candidate exons as

(l, r, w )(left, right, weight  defined as score of local alignment)

• Look for a maximum chain of substrings

 – Chain: a set of non-overlapping nonadjacentintervals.

Page 30: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 30/59

Exon Chaining Problem

• Locate the beginning and end of each interval

(2n points)• Find the “best” path 

3

4

11

915

5

5

0 2 3 5 6 11 13 16 20 25 27 28 30 32

Page 31: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 31/59

Exon Chaining Problem: Formulation

• Exon Chaining Problem: Given a set ofputative exons, find a maximum set ofnon-overlapping putative exons

• Input: a set of weighted intervals (putativeexons)

• Output: A maximum chain of intervalsfrom this set

Page 32: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 32/59

Exon Chaining Problem: Formulation

• Exon Chaining Problem: Given a set ofputative exons, find a maximum set ofnon-overlapping putative exons

• Input: a set of weighted intervals (putativeexons)

• Output: A maximum chain of intervalsfrom this set

Page 33: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 33/59

Exon Chaining Problem: Graph Representation

• This problem can be solved with dynamicprogramming in O(n) time.

Page 34: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 34/59

Exon Chaining AlgorithmExonChaining (G, n ) //Graph, number of intervals1 for i ← to 2n2 s i  ← 03 for i ← 1 to 2n

4if

 vertex v i   in G  corresponds to right end of the interval I5  j  ← index of vertex for left end of the interval I6 w  ← weight of the interval I7 s  j  ← max {s  j  + w, s i-1}8 else

9 s i  ← s i-1

1 return s 2n

Page 35: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 35/59

Exon Chaining: Deficiencies

 – Poor definition of the putative exon endpoints

 – Optimal chain of intervals may not correspond to any validalignment 

• First interval may correspond to a suffix, whereas secondinterval may correspond to a prefix

• Combination of such intervals is not a valid alignment

Page 36: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 36/59

Infeasible Chains

Red local similarities form two non -overlappingintervals but do not form a valid global alignment

Human Genome

 F  r  o

 g G  e n e s  (   k   n o w n )  

Page 37: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 37/59

Page 38: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 38/59

Using Blueprint

Page 39: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 39/59

 Assembling Putative Exons

Page 40: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 40/59

Still Assembling Putative Exons

Page 41: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 41/59

Spliced Alignment

• Mikhail Gelfand and colleagues proposed a splicedalignment approach of using a protein within one

genome to reconstruct the exon-intron structure of a

(related) gene in another genome.

 – Begins by selecting either all putative exons betweenpotential acceptor and donor sites or by finding all

substrings similar to the target protein (as in the Exon

Chaining Problem).

 – This set is further filtered in a such a way that attempt

to retain all true exons, with some false ones.

Page 42: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 42/59

Spliced Alignment Problem: Formulation

• Goal: Find a chain of blocks in a genomic

sequence that best fits a target sequence

• Input: Genomic sequences G, target

sequence T , and a set of candidate exons B.

• Output: A chain of exons Γ such that the

global alignment score between Γ* and T is

maximum among all chains of blocks from B.Γ* - concatenation of all exons from chain Γ 

Page 43: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 43/59

Lewis Carroll Example

Page 44: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 44/59

Spliced Alignment: Idea

• Compute the best alignment between i -prefix of

genomic sequence G and j-prefix of target T:

•   S(i,j)

• But what is “i -prefix” of  G?

• There may be a few i-prefixes of  G depending on

which block B we are in.

Page 45: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 45/59

Spliced Alignment: Idea

• Compute the best alignment between i -prefix of genomicsequence G and j-prefix of target T:

•   S(i,j)

• But what is “i -prefix” of  G?• There may be a few i-prefixes of  G depending on which

block B we are in.

• Compute the best alignment between i -prefix of genomicsequence G and j-prefix of target T under the assum pt ion  

that the alignment uses the block B at position iS(i,j,B)

Page 46: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 46/59

Spliced Alignment Recurrence

If i  is not the starting vertex of block B :• S(i, j, B) =

max { S(i – 1, j, B) – indel penaltyS(i, j – 1, B) – indel penalty  S(i – 1, j – 1, B) + δ (g i , t  j  ) }

If i  is the starting vertex of block B :• S(i, j, B) =

max { S(i, j – 1, B) – indel penalty  maxall blocks B’  preceding block B S(end(B’), j, B’) – indel

 penaltymaxall blocks B’ preceding block B S(end(B’), j – 1, B’) + δ (g i , t  j  ) }

Page 47: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 47/59

Spliced Alignment Solution

•  After computing the three-dimensionaltable S(i, j, B), the score of the optimalspliced alignment is:

maxall blocks BS(end(B), length(T), B)

Page 48: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 48/59

Spliced Alignment: Complications 

• Considering multiple i -prefixes leads to slow down.running time:

O(mn2  |B|)

where m is the target length, n is the genomicsequence length and |B| is the number of blocks.

•  A mosaic effect : short exons are easily combinedto fit any target protein

Page 49: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 49/59

Exon Chaining vs Spliced Alignment

• In Spliced Alignment, every path spells outstring obtained by concatenation of labels ofits edges. The weight of the path is defined asoptimal alignment score between

concatenated labels (blocks) and targetsequence

 – Defines weight of entire path in graph, butnot the weights for individual edges.

• Exon Chaining assumes the positions and weightsof exons are pre-defined

Page 50: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 50/59

Gene Prediction: Aligning Genome vs. Genome

•  Align entire human and mouse genomes

• Predict genes in both sequences

simultaneously as chains of aligned blocks(exons)

• This approach does not assume anyannotation of either human or mousegenes.

Page 51: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 51/59

Gene Prediction Tools

• GENSCAN/Genome Scan

• TwinScan

• Glimmer• GenMark

Page 52: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 52/59

The GENSCAN Algorithm

•  Algorithm is based on probabilistic model of gene structuresimilar to Hidden Markov Models (HMMs). 

• GENSCAN uses a training set in order to estimate theHMM parameters, then the algorithm returns the exonstructure using maximum likelihood approach standardto many HMM algorithms (Viterbi  algorithm). – Biological input: Codon bias in coding regions, gene

structure (start and stop codons, typical exon andintron length, presence of promoters, presence ofgenes on both strands, etc)

 – Covers cases where input sequence contains nogene, partial gene, complete gene, multiple genes.

Page 53: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 53/59

GENSCAN Limitations

 – Does not use similarity search to predictgenes.

 – Does not address alternative splicing.

 – Could combine two exons fromconsecutive genes together

Page 54: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 54/59

• Incorporates similarity information intoGENSCAN: predicts gene structure whichcorresponds to maximum probabilityconditional on similarity information

•  Algorithm is a combination of two sources of information – Probabilistic models of exons-introns

 – Sequence similarity information

GenomeScan

Page 55: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 55/59

TwinScan

•  Aligns two sequences and marks each

base as gap ( - ), mismatch (:), match (|),

resulting in a new alphabet of 12 letters: Σ 

{A-, A:, A |, C-, C:, C |, G-, G:, G |, T-, T:,T|}.

• Run Viterbi algorithm using emissions

ek (b) where b ∊ {A-, A:, A|, …, T|}. 

http://www.standford.edu/class/cs262/

Spring2003/Notes/ln10.pdf

Page 56: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 56/59

TwinScan (cont’d) 

• The emission probabilities are estimated

from from human/mouse gene pairs.

 – Ex. eI (x|) < eE (x|) since matches are

favored in exons, and eI (x-) > eE (x-) sincegaps (as well as mismatches) are favored

in introns.

 – Compensates for dominant occurrence ofpoly-A region in introns

Page 57: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 57/59

Glimmer

• Gene Locator and Interpolated MarkovModelER

• Finds genes in bacterial DNA

• Uses interpolated Markov Models

Page 58: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 58/59

The Glimmer Algorithm

• Made of 2 programs

 – BuildIMM

• Takes sequences as input and outputs theInterpolated Markov Models (IMMs)

 – Glimmer

• Takes IMMs and outputs all candidate genes

• Automatically resolves overlapping genes by

choosing one, hence limited• Marks “suspected to truly overlap” genes forcloser inspection by user

Page 59: Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 59/59

GenMark

• Based on non-stationary  Markov chainmodels

• Results displayed graphically with codingvs. noncoding probability dependent onposition in nucleotide sequence