Gene Calling

8/10/2019 Gene Calling

http://slidepdf.com/reader/full/gene-calling 1/59



Bioinformatics:

Mining a Mountain of Data

Where are theputative genes?



8-3 to reproduce or display

Brenner & Crick: A codon is composed of three nucleotides

and the starting point of each gene establishes a reading frame.

Fig. 8.5



Central Dogma of Molecular Genetics

DNA RNA Protein

Yanofsky: Genes & proteins are colinear



Genome

Random

Pieces Shotgun

Genomic

Libraries

6-8X

Sequencing

Coverage

Overlaps inSmall Pieces

to Form Contigs

Join Large

Pieces into

SequencedGenome

Genetic/

Physical

Map

Subgenomic

Libraries

1X

Sequencing

Coverage

Annotation of

Contig Ends

Gap Closure

Functional Genomics

Annotation

Subgenomic

Mega-fragments

Sequence Annotation & Functional Genomics



Annotation PipelineThe Need for Humans

0 kb 10 kb

20 kb

% with functional % without prediction

Genome # ORFs prediction but with similarity

E. coli K12 DH10B 4126 81.9 14.7 Agrobacterium tumefaciens C58 5402 63.3 34.0

Vibrio cholerae O1 el tor 3835 57.2 38.7

Staphylococcus aureus COL 2652 63.5 32.3

Planctomyces limnophilus 4304 36.2 61.6

What about transmembrane domains, conserved

small domains (e.g., PFAMs), etc.?



Statistical approaches



Genetic Code

• Triplet• Nonoverlapping

• Comma-less

• Redundant



6 frame translation of sequence

ATGCTTTGCTTGGAT

|||||||||||||||

TACGAAACGAACCTA

frame 1 ATG CTT TGC TTG GAT

frame 2 TGC TTT GCT TGG ATT

frame 3 GCT TTG CTT GGA TTC

frame -1 ATC CAA GCA AAG CATframe -1 TCC AAG CAA AGC ATG

frame -1 CCA AGC AAA GCA TGC



Open Reading Frames (ORFs)

Detect potential coding regions by looking at ORFsA genome of length n is comprised of (n/3) codons (at least,

really with six frames, there’s 6n codons)

Stop codons break genome into segments betweenconsecutive Stop codons

The subsegments of these that start from the Start codon

(ATG) are ORFsORFs in different frames may overlap



Long vs.Short ORFs

Long open reading frames may be a gene

At random, we should expect one stop codon every (64/3) ~= 21 codonsHowever , genes are usually much longer than this

A basic approach is to scan for ORFs whose lengthexceeds certain threshold

This is naïve because some genes (e.g. some neural and immune systemgenes) are relatively short



Testing ORFs: Codon Usage

Create a 64-element hash table and count thefrequencies of codons in an ORF

Amino acids typically have more than one codon,

but in nature certain codons are more in use(codon usage)

Uneven use of the codons may characterize a

real gene

This compensates for pitfalls of the ORF lengthtest



Codon Usage in Human Genome



Codon usage in Mouse

AA codon /1000 frac

Leu CTG 39.95 0.40

Leu CTA 7.89 0.08

Leu CTT 12.97 0.13Leu CTC 20.04 0.20

Ala GCG 6.72 0.10

Ala GCA 15.80 0.23

Ala GCT 20.12 0.29

Ala GCC 26.51 0.38

Gln CAG 34.18 0.75

Gln CAA 11.51 0.25



Codon Usage and Likelihood Ratio

An ORF is more “believable” than another if it has more“likely” codons

Do sliding window calculations to find ORFs that have the

“likely” codon usage

Allows for higher precision in identifying true ORFs; muchbetter than merely testing for length.

However, average vertebrate exon length is 130nucleotides, which is often too small to produce reliablepeaks in the likelihood ratio

Further improvement: in-frame hexamer count

(frequencies of pairs of consecutive codons)



TestCode Statistics

Define a window size no less than 200 bp, slide thewindow the sequence down 3 bases.

In each window:

Calculate for each base {A, T, G, C}max (n3k+1, n3k+2, n3k) / min ( n3k+1, n3k+2, n3k)

Use these values to obtain a probability from a

lookup table (which was a previously defined and

determined experimentally with known coding andnoncoding sequences



TestCode Statistics

Probabilities can be classified asindicative of " coding” or “noncoding”regions, or “no opinion” when it is

unclear what level of randomizationtolerance a sequence carries

The resulting sequence of probabilitiescan be plotted



Test code sample output



Gene Prediction:

Similarity Based Approaches



Outline

• The idea of similarity-based approach togene prediction

• Exon Chaining Problem• Spliced Alignment Problem

• Gene prediction tools



Using Known Genes to Predict New Genes

• Some genomes may be very well-studied, with

many genes having been experimentally

verified.

• Closely-related organisms may have similar

genes

• Unknown genes in one species may be

compared to genes in some closely-related

species



Similarity-Based Approach to Gene Prediction

• Genes in different organisms are similar

• The similarity-based approach uses known

genes in one genome to predict (unknown)

genes in another genome

• Problem: Given a known gene and an

unannotated genome sequence, find a set

of substrings of the genomic sequence

whose concatenation best fits the gene



Comparing Genes in Two Genomes

• Small islands of similarity corresponding tosimilarities between exons



Reverse Translation

• Given a known protein, find a gene in thegenome which codes for it

• One might infer the coding DNA of the

given protein by reversing the translationprocess

– Inexact: amino acids map to > 1 codon

– This problem is essentially reduced to analignment problem





Comparing Genomic DNA Against mRNA

Portion of genome

mR N A

( c o

d o n s e q u e n c e )

exon3exon1 exon2 {

{

{

intron1 intron2 {

{



Using Similarities to Find the Exon Structure

• The known frog gene is aligned to different locations inthe human genome

• Find the “best” path to reveal the exon structure of humangene

F r o g G e n e ( k n o w

n )

Human Genome



Finding Local Alignments

Use local alignments to find all islands of similarity

Human Genome

F r o g

G e n e s ( k n o w n )



Chaining Local Alignments

• Find substrings that match a given gene sequence

(candidate exons)

• Define a candidate exons as

(l, r, w )(left, right, weight defined as score of local alignment)

• Look for a maximum chain of substrings

– Chain: a set of non-overlapping nonadjacentintervals.



Exon Chaining Problem

• Locate the beginning and end of each interval

(2n points)• Find the “best” path

3

4

11

915

5

5

0 2 3 5 6 11 13 16 20 25 27 28 30 32



Exon Chaining Problem: Formulation

• Exon Chaining Problem: Given a set ofputative exons, find a maximum set ofnon-overlapping putative exons

• Input: a set of weighted intervals (putativeexons)

• Output: A maximum chain of intervalsfrom this set



Exon Chaining Problem: Formulation

• Exon Chaining Problem: Given a set ofputative exons, find a maximum set ofnon-overlapping putative exons

• Input: a set of weighted intervals (putativeexons)

• Output: A maximum chain of intervalsfrom this set



Exon Chaining Problem: Graph Representation

• This problem can be solved with dynamicprogramming in O(n) time.



Exon Chaining AlgorithmExonChaining (G, n ) //Graph, number of intervals1 for i ← to 2n2 s i ← 03 for i ← 1 to 2n

4if

vertex v i in G corresponds to right end of the interval I5 j ← index of vertex for left end of the interval I6 w ← weight of the interval I7 s j ← max {s j + w, s i-1}8 else

9 s i ← s i-1

1 return s 2n



Exon Chaining: Deficiencies

– Poor definition of the putative exon endpoints

– Optimal chain of intervals may not correspond to any validalignment

• First interval may correspond to a suffix, whereas secondinterval may correspond to a prefix

• Combination of such intervals is not a valid alignment



Infeasible Chains

Red local similarities form two non -overlappingintervals but do not form a valid global alignment

Human Genome

F r o

g G e n e s ( k n o w n )





Using Blueprint



Assembling Putative Exons



Still Assembling Putative Exons



Spliced Alignment

• Mikhail Gelfand and colleagues proposed a splicedalignment approach of using a protein within one

genome to reconstruct the exon-intron structure of a

(related) gene in another genome.

– Begins by selecting either all putative exons betweenpotential acceptor and donor sites or by finding all

substrings similar to the target protein (as in the Exon

Chaining Problem).

– This set is further filtered in a such a way that attempt

to retain all true exons, with some false ones.



Spliced Alignment Problem: Formulation

• Goal: Find a chain of blocks in a genomic

sequence that best fits a target sequence

• Input: Genomic sequences G, target

sequence T , and a set of candidate exons B.

• Output: A chain of exons Γ such that the

global alignment score between Γ* and T is

maximum among all chains of blocks from B.Γ* - concatenation of all exons from chain Γ



Lewis Carroll Example



Spliced Alignment: Idea

• Compute the best alignment between i -prefix of

genomic sequence G and j-prefix of target T:

• S(i,j)

• But what is “i -prefix” of G?

• There may be a few i-prefixes of G depending on

which block B we are in.



Spliced Alignment: Idea

• Compute the best alignment between i -prefix of genomicsequence G and j-prefix of target T:

• S(i,j)

• But what is “i -prefix” of G?• There may be a few i-prefixes of G depending on which

block B we are in.

• Compute the best alignment between i -prefix of genomicsequence G and j-prefix of target T under the assum pt ion

that the alignment uses the block B at position iS(i,j,B)



Spliced Alignment Recurrence

If i is not the starting vertex of block B :• S(i, j, B) =

max { S(i – 1, j, B) – indel penaltyS(i, j – 1, B) – indel penalty S(i – 1, j – 1, B) + δ (g i , t j ) }

If i is the starting vertex of block B :• S(i, j, B) =

max { S(i, j – 1, B) – indel penalty maxall blocks B’ preceding block B S(end(B’), j, B’) – indel

penaltymaxall blocks B’ preceding block B S(end(B’), j – 1, B’) + δ (g i , t j ) }



Spliced Alignment Solution

• After computing the three-dimensionaltable S(i, j, B), the score of the optimalspliced alignment is:

maxall blocks BS(end(B), length(T), B)



Spliced Alignment: Complications

• Considering multiple i -prefixes leads to slow down.running time:

O(mn2 |B|)

where m is the target length, n is the genomicsequence length and |B| is the number of blocks.

• A mosaic effect : short exons are easily combinedto fit any target protein



Exon Chaining vs Spliced Alignment

• In Spliced Alignment, every path spells outstring obtained by concatenation of labels ofits edges. The weight of the path is defined asoptimal alignment score between

concatenated labels (blocks) and targetsequence

– Defines weight of entire path in graph, butnot the weights for individual edges.

• Exon Chaining assumes the positions and weightsof exons are pre-defined



Gene Prediction: Aligning Genome vs. Genome

• Align entire human and mouse genomes

• Predict genes in both sequences

simultaneously as chains of aligned blocks(exons)

• This approach does not assume anyannotation of either human or mousegenes.



Gene Prediction Tools

• GENSCAN/Genome Scan

• TwinScan

• Glimmer• GenMark



The GENSCAN Algorithm

• Algorithm is based on probabilistic model of gene structuresimilar to Hidden Markov Models (HMMs).

• GENSCAN uses a training set in order to estimate theHMM parameters, then the algorithm returns the exonstructure using maximum likelihood approach standardto many HMM algorithms (Viterbi algorithm). – Biological input: Codon bias in coding regions, gene

structure (start and stop codons, typical exon andintron length, presence of promoters, presence ofgenes on both strands, etc)

– Covers cases where input sequence contains nogene, partial gene, complete gene, multiple genes.



GENSCAN Limitations

– Does not use similarity search to predictgenes.

– Does not address alternative splicing.

– Could combine two exons fromconsecutive genes together



• Incorporates similarity information intoGENSCAN: predicts gene structure whichcorresponds to maximum probabilityconditional on similarity information

• Algorithm is a combination of two sources of information – Probabilistic models of exons-introns

– Sequence similarity information

GenomeScan



TwinScan

• Aligns two sequences and marks each

base as gap ( - ), mismatch (:), match (|),

resulting in a new alphabet of 12 letters: Σ

{A-, A:, A |, C-, C:, C |, G-, G:, G |, T-, T:,T|}.

• Run Viterbi algorithm using emissions

ek (b) where b ∊ {A-, A:, A|, …, T|}.

http://www.standford.edu/class/cs262/

Spring2003/Notes/ln10.pdf



TwinScan (cont’d)

• The emission probabilities are estimated

from from human/mouse gene pairs.

– Ex. eI (x|) < eE (x|) since matches are

favored in exons, and eI (x-) > eE (x-) sincegaps (as well as mismatches) are favored

in introns.

– Compensates for dominant occurrence ofpoly-A region in introns



Glimmer

• Gene Locator and Interpolated MarkovModelER

• Finds genes in bacterial DNA

• Uses interpolated Markov Models



The Glimmer Algorithm

• Made of 2 programs

– BuildIMM

• Takes sequences as input and outputs theInterpolated Markov Models (IMMs)

– Glimmer

• Takes IMMs and outputs all candidate genes

• Automatically resolves overlapping genes by

choosing one, hence limited• Marks “suspected to truly overlap” genes forcloser inspection by user



GenMark

• Based on non-stationary Markov chainmodels

• Results displayed graphically with codingvs. noncoding probability dependent onposition in nucleotide sequence

Documents

Gene Calling