19
A FAST PRUNNING ALGORITHM FOR OPTIMAL SEQUENCE ALIGNMENT Linear Space Bounded Dynamic Programming

A fast Prunning Algorithm for optimal Sequence Alignment

  • Upload
    bob

  • View
    50

  • Download
    0

Embed Size (px)

DESCRIPTION

A fast Prunning Algorithm for optimal Sequence Alignment. Linear Space Bounded Dynamic Programming. Overview. An introduction to alignments Dynamic Programming Other approaches to optimal alignment calculation A*-star algorithm LBD and boundaries Results Outlook on coming improvements. - PowerPoint PPT Presentation

Citation preview

Page 1: A fast Prunning  Algorithm for optimal Sequence Alignment

A FAST PRUNNING ALGORITHM FOR OPTIMAL SEQUENCE ALIGNMENT Linear Space Bounded Dynamic Programming

Page 2: A fast Prunning  Algorithm for optimal Sequence Alignment

OVERVIEW

1. An introduction to alignments2. Dynamic Programming 3. Other approaches to optimal alignment

calculation4. A*-star algorithm5. LBD and boundaries6. Results7. Outlook on coming improvements

Page 3: A fast Prunning  Algorithm for optimal Sequence Alignment

ALIGNMENTS

“the holy grail of Bioinformatics“ – Dan Gusfield

sequencing function of genes and proteins structure of proteins evolutionary trees

Sequencing gel

Page 4: A fast Prunning  Algorithm for optimal Sequence Alignment

MATHEMATICAL FORMALIZATION

Given k sequences sk over an alphabet Σ and k sequences ask over an extended alphabet

Σ΄ = Σ + {-} The set A = {as1, as2, ..., ask} is a sequence

alignment when each of the following three conditions are fullfilled

1. Each of the sequences in A have the same length

2. If you remove the gap symbols you arrive at the original sequneces

3. There is no column of gap symbols

AGGTCGAGAC_ GACGC_ G

AGGTCGAGACGACGCG

Page 5: A fast Prunning  Algorithm for optimal Sequence Alignment

DYNAMIC PROGRAMMING

Algorithm for finding the optimal sequence alignment:

Needleman–Wunsch algorithm A G C G

0 -1 -2 -3 -4

A -1 1 0 -1 -2

C -2 0 0 1 0

G -3 -1 1 0 2

G -4 -2 0 0 1

A C G T _

A 1 -1 -1 -1 -1

C -1 1 -1 -1 -1

G -1 -1 1 -1 -1

T -1 -1 -1 1 -1

_ -1 -1 -1 -1 -1

AGCG_A_CGG

AGC_GA_CGG

Page 6: A fast Prunning  Algorithm for optimal Sequence Alignment

DYNAMIC PROGRAMMING Analysis of the Algorithm

Runtime: O(n*n) (filling a quadratic matrix) Space consumption: O(n*n) (store n * n entries of

the quadratic matrix) Comparison of the genomes of Yeast,

Saccharomyces cerevisiae (20 * 10^6bp) Fruit fly, Drosophila melanogaster (130 * 10^6bp)Space consumption:

20*10^6 * 130 * 10^6 = 26 * 10 ^144 Bytes to store an integer => 26 * 10 ^5 Gigabytes

Drosophila melanogaster

Saccharomyces cerevisiae

Page 7: A fast Prunning  Algorithm for optimal Sequence Alignment

HIRSCHBERG‘S DIVIDE & CONQUER

Main idea: Only the row above neccessary to compute the

one below that Problem: Backtracking is not possible anymore

Algorithm: Divide s1 in s1a and s1b Align s1a with s2 and s1b with s2 Search the largest transition

(maximum sum) of these rows. Go in recursion Extra cell computations but space

requirements reduced to O(n^d-1)

s1a

s1b

s1a

s1b

s2

s2

Page 8: A fast Prunning  Algorithm for optimal Sequence Alignment

A*-ALGORITHM

A classic graph algorithm to find the shortest distance between two locations

Page 9: A fast Prunning  Algorithm for optimal Sequence Alignment

A*-ALGORITHM

Mathematical formalization Scoring function f*(n) = g*(n) + h*(n) with g* giving the optimal path to node n found

so far and the heuristic h* giving an optimistic approximation for the cost of a path from node n to a goal node

h* may never under-/overerstimate the score! Open list/priority que, close list (avoid circles)

Page 10: A fast Prunning  Algorithm for optimal Sequence Alignment

A*- ALGORITHM

Application The shortest path problem Use coordinate frame as the heuristic (shortest

connection between to points is a straight line) Alignments

Problems Close and open list can easily become large Not applicable to our problem in the basic

version Extensions

Do not store close list Do not insert none promising children in open

lists

Page 11: A fast Prunning  Algorithm for optimal Sequence Alignment

BOUNDED DYNAMIC PROGRAMMING

Main idea: Combine the low overhead of dynamic programming with the pruning capabilities of A*

Algorithm(1) Only prune where promising Compute the matrix (anti-)diagonalwise and

check for pruning always at the end of the diagonal which means to compare the current upper bound with the lowest lower bound

Good upper and lower bounds are neccessary

pruned matrix

Diagonal wise computation & pruning

Page 12: A fast Prunning  Algorithm for optimal Sequence Alignment

UPPER AND LOWER BOUNDS

Lower Bounds Diagonal Alignment e.g align the sequences directly

without any gaps

Greedy headlight search Result of several local alignments Always search the frontier for the largest value Use this as a fulcrum for the next local alignment step Only use diagonals for computing as no backtracking is needed Size of local alignment influences the time consumption drastically

Greedy headlight search

Page 13: A fast Prunning  Algorithm for optimal Sequence Alignment

UPPER AND LOWER BOUNDS

Upper bound Simply assume that the remaining characters are

aligned perfectly

A G T G C

A G T C G A A G

Upper bound: 5 – 3 = 2

Page 14: A fast Prunning  Algorithm for optimal Sequence Alignment

LINEAR SPACE- LBD ALIGN

Algorithm(2)

Use Hirschberg‘s Divide & Conquer Algorithm Shaded areas show the two created subproblems

Diagonalwise matrix computation

Divide & Conquer step

Page 15: A fast Prunning  Algorithm for optimal Sequence Alignment

RESULTS

3000

10000

100000

-1,00

0,00

1,00

2,00

3,00

4,003000

5000

8000

10000

15000

40000

100000

230000

Log(time in secondes)

Sequence length

Method

Page 16: A fast Prunning  Algorithm for optimal Sequence Alignment

RESULTS

Changes in pruning Strictly penalization leads to more pruning Using different lower bounds

Estimation of the greedy method comes with far better results and in conclusion more pruning than the diagonal alignment

Affine gap cost greatly reduces pruning as well as sequences with large difference in size

Dissimilar sequences (lengths)

Different shaded areas denote different lower bounds

Normal and affine gap costs

Page 17: A fast Prunning  Algorithm for optimal Sequence Alignment

EXTENSION & FUTURE WORK

LBD-Align has limited usage due to high flunctuation in pruning (affine gap costs, lower bounds, differnt sequence length)

use as second-order sequence tool sort out dissimilar sequences by highly heuristic

tools like BLAST best available optimal sequence alignment

tool for similar sequences

Page 18: A fast Prunning  Algorithm for optimal Sequence Alignment

SUMMARY

Alignments are still a current topic in bioinformatics because there is still room for improvements

Page 19: A fast Prunning  Algorithm for optimal Sequence Alignment

THANK YOU FOR YOUR ATTENTION