3
com PRIMER What is (dynamic programming? Sean R Eddy Sequence alignment methods often use something called a 'dynamic programming' algorithm. What is dynamic programming and how does it work? Dynamic programming algorithms are a good place to start understanding what's really going on inside computational hiology software. The heart of many well-known pro- grams is a dynamic programming algorithm, or a fast approximation of one, including sequence database search programs like BLAST and FASTA, multiple sequence align- ment programs like CLUSTALW, profile search programs like HMMER, gene finding programs like GENSCAN and even RNA- folding programs like MFOLD and phyloge- netic inference programs like PHYLIP. Don't expect much enlightenment from the etymology of the term 'dynamic program- ming,' though. Dynamic programming was formalized in the early 1950s by mathemati- cian Richard Bellman, who was working at RAND Corporation on optimal decision processes. He wanted to concoct an impres- sive name that would shield his work from US Secretary of Defense Charles Wilson, a man known to be hostile to mathematics research. His work involved time series and planning^— thus 'dynamic' and 'programming' (note, nothing particularly to do with computer programming). Bellman especially liked 'dynamic' hecause "it's impossible to use the word dynamic in a pejorative sense"; he fig- ured dynamic programming was "something not even a Congressman could object to"'. The best way to understand how dynamic programming works is to see an example. Conveniently, optima! sequence alignment provides an example that is hoth simple and biologically relevant. Sean R. Eddy is at Howard Hughes Medical Institute & Department of Genetics, Washington University School of Medicine, 4444 Eorest Park Blvd., Box 8510, Saint Louis, Missouri 63108, USA. e-mail: [email protected] Dynamic programming matrix: (sequence y) 3 4 5 6 7 C T C G T 0 -6 -12 -18 -24 -30 -36 1-6 5 -1 -7 -13 -19 -25 1-12 >-1 3 -3 -9 -15 -21 --18 1-7 --3 8 2 -4 -10 --24 1-13 -2 2 6 7 1 --30 --19 1-8 3 0 4 5 --36 >-25 -14 -3 1 -2 2 «-42 • -31 --20 - -9 1-5 6 0 --48 --37 --26 --15 -4 1 0 11 i 0 1 T 2 T 3C M - 6 A Optimum alignment scores 11 T - - T C A T A T G C T C G T A +5 _6 - ^ +5 -1-5 -2 -1-5 -t-5 The biological problem: pairwise sequence alignment We have two DNA or protein sequences, and we want to infer if they are homologous or not. To do this, we will calculate a score that reflects how similar the two sequences are (that is, how likely they are to he derived from a common ancestor). Because sequences differ not just by substitution, but also hy insertion and deletion, we want to optimally align the two sequences to maximize their similarity. Why do we need a fancy algorithm? Can't we just score all possihle alignments and pick the hest one? This isn't practical, because there are about 2'^n2nN different align- ments for two sequences of length N; for two sequences of length 300, that's about 10^''^ different alignments. Let's .set up the problem with some nota- tion. Call the two sequences x and y. They are of length M and N residues, respectively. The I* residue in x is Xj\ the/*" residue of y is yj. We Figure 1 The filled dynamic programming matrix for two DNA sequences, x = TTCATA and y = TGCTCGTA, for a scoring system of +5 for a match, -2 for a mismatch and -6 for each insertion or deletion. The cells in the optimum path are shown in red. Arrowheads are 'traceback pointers,' indicating which of the three cases were optimal for reaching each cell. (Some cells can be reached by two or three different optimal paths of equal score: whenever two or more cases are equally optimal, dynamic programming implementations usually choose one case arbitrarily. In this example, though, the optimal path is unique,) need some parameters for how to score align- ments: we'll use a scoring matrix o(a, h) for aligning two residues a,b to each other (e.g., a 4 x 4 matrix for scoring any pair of aligned DNA nucleotides, or simply a match and a mismatch score), and a gap penalty y for every time we introduce a gap character. A dynamic programming algorithm con- sists of four parts: a recursive definition of the optimal score; a dynamic programming matrix for rememhering optimal scores of subproblems; a hottom-up approach of filling the matrix by solving the smallest subprob- lems first; and a traceback of the matrix to recover the structure of the optimal solution that gave the optimal score. For pairwise alignment, those steps are the following: Recursive definition of the optimal alignment score There are only three ways the alignment can possibly end: (i) residues Xjy and >'fj are NATURE BIOTECHNOLOGY VOLUME 22 NUMBER 7 JULY 2004 909

What is (dynamic programming? - LMSE...programming). Bellman especially liked 'dynamic' hecause "it's impossible to use the word dynamic in a pejorative sense"; he fig-ured dynamic

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: What is (dynamic programming? - LMSE...programming). Bellman especially liked 'dynamic' hecause "it's impossible to use the word dynamic in a pejorative sense"; he fig-ured dynamic

comPRIMER

What is (dynamic programming?Sean R Eddy

Sequence alignment methods often use something called a 'dynamic programming' algorithm. What is dynamicprogramming and how does it work?

Dynamic programming algorithms are agood place to start understanding what'sreally going on inside computational hiologysoftware. The heart of many well-known pro-grams is a dynamic programming algorithm,or a fast approximation of one, includingsequence database search programs likeBLAST and FASTA, multiple sequence align-ment programs like CLUSTALW, profilesearch programs like HMMER, gene findingprograms like GENSCAN and even RNA-folding programs like MFOLD and phyloge-netic inference programs like PHYLIP.

Don't expect much enlightenment fromthe etymology of the term 'dynamic program-ming,' though. Dynamic programming wasformalized in the early 1950s by mathemati-cian Richard Bellman, who was working atRAND Corporation on optimal decisionprocesses. He wanted to concoct an impres-sive name that would shield his work from USSecretary of Defense Charles Wilson, a manknown to be hostile to mathematics research.His work involved time series and planning^—thus 'dynamic' and 'programming' (note,nothing particularly to do with computerprogramming). Bellman especially liked'dynamic' hecause "it's impossible to use theword dynamic in a pejorative sense"; he fig-ured dynamic programming was "somethingnot even a Congressman could object to"'.

The best way to understand how dynamicprogramming works is to see an example.Conveniently, optima! sequence alignmentprovides an example that is hoth simple andbiologically relevant.

Sean R. Eddy is at Howard Hughes MedicalInstitute & Department of Genetics,Washington University School of Medicine,4444 Eorest Park Blvd., Box 8510, Saint Louis,Missouri 63108, USA.e-mail: [email protected]

Dynamic programming matrix:

(sequence y)3 4 5 6 7C T C G T

0

-6

-12

-18

-24

-30

-36

1-6

5

-1

-7

-13

-19

-25

1-12

>-1

3

- 3

- 9

-15

-21

--18

1 - 7 •

- - 3

8

2

- 4

- 1 0

--24

1-13

- 2

• 2

6

7

1

--30

--19

1-8

3

• 0

4

5

- -36

>-25

• - 1 4

• - 3

1

• -2

2

«-42

• -31

--20

- -9

1-5

6

0

--48

--37

--26

--15

-4

1 0

11

i 0

1 T

2 T

3 C

M - 6 A

Optimum alignment scores 11

T - - T C A T AT G C T C G T A

+5 _6 -^ +5 -1-5 -2 -1-5 -t-5

The biological problem: pairwisesequence alignmentWe have two DNA or protein sequences, andwe want to infer if they are homologous ornot. To do this, we will calculate a score thatreflects how similar the two sequences are(that is, how likely they are to he derived froma common ancestor). Because sequences differnot just by substitution, but also hy insertionand deletion, we want to optimally align thetwo sequences to maximize their similarity.

Why do we need a fancy algorithm? Can'twe just score all possihle alignments and pickthe hest one? This isn't practical, becausethere are about 2'^n2nN different align-ments for two sequences of length N; for twosequences of length 300, that's about 10 ''different alignments.

Let's .set up the problem with some nota-tion. Call the two sequences x and y. They areof length M and N residues, respectively. TheI* residue in x is Xj\ the/*" residue of y is yj. We

Figure 1 The filled dynamicprogramming matrix for two DNAsequences, x = TTCATA andy = TGCTCGTA, for a scoringsystem of +5 for a match, - 2for a mismatch and -6 for eachinsertion or deletion. The cells inthe optimum path are shown inred. Arrowheads are 'tracebackpointers,' indicating which ofthe three cases were optimalfor reaching each cell. (Somecells can be reached by two orthree different optimal pathsof equal score: whenever twoor more cases are equallyoptimal, dynamic programmingimplementations usually chooseone case arbitrarily. In thisexample, though, the optimalpath is unique,)

need some parameters for how to score align-ments: we'll use a scoring matrix o(a, h) foraligning two residues a,b to each other (e.g., a4 x 4 matrix for scoring any pair of alignedDNA nucleotides, or simply a match and amismatch score), and a gap penalty y for everytime we introduce a gap character.

A dynamic programming algorithm con-sists of four parts: a recursive definition of theoptimal score; a dynamic programmingmatrix for rememhering optimal scores ofsubproblems; a hottom-up approach of fillingthe matrix by solving the smallest subprob-lems first; and a traceback of the matrix torecover the structure of the optimal solutionthat gave the optimal score. For pairwisealignment, those steps are the following:

Recursive definition of the optimalalignment scoreThere are only three ways the alignment canpossibly end: (i) residues Xjy and >'fj are

NATURE BIOTECHNOLOGY VOLUME 22 NUMBER 7 JULY 2004 9 0 9

Page 2: What is (dynamic programming? - LMSE...programming). Bellman especially liked 'dynamic' hecause "it's impossible to use the word dynamic in a pejorative sense"; he fig-ured dynamic

P R I M E R

aligned to each other; (ii) residue x^, isaligned to a gap character, and y^ appearedsomewhere earlier in the alignment; or (iii)residue y ^ is aligned to a gap character and x^appeared earlier in the alignment. The opti-mal alignment will be the highest scoring ofihese three cases.

Crucially, our scoring system allows us todefine the score of these three cases recur-sively, in terms of optimal alignments of thepreceding subsequences. Let S[i, j) be thescore of the optimum alignment of sequenceprefix A"|.. .Xi to prefix y].. .y:. The score for case(I) above is the score s(.V/ >'y) for aligning jc ^to /,v plus the score S{M - \,N - I) for anoptimal alignment of everything else up tothis point. Case (ii) is the gap penalty y plusthe score S(M - 1, H)\ case (iii) is the gappenalty 7 plus the score S(M, N - I).

This works because the problem breaksinto independently optimizable pieces, as thescoring system is strictly local to one alignedcolumn at a time. That is, for instance, theoptimal alignment of AJ. . .JCy _ j to y|.. .y ?_ j isunaffected by adding on the aligned residuepair Xf^.j, y^, and likewise, the score s[Xf^, y^)we add on is independent of the previousoptimal alignment.

So, to calculate the score of the three cases,we will need to know three more alignmentscores for three smaller problems:

S(M-1,N-1), S(M-1,N), SiM.N-l).

And to calculate those, we need the solu-tions for nine still smaller problems:

S{M-2,N-1), S(M-2,JV-I). S(M-l,N-2),S(M-2,N-l), S(M-2,N). S(M-1,N-1),S(M- l . N - 2), S{,M- 1, N-1), S(M, N- 2).

and so on, until we reach tiny alignment sub-problems with obvious solutions (the scoreS(0,0) for aligning nothing to nothing iszero).

Thus, we can write a general recursive defi-nition of ail our partial optimal alignmentscores S{i,j):

S{i,j) = max

The dynamic programming matrixThe problem with a purely recursive align-ment algorithm may already be obvious, ifyou looked carefully at that list of nine smallersubproblenis we'd be solving in the secondround of the top-down recursion. Some sub-problems are already occurring more tlian

once, and this wastage gets exponentiallyworse as we move deeper into the recursion.Clearly, the sensible thing to do is to somehowkeep track of which subproblems we arealready working on. This is the key differencebetween dynamic programming and simplerecursion; a dynamic programming algo-rithm memorizes the solutions of optimalsubproblems in an organized, tabular form (adynamic programming matrix), so that eachsubproblem is solved just once.

For the pairwise sequence alignment algo-rithm, the optimal scores S{i,;) arc tabulatedin a two-dimensional tiiatrix, with / runningfrom 0...M and j running from 0...N, as shownin Figure 1. As we calculate solutions to sub-problems S(i, j), their optimal alignmentscores are stored in the appropriate {i,j) cell ofthe matrix.

A bottom-up calculation to get theoptimal scoreOnce the dynamic matrix programmingmatrix S(i,j) is laid out—either on a napkinor in your computer's memory—it is easy tofill it in in a 'bottom-up' way, from smallestproblems to progressively bigger problems.We know the boundary conditions in the left-most column and topmost row (S(0, 0) = 0;S(i, 0) = 7); S{O,j) = yj): for example, the opti-mum alignment of the first i residues ofsequence x to nothing in sequence y has onlyone possible solution, which is to align to gapcharacters and pay T gap penalties. Once we'veinitialized the top row and left column, wecan fill in the rest of the matrix by using therecursive definition oiSii,j) to calculate anycell where we already know the values we needfor the three adjoining cells to the upper left( i - l , ; - l ) , above ((-1,;) and to the left li,j~\).There are several different ways we can dothis; one is to iterate two ne.sted loop.s, / = I toM and J - 1 to N, so we're filling in the matrixleft to right, top to bottom.

A traceback to get the optimal alignmentOnce we're done filling in the matrix, thescore of the optimal alignment of the com-plete sequences is the last score we calculate,S{M, N). We still don't know the optimalalignment itself, though. This, we recover by arecursive 'traceback' of the matrix. We start incell M,N, determine which of the three caseswe used to get here (e.g., by repeating thesame three calculations), record that choice aspart of the alignment, and then follow theappropriate path for that case back into theprevious cell on the optimum path. We keep

doing that, one cell in the optimal path at atime, until we reach cell (0,0), at which poinithe optimal alignment is fully reconstructed.

Fine. But what do I really need to know?Dynamic programming is gtiarimlced togive you a mathematically optimal (highestscoring) solution. Whether that correspondsto the biologically correct alignment is aproblem for your scoring system, not for thealgorithm.

Similarly, the dynamic programming algo-rithm will happily align unrelated sequences.(The two sequences in Figure I might lookwell-aligned; but in fact, they are unrelated,randomly generated DNA sequences!) Thequestion of when a score is statistically signif-icant is also a separate problem, requiringclever statistical theory.

Dynamic programming is surprisinglycomputationally demanding. You can see thatfilling in the matrix takes time proportionalto MN. Alignment of two 200-mers will takefour times as long as two 100-mers. This iswhy there is so much research devoted tofinding good, fast approximations to dynamicprogramtning alignment, like the venerableworkhorses BLAST and FASTA, and newerprograms like BIAT and FLASH.

Only certain scoring systems are amenableto dynamic programming. The scoring sys*tem has to allow the optimal solution to bebroken up into independent parts, or else itcan't be dealt with recursively. The reason thatprograms use simple alignment scoring sys-tems is that we're striking a reasonable com-promise between biological realism andefficient computation.

Note: Supplementary information is available on theNature Biotechnology website.

1. Bellman, R.E. Eye of the Hurricane: An Autobio-graphy. (World Scientific. Singapore, 1984).

Further studyTo study a working example, you can download a small,bare-bones C implementation ot this algorithm (seeSupplementary Note). I used this C program to generateFigure 1 .

naturebiotechnologyWondering how some othermathematical technique really works?Send suggestions for future primers [email protected]. _1

910 VOLUME 22 NUMBER 7 |ULY 2004 NATURE BIOTECHNOLOGY

Page 3: What is (dynamic programming? - LMSE...programming). Bellman especially liked 'dynamic' hecause "it's impossible to use the word dynamic in a pejorative sense"; he fig-ured dynamic