29
Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter

Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter

Embed Size (px)

Citation preview

Page 1: Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter

Dynamic Programming

Presenters:

Michal Karpinski

Eric Hoffstetter

Page 2: Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter

Background• “Dynamic programming” originates with Richard Bellman (1940s) in multistage

decision process problems.– While at RAND Corp, he wanted his work to appear more practical (“real work”) as opposed

to theoretical. To shield himself from scrutiny, Bellman chose the word “programming,” which implies fruitful, deliberate effort and embellished it with “dynamic.” As he puts it “it’s impossible to use dynamic in a pejorative sense.”

• Applications:– String alignments / problems– Pattern recognition:

• Image matching / image recognition (2D & 3D)• Speech recognition (Viterbi algorithm)

– Manufacturing – find fastest way through factory– Order of matrices in matrix multiplication to minimize cost– Build optimal binary search tree – minimize number of nodes visited during search

• Language translator – most common words near root of tree

Page 3: Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter

Used to solve problems exhibiting:

– Overlapping Subproblems: “they occur as a subproblem of different problems”

– Optimal Substructure: “An optimal solution to the problem contains within it optimal solutions to subproblems.”

– Subproblem Independence: “the solution to one subproblem does not affect the solution to another subproblem, i.e., they do not share resources”

Page 4: Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter

Tops Down and Bottoms Up– Top-down: problem is broken down to subproblems then solved using

memoization to remember the solutions to subproblems already solved.Top down:

function fib(n) if n = 0 return 0 if n = 1 return 1 else return fib(n − 1) + fib(n − 2)

Top down with memoization (not memorization) var m := map(0 → 1, 1 → 1)

function fib(n) if map m does not contain key n m[n] := fib(n − 1) + fib(n − 2) return m[n]

– Bottom-up: all subproblems must be solved in advance to build solutions to larger problems

function fib(n) var previousFib := 0, currentFib := 1 repeat n − 1 times var newFib := previousFib + currentFib previousFib := currentFib

currentFib := newFib return currentFib

Page 5: Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter

Biological Sequence Matching Problems 1

• DNA– Two strands– Four letter alphabet (four bases)– Base pairing rules– Strands are directional and, within a gene, only one

strand is translated

• RNA– Functional or intermediate step of protein

manufacturing– Four letter alphabet

• Proteins– 20 letter alphabet

Page 6: Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter

Biological Sequence Matching Problems 2

• Applications– Identify strains of viruses, bacteria– Identify genes (hair, skin, eye color, height) and

genetic basis for diseases (lethal or susceptibility to cancer, etc.)

– Identify evolutionary relationships

• Dynamic programming is the basis of BLAST (Basic Local Alignment Search Tool) – in top 3 of most cited papers in recent bioscience history (was #1 in 1990s)

Page 7: Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter

Sequence Alignment Algorithm 1

-AGGCGGATC---TAG-C--ATCTAC

Given two strings x = x1x2...xM, y = y1y2…yN,

Find the alignment with maximum score

F = (# matches) m - (# mismatches) s – (#gaps) d

AGGCGGATCTAGCATCTAC

Page 8: Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter

Sequence Alignment Algorithm 2

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

There are > 2N possible alignments.

Page 9: Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter

Sequence Alignment Algorithm 3

Note:

The score of aligning x1……xM

y1……yN

is additive

Say that x1…xi xi+1…xM

aligns to y1…yj yj+1…yN

Add the two scores:

F(x1…xM, y1…yN) = F(x1…xi, y1...yj) + F(xi+1…xm, yj+1…yN)

Page 10: Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter

Sequence Alignment Algorithm 4• Original problem

– Align x1…xM to y1…yN

• Divide into a finite number of subproblems (non-overlapping for efficiency)

– Align x1…xi to y1…yj

• Subdivide the subproblem and construct the solution from smaller subproblems

• Classic problem type for dynamic programmingrogramming

Let F(i, j) = optimal score of aligning

x1……xi

y1……yj

• F is the “matrix” or “table” or “program.”Hence the term “dynamic programming.”

Page 11: Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter

Sequence Alignment Algorithm 5

Three cases:

1. xi aligns to yj

x1……xi-1 xi

y1……yj-1 yj

2. xi aligns to a gap

x1……xi-1 xi

y1……yj -

3. yj aligns to a gap

x1……xi -

y1……yj-1 yj

diagonal move

m, if xi = yj

F(i, j) = F(i – 1, j – 1) + -s, if

nothorizontal move

F(i, j) = F(i – 1, j) – d

vertical move

F(i, j) = F(i, j – 1) – d

F = (# matches) m - (# mismatches) s – (# gaps) d

Scoring function s(xi, yj)

F(i, j) calculated with scoring function s(xi, yj) or gap function g

Gap function

Page 12: Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter

Sequence Alignment Algorithm 6

How do we choose the case for each matrix position?

Assume that the subproblems are solved:

F(i, j – 1), F(i – 1, j), F(i – 1, j – 1) are optimal

Therefore,

F(i – 1, j – 1) + s(xi, yj)

F(i, j) = max F(i – 1, j) – d

F(i, j – 1) – d

Where s(xi, yj) = m, if xi = yj; -s, if not

Page 13: Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter

Set d = 1, m = 1, s = -0.5

F(i – 1, j – 1) + s(xi, yj)

F(i, j) = max F(i – 1, j) – 1

F(i, j – 1) – 1

Where s(xi, yj) = 1, if xi = yj

-0.5, if not

Sequence Alignment Algorithm 7

A T G0 -1 -2 -3

A -1 1 0 -1T -2 0 2 1C -3 -1 1 1.5G -4 -2 0 2

A T — GA T C G

Page 14: Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter

Needleman-Wunsch Algorithm 1:Finds Global Optimal Alignment

1. Initializationa. F(0, 0) = 0b. F(0, j) = - j dc. F(i, 0) = - i d

2. Main Iteration Filling-in partial alignmentsa. For each i = 1……M

For each j = 1……N F(i – 1,j – 1) + s(xi, yj) [case

1]F(i, j) = max F(i – 1, j) – d [case 2]

F(i, j – 1) – d [case 3]

if [case 1]Ptr(i, j) = if [case 2]

if [case 3]

3. Termination

F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

Page 15: Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter

Needleman-Wunsch Algorithm 2

InitializationF(0, 0) = 0F(0, j) = - j dF(i, 0) = - i d

(1) F(i – 1,j – 1) + s(xi, yj)

F(i, j) = max (2) F(i – 1, j) – d(3) F(i, j – 1) – d

(1)Ptr(i, j) = (2)

(3)

A T G0 -1 -2 -3

A -1 1 0 -1T -2 0 2 1C -3 -1 1 1.5G -4 -2 0 2

A T G

A % T % C

G %

A T — GA T C G

Page 16: Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter

Smith-Waterman Algorithm 1:Finds local optimal alignment(s)

Ignore poorly aligned regions

1. Initializationa. F(0, 0) = 0b. F(0, j) = 0c. F(i, 0) = 0

2. Main Iteration Filling-in partial alignmentsa. For each i = 1……M

For each j = 1……N 0 F(i – 1,j – 1) + s(xi, yj) [case 1]

F(i, j) = max F(i – 1, j) – d [case 2] F(i, j – 1) – d [case 3]

if [case 1]Ptr(i, j) = if [case 2]

if [case 3]

3. Termination

F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

Page 17: Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter

Smith-Waterman Algorithm 2

InitializationF(0, 0) = 0F(0, j) = 0F(i, 0) = 0

(1) F(i – 1,j – 1) + s(xi, yj)

F(i, j) = max (2) F(i – 1, j) – d(3) F(i, j – 1) – d

(1)Ptr(i, j) = (2)

(3)

A T G0 0 0 0

A 0 1 0 -0.5T 0 0 2 1C 0 -0.5 1 1.5G 0 -1 0 2

A T G

A %

T % C

G %

A T — GA T C G

Page 18: Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter

Smith-Waterman Algorithm 3

A G G C T A T C A C C TG G C G A C C T A C

A G G C T A T C A C C T0 0 0 0 0 0 0 0 0 0 0 0 0

G 0 0 1 1 0 0 0 0 0 0 0 0 0G 0 0 1 2 1 0 0 0 0 0 0 0 0C 0 0 0 1 3 2 1 0 1 0 1 1 0G 0 0 1 1 2 2 1 0 0 0 0 0 0A 0 1 0 0 1 1 3 2 1 1 0 0 0C 0 0 0 0 1 0 2 2 3 2 2 1 0C 0 0 0 0 1 0 1 1 3 2 3 3 2T 0 0 0 0 0 2 1 2 2 2 2 2 4A 0 1 0 0 0 1 3 2 1 3 2 1 3C 0 0 0 0 1 0 2 2 3 2 4 3 2

Page 19: Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter

Smith-Waterman Algorithm 4

A G G C T A T C A C C T0 0 0 0 0 0 0 0 0 0 0 0 0

G 0 0 1 1 0 0 0 0 0 0 0 0 0G 0 0 1 2 1 0 0 0 0 0 0 0 0C 0 0 0 1 3 2 1 0 1 0 1 1 0G 0 0 1 1 2 2 1 0 0 0 0 0 0A 0 1 0 0 1 1 3 2 1 1 0 0 0C 0 0 0 0 1 0 2 2 3 2 2 1 0C 0 0 0 0 1 0 1 1 3 2 3 3 2T 0 0 0 0 0 2 1 2 2 2 2 2 4A 0 1 0 0 0 1 3 2 1 3 2 1 3C 0 0 0 0 1 0 2 2 3 2 4 3 2

A G G C T A T C A C C T — —— G G C — — — G A C C T A C

Page 20: Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter

Overlap Detection 1

• When searching for matches of a short string in database of long strings, we don’t want to penalize overhangs

x1 …………………… xM

y1 …

……

……

……

y

N

x1 …………………… xM

y1 …

……

yN

x

y

x

y

Page 21: Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter

Overlap Detection 2

x1 …………………… xM

y 1…

……

……

……

y N

x

y

x1 …………………… xM

y 1…

……

…y N

x

y

F(i – 1, 0) F(i, 0) = max F(i – 1, m) – T

F(i – 1,j – 1) + s(xi, yj)

F(i, j) = max F(i – 1, j) – dF(i, j – 1) – d

Page 22: Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter

Overlap Detection 3

A T G G T A T A G G T T A A0 0 0 0 0 0 0 0.5 1 1 1 1 1 3 3

G -1 -0.5 -0.5 1 1 0 -0.5 -0.5 0 2 2 1 0.5 2 2.5G -2 -1.5 -1 0.5 2 1 0 -1 -1 1 3 2 1 1 1.5T -3 -2.5 -0.5 -0.5 1 3 2 1 0 0 2 4 3 2 1T -4 -3.5 -1.5 -1 0 2 2.5 3 2 1 1 3 5 4 3

A T G G T A T A G G T T A A0 0 0 0 0 0 0 0 1 1 1 1 1 3 3

G 0 0 0 1 1 0 0 0 0 2 2 1 0 2 2G 0 0 0 1 2 1 0 0 0 1 3 2 1 1 1T 0 0 1 0 1 3 2 1 0 0 2 4 3 2 1T 0 0 1 0 0 2 2 3 2 1 1 3 5 4 3

A T G G T A T A G G T T A AG G T T G G T T

x1 …………………… xM

y 1…

……

…y N

x

y

Needleman-Wunschwith

Overlap Detection

Smith-Watermanwith

Overlap Detection

F(i – 1, 0) F(i, 0) = max F(i – 1, m) – T

0 F(i – 1,j – 1) + s(xi, yj)

F(i, j) = max F(i – 1, j) – dF(i, j – 1) – d

Page 23: Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter

Bounded Dynamic Programming

Initialization:

F(i,0), F(0,j) undefined for i, j > k

Iteration:

For i = 1…M

For j = max(1, i – k)…min(N, i+k)

F(i – 1, j – 1)+ s(xi, yj)

F(i, j) = max F(i, j – 1) – d, if j > i – k(N)

F(i – 1, j) – d, if j < i + k(N)

Termination: same

x1 ………………………… xM

y1 …

……

……

……

……

y

N

k(N)

Page 24: Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter

Largest Common Subsequence 11. Initialization

a. F(0, 0) = 0b. F(0, j) = 0c. F(i, 0) = 0

2. Main Iterationa. For each i = 1……M

For each j = 1……N F(i – 1,j – 1) + 1, if xi = yj [case 1]

F(i, j) = max F(i – 1, j), if not(xi = yj) [case 2] F(i, j – 1), if not(xi = yj) [case 3]

if [case 1]Ptr(i, j) = if [case 2]

if [case 3]

3. Termination

F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

Page 25: Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter

Largest Common Subsequence 2

InitializationF(0, 0) = 0F(0, j) = 0F(i, 0) = 0

(1) F(i – 1,j – 1) + 1, if xi = yj

F(i, j) = max (2) F(i – 1, j), if not(xi = yj)(3) F(i, j – 1), if not(xi = yj)

(1)Ptr(i, j) = (2)

(3)

A T G0 0 0 0

A 0 1 1 1T 0 1 2 2C 0 1 2 2G 0 1 2 3

A T G

A % T % C G %

A T — GA T C G

Page 26: Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter

Cormen: error on page 353

Corrected (to obtain figure 15.6)

m = length[X]n = length[Y]for i = 1 to m

do c[i,0] = 0for j = 0 to n

do c[0,j] = 0for i = 1 to m

for j = 1 to nif xi = yj thenc[i,j] = c[i-1, j-1] + 1]b[i,j] = “%”

else if c[i-1, j] > c[i, j-1] then

c[i,j] = c[i-1, j]b[i,j] = “”

elsec[i,j] = c[i, j-1]b[i,j] = “”

return c and b

Largest Common Subsequence 3

B D C A B A0 0 0 0 0 0 0

A 0 0 0 0 1 1 1B 0 1 1 1 1 2 2C 0 1 1 2 2 2 2B 0 1 1 2 2 3 3D 0 1 2 2 2 3 3A 0 1 2 2 3 3 4B 0 1 2 2 3 4 4

B D C A B A

A

B % C % B % D A %

B

Page 27: Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter

Performance• Running Time: O(mn) + O(m+n) for output• Storage: O(mn)

– Possible to eliminate backpointer matrix for some problems

• Improvements– Overlap detection– Partitioning: Find local alignments to seed global alignment– Bounded DP– Gap opening vs. gap extension– Biochemically significant scoring function

Page 28: Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter

SourcesAltschul, S.F., et al. Basic Local Alignment Search Tool. J. Molec. Biol. 215(3): 403-10,

1990.Bellman, Richard. Dynamic Programming. Princeton University Press, Princeton: 1957.Cormen et al. Introduction to Algorithms. MIT Press, Cambridge: 2001.Dreyfus, Stuart. 2002. Richard Bellman on the birth of dynamic programming.

Operations Research 50: 48-51.Durbin et al. Biological Sequence Analysis: Probabilistic models of proteins and nucleic

acids. Cambridge University Press, New York: 1998.Gotoh, O. 1982. An improved algorithm for matching biological sequences. Journal of

Molecular Biology 162: 705-708.Gusfield, Dan. Algorithms on Strings, Trees, and Sequences, Cambridge University

Press, New York: 1997.Needleman, S.B. and Wunsch, C.D. 1970. A general method applicable to the search

for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48: 443-453.

Preiss. B.R. Data Structures and Algorithms with Object-Oriented Design Patterns in C#.

Smith, T. F. and Waterman, M.S. 1981. Identification of common molecular subsequences. Journal of Molecular Biology 147: 195-197.

Wikipedia

Page 29: Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter

Sequence Alignment Algorithm X

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Given two strings x = x1x2...xM, y = y1y2…yN,

Find the alignment with maximum score

F = (# matches) m - (# mismatches) s – (#gaps) d

AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC