Upload
others
View
18
Download
0
Embed Size (px)
Citation preview
Dynamic Programming Algorithms and
Sequence Alignment
A T - G T A Tz
-
A T C G - A - CATGTTAT, ATCGTACATGTTAT, ATCGTAC T
T
4 matches 2 insertions 2 deletions
1. Change Problem
2. Manhattan Tourist Problem
3. Longest Paths in Graphs
4. Sequence Alignment
5. Edit Distance
Outline
• Say we want to provide change totaling 97 cents.
• We could do this in a large number of ways, but the quickest way to do it would be:
• Three quarters = 75 cents
• Two dimes = 20 cents
• Two pennies = 2 cents
• Question 1: How do we know that this is quickest?
• Question 2: Can we generalize to arbitrary denominations?
The Change Problem
• Goal: Convert some amount of money M into given denominations, using the fewest possible number of coins.
• Input: An amount of money M, and an array of d denominations c = (c1, c2, …, cd), in decreasing order of value (c1 > c2 > … > cd).
• Output: A list of d integers i1, i2, …, id such that
c1i1 + c2i2 + … + cdid = M
and i1 + i2 + … + id is minimal.
The Change Problem: Formal Statement
• Given the denominations 1, 3, and 5, what is the minimum number of coins needed to make change for a given value?
1 2 3 4 5 6 7 8 9 10Value
Min # of coins
The Change Problem: Another Example
• Given the denominations 1, 3, and 5, what is the minimum number of coins needed to make change for a given value?
• Only one coin is needed to make change for the values 1, 3, and 5.
1 2 3 4 5 6 7 8 9 10
1 1 1
Value
Min # of coins
The Change Problem: Another Example
• Given the denominations 1, 3, and 5, what is the minimum number of coins needed to make change for a given value?
• Only one coin is needed to make change for the values 1, 3, and 5.
• However, two coins are needed to make change for the values 2, 4, 6, 8, and 10.
1 2 3 4 5 6 7 8 9 10
1 2 1 2 1 2 2 2
Value
Min # of coins
The Change Problem: Another Example
• Given the denominations 1, 3, and 5, what is the minimum number of coins needed to make change for a given value?
• Only one coin is needed to make change for the values 1, 3, and 5.
• However, two coins are needed to make change for the values 2, 4, 6, 8, and 10.
• Lastly, three coins are needed to make change for 7 and 9.
1 2 3 4 5 6 7 8 9 10
1 2 1 2 1 2 2 2
Value
Min # of coins 3 3
The Change Problem: Another Example
• In general, given the denominations c: c1, c2, …, cd, the recurrence relation is:
The Change Problem: Recurrence
74
77
76 70
75 73 69 73 71 67 69 67 63
74 72 68
72 70 66
68 66 62
72 70 66
70 68 64
66 64 60
68 66 62
66 64 60
62 60 56
The RecursiveChange TreeM = 77M = 77
c:1,3,7c:1,3,7
74
77
76 70
75 73 69 73 71 67 69 67 63
74 72 68
72 70 66
68 66 62
72 70 66
70 68 64
66 64 60
68 66 62
66 64 60
62 60 56
. . . . . .70 70 70 7070
The RecursiveChange TreeM = 77M = 77
c:1,3,7c:1,3,7
• RecursiveChange recalculates the optimal coin combination for a given amount of money repeatedly.
• M = 77, c = (1,3,7):
• The optimal coin combination for 70 cents is computed 9 times!
RecursiveChange: Inefficiencies
• RecursiveChange recalculates the optimal coin combination for a given amount of money repeatedly.
• M = 77, c = (1,3,7):
• The optimal coin combination for 70 cents is computed 9 times!
• The optimal coin combination for 50 cents is computed billions of times!
RecursiveChange: Inefficiencies
• Save results of each computation for all amounts from 0 to M.– Reference call to find an already computed value
• Running time: M*d, where M is the amount of money and d is the number of denominations.
• Dynamic Programming.
RecursiveChange: Improvement
0 1 2 3 4 5 6 7 8 90 1 2 1 2 3 2 1 2 3
• For example, let us takec = (1,3,7), M = 9:
DPChange: Example
0 1
0 1 2
0 1 2 3
0 1 2 3 4
0 1 2 3 4 5
0 1 2 3 4 5 6
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 9
0 1
0 1 2
0 1 2 1
0 1 2 1 2
0 1 2 1 2 3
0 1 2 1 2 3 2
0 1 2 1 2 3 2 1
0 1 2 1 2 3 2 1 2
0 1 2 1 2 3 2 1 2 3
• For example, let us takec = (1,3,7), M = 9:
00
DPChange: Example
DPChange builds up from easier problem instances to the desired one, avoiding repetition.DPChange builds up from easier problem instances to the desired one, avoiding repetition.
Hotel
• Imagine that you are a tourist in Manhattan, whose streets are represented by the grid on the right.
Station
Manhattan Tourist Problem
Station*
*
*
*
*
**
* *
*
*
Hotel
*
• Imagine that you are a tourist in Manhattan, whose streets are represented by the grid on the right.
• You are leaving town, and you want to see as many attractions (represented by *) as possible.
Manhattan Tourist Problem
Station*
*
*
*
*
**
* *
*
*
Hotel
*
• Imagine that you are a tourist in Manhattan, whose streets are represented by the grid on the right.
• You are leaving town, and you want to see as many attractions (represented by *) as possible.
• Your time is limited: you only have time to travel east and south.
Manhattan Tourist Problem
Station*
*
*
*
*
**
* *
*
*
Hotel
*
• Imagine that you are a tourist in Manhattan, whose streets are represented by the grid on the right.
• You are leaving town, and you want to see as many attractions (represented by *) as possible.
• Your time is limited: you only have time to travel east and south.
• What is the best path through town?
Additional Example: Manhattan Tourist Problem
Station*
*
*
*
*
**
* *
*
*
Hotel
*
• Imagine that you are a tourist in Manhattan, whose streets are represented by the grid on the right.
• You are leaving town, and you want to see as many attractions (represented by *) as possible.
• Your time is limited: you only have time to travel east and south.
• What is the best path through town?
Additional Example: Manhattan Tourist Problem
• Goal: Find the longest path in a weighted grid.
• Input: A weighted grid G with two distinct vertices, one labeled “source” and the other labeled “sink.”
• Output: A longest path in G from “source” to “sink.”
Manhattan Tourist Problem (MTP): Formulation
• Our first try at solving the MTP will use a greedy algorithm.
• Main Idea: At each node (intersection), choose the edge (street) departing that node which has the greatest weight.
MTP Greedy Algorithm
3 2 4
0 7 3
3 3 0
1 3 2
4
4
5
6
4
6
5
5
8
2
2
5
0 1 2 3
0
1
2
3
j coordinate
i co
ord
ina
te
source
sink
4
3 2 4 0
1 0 2 4 3
3
1
1
2
2
2
4
4
MTP Greedy Algorithm: Example
3 2 4
0 7 3
3 3 0
1 3 2
4
4
5
6
4
6
5
5
8
2
2
5
0 1 2 3
0
1
2
3
j coordinate
i co
ord
ina
te
source
sink
4
3 2 4 0
1 0 2 4 3
3
1
1
2
2
2
4
0
4
MTP Greedy Algorithm: Example
3 2 4
0 7 3
3 3 0
1 3 2
4
4
5
6
4
6
5
5
8
2
2
5
0 1 2 3
0
1
2
3
j coordinate
i co
ord
ina
te
source
sink
4
3 2 4 0
1 0 2 4 3
3
1
1
2
2
2
4
0 3
4
MTP Greedy Algorithm: Example
3 2 4
0 7 3
3 3 0
1 3 2
4
4
5
6
4
6
5
5
8
2
2
5
0 1 2 3
0
1
2
3
j coordinate
i co
ord
ina
te
source
sink
4
3 2 4 0
1 0 2 4 3
3
1
1
2
2
2
4
50 3
4
MTP Greedy Algorithm: Example
3 2 4
0 7 3
3 3 0
1 3 2
4
4
5
6
4
6
5
5
8
2
2
5
0 1 2 3
0
1
2
3
j coordinate
i co
ord
ina
te
source
sink
4
3 2 4 0
1 0 2 4 3
3
1
1
2
2
2
4
950 3
4
MTP Greedy Algorithm: Example
3 2 4
0 7 3
3 3 0
1 3 2
4
4
5
6
4
6
5
5
8
2
2
5
0 1 2 3
0
1
2
3
j coordinate
i co
ord
ina
te
13
source
sink
4
3 2 4 0
1 0 2 4 3
3
1
1
2
2
2
4
950 3
4
MTP Greedy Algorithm: Example
3 2 4
0 7 3
3 3 0
1 3 2
4
4
5
6
4
6
5
5
8
2
2
5
0 1 2 3
0
1
2
3
j coordinate
i co
ord
ina
te
13
source
sink
4
3 2 4 0
1 0 2 4 3
3
1
1
2
2
2
4
95
15
0 3
4
MTP Greedy Algorithm: Example
3 2 4
0 7 3
3 3 0
1 3 2
4
4
5
6
4
6
5
5
8
2
2
5
0 1 2 3
0
1
2
3
j coordinate
i co
ord
ina
te
13
source
sink
4
3 2 4 0
1 0 2 4 3
3
1
1
2
2
2
4 19
95
15
0 3
4
MTP Greedy Algorithm: Example
3 2 4
0 7 3
3 3 0
1 3 2
4
4
5
6
4
6
5
5
8
2
2
5
0 1 2 3
0
1
2
3
j coordinate
i co
ord
ina
te
13
source
sink
4
3 2 4 0
1 0 2 4 3
3
1
1
2
2
2
4 19
95
15
0
20
3
4
MTP Greedy Algorithm: Example
23
3 2 4
0 7 3
3 3 0
1 3 2
4
4
5
6
4
6
5
5
8
2
2
5
0 1 2 3
0
1
2
3
j coordinate
i co
ord
ina
te
13
source
sink
4
3 2 4 0
1 0 2 4 3
3
1
1
2
2
2
4 19
95
15
0
20
3
4
MTP Greedy Algorithm: Example
3 2 4
0 7 3
3 3 0
1 3 2
4
4
5
6
4
6
5
5
8
2
2
5
0 1 2 3
0
1
2
3
j coordinate
i co
ord
ina
te
source
sink
4
3 2 4 0
1 0 2 4 3
3
1
1
2
2
2
4
4
MTP DP Algorithm: Example
0
3 2 4
0 7 3
3 3 0
1 3 2
4
4
5
6
4
6
5
5
8
2
2
5
0 1 2 3
0
1
2
3
j coordinate
i co
ord
ina
te
source
sink
4
3 2 4 0
1 0 2 4 3
3
1
1
2
2
2
4
4
0 3
1
MTP DP Algorithm: Example
3 2 4
0 7 3
3 3 0
1 3 2
4
4
5
6
4
6
5
5
8
2
2
5
0 1 2 3
0
1
2
3
j coordinate
i co
ord
ina
te
source
sink
4
3 2 4 0
1 0 2 4 3
3
1
1
2
2
2
4
4
MTP DP Algorithm: Example
0 3
1
3 2 4
0 7 3
3 3 0
1 3 2
4
4
5
6
4
6
5
5
8
2
2
5
0 1 2 3
0
1
2
3
j coordinate
i co
ord
ina
te
source
sink
4
3 2 4 0
1 0 2 4 3
3
1
1
2
2
2
4
4
MTP DP Algorithm: Example
0 3
1 4
3 2 4
0 7 3
3 3 0
1 3 2
4
4
5
6
4
6
5
5
8
2
2
5
0 1 2 3
0
1
2
3
j coordinate
i co
ord
ina
te
source
sink
4
3 2 4 0
1 0 2 4 3
3
1
1
2
2
2
4
4
MTP DP Algorithm: Example
0 3
1 4
5
5
3 2 4
0 7 3
3 3 0
1 3 2
4
4
5
6
4
6
5
5
8
2
2
5
0 1 2 3
0
1
2
3
j coordinate
i co
ord
ina
te
source
sink
4
3 2 4 0
1 0 2 4 3
3
1
1
2
2
2
4
4
MTP DP Algorithm: Example
0 3
1 4
5
5
7
9
10
9
3 2 4
0 7 3
3 3 0
1 3 2
4
4
5
6
4
6
5
5
8
2
2
5
0 1 2 3
0
1
2
3
j coordinate
i co
ord
ina
te
source
sink
4
3 2 4 0
1 0 2 4 3
3
1
1
2
2
2
4
4
MTP DP Algorithm: Example
0 3
1 4
5
5
7
9
10
9
13
9
17
14
14
3 2 4
0 7 3
3 3 0
1 3 2
4
4
5
6
4
6
5
5
8
2
2
5
0 1 2 3
0
1
2
3
j coordinate
i co
ord
ina
te
source
sink
4
3 2 4 0
1 0 2 4 3
3
1
1
2
2
2
4
4
MTP DP Algorithm: Example
0 3
1 4
5
5
7
9
10
9
13
9
17
14
14
15
20
22
20
3 2 4
0 7 3
3 3 0
1 3 2
4
4
5
6
4
6
5
5
8
2
2
5
0 1 2 3
0
1
2
3
j coordinate
i co
ord
ina
te
source
sink
4
3 2 4 0
1 0 2 4 3
3
1
1
2
2
2
4
4
MTP DP Algorithm: Example
0 3
1 4
5
5
7
9
10
9
13
9
17
14
14
15
20
22
20
24
22
30
3 2 4
0 7 3
3 3 0
1 3 2
4
4
5
6
4
6
5
5
8
2
2
5
0 1 2 3
0
1
2
3
j coordinate
i co
ord
ina
te
source
sink
4
3 2 4 0
1 0 2 4 3
3
1
1
2
2
2
4
4
MTP DP Algorithm: Example
0 3
1 4
5
5
7
9
10
9
13
9
17
14
14
15
20
22
20
24
22
30
25
32 34
3
7 3
2
4
4
5
6
4
6
5
8
2
5
0 1 2 3
0
1
2
3
j coordinate
i co
ord
ina
te
source
sink
4
3 2 4 0
1 2 4
1
2
2
4
4
MTP DP Algorithm: Example
0 3
1 4
5
5
7
9
10
9
13
9
17
14
14
15
20
22
20
24
22
30
25
32 34
MTP: DP Implementation
w: weights of N to S edges
w: weights of W to E edges
w: weights of N to S edges
w: weights of W to E edges
• The score si, j for a point (i,j) is given by the recurrence:
• The running time is n x m for an n by m grid.• (n = # of rows, m = # of columns)
MTP: Running Time with Dynamic Programming
• We would like to compute the score for point v in an arbitrary graph.
• Let Predecessors(v) be the set of vertices with edges leading into v. Then the recurrence is given by:
The running time for a graph with E edges is O(E), since each edge is evaluated once.
Recursion for an Arbitrary Graph
• Traversal – order of visiting vertices
• By the time the vertex x is analyzed, the values sy for all its predecessors y should already be computed.
• If the graph has a cycle, we will get stuck in the pattern of going over and over the same cycle.
• Manhattan graph restricts movement in only east or south directions to avoid this problem
Recursion for an Arbitrary Graph: Problem
• Directed Acyclic Graph (DAG): A graph in which each edge is provided an orientation, and which has no cycles.– Edges in a DAG is represented with directed arrows.
http://commons.wikimedia.org/wiki/File:Directed_acyclic_graph.svg
Some Graph Theory Terminology
• Topological Ordering: A labeling of the vertices of a DAG (from 1 to n, say) such that every edge of the DAG connects a vertex with a smaller label to a vertex with a larger label.
• In other words, if vertices are positioned on a line in an increasing order, then all edges go from left to right.
• Theorem: Every DAG has a topological ordering.
• What this means: Every DAG has a source node (1) and a sink node (n).
Some Graph Theory Terminology
• Goal: Find a longest path between two vertices in a weighted DAG.
• Input: A weighted DAG G with source and sink vertices.
• Output: A longest path in G from source to sink.
• Note: Now we know that we can apply a topological ordering to G, and then use dynamic programming to find the longest path in G.
Longest Path in a DAG: Formulation
Back to Biology: Sequence Alignment
• Original problem: Fit a similarity score on two DNA sequences
• Alignment matrix
ATGTTATATGTTAT
ATCGTACATCGTAC
A T - G T A Tz
-
A T C G - A - C
T
T
4 matches 2 insertions 2 deletions
• Given two sequences, v = v1 v2…vm and w = w1 w2…wn
a common subsequence of v and w is a sequence of positions in
v: 1 < i1 < i2 < … < it < m and a sequence of positions in
w: 1 < j1 < j2 < … < jt < n such that the it -th letter of v is equal to the jt-th letter of w.
• Example: v = ATGCCAT, w = TCGGGCTATC. Then take:
• i1 = 2, i2 = 3, i3 = 6, i4 = 7
• j1 = 1, j2 = 3, j3 = 8, j4 = 9
– This gives us that the common subsequence is TGAT.
Common Subsequence
• Given two sequences v = v1 v2…vm and w = w1 w2…wn
the Longest Common Subsequence (LCS) of v and w is a sequence of positions in v: 1 < i1 < i2 < … < iT < m and a sequence of positions in w: 1 < j1 < j2 < … < jT < n such that the it -th letter of v is equal to jt-th letter of w and T is maximal.
• Example: v = ATGCCAT, w = TCGGGCTATC.
• TGCAT is a longer subsequence compared to TGAT
• Find the LCS of two sequences.
Longest Common Subsequence
T
G
C
A
T
A
C
1
2
3
4
5
6
7
0i
A T C T G A T C0 1 2 3 4 5 6 7 8
j• Assign one sequence to the rows, and one to the columns.
Edit Graph for LCS Problem
T
G
C
A
T
A
C
1
2
3
4
5
6
7
0i
A T C T G A T C0 1 2 3 4 5 6 7 8
j
• Assign one sequence to the rows, and one to the columns.
• Every diagonal edge represents a match of elements.
Edit Graph for LCS Problem
T
G
C
A
T
A
C
1
2
3
4
5
6
7
0i
A T C T G A T C0 1 2 3 4 5 6 7 8
j• Assign one sequence to the rows, and one to the columns.
• Every diagonal edge represents a match of elements.
Edit Graph for LCS Problem
T
G
C
A
T
A
C
1
2
3
4
5
6
7
0i
A T C T G A T C0 1 2 3 4 5 6 7 8
j• Assign one sequence to the rows, and one to the columns.
• Every diagonal edge represents a match of elements.
Edit Graph for LCS Problem
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
T
G
C
A
T
A
C
1
2
3
4
5
6
7
0i
A T C T G A T C0 1 2 3 4 5 6 7 8
j• Assign one sequence to the rows, and one to the columns.
• Every diagonal edge represents a match of elements.
• In a path from source to sink, the diagonal edges represent a common subsequence. Common Subsequence: TGAT
Edit Graph for LCS Problem
T
G
C
A
T
A
C
1
2
3
4
5
6
7
0i
A T C T G A T C0 1 2 3 4 5 6 7 8
j• LCS Problem: Find a path with the maximum number of diagonal edges.
Common Subsequence: TGAT
Edit Graph for LCS Problem
• Let vi = prefix of v of length i: v1 … vi
• and wj = prefix of w of length j: w1 … wj
• The length of LCS(vi,wj) is computed by:
Computing the LCS: Dynamic Programming
• The Hamming Distance dH(v, w) between two DNA sequences v and w of the same length is equal to the number of places in which the two sequences differ.
• Example: Given as follows, dH(v, w) = 8:
• These sequences are very similar!
• Hamming Distance is therefore not an ideal similarity score, because it ignores insertions and deletions.
Hamming Distance
v: ATATATATw: TATATATA
Minimum number of elementary operations (insertions, deletions, and substitutions) needed to transform one string into the other
d(v,w) = MIN number of elementary operations
to transform v w
Edit Distance
• Shift w one nucleotide to the right, and see that w is obtained from v by one insertion and one deletion:
• Hence the edit distance, d(v, w) = 2.
• Note: In order to provide this distance, we had to “fiddle” with the sequences. Hamming distance was easier to find.
Edit Distance: Example 1
v: ATATATAT-w: -TATATATA
• We can transform TGCATAT ATCCGAT in 5 steps:
TGCATAT (delete last T)TGCATA (delete last A)
Edit Distance: Example 2
• We can transform TGCATAT ATCCGAT in 5 steps:
TGCATAT (delete last T)TGCATA (delete last A)ATGCAT (insert A at front)
Edit Distance: Example 2
• We can transform TGCATAT ATCCGAT in 5 steps:
TGCATAT (delete last T)TGCATA (delete last A)ATGCAT (insert A at front)ATCCAT (substitute C for G)
Edit Distance: Example 2
• We can transform TGCATAT ATCCGAT in 5 steps:
TGCATAT (delete last T)TGCATA (delete last A)ATGCAT (insert A at front)ATCCAT (substitute C for G)ATCCGAT (insert G before last A)
Edit Distance: Example 2
• We can transform TGCATAT ATCCGAT in 5 steps:
TGCATAT (delete last T)TGCATA (delete last A)ATGCAT (insert A at front)ATCCAT (substitute C for G)ATCCGAT (insert G before last A)
• Note: This only allows us to conclude that the edit distance is at most 5.
Edit Distance: Example 2
• Theorem: Given two sequences v and w of length m and n, the edit distance d(v,w) is given by d(v,w) = m + n – s(v,w), where s(v,w) is the length of the longest common subsequence of v and w.
Solving the LCS problem for v and w is equivalent to finding the edit distance between them!
Key Result
• Every alignment corresponds to a path from source to sink.
• Horizontal and vertical edges correspond to indels (deletions and insertions).
Return to the Edit Graph
• Every alignment corresponds to a path from source to sink.
• Horizontal and vertical edges correspond to indels (deletions and insertions).
• Diagonal edges correspond to matches and mismatches.
Return to the Edit Graph
• Find LCS in ATCGTAC, ATGTTAT.
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
• ATCGTAC, ATGTTAT
• Match: +1
• Mismatches and indels: 0
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
• ATCGTAC, ATGTTAT
• Match: +1
• Mismatches and indels: 0
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
0
• ATCGTAC, ATGTTAT
• Match: +1
• Mismatches and indels: 0
• Score (0,1) =
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
0
• ATCGTAC, ATGTTAT
• Match: +1
• Mismatches and indels: 0
• Score (0,1) =
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
0
• ATCGTAC, ATGTTAT
• Match: +1
• Mismatches and indels: 0
• Score (0,1) = Score (indel)
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
0
-
A
• ATCGTAC, ATGTTAT
• Match: +1
• Mismatches and indels: 0
• Score (0,1) = 0
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
0 0
• ATCGTAC, ATGTTAT
• Match: +1
• Mismatches and indels: 0
• Score (0,j) = ?
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
0 0
• ATCGTAC, ATGTTAT
• Match: +1
• Mismatches and indels: 0
• Score (0,j) = 0
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
0 0 0 0 0 0 0 0
• ATCGTAC, ATGTTAT
• Match: +1
• Mismatches and indels: 0
• Score (0,j) = 0
• Score (i,0) = 0
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
0 0 0 0 0 0 0 0
0
0
0
0
0
0
0
• ATCGTAC, ATGTTAT
• Match: +1
• Mismatches and indels: 0
• Score (0,j) = 0
• Score (i,0) = 0
• Score (1,1) = ?
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
0 0 0 0 0 0 0 0
0
0
0
0
0
0
0
• ATCGTAC, ATGTTAT
• Match: +1
• Mismatches and indels: 0
• Score (0,j) = 0
• Score (i,0) = 0
• Score (1,1) = ?
• Three possibilities
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
0 0 0 0 0 0 0 0
0
0
0
0
0
0
0
-
A -
A
A
A
0 0 1
• ATCGTAC, ATGTTAT
• Match: +1
• Mismatches and indels: 0
• Score (0,j) = 0
• Score (i,0) = 0
• Score (1,1) = ?
• Three possibilities
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
0 0 0 0 0 0 0 0
0
0
0
0
0
0
0
-
A -
A
A
A
0 0 1
• ATCGTAC, ATGTTAT
• Match: +1
• Mismatches and indels: 0
• Score (0,j) = 0
• Score (i,0) = 0
• Score (1,1) =
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
0 0 0 0 0 0 0 0
0
0
0
0
0
0
0
-A -
AAA
0 0 1
• ATCGTAC, ATGTTAT
• Match: +1
• Mismatches and indels: 0
• Score (0,j) = 0
• Score (i,0) = 0
• Score (1,1) = ?
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
0 0 0 0 0 0 0 0
0
0
0
0
0
0
0
1
• ATCGTAC, ATGTTAT
• Match: +1
• Mismatches and indels: 0
• Score (0,j) = 0
• Score (i,0) = 0
• Score (1,i) = ?
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
0 0 0 0 0 0 0 0
0
0
0
0
0
0
0
1
• ATCGTAC, ATGTTAT
• Match: +1
• Mismatches and indels: 0
• Score (0,j) = 0
• Score (i,0) = 0
• Score (1,i) = ?
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
0 0 0 0 0 0 0 0
0
0
0
0
0
0
0
1
• ATCGTAC, ATGTTAT
• Match: +1
• Mismatches and indels: 0
• Score (0,j) = 0
• Score (i,0) = 0
• Score (1,i) = ?
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
0 0 0 0 0 0 0 0
0
0
0
0
0
0
0
1
-
A -
T
A
T
0 0 0
• ATCGTAC, ATGTTAT
• Match: +1
• Mismatches and indels: 0
• Score (0,j) = 0
• Score (i,0) = 0
• Score (1,i) = ?
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
0 0 0 0 0 0 0 0
0
0
0
0
0
0
0
1
1
• ATCGTAC, ATGTTAT
• Match: +1
• Mismatches and indels: 0
• Score (0,j) = 0
• Score (i,0) = 0
• Score (1,i) = ?
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
0 0 0 0 0 0 0 0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
• ATCGTAC, ATGTTAT
• Match: +1
• Mismatches and indels: 0
• Score (0,j) = 0
• Score (i,0) = 0
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
0 0 0 0 0 0 0 0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1 1 1 1 1 1 1
• ATCGTAC, ATGTTAT
• Match: +1
• Mismatches and indels: 0
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
0 0 0 0 0 0 0 0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1 1 1 1 1 1 1
2
• ATCGTAC, ATGTTAT
• Match: +1
• Mismatches and indels: 0
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
0 0 0 0 0 0 0 0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1 1 1 1 1 1 1
2 2 2 2 2 2
2
2
2
2
2
2 3 3 3 3
2
2
2
2
3
3
3
3
4
4
4
4
4
4
5
5
4
4
5
5
• ATCGTAC, ATGTTAT
• Match: +1
• Mismatches and indels: 0
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
0 0 0 0 0 0 0 0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1 1 1 1 1 1 1
2 2 2 2 2 2
2
2
2
2
2
2 3 3 3 3
2
2
2
2
3
3
3
3
4
4
4
4
4
4
5
5
4
4
5
5
• ATCGTAC, ATGTTAT
• Match: +1
• Mismatches and indels: 0
• Optimal Alignment
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
0 0 0 0 0 0 0 0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1 1 1 1 1 1 1
2 2 2 2 2 2
2
2
2
2
2
2 3 3 3 3
2
2
2
2
3
3
3
3
4
4
4
4
4
4
5
5
4
4
5
5
C
-
• ATCGTAC, ATGTTAT
• Match: +1
• Mismatches and indels: 0
• Optimal Alignment
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
0 0 0 0 0 0 0 0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1 1 1 1 1 1 1
2 2 2 2 2 2
2
2
2
2
2
2 3 3 3 3
2
2
2
2
3
3
3
3
4
4
4
4
4
4
5
5
4
4
5
5
C
-
-
T
• ATCGTAC, ATGTTAT
• Match: +1
• Mismatches and indels: 0
• Optimal Alignment
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
0 0 0 0 0 0 0 0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1 1 1 1 1 1 1
2 2 2 2 2 2
2
2
2
2
2
2 3 3 3 3
2
2
2
2
3
3
3
3
4
4
4
4
4
4
5
5
4
4
5
5
C
-
-
T
A
A
• ATCGTAC, ATGTTAT
• Match: +1
• Mismatches and indels: 0
• Optimal Alignment
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
0 0 0 0 0 0 0 0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1 1 1 1 1 1 1
2 2 2 2 2 2
2
2
2
2
2
2 3 3 3 3
2
2
2
2
3
3
3
3
4
4
4
4
4
4
5
5
4
4
5
5
C
-
-
T
A
A
T
T
• ATCGTAC, ATGTTAT
• Match: +1
• Mismatches and indels: 0
• Optimal Alignment
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
0 0 0 0 0 0 0 0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1 1 1 1 1 1 1
2 2 2 2 2 2
2
2
2
2
2
2 3 3 3 3
2
2
2
2
3
3
3
3
4
4
4
4
4
4
5
5
4
4
5
5
C
-
-
T
A
A
T
T
-
T
• ATCGTAC, ATGTTAT
• Match: +1
• Mismatches and indels: 0
• Optimal Alignment
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
0 0 0 0 0 0 0 0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1 1 1 1 1 1 1
2 2 2 2 2 2
2
2
2
2
2
2 3 3 3 3
2
2
2
2
3
3
3
3
4
4
4
4
4
4
5
5
4
4
5
5
C
-
-
T
A
A
T
T
-
T
G
G
• ATCGTAC, ATGTTAT
• Match: +1
• Mismatches and indels: 0
• Optimal Alignment
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
0 0 0 0 0 0 0 0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1 1 1 1 1 1 1
2 2 2 2 2 2
2
2
2
2
2
2 3 3 3 3
2
2
2
2
3
3
3
3
4
4
4
4
4
4
5
5
4
4
5
5
C
-
-
T
A
A
T
T
-
T
G
G
C
-
• ATCGTAC, ATGTTAT
• Match: +1
• Mismatches and indels: 0
• Optimal Alignment
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
0 0 0 0 0 0 0 0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1 1 1 1 1 1 1
2 2 2 2 2 2
2
2
2
2
2
2 3 3 3 3
2
2
2
2
3
3
3
3
4
4
4
4
4
4
5
5
4
4
5
5
C
-
-
T
A
A
T
T
-
T
G
G
C
-
T
T
• ATCGTAC, ATGTTAT
• Match: +1
• Mismatches and indels: 0
• Optimal Alignment
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
0 0 0 0 0 0 0 0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1 1 1 1 1 1 1
2 2 2 2 2 2
2
2
2
2
2
2 3 3 3 3
2
2
2
2
3
3
3
3
4
4
4
4
4
4
5
5
4
4
5
5
C
-
-
T
A
A
T
T
-
T
G
G
C
-
T
T
A
A
• ATCGTAC, ATGTTAT
• Match: +1
• Mismatches and indels: 0
• Optimal Alignment, LCS
Alignment as a Path in the Edit Graph: Example
ε A T C G T A C
ε
A
T
G
T
T
A
T
0 0 0 0 0 0 0 0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1 1 1 1 1 1 1
2 2 2 2 2 2
2
2
2
2
2
2 3 3 3 3
2
2
2
2
3
3
3
3
4
4
4
4
4
4
5
5
4
4
5
5
C
-
-
T
A
A
T
T
-
T
G
G
C
-
T
T
A
A
O(nm) to fill in the n x m dynamic programming matrix: the pseudocode consists of a nested “for” loop inside of another “for” loop.
LCS: Runtime
Simplest scoring schema: For some positive numbers μ and σ:– Match Premium: +1– Mismatch Penalty: –μ– Indel Penalty: –σ
Alignment score =
Choice of µ and σ depends on how we wish to penalize mismatches and indels.
From LCS to Alignment: Change the Scoring
The Global Alignment Problem
Input : Strings v and w and a scoring schema
Output : An alignment with maximum score
Use DP to solve the Global Alignment Problem:
: mismatch penaltyσ : indel penalty
• Align ATCGTAC and ATGTTAT. : 1, σ : 0.5
Needleman and Wunsch Algorithm
A C T C G
0 -1 -2 -3 -4 -5
A -1
C -2
A -3
G -4
T -5
A -6
G -7
Gap Penalty = -1Match Score = +1Mismatch Score = 0
ACTCG vs. ACAGTAG
Needleman and Wunsch Algorithm
A C T C G
0 -1 -2 -3 -4 -5
A -1 1
C -2
A -3
G -4
T -5
A -6
G -7
Gap Penalty = -1Match Score = +1Mismatch Score = 0
ACTCG vs. ACAGTAG
Needleman and Wunsch Algorithm
A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0
C -2
A -3
G -4
T -5
A -6
G -7
Gap Penalty = -1Match Score = +1Mismatch Score = 0
ACTCG vs. ACAGTAG
A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0 -1 -2 -3
C -2
A -3
G -4
T -5
A -6
G -7
Gap Penalty = -1Match Score = +1Mismatch Score = 0
ACTCG vs. ACAGTAG
Needleman and Wunsch Algorithm
A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0 -1 -2 -3
C -2 0 2 1 0 -1
A -3
G -4
T -5
A -6
G -7
Gap Penalty = -1Match Score = +1Mismatch Score = 0
ACTCG vs. ACAGTAG
Needleman and Wunsch Algorithm
A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0 -1 -2 -3
C -2 0 2 1 0 -1
A -3 -1 1 2 1 0
G -4 -2 0 1 2 2
T -5 -3 -1 1 1 2
A -6 -4 -2 0 1 1
G -7 -5 -3 -1 0 2
Gap Penalty = -1Match Score = +1Mismatch Score = 0
ACTCG vs. ACAGTAG
Needleman and Wunsch Algorithm
A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0 -1 -2 -3
C -2 0 2 1 0 -1
A -3 -1 1 2 1 0
G -4 -2 0 1 2 2
T -5 -3 -1 1 1 2
A -6 -4 -2 0 1 1
G -7 -5 -3 -1 0 2
Gap Penalty = -1Match Score = +1Mismatch Score = 0
A C A G T A GA C – – T C G
ACTCG vs. ACAGTAG
Needleman and Wunsch Algorithm
Scoring Matrices: Example
A G T C —
A 1 -0.8 -0.2 -2.3 -0.6
G -0.8 1 -1.1 -0.7 -1.5
T -0.2 -1.1 1 -0.5 -0.9
C -2.3 -0.7 -0.5 1 -1
— -0.6 -1.5 -0.9 -1 n/a
Scoring Matrices: Example
A G T C —
A 1 -0.8 -0.2 -2.3 -0.6
G -0.8 1 -1.1 -0.7 -1.5
T -0.2 -1.1 1 -0.5 -0.9
C -2.3 -0.7 -0.5 1 -1
— -0.6 -1.5 -0.9 -1 n/a
A-GTC-A
CGTTGGScore: –0.6 – 1 + 1 + 1 – 0.5 – 1.5 – 0.8 = –2.4
• Align AGTCA and CGTTGG with the scoring matrix:
Sample Alignment:
How Do We Make a Scoring Matrix?
Scoring matrices are created based on biological evidence.
Alignments can be thought of as two sequences that differ due to mutations.
Some of these mutations have little effect on the protein’s function, therefore some penalties, δ(vi , wj), will be less harsh than others.
Amino Acid Scoring Matrix
A R N K
A 5 -2 -1 -1
R -2 7 -1 3
N -1 -1 7 0
K -1 3 0 6
R and K have a positive mismatch score.Both positively charged amino acids this mismatch will not greatly change the function of the protein.Positive mismatch scores for amino acid changes that tend to preserve the physicochemical properties of the original residue (identical polarity, similar behaviour)
Scoring Matrices: Amino Acid vs. DNA
Two commonly used amino acid substitution matrices:1. PAM2. BLOSUM
DNA substitution matrices:• DNA is less conserved than protein sequences• It is therefore less effective to compare sequences at
the nucleotide level
PAM
PAM: Stands for Point Accepted Mutation
1 PAM = PAM1 = 1% average change of all amino acid positions.
• Note: This doesn’t mean that after 100 PAMs of evolution, every residue will have changed:• Some residues may have mutated several times.• Some residues may have returned to their original
state.• Some residues may not changed at all.
PAMX
PAMx = PAM1x (x iterations of PAM1)
– Example: PAM250 = PAM1250
PAM250 is a widely used scoring matrix:
Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys ... A R N D C Q E G H I L K ...Ala A 13 6 9 9 5 8 9 12 6 8 6 7 ...Arg R 3 17 4 3 2 5 3 2 6 3 2 9Asn N 4 4 6 7 2 5 6 4 6 3 2 5Asp D 5 4 8 11 1 7 10 5 6 3 2 5Cys C 2 1 1 1 52 1 1 2 2 2 1 1Gln Q 3 5 5 6 1 10 7 3 7 2 3 5...Trp W 0 2 0 0 0 0 0 0 1 0 1 0Tyr Y 1 1 2 1 3 1 1 1 3 2 2 1Val V 7 4 4 4 4 4 4 4 5 4 15 10
BLOSUM
BLOSUM: Stands for Blocks Substitution Matrix
Scores are derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins.
• BLOSUM62 was createdusing sequences sharingno more than 62%identity.
C S T P … F Y W
C 9 -1 -1 3 … -2 -2 -2
S -1 4 1 -1 … -2 -2 -3
T -1 1 4 1 … -2 -2 -3
P 3 -1 1 7 … -4 -3 -4
… … … … … … … … …
F -2 -2 -2 -4 … 6 3 1
Y -2 -2 -2 -3 … 3 7 2
W -2 -3 -3 -4 … 1 2 11
http://www.uky.edu/Classes/BIO/520/BIO520WWW/blosum62.htm
Local vs. Global Alignment: Example
• Global Alignment:
• Local Alignment—better alignment to find conserved segment:
--T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C
tccCAGTTATGTCAGgggacacgagcatgcagagac ||||||||||||
aattgccgccgtcgttttcagCAGTTATGTCAGatc
Local Alignment: Why?
Two genes in different species may be similar over short conserved regions and dissimilar over remaining regions.
Local Alignment: Why?
Two genes in different species may be similar over short conserved regions and dissimilar over remaining regions.
Example: Homeobox genes (regulate embryonic development) have a short homeodomains that are highly conserved among species.
• Aligning entire sequence (Global alignment) may miss homeodomains.
• Search for an alignment which has a positive score locally• (Alignment on substrings of the given sequences that has
a positive score)
Local Alignment: Illustration
Global alignment
Compute a “mini” Global Alignment to get Local Alignment
The Local Alignment Problem
Goal: Find the best local alignment between two strings.
Input : Strings v and w as well as a scoring matrix δ
Output : Alignment of substrings of v and w whose alignment score is maximum among all possible alignments of all possible substrings of v and w.
Local Alignment: How to Solve?
Global Alignment Problem finds the longest path between vertices (0,0) and (n,m) in the edit graph.
Local Alignment Problem finds the longest path among paths between arbitrary vertices (i,j) and (i’, j’) in the edit graph.
Local Alignment: How to Solve?
Global Alignment Problem finds the longest path between vertices (0,0) and (n,m) in the edit graph.
Local Alignment Problem finds the longest path among paths between arbitrary vertices (i,j) and (i’, j’) in the edit graph.
In the edit graph with negatively-scored edges, Local Alignment may score higher than Global Alignment.
Global alignment
Local alignment
The Problem with This Setup
• In the grid of size n x n there are ~n2 vertices (i,j) that may serve as a source.
The Problem with This Setup
• In the grid of size n x n there are ~n2 vertices (i,j) that may serve as a source.
• For each such vertex computing alignments from (i,j) to (i’,j’) takes O(n2) time.
• In the grid of size n x n there are ~n2 vertices (i,j) that may serve as a source.
• For each such vertex computing alignments from (i,j) to (i’,j’) takes O(n2) time.
The Problem with This Setup
• In the grid of size n x n there are ~n2 vertices (i,j) that may serve as a source.
• For each such vertex computing alignments from (i,j) to (i’,j’) takes O(n2) time.
• This gives an overall runtime of O(n4), which is a bit too slow…can we do better?
The Problem with This Setup
Local Alignment Solution: Free Rides
• Add “free” edges to the edit graph.
• The dashed edges represent the“free rides” from (0, 0) to everyother node.
• Each “free ride” is assignedan edge weight of 0.
• If we start at (0, 0) instead of(i, j) and maximize the longestpath to (i’, j’), we will obtainthe local alignment.
Smith-Waterman Local Alignment Algorithm
• The largest value of si,j over the whole edit graph is the score of the best local alignment.
• The recurrence:
• O(n2)
Smith and Waterman Algorithm
A A C C T A T A G C T
0 0 0 0 0 0 0 0 0 0 0 0
G 0 0 0 0 0 0 0 0 0 1 0 0
C 0 0 0 1 1 0 0 0 0 0 2 1
G 0 0 0 2 0 0 0 0 0 1 0 1
A 0 1 1 1 0 0 1 0 1 0 0 0
T 0 0 0 0 0 1 0 2 1 0 0 1
A 0 0 1 3 0 0 2 0 3 2 1 0
T 0 0 0 3 0 0 1 3 2 2 1 2
A 0 0 0 3 0 0 2 2 4 3 2 1
AACCTATAGCT, GCGATATA
Gap Penalty = -1Match Score = +1Mismatch Score = 0