View
213
Download
0
Embed Size (px)
Citation preview
Inexact Matching
• General Problem– Input
• Strings S and T
– Questions• How distant is S from T?• How similar is S to T?
• Solution Technique– Dynamic programming with
cost/similarity/scoring matrix
Biological Motivation
• Read pages 210-214 in textbook• “First Fact of Biological Sequence Analysis”
– In biomolecular sequences (DNA, RNA, amino acid sequences), high sequence similarity usually implies significant functional or structural similarity
– sequence similarity implies functional/structural similarity
– Converse is NOT true– Evolution reuses, builds upon, duplicates, and modifies
“successful” structures
Measuring Distance of S and T
• Consider S and T• We can transform S into T using the
following four operations– insertion of a character into S– deletion of a character from S– substitution (replacement) of a character in S by
another character (typically in T)– matching (no operation)
Example
• S = vintner• T = writers• vintner• wintner (Replace v with w)• wrintner (Insert r)• writner (Delete first n)• writer (Delete second n)• writers (Insert S)
Example
• Edit Transcript (or just transcript):– a string that describes the transformation of one
string into the other
• Example– RIMDMDMMI– v intner– wri t ers
Edit Distance
• Edit distance of strings S and T– The minimum number of edit operations (insertion,
deletion, replacement) needed to transform string S into string T
– Levenshtein distance [299], Levenshtein appears to have been the first to define this concept
• Optimal transcript– An edit transcript of S and T that has the minimum
number of edit operations– cooptimal transcripts
Alignment
• A global alignment of strings S and T is obtained– by inserting spaces (dashes) into S and T
• they should have the same number of characters (including dashes) at the end
– then placing two strings over each other matching one character (or dash) in S with a unique character (or dash) in T
– Note ALL positions in both S and T are involved
– Later, we will consider local alignments
Alignments and Edit transcripts
• Example Alignment– v-intner-– wri-t-ers
• Alignments and edit transcripts are interrelated– edit transcript: emphasizes process
• the specific mutational events
– alignment: emphasizes product • the relationship between the two strings
– Alignments are often easier to work with and visualize• also generalize better to more than 2 strings
Edit Distance Problem
• Input– 2 strings S and T
• Task– Output edit distance of S and T– Output optimal edit transcript– Output optimal alignment
• Solution method– Dynamic Programming
Definition of D(i,j)
• Let D(i,j) be the edit distance of S[1..i] and T[1..j]– The edit distance of the first i characters of S
with the first j characters of T– Let |S| = n, |T| = m
• D(n,m) = edit distance of S and T• We will compute D(i,j) for all i and j such
that 0 <= i <= n, 0 <= j <= m
Recurrence Relation
• Base Case:– For 0 <= i <= n, D(i,0) = i– For 0 <= j <= m, D(0,j) = j
• Recursive Case:– 0 < i <= n, 0 < j <= m– D(i,j) = min
• D(i-1,j) + 1• D(i,j-1) + 1• D(i-1,j-1) + d(i,j)
– d(i,j) = 0 if S(i) = T(j) and is 1 otherwise
What the various cases mean
• D(i,j) = min– D(i-1,j) + 1:
• Align S[1..i-1] with T[1..j] optimally
• Match S(i) with a dash in T
– D(i,j-1) + 1• Align S[1..i] with T[1..j-1] optimally
• Match a dash in S with T(j)
– D(i-1,j-1) + d(i,j)• Align S[1..i-1] with T[1..j-1] optimally
• Match S(i) with T(j)
Computing D(i,j) values
D(i,j) w r i t e r s
0 1 2 3 4 5 6 7
0
v 1
i 2
n 3
t 4
n 5
e 6
r 7
Initialization: Base Case
D(i,j) w r i t e r s
0 1 2 3 4 5 6 7
0 0 1 2 3 4 5 6 7
v 1 1
i 2 2
n 3 3
t 4 4
n 5 5
e 6 6
r 7 7
Row i=1
D(i,j) w r i t e r s
0 1 2 3 4 5 6 7
0 0 1 2 3 4 5 6 7
v 1 1 1 2 3 4 5 6 7
i 2 2
n 3 3
t 4 4
n 5 5
e 6 6
r 7 7
Entry i=2, j=3
D(i,j) w r i t e r s
0 1 2 3 4 5 6 7
0 0 1 2 3 4 5 6 7
v 1 1 1 2 3 4 5 6 7
i 2 2 2 2 ?
n 3 3
t 4 4
n 5 5
e 6 6
r 7 7
Calculation methodologies
• Location of edit distance– D(n,m)
• Example was to calculate row by row
• Can also calculate column by column
• Can also use antidiagonals
• Key is to build from upper left corner
Traceback
• Using table to construct optimal transcript
• Pointers in cell D(i,j)– Set a pointer from cell (i,j) to
• cell (i, j-1) if D(i,j) = D(i, j-1) + 1
• cell (i-1,j) if D(i,j) = D(i-1,j) + 1
• cell (i-1,j-1) if D(i,j) = D(i-1,j-1) + d(i,j)
– Follow path of pointers from (n,m) back to (0,0)
• Example: Figure 11.3 on page 222
What the pointers mean
• horizontal pointer: cell (i,j) to cell (i, j-1) – Align T(j) with a space in S– Insert T(j) into S
• vertical pointer: cell (i,j) to cell (i-1, j)– Align S(i) with a space in T– Delete S(i) from S
• diagonal pointer: cell (i,j) to cell (i-1, j-1)– Align S(i) with T(j)– Replace S(i) with T(j)
Table and transcripts
• The pointers represent all optimal transcripts
• Theorem: – Any path from (n,m) to (0,0) following the
pointers specifies an optimal transcript.– Conversely, any optimal transcript is specified
by such a path. – The correspondence between paths and
transcripts is one to one.
Running Time
• Initialization of table– O(n+m)
• Calculating table and pointers– O(nm)
• Traceback for one optimal transcript or optimal alignment– O(n+m)
Operation-Weight Edit Distance
• Consider S and T
• We can assign weights to the various operations– insertion/deletion of a character: cost d– substitution (replacement) of a character: cost r– matching: cost e– Previous case: d = r = 1, e = 0
Modified Recurrence Relation
• Base Case:– For 0 <= i <= n, D(i,0) = i d– For 0 <= j <= m, D(0,j) = j d
• Recursive Case:– 0 < i <= n, 0 < j <= m– D(i,j) = min
• D(i-1,j) + d• D(i,j-1) + d• D(i-1,j-1) + d(i,j)
– d(i,j) = e if S(i) = T(j) and is r otherwise
Alphabet-Weight Edit Distance
• Define weight of each possible substitution– r(a,b) where a is being replaced by b for all a,b in the
alphabet
– For example, with DNA, maybe r(A,T) > r(A,G)
– Likewise, I(a) may vary by character
• Operation-weight edit distance is a special case of this variation
• Weighted edit distance refers to this alphabet-weight setting
Modified Recurrence Relation
• Base Case:– For 0 <= i <= n, D(i,0) = 1 <= k <= i I(S(k))– For 0 <= j <= m, D(0,j) = 1 <= k <= j I(T(k))
• Recursive Case:– 0 < i <= n, 0 < j <= m– D(i,j) = min
• D(i-1,j) + I(S(i))• D(i,j-1) + I(T(j))• D(i-1,j-1) + d(i,j)
– d(i,j) = r(S(i), T(j))
Measuring Similarity of S and T
• Definitions– Let be the alphabet for strings S and T– Let ’ be the alphabet with character - added– For any two characters x,y in ’, s(x,y) denotes
the value (or score) obtained by aligning x with y– For a given alignment A of S and T, let S’ and T’
denote the strings after the chosen insertion of spaces and l their new length
– The value of alignment A is 1<=i<=l s(S(i),T(i))
Example
• a b a a - b a b• a a a a a b - b• 1-2+1+1+0+2+0+2=5
s a b -
a 1 -2 0
b 2 -1
- 0
String Similarity Problem
• Input– 2 strings S and T– Scoring matrix s for alphabet ’
• Task– Output optimal alignment value of S and T
• The alignment of S and T with maximal, not minimal, value
– Output this alignment
Modified Recurrence Relation
• Base Case:– For 0 <= i <= n, V(i,0) = 1 <= k <= i s(S(k),-)
– For 0 <= j <= m, V(0,j) = 1 <= k <= j s(-,T(k))
• Recursive Case:– 0 < i <= n, 0 < j <= m
– V(i,j) = max• V(i-1,j) + s(S(i),-)
• V(i,j-1) + s(-,T(j))
• V(i-1,j-1) + s(S(i), T(j))
Longest Common Subsequence Problem
• Given 2 strings S and T, a common subsequence is a subsequence that appears in both S and T.
• The longest common subsequence problem is to find a longest common subsequence (lcs) of S and T– subsequence: characters need not be contiguous– different than substring
• O(nm) solution: – Make scoring matrix 1 for match, 0 for mismatch– The matched characters in an alignment of maximal value
form a longest common subsequence
Similarity and Distance
• If we are focused on aligning both entire strings, maximizing similarity is essentially identical to minimizing distance– Just need to modify scoring matrices appropriately
• When we consider substrings of uncertain length, maximizing similarity often makes more sense than minimizing distance– Overlapping strings
– Local alignment
Overlapping Strings
• Find best alignment where the two strings overlap without penalizing for the unmatched ends– Application: sequence assembly problem
• strings are likely to overlap without being substrings of each other
• Solution method– End-space free variant of dynamic programming– Change base conditions so that V(i,0) = V(0,j) = 0– Need to search over row n and column n for optimal value
• Optimal value may not be in entry (n,m)
• Why is max similarity better than min distance?
Maximally Similar Substrings
• Local alignment problem– Input
• Two strings S and T
– Task• Find substrings s and t of S and T that have the maximum
possible alignment value as well as this value.• Let v* denote this value.
• Why is max similarity better than min distance?• Read pages 230-231 for motivation
Local suffix alignments
• Define v(i,j) to be the value of the optimal alignment of any of the i+1 suffixes of S[1..i] with any of the j+1 suffixes of T[1..j].– We bound v(i,j) to be at least 0 by scoring the
alignment of two empty suffixes to be 0
• Theorem– v* (the value of the optimal local alignment) =
max{ v(i,j) | 1 <= i <= n, 1 <= j <= m}
Recurrences for local suffix alignments
• Base Case:– For 0 <= i <= n, v(i,0) = – For 0 <= j <= m, v(0,j) =
• Recursive Case:– 0 < i <= n, 0 < j <= m– v(i,j) = max
• 0• v(i-1,j) + s(S(i),-)• v(i,j-1) + s(-,T(j))• v(i-1,j-1) + s(S(i), T(j))
Comments
• Traceback– No longer start from cell (n,m)– Search whole table for max value and start from there– Still O(mn) running time
• Terminology– In the literature, the distinction between problem statements
from solution methods is not clear– Global alignment often referred to as Needleman-Wunsch
alignment• There solution method was cubic in terms of m,n
– Smith-Waterman often used to refer to both local alignments and their solution method
Comments continued
• Scoring schemes– The utility of optimal local alignments is highly
dependent on the scoring scheme– Examples
• matches 1, mismatches & spaces 0 leads to longest common subsequence
• mismaches and spaces big negatives leads to longest common substring
– Average score in matrix must be negative, otherise local alignments tend to be global
– There is a theory developed about scoring schemes that we will cover later.
Aligning with Gaps
• Gaps: Any maximal run of spaces in a single string of a given alignment
• Example– S = aaabbbcccdddeeefff– T = aaabbbdddeeefffggg– Alignment
•aaabbbcccdddeeefff---•aaabbb---dddeeefffggg
Scoring with gaps
• Example Scoring– aaabbbcccdddeeefff---– aaabbb---dddeeefffggg– 111111-1 111111111 -1 = 13
• Why include gaps in scoring schemes?– Read 236-240– When an insertion/deletion event occurs, often more than a
single character is inserted or deleted.– A single gap cost helps model the fact that a sequence of
insertions/deletions is really one mutational event
Constant gap weight model
• We present a series of possible gap weight models, each of which is a special case of the next one
• Constant gap weight model– each individual space is free (Ws = 0)
– each gap has constant cost Wg
– Alignment problem boils down to finding an alignment that maximizes
• Match scores - mismatch scores - Wg(# of gaps)
– Dynamic programming can still solve in O(nm) time
Affine gap weight model
• Gap opening versus gap extension penalties– each gap has constant cost Wg
– each individual space has cost Ws < Wg, typically
• Alignment problem boils down to finding an alignment that maximizes
• Match scores - mismatch scores - Wg(# of gaps) - Ws(# of spaces)
• Dynamic programming can still solve in O(nm) time• Probably most commonly used model because of
efficiency and generality of model
Convex gap weight model
• Extension penalty should not be a constant but rather decrease as length of gap increases– One example
• each gap has cost Wg + log q where q is the length of the gap
• Time now requires more than O(nm) time– In chapter 13 is an O(nmlog m) time solution– Further improvement is possible, but costly
Arbitrary gap weight model
• Gap cost is an arbitrary function of gap length– each gap has cost w(q) where q is the length of the
gap– no properties are assumed on w(q) such as its
second derivative is negative
• Solution time is now O(nm2 + n2m)– cubic cost, similar to original Needleman-Wunsch
solution
Recurrences for arbitrary gap weights
• Base Case:– For 0 <= i <= n, V(i,0) = w(i)– For 0 <= j <= m, V(0,j) = w(j)
• Recursive Case:– 0 < i <= n, 0 < j <= m– V(i,j) = max
• V(i-1,j-1) + s(S(i),T(j))• max0<=k<j-1 [V(i,k) - w(j-k)]
– Match S[1..i] with T[1..k] and gap of length j-k at end of T
• max0<=k<i-1 [V(k,j) - w(i-k)]– Match S[1..k] with T[1..j] and gap of length i-k at end of S
Recurrences for affine gap weights
• Base Case:– For 0 <= i <= n, V(i,0) = E(i,0)Wg - iWs
– For 0 <= j <= m, V(0,j) = F(0,j) = -Wg - jWs
• Recursive Case:– 0 < i <= n, 0 < j <= m– V(i,j) = max [E(i,j), F(i,j), G(i,j)]
• G(i,j) = V(i-1,j-1) + s(S(i),T(j))• E(i,j) = max [E(i,j-1), V(i,j-1) - Wg] - Ws
– max checks if gap begins at S(i) or if it began earlier
• F(i,j) = max [F(i-1,j), V(i-1,j) - Wg] - Ws
– max checks if gap begins at T(j) or if it began earlier