Upload
joanna-woods
View
225
Download
3
Embed Size (px)
Citation preview
Inexact Matching of Strings
• General Problem– Input
• Strings S and T
– Questions• How distant is S from T?• How similar is S to T?
• Solution Technique– Dynamic programming with
cost/similarity/scoring matrix
Measuring Distance of S and T
• Consider S and T• We can transform S into T using the
following four operations– insertion of a character into S– deletion of a character from S– substitution (replacement) of a character in S by
another character (typically in T)– matching (no operation)
Example
• S = vintner• T = writers• vintner• wintner (Replace v with w)• wrintner (Insert r)• writner (Delete first n)• writer (Delete second n)• writers (Insert S)
Example
• Edit Transcript (or just transcript):– a string that describes the transformation of one
string into the other
• Example– RIMDMDMMI– v intner– wri t ers
Edit Distance
• Edit distance of strings S and T– The minimum number of edit operations (insertion,
deletion, replacement) needed to transform string S into string T
– Levenshtein distance, Levenshtein appears to have been the first to define this concept
• Optimal transcript– An edit transcript of S and T that has the minimum
number of edit operations– cooptimal transcripts
Alignment
• A global alignment of strings S and T is obtained– by inserting spaces (dashes) into S and T
• they should have the same number of characters (including dashes) at the end
– then placing two strings over each other matching one character (or dash) in S with a unique character (or dash) in T
– Note ALL positions in both S and T are involved
Alignments and Edit transcripts
• Example Alignment– v-intner-– wri-t-ers
• Alignments and edit transcripts are interrelated– edit transcript: emphasizes process
• the specific mutational events
– alignment: emphasizes product • the relationship between the two strings
– Alignments are often easier to work with and visualize• also generalize better to more than 2 strings
Edit Distance Problem
• Input– 2 strings S and T
• Task– Output edit distance of S and T– Output optimal edit transcript– Output optimal alignment
• Solution method– Dynamic Programming
Definition of D(i,j)
• Let D(i,j) be the edit distance of S[1..i] and T[1..j]– The edit distance of the first i characters of S
with the first j characters of T– Let |S| = n, |T| = m
• D(n,m) = edit distance of S and T• We will compute D(i,j) for all i and j such
that 0 <= i <= n, 0 <= j <= m
Recurrence Relation
• Base Case:– For 0 <= i <= n, D(i,0) = i– For 0 <= j <= m, D(0,j) = j
• Recursive Case:– 0 < i <= n, 0 < j <= m– D(i,j) = min
• D(i-1,j) + 1 (what does this mean?)• D(i,j-1) + 1 (what does this mean?)• D(i-1,j-1) + d(i,j) (what does this mean?)
– d(i,j) = 0 if S(i) = T(j) and is 1 otherwise
What the various cases mean
• D(i,j) = min– D(i-1,j) + 1:
• Align S[1..i-1] with T[1..j] optimally
• Match S(i) with a dash in T
– D(i,j-1) + 1• Align S[1..i] with T[1..j-1] optimally
• Match a dash in S with T(j)
– D(i-1,j-1) + d(i,j)• Align S[1..i-1] with T[1..j-1] optimally
• Match S(i) with T(j)
Initialization: Base Case
D(i,j) w r i t e r s
0 1 2 3 4 5 6 7
0 0 1 2 3 4 5 6 7
v 1 1
i 2 2
n 3 3
t 4 4
n 5 5
e 6 6
r 7 7
Row i=1
D(i,j) w r i t e r s
0 1 2 3 4 5 6 7
0 0 1 2 3 4 5 6 7
v 1 1 1 2 3 4 5 6 7
i 2 2
n 3 3
t 4 4
n 5 5
e 6 6
r 7 7
Entry i=2, j=2
D(i,j) w r i t e r s
0 1 2 3 4 5 6 7
0 0 1 2 3 4 5 6 7
v 1 1 1 2 3 4 5 6 7
i 2 2 2 ?
n 3 3
t 4 4
n 5 5
e 6 6
r 7 7
Entry i=2, j=3
D(i,j) w r i t e r s
0 1 2 3 4 5 6 7
0 0 1 2 3 4 5 6 7
v 1 1 1 2 3 4 5 6 7
i 2 2 2 2 ?
n 3 3
t 4 4
n 5 5
e 6 6
r 7 7
Calculation methodologies
• Location of edit distance– D(n,m)
• Example was to calculate row by row
• Can also calculate column by column
• Can also use antidiagonals
• Key is to build from upper left corner
Traceback
• Using table to construct optimal transcript
• Pointers in cell D(i,j)– Set a pointer from cell (i,j) to
• cell (i, j-1) if D(i,j) = D(i, j-1) + 1
• cell (i-1,j) if D(i,j) = D(i-1,j) + 1
• cell (i-1,j-1) if D(i,j) = D(i-1,j-1) + d(i,j)
– Follow path of pointers from (n,m) back to (0,0)
What the pointers mean
• horizontal pointer: cell (i,j) to cell (i, j-1) – Align T(j) with a space in S– Insert T(j) into S
• vertical pointer: cell (i,j) to cell (i-1, j)– Align S(i) with a space in T– Delete S(i) from S
• diagonal pointer: cell (i,j) to cell (i-1, j-1)– Align S(i) with T(j)– Replace S(i) with T(j)
Table and transcripts
• The pointers represent all optimal transcripts
• Theorem: – Any path from (n,m) to (0,0) following the
pointers specifies an optimal transcript.– Conversely, any optimal transcript is specified
by such a path. – The correspondence between paths and
transcripts is one to one.
Running Time
• Initialization of table– O(n+m)
• Calculating table and pointers– O(nm)
• Traceback for one optimal transcript or optimal alignment– O(n+m)
Operation-Weight Edit Distance
• Consider S and T
• We can assign weights to the various operations– insertion/deletion of a character: cost d– substitution (replacement) of a character: cost r– matching: cost e– Previous case: d = r = 1, e = 0
Modified Recurrence Relation
• Base Case:– For 0 <= i <= n, D(i,0) = i d– For 0 <= j <= m, D(0,j) = j d
• Recursive Case:– 0 < i <= n, 0 < j <= m– D(i,j) = min
• D(i-1,j) + d• D(i,j-1) + d• D(i-1,j-1) + d(i,j)
– d(i,j) = e if S(i) = T(j) and is r otherwise
Alphabet-Weight Edit Distance
• Define weight of each possible substitution– r(a,b) where a is being replaced by b for all a,b in the
alphabet
– For example, with DNA, maybe r(A,T) > r(A,G)
– Likewise, I(a) may vary by character
• Operation-weight edit distance is a special case of this variation
• Weighted edit distance refers to this alphabet-weight setting
Modified Recurrence Relation
• Base Case:– For 0 <= i <= n, D(i,0) = 1 <= k <= i I(S(k))– For 0 <= j <= m, D(0,j) = 1 <= k <= j I(T(k))
• Recursive Case:– 0 < i <= n, 0 < j <= m– D(i,j) = min
• D(i-1,j) + I(S(i))• D(i,j-1) + I(T(j))• D(i-1,j-1) + d(i,j)
– d(i,j) = r(S(i), T(j))
Measuring Similarity of S and T
• Definitions– Let be the alphabet for strings S and T– Let ’ be the alphabet with character - added– For any two characters x,y in ’, s(x,y) denotes
the value (or score) obtained by aligning x with y– For a given alignment A of S and T, let S’ and T’
denote the strings after the chosen insertion of spaces and l their new length
– The value of alignment A is 1<=i<=l s(S’(i),T’(i))
String Similarity Problem
• Input– 2 strings S and T– Scoring matrix s for alphabet ’
• Task– Output optimal alignment value of S and T
• The alignment of S and T with maximal, not minimal, value
– Output this alignment
Modified Recurrence Relation
• Base Case:– For 0 <= i <= n, V(i,0) = 1 <= k <= i s(S(k),-)
– For 0 <= j <= m, V(0,j) = 1 <= k <= j s(-,T(k))
• Recursive Case:– 0 < i <= n, 0 < j <= m
– V(i,j) = max• V(i-1,j) + s(S(i),-)
• V(i,j-1) + s(-,T(j))
• V(i-1,j-1) + s(S(i), T(j))
Longest Common Subsequence Problem
• Given 2 strings S and T, a common subsequence is a subsequence that appears in both S and T.
• The longest common subsequence problem is to find a longest common subsequence (lcs) of S and T– subsequence: characters need not be contiguous– different than substring
• Can you use dynamic programming to solve the longest common subsequence problem?
Computing alignments using linear space.
• Hirschberg [1977]
• Suppose we only need the maximum similarity/distance value of S and T without an alignment or transcript
• How can we conserve space?– Only save row i-1 when computing row i in the
table
Linear space and an alignment
• Assume S has length 2n• Divide and conquer approach
– Compute value of optimal alignment of S[1..n] with all prefixes of T
• Store row n only at end along with pointer values of row n
– Compute value of optimal alignment of Sr[1..n] with all prefixes of Tr
• Store only values in row n
• Find k such that – V(S[1..n],T[1..k]) + V(Sr[1..n],Tr[1..m-k]) – is maximized over 0 <= k <=m
Illustration0123456
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 -
6543210
- 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
V(S[1..6], T[1..0])
V(Sr[1..6], Tr[1..18])
k=0
m-k=18
Illustration0123456
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 -
6543210
- 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
V(S[1..6], T[1..1])
V(Sr[1..6], Tr[1..17])
k=1
m-k=17
Illustration0123456
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 -
6543210
- 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
V(S[1..6], T[1..2])
V(Sr[1..6], Tr[1..16])
k=2
m-k=16
Illustration0123456
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 -
6543210
- 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
V(S[1..6], T[1..9])
V(Sr[1..6], Tr[1..9])
k=9
m-k=9
Illustration0123456
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 -
6543210
- 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
V(S[1..6], T[1..18])
V(Sr[1..6], Tr[1..0])
k=18
m-k=0
Illustration0123456
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 -
6543210
- 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
Recursive Step
• Let k* be the k that maximizes – V(S[1..n],T[1..k]) + V(Sr[1..n],Tr[1..m-k])
• Record all steps on row n including the one from n-1 and the one to n+1
• Recurse on the two subproblems– S[1..n-1] with T[1..j] where j <= k*– Sr[1..n] with Tr[1..q] where q <= m-k*
Illustration0123456
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 -
6543210
- 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0