Upload
nathan-gallegos
View
219
Download
4
Embed Size (px)
Citation preview
1
String Matching with Errors
The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20, No. 1, 1980, pp. 359~373.
Speaker: C. C. LinAdviser: R. C. T. Lee
2
In the following, we will present a problem related
to the notion of edit distance.
Next, let us introduce edit distance.
3
In edit distance, there are three types of differences
between two strings X and Y:
Insertion: a symbol of Y is missing in X at a
corresponding position, with its cost being 1.
Substitution: symbols at corresponding positions are
distinct, with its cost being 1.
Deletion: a symbol of X is missing in Y at a
corresponding position, with its cost being 1. X: G C AY: G - A
X : A C CY : T C C
X : A - T Y : A G T
4
Given two strings X and Y, the edit distance
between X and Y is the minimum number of
insertions, deletions and substitutions needed to
transform Y to X.
5
String X ︰ ATGAATCTTACCGCCTCG String Y ︰ ATGAGGCTCTGGCCCCTG
Transformation (from string Y to string X)
String X:A T G A A – – T C T T A C C G C C T C G String Y:A T G A G G C T C T G G C C – C C T – G
EDIT(X, Y)=7 (2 insertions, 2 deletions and 3 changes).
6
Next, we will introduce a dynamic programming
method to compute the edit distance between
two strings X and Y.
7
.],0[,]0,[
otherwise. 1])[ ],[(
and ],[][ if 0,])[ ],[( where
])[ ],[ (]1 ,1[
1] ,1 [
1]1 , [
min ] , [
jjEDITiiEDIT
jyix
jyixjyix
jyixj iEDIT
jiEDIT
jiEDIT
jiEDIT
Dynamic Programming for Edit Distance:
(Delete)
(Insert)
(Substitute)
8
a b c a b b a
c
b
a
b
a
c
0 1 2 3 4 5 6 7
1
2
3
4
5
6
Given
X=abcabba
Y=cbabac
9
a b c a b b a
c
b
a
b
a
c
0 1 2 3 4 5 6 7
1 1
2
3
4
5
6
Given
X=abcabba
Y=cbabac
10
a b c a b b a
c
b
a
b
a
c
0 1 2 3 4 5 6 7
1 1 2
2
3
4
5
6
Given
X=abcabba
Y=cbabac
11
a b c a b b a
c
b
a
b
a
c
0 1 2 3 4 5 6 7
1 1 2 2
2
3
4
5
6
Given
X=abcabba
Y=cbabac
12
a b c a b b a
c
b
a
b
a
c
0 1 2 3 4 5 6 7
1 1 2 2 3
2
3
4
5
6
Given
X=abcabba
Y=cbabac
13
a b c a b b a
c
b
a
b
a
c
Given
X=abcabba
Y=cbabac
0 1 2 3 4 5 6 7
1 1 2 2 3 4 5 6
2 2 1 2 3 3 4 5
3 2 2 2 2 3 4 4
4 3 2 3 3 2 3 4
5 4 3 3 3 3 3 3
6 5 4 3 4 4 4 4
14
a b c a b b a
c
b
a
b
a
c
0 1 2 3 4 5 6 7
1 1 2 2 3 4 5 6
2 2 1 2 3 3 4 5
3 2 2 2 2 3 4 4
4 3 2 3 3 2 3 4
5 4 3 3 3 3 3 3
6 5 4 3 4 4 4 4
EDIT(X, Y)=4
a
c
Given
X=abcabba
Y=cbabac
Substitute
15
a b c a b b a
c
b
a
b
a
c
0 1 2 3 4 5 6 7
1 1 2 2 3 4 5 6
2 2 1 2 3 3 4 5
3 2 2 2 2 3 4 4
4 3 2 3 3 2 3 4
5 4 3 3 3 3 3 3
6 5 4 3 4 4 4 4
EDIT(X, Y)=4
ba
ac
Given
X=abcabba
Y=cbabac
Substitute
16
a b c a b b a
c
b
a
b
a
c
EDIT(X, Y)=4
bba
bac
Given
X=abcabba
Y=cbabac
0 1 2 3 4 5 6 7
1 1 2 2 3 4 5 6
2 2 1 2 3 3 4 5
3 2 2 2 2 3 4 4
4 3 2 3 3 2 3 4
5 4 3 3 3 3 3 3
6 5 4 3 4 4 4 4
Match
17
a b c a b b a
c
b
a
b
a
c
EDIT(X, Y)=4
abba
abac
Given
X=abcabba
Y=cbabac
0 1 2 3 4 5 6 7
1 1 2 2 3 4 5 6
2 2 1 2 3 3 4 5
3 2 2 2 2 3 4 4
4 3 2 3 3 2 3 4
5 4 3 3 3 3 3 3
6 5 4 3 4 4 4 4
Match
18
a b c a b b a
c
b
a
b
a
c
0 1 2 3 4 5 6 7
1 1 2 2 3 4 5 6
2 2 1 2 3 3 4 5
3 2 2 2 2 3 4 4
4 3 2 3 3 2 3 4
5 4 3 3 3 3 3 3
6 5 4 3 4 4 4 4
EDIT(X, Y)=4
cabba
–abac
Given
X=abcabba
Y=cbabacInsert
19
44443456
33333345
43233234
44322223
54332122
65432211
7654321
EDIT(X, Y)=4
bcabba
b–abac
Given
X=abcabba
Y=cbabac
c
a
b
a
b
c
abbacba
Match
0
20
0 1 2 3 4 5 6 7
1 1 2 2 3 4 5 6
2 2 1 2 3 3 4 5
3 2 2 2 2 3 4 4
4 3 2 3 3 2 3 4
5 4 3 3 3 3 3 3
6 5 4 3 4 4 4 4
EDIT(X, Y)=4
abcabba
cb–abac
Given
X=abcabba
Y=cbabac
a b c a b b a
c
b
a
b
a
c
Substitute
21
a b c a b b a
c
b
a
b
a
c
Given
X=abcabba
Y=cbabac
0 1 2 3 4 5 6 7
1 1 2 2 3 4 5 6
2 2 1 2 3 3 4 5
3 2 2 2 2 3 4 4
4 3 2 3 3 2 3 4
5 4 3 3 3 3 3 3
6 5 4 3 4 4 4 4
EDIT(X, Y)=4
abcabba-
cb–ab-ac
Substitute
Match
InsertMatch
Match
Insert Match
Delete
22
a b c a b b a
c
b
a
b
a
c
Given
X=abcabba
Y=cbabac
0 1 2 3 4 5 6 7
1 1 2 2 3 4 5 6
2 2 1 2 3 3 4 5
3 2 2 2 2 3 4 4
4 3 2 3 3 2 3 4
5 4 3 3 3 3 3 3
6 5 4 3 4 4 4 4
EDIT(X, Y)=4
abcabba-
cb–a-bac
23
We can recognize the time complexity of computing
edit distance by the above algorithm to be O(mn)
and space complexity O(mn) where n and m are the
size of text and pattern, respectively.
24
In the following, we will introduce the topic, called
the “string matching with errors” problem.
25
The definition of the problem: Given a pattern P of length m and a text T of length n, find a substring S of T such that EDIT(S, P) is minimal.
Given: T=abcabba
P=cbabac
Find: S=cabba
EDIT(S, P)=3
P=cbabac
S=c–abba
Given: T=abcabba
P=cbabac
T’s substring K=bcabb
EDIT(K, P)=4
P=–cbabac
K=bc–ab–b
26
.0],0[,]0,[]0,0[
otherwise. 1])[ ],[(
and ],[][ if 0,])[ ],[( where
])[ ],[ (]1 ,1 [
1] ,1 [
1]1 , [
min ] , [
jSEiiSESE
jyix
jyixjyix
jyixjiSE
jiSE
jiSE
jiSE
Dynamic Programming for the String Matching with Error Problem:
27
The difference between EDIT[i, j] is that the EDIT[0, j]=j for the edit distance finding problem and SE[0,j]=0 for the string with error problem.
.],0[,]0,[
otherwise. 1])[ ],[(
and ],[][ if 0,])[ ],[( where
])[ ],[ (]1 ,1 [
1] ,1 [
1]1 , [
min ] , [
jjEDITiiEDIT
jyix
jyixjyix
jyixjiEDIT
jiEDIT
jiEDIT
jiEDIT
The dynamic programming approach for the edit distance problem:
28
In the edit distance problem, we have EDIT[0, j]=j.
In the string matching with error problem, we set SE[0, j]=0.
29
a b c a b b a
c
b
a
b
a
c33343456
22233345
22123234
12212223
21111122
11110111
00000000T=abcabba
P=cbabac
Since this path starts at the bottom row and ends at the top row with SE(0, j)=0, this shows that there exists a substring S in T such that EDIT(S, P)=3.
30
We find the lowest value of the last row and trace
back from the point.
Our output may be several strings.
31
a b c a b b a
c
b
a
b
a
c
0 0 0 0 0 0 0 0
1 1 1 0 1 1 1 1
2 2 1 1 1 1 1 2
3 2 2 2 1 2 2 1
4 3 2 3 2 1 2 2
5 4 3 3 3 2 2 2
6 5 4 3 4 3 3 3
S=cabba
T=abcabba
P=cbabac
T: abc–abba
P: cbabac
32
0 1 2 3 4 5
1 0 1 2 3 4
2 1 1 1 2 3
3 2 1 2 2 2
4 3 2 1 2 3
5 4 3 2 2 2
6 5 4 3 3 3
T=abcabba
P=cbabac
EDIT(S, P)=3
edit distance
c a b b a
c
b
a
b
a
c
S: c–abba
P: cbabac
33
a b c a b b a
c
b
a
b
a
c
0 0 0 0 0 0 0 0
1 1 1 0 1 1 1 1
2 2 1 1 1 1 1 2
3 2 2 2 1 2 2 1
4 3 2 3 2 1 2 2
5 4 3 3 3 2 2 2
6 5 4 3 4 3 3 3
T=abcabba
P=cbabac
S: cabba–
P: cbabac
EDIT(S, P)=3
34
a b c a b b a
c
b
a
b
a
c
0 0 0 0 0 0 0 0
1 1 1 0 1 1 1 1
2 2 1 1 1 1 1 2
3 2 2 2 1 2 2 1
4 3 2 3 2 1 2 2
5 4 3 3 3 2 2 2
6 5 4 3 4 3 3 3
T=abcabba
P=cbabac
S: c-ab--
P: cbabac
EDIT(S, P)=3
35
a b c a b b a
c
b
a
b
a
c
0 0 0 0 0 0 0 0
1 1 1 0 1 1 1 1
2 2 1 1 1 1 1 2
3 2 2 2 1 2 2 1
4 3 2 3 2 1 2 2
5 4 3 3 3 2 2 2
6 5 4 3 4 3 3 3
T=abcabba
P=cbabac
S: --ab-c
P: cbabac
EDIT(S, P)=3
36
References
For Edit Distance Computation:[NW70] Neddleman, S.B., and Wunsch, C.D., A general method applicable to the search for similarities in the aminoacid sequence of two proteins, Journal of Molecular Biology 48 (1970): 443-453.
For String matching with error:
[S80] The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20, No. 1, 1980, pp. 359~373.
37
Thank you