37
1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20, No. 1, 1980, pp. 359~373. Speaker: C. C. Lin Adviser: R. C. T. Lee

1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

Embed Size (px)

Citation preview

Page 1: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

1

String Matching with Errors

The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20, No. 1, 1980, pp. 359~373.

Speaker: C. C. LinAdviser: R. C. T. Lee

Page 2: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

2

In the following, we will present a problem related

to the notion of edit distance.

Next, let us introduce edit distance.

Page 3: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

3

In edit distance, there are three types of differences

between two strings X and Y:

Insertion: a symbol of Y is missing in X at a

corresponding position, with its cost being 1.

Substitution: symbols at corresponding positions are

distinct, with its cost being 1.

Deletion: a symbol of X is missing in Y at a

corresponding position, with its cost being 1. X: G C AY: G - A

X : A C CY : T C C

X : A - T Y : A G T

Page 4: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

4

Given two strings X and Y, the edit distance

between X and Y is the minimum number of

insertions, deletions and substitutions needed to

transform Y to X.

Page 5: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

5

String X ︰ ATGAATCTTACCGCCTCG String Y ︰ ATGAGGCTCTGGCCCCTG

Transformation (from string Y to string X)

String X:A T G A A – – T C T T A C C G C C T C G String Y:A T G A G G C T C T G G C C – C C T – G

EDIT(X, Y)=7 (2 insertions, 2 deletions and 3 changes).

Page 6: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

6

Next, we will introduce a dynamic programming

method to compute the edit distance between

two strings X and Y.

Page 7: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

7

.],0[,]0,[

otherwise. 1])[ ],[(

and ],[][ if 0,])[ ],[( where

])[ ],[ (]1 ,1[

1] ,1 [

1]1 , [

min ] , [

jjEDITiiEDIT

jyix

jyixjyix

jyixj iEDIT

jiEDIT

jiEDIT

jiEDIT

Dynamic Programming for Edit Distance:

(Delete)

(Insert)

(Substitute)

Page 8: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

8

a b c a b b a

c

b

a

b

a

c

0 1 2 3 4 5 6 7

1

2

3

4

5

6

Given

X=abcabba

Y=cbabac

Page 9: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

9

a b c a b b a

c

b

a

b

a

c

0 1 2 3 4 5 6 7

1 1

2

3

4

5

6

Given

X=abcabba

Y=cbabac

Page 10: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

10

a b c a b b a

c

b

a

b

a

c

0 1 2 3 4 5 6 7

1 1 2

2

3

4

5

6

Given

X=abcabba

Y=cbabac

Page 11: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

11

a b c a b b a

c

b

a

b

a

c

0 1 2 3 4 5 6 7

1 1 2 2

2

3

4

5

6

Given

X=abcabba

Y=cbabac

Page 12: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

12

a b c a b b a

c

b

a

b

a

c

0 1 2 3 4 5 6 7

1 1 2 2 3

2

3

4

5

6

Given

X=abcabba

Y=cbabac

Page 13: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

13

a b c a b b a

c

b

a

b

a

c

Given

X=abcabba

Y=cbabac

0 1 2 3 4 5 6 7

1 1 2 2 3 4 5 6

2 2 1 2 3 3 4 5

3 2 2 2 2 3 4 4

4 3 2 3 3 2 3 4

5 4 3 3 3 3 3 3

6 5 4 3 4 4 4 4

Page 14: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

14

a b c a b b a

c

b

a

b

a

c

0 1 2 3 4 5 6 7

1 1 2 2 3 4 5 6

2 2 1 2 3 3 4 5

3 2 2 2 2 3 4 4

4 3 2 3 3 2 3 4

5 4 3 3 3 3 3 3

6 5 4 3 4 4 4 4

EDIT(X, Y)=4

a

c

Given

X=abcabba

Y=cbabac

Substitute

Page 15: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

15

a b c a b b a

c

b

a

b

a

c

0 1 2 3 4 5 6 7

1 1 2 2 3 4 5 6

2 2 1 2 3 3 4 5

3 2 2 2 2 3 4 4

4 3 2 3 3 2 3 4

5 4 3 3 3 3 3 3

6 5 4 3 4 4 4 4

EDIT(X, Y)=4

ba

ac

Given

X=abcabba

Y=cbabac

Substitute

Page 16: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

16

a b c a b b a

c

b

a

b

a

c

EDIT(X, Y)=4

bba

bac

Given

X=abcabba

Y=cbabac

0 1 2 3 4 5 6 7

1 1 2 2 3 4 5 6

2 2 1 2 3 3 4 5

3 2 2 2 2 3 4 4

4 3 2 3 3 2 3 4

5 4 3 3 3 3 3 3

6 5 4 3 4 4 4 4

Match

Page 17: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

17

a b c a b b a

c

b

a

b

a

c

EDIT(X, Y)=4

abba

abac

Given

X=abcabba

Y=cbabac

0 1 2 3 4 5 6 7

1 1 2 2 3 4 5 6

2 2 1 2 3 3 4 5

3 2 2 2 2 3 4 4

4 3 2 3 3 2 3 4

5 4 3 3 3 3 3 3

6 5 4 3 4 4 4 4

Match

Page 18: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

18

a b c a b b a

c

b

a

b

a

c

0 1 2 3 4 5 6 7

1 1 2 2 3 4 5 6

2 2 1 2 3 3 4 5

3 2 2 2 2 3 4 4

4 3 2 3 3 2 3 4

5 4 3 3 3 3 3 3

6 5 4 3 4 4 4 4

EDIT(X, Y)=4

cabba

–abac

Given

X=abcabba

Y=cbabacInsert

Page 19: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

19

44443456

33333345

43233234

44322223

54332122

65432211

7654321

EDIT(X, Y)=4

bcabba

b–abac

Given

X=abcabba

Y=cbabac

c

a

b

a

b

c

abbacba

Match

0

Page 20: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

20

0 1 2 3 4 5 6 7

1 1 2 2 3 4 5 6

2 2 1 2 3 3 4 5

3 2 2 2 2 3 4 4

4 3 2 3 3 2 3 4

5 4 3 3 3 3 3 3

6 5 4 3 4 4 4 4

EDIT(X, Y)=4

abcabba

cb–abac

Given

X=abcabba

Y=cbabac

a b c a b b a

c

b

a

b

a

c

Substitute

Page 21: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

21

a b c a b b a

c

b

a

b

a

c

Given

X=abcabba

Y=cbabac

0 1 2 3 4 5 6 7

1 1 2 2 3 4 5 6

2 2 1 2 3 3 4 5

3 2 2 2 2 3 4 4

4 3 2 3 3 2 3 4

5 4 3 3 3 3 3 3

6 5 4 3 4 4 4 4

EDIT(X, Y)=4

abcabba-

cb–ab-ac

Substitute

Match

InsertMatch

Match

Insert Match

Delete

Page 22: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

22

a b c a b b a

c

b

a

b

a

c

Given

X=abcabba

Y=cbabac

0 1 2 3 4 5 6 7

1 1 2 2 3 4 5 6

2 2 1 2 3 3 4 5

3 2 2 2 2 3 4 4

4 3 2 3 3 2 3 4

5 4 3 3 3 3 3 3

6 5 4 3 4 4 4 4

EDIT(X, Y)=4

abcabba-

cb–a-bac

Page 23: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

23

We can recognize the time complexity of computing

edit distance by the above algorithm to be O(mn)

and space complexity O(mn) where n and m are the

size of text and pattern, respectively.

Page 24: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

24

In the following, we will introduce the topic, called

the “string matching with errors” problem.

Page 25: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

25

The definition of the problem: Given a pattern P of length m and a text T of length n, find a substring S of T such that EDIT(S, P) is minimal.

Given: T=abcabba

P=cbabac

Find: S=cabba

EDIT(S, P)=3

P=cbabac

S=c–abba

Given: T=abcabba

P=cbabac

T’s substring K=bcabb

EDIT(K, P)=4

P=–cbabac

K=bc–ab–b

Page 26: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

26

.0],0[,]0,[]0,0[

otherwise. 1])[ ],[(

and ],[][ if 0,])[ ],[( where

])[ ],[ (]1 ,1 [

1] ,1 [

1]1 , [

min ] , [

jSEiiSESE

jyix

jyixjyix

jyixjiSE

jiSE

jiSE

jiSE

Dynamic Programming for the String Matching with Error Problem:

Page 27: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

27

The difference between EDIT[i, j] is that the EDIT[0, j]=j for the edit distance finding problem and SE[0,j]=0 for the string with error problem.

.],0[,]0,[

otherwise. 1])[ ],[(

and ],[][ if 0,])[ ],[( where

])[ ],[ (]1 ,1 [

1] ,1 [

1]1 , [

min ] , [

jjEDITiiEDIT

jyix

jyixjyix

jyixjiEDIT

jiEDIT

jiEDIT

jiEDIT

The dynamic programming approach for the edit distance problem:

Page 28: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

28

In the edit distance problem, we have EDIT[0, j]=j.

In the string matching with error problem, we set SE[0, j]=0.

Page 29: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

29

a b c a b b a

c

b

a

b

a

c33343456

22233345

22123234

12212223

21111122

11110111

00000000T=abcabba

P=cbabac

Since this path starts at the bottom row and ends at the top row with SE(0, j)=0, this shows that there exists a substring S in T such that EDIT(S, P)=3.

Page 30: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

30

We find the lowest value of the last row and trace

back from the point.

Our output may be several strings.

Page 31: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

31

a b c a b b a

c

b

a

b

a

c

0 0 0 0 0 0 0 0

1 1 1 0 1 1 1 1

2 2 1 1 1 1 1 2

3 2 2 2 1 2 2 1

4 3 2 3 2 1 2 2

5 4 3 3 3 2 2 2

6 5 4 3 4 3 3 3

S=cabba

T=abcabba

P=cbabac

T: abc–abba

P: cbabac

Page 32: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

32

0 1 2 3 4 5

1 0 1 2 3 4

2 1 1 1 2 3

3 2 1 2 2 2

4 3 2 1 2 3

5 4 3 2 2 2

6 5 4 3 3 3

T=abcabba

P=cbabac

EDIT(S, P)=3

edit distance

c a b b a

c

b

a

b

a

c

S: c–abba

P: cbabac

Page 33: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

33

a b c a b b a

c

b

a

b

a

c

0 0 0 0 0 0 0 0

1 1 1 0 1 1 1 1

2 2 1 1 1 1 1 2

3 2 2 2 1 2 2 1

4 3 2 3 2 1 2 2

5 4 3 3 3 2 2 2

6 5 4 3 4 3 3 3

T=abcabba

P=cbabac

S: cabba–

P: cbabac

EDIT(S, P)=3

Page 34: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

34

a b c a b b a

c

b

a

b

a

c

0 0 0 0 0 0 0 0

1 1 1 0 1 1 1 1

2 2 1 1 1 1 1 2

3 2 2 2 1 2 2 1

4 3 2 3 2 1 2 2

5 4 3 3 3 2 2 2

6 5 4 3 4 3 3 3

T=abcabba

P=cbabac

S: c-ab--

P: cbabac

EDIT(S, P)=3

Page 35: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

35

a b c a b b a

c

b

a

b

a

c

0 0 0 0 0 0 0 0

1 1 1 0 1 1 1 1

2 2 1 1 1 1 1 2

3 2 2 2 1 2 2 1

4 3 2 3 2 1 2 2

5 4 3 3 3 2 2 2

6 5 4 3 4 3 3 3

T=abcabba

P=cbabac

S: --ab-c

P: cbabac

EDIT(S, P)=3

Page 36: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

36

References

For Edit Distance Computation:[NW70] Neddleman, S.B., and Wunsch, C.D., A general method applicable to the search for similarities in the aminoacid sequence of two proteins, Journal of Molecular Biology 48 (1970): 443-453.

For String matching with error:

[S80] The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20, No. 1, 1980, pp. 359~373.

Page 37: 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

37

Thank you