34
1 String Matching Algorithms Based upon the Uniqueness Property Advisor Advisor Prof. R. C. T. Lee Prof. R. C. T. Lee Speaker Speaker C. W. Lu C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String Matching Algorithms Based upon the Uniqueness Property, The 24th Workshop on Combinatorial Mathematics and Computation Theory, pp.385-392.

1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

1

String Matching Algorithms Based upon the Uniqueness Property

AdvisorAdvisor : : Prof. R. C. T. LeeProf. R. C. T. LeeSpeakerSpeaker : : C. W. LuC. W. Lu

C. W. Lu and R. C. T. Lee, 2007, String Matching Algorithms Based upon the Uniqueness Property, The 24th Workshop on Combinatorial Mathematics and Computation Theory, pp.385-392.

Page 2: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

2

• String matching problem– Given a text string T of length n and a pattern

string P of length m.– Find all occurrences of P in T.

Page 3: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

3

Rule 1: The Suffix to Prefix Rule• Suppose we have longest suffix u of a window which

is also a prefix of P, we can move P in such a way that the prefix u of P matches with the suffix u of the window.

u

u

(b)

T

P

(a)

u T

P u

Window

Page 4: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

4

The Uniqueness Property of a String

• For any substring V of P, if V occurs in P only once, V is a unique substring.

• When V matches with some substring of T, we can move P such a way that the prefix of P matches with the suffix of V.

u T

P u u

V

V

P u u

V

Page 5: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

5

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

T a c g c c g c g c c c g c g c t c a a a

P c a t a g t a g c c t0 1 2 3 4 5 6 7 8 9 10

Example

P = c a t a g t a g c c t

Suppose we use the substring “cc” as the unique substring.

P c a t a g t a g c c t0 1 2 3 4 5 6 7 8 9 10

Page 6: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

6

Algorithm 1- The Longest Prefix with Unique Suffix Matching Algorithm

• We further modified the uniqueness by noting that the substring does not have to be unique in the entire pattern P. In fact, a substring which is unique in a prefix of P suffices.

• Therefore, we only have to find the longest prefix which contains a unique suffix in P.

Page 7: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

7

Example

P = CACTAGCCACTCTC

The substring TC occurs twice in P, but it is unique in the prefix CACTAGCCACTC.

T : CTAGCGTATGCCAGTCACGATCGAGCAGGCTAC…

P : CACTAGCCACTCTC

P : CACTAGCCACTCTC

Move P 11 steps.

Page 8: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

8

Example

P = CACTAGCCACTCTC

The substring G is also unique in the prefix CACTAG.

Move P 6 steps.

T : CTAGCGTATGCCAGTCACGATCGAGCAGGCTAC…

P : CACTAGCCACTCTC

P : CACTAGCCACTCTC

Page 9: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

9

In the above example, using the unique substring TC, we could move P 11 steps if TC matches with TC in T; using the unique substring G, we could move P 6 steps if G matches with G in T.

P = CACTAGCCACTCTC

Is the unique substring TC better than the unique substring G?

Page 10: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

10

• We should notice that if the unique substring appears in T many times, our algorithm would be efficient.

• In general, the probability of TC in P matching with TC in T exactly is 1/16 (Suppose the size of alphabet is 4), and the probability of G in P matching with G in T exactly is 1/4.

• Thus, the size of the unique substring is also important.

Page 11: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

11

• If the substring TC in P exactly matches with TC in T once and moves P by 11 steps, the substring G in P may match G in T four times and moves P by 6 steps for each time. So, we expect that the substring G would be better than the substring TC in general.

P = CACTAGCCACTCTC

Page 12: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

12

• We now define a ratio to determine which substring is better.

• Let Σ be the alphabet.

• The larger σ is, the better efficiency can be achieved in the searching phase.

substring of Size) moving of (Steps P

Page 13: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

13

Preprocessing Phase

P = CAGACGACCCCAACAGC

Σ = {A, C, G, T}, |Σ| = 4.

Find the longest prefix with an unique suffix which size is one.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

T A C G C C G C G C C C G C G C T C A A A …

P C A G A C G A C C C C A A C A G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

P C A G A C G A C C C C A A C A G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

. moving of steps

substringof Size 4

3

4

31

Page 14: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

14

• We have found the unique substring with size 1, and we could use it to move P 3 steps.

• Next, we try to find an unique substring with size 2 such that we could use this substring to move P more than 3*4 steps.

• Thus, we only consider the substrings of p12p13…p16.

Preprocessing Phase

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

T A C G C C G C G C C C G C G C G C A A A …

P C A G A C G A C C C C A A C A G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

P C A G A C G0 1 2 3 4 5

. moving of steps

substringof Size 14

162

Page 15: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

15

Searching Phase

T … C G C C G C G C C C G C G C G C A A A …

P C A G A C G A C C C C A A C A G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

P C A G A C G A C C C C A A C A G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

P C A G A C G A C C C C A A C A G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Move 1 step.

If the unique substring mismatches, move P one step.

Page 16: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

16

Searching Phase

T … C G C C G C G C C C G C G C G C A A A …

P C A G A C G A C C C C A A C A G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

P C A G A C G A C C C C A A C A G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

P C A G A C G A C C C C A A C A G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Move 16 steps.

If the unique substring GC matches with GC in T, move P 16 steps.

Page 17: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

17

• As we discuss above, the size of the unique substring is important.

• In the following, we will introduce another algorithm which uses an unique substring with size one.

Page 18: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

18

Algorithm 2- Longest Substring with Unique Character Matching Algorithm

• In the window, let x be any character. In order to have any meaningful matching of P with T, we must find the same x in P located in the left side of x in T.

x

(b)

T

P

(a)

x

x T

P x

Page 19: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

19

• In preprocessing phase, we try to find the longest substring p’ in P such that x in p’ occurs only once. That is,

and pj occurs in p’ only once.

P x

P x x

(a) i = 1.

(b) i > 1.

p’

p’

ji ppp ...'

Page 20: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

20

• If the unique character x matches with x in T, we can move P |p’| steps.

x T

P

x

x T

P x x

p’

p’

(b) i > 1.

(a) i = 1.

x

x x

Page 21: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

21

Example

In this example, we would find the longest substring p4p5…p10 with a unique character p10.

If the character p10 matches with T, we can move P 7 steps.

P C A C T A G C C A C T C T C0 1 2 3 4 5 6 7 8 9 10 11 12 13

Page 22: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

22

Searching Phase

T … C G C C T C G C T C G C G T G C T A A …

Move 1 step.

If p10 mismatches, move P one step.

P C A C T A G C C A C T C T C0 1 2 3 4 5 6 7 8 9 10 11 12 13

P C A C T A G C C A C T C T C0 1 2 3 4 5 6 7 8 9 10 11 12 13

P C A C T A G C C A C T C T C0 1 2 3 4 5 6 7 8 9 10 11 12 13

Page 23: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

23

Searching Phase

T … C G C C T C G C T C G C G T G C T A A …

Move 7 steps.

If p10 matches with T, move P 7 steps.

P C A C T A G C C A C T C T C0 1 2 3 4 5 6 7 8 9 10 11 12 13

P C A C T A G C C A C T C T C0 1 2 3 4 5 6 7 8 9 10 11 12 13

P C A C T A G C C A C T C T C0 1 2 3 4 5 6 7 8 9 10 11 12 13

Page 24: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

24

Algorithm 3- The Unique Pairwise Substring Algorithm

• The substring pipi+1…pj-1pj is called an unique pairwise substring if it satisfies the condition that pipi+1…pj-1p

j occurs in the prefix p1p2…pj-1pj of P exactly once, and no pkpk+1…pk+j-i exists in p1p2…pj-1 such that pk = pi and pk+j-i = pj.

Page 25: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

25

P C A C T C A G C C A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Example

The substring TCG is an unique pairwise substring because no pkpk+1pk+2 exists in p1p2…p12 such that p

k = p11= T and pk+2 = p13= G.

P C A C T C A G C C A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

The substring CAC is not an unique pairwise substring because there exists a substring p2p3p4 in p1p2…p9 such that p2 = p8= C and p4 = p10= C.

Page 26: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

26

• Suppose pipi+1…pj-1pj is an unique pairwise substring.

• If pi and pj match with T, we have two cases to move P.

yxPi j

y

yxPi j

yk

yx

yx

T

Pi j

Case 1: such that pj = pk, where 0≦k≦j-i-1.

We can move P j-k steps.

k

Page 27: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

27

Case 2: pj ≠ pk, where 0≦k ≦j-i-1.

We can move P j+1 steps.

k

yxPi j

y

yxPi j

yx

yx

T

Pi j

Page 28: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

28

P C A C T C A G C C A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Example

If we choose p11p12p13 as the unique pairwise substring, we can move P 14 steps when p11 and p13 match with T.

T … C G C C T C G C T C G T G G G C T A A …

P C A C T C A G C C A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

P C A C T C A G C C A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

P C A C T C A G C C A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Page 29: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

29

• There would be many unique pairwise substrings in the pattern.

• We will select the one which is located at rightest in the pattern.

P C A C T C A G C G A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Example

The substrings p5p6, p7p8p9 and p11p12p13 are all unique pairwise substrings.

We would select p11p12p13 because it will have the largest move.

Page 30: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

30

P C A C T C A G C C A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

T … C G C C T C G C T C G T G G G C T A A …

Example

P C A C T C A G C C A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

If p11 or p13 mismatch, move P one step.

P C A C T C A G C C A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Page 31: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

31

P C C T C A G C C A C T C G C0 2 3 4 5 6 7 8 9 10 11 12 13 14

T … C G C C T C G C T C G T G G G C T A A …

Example

P C A C T C A G C C A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

If p11 and p13 match with T, move P 14 steps.

P C A C T C A G C C A C T C G C0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Page 32: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

32

References

• [1] Apostolico, A., Giancarlo, R., 1986, The Boyer-Moore-Galil string searching strategies revisited, SIAM Journal on Computing 15(1):98-105.

• [2] Apostolico, A., Crochemore, M., 1991, Optimal canonization of all substrings of a string, Information and Computation 95(1):76-95.

• [3] Boyer, R.S., Moore, J.S., 1977, A fast string searching algorithm. Communications of the ACM. 20:762-772.

• [4] Colussi, L., 1991, Correctness and efficiency of the pattern matching algorithms, Information and Computation 95(2):225-251.

• [5] Crochemore, M., Czumaj, A., Gasieniec, L., Jarominek, S., Lecroq, T., Plandowski, W., Rytter, W., 1992, Deux méthodes pour accélérer l'algorithme de Boyer-Moore, in Théorie des Automates et Applications, Actes des 2e Journées Franco-Belges, D. Krob ed., Rouen, France, 1991, pp 45-63, PUR 176, Rouen, France.

• [6] Colussi, L., 1994, Fastest pattern matching in strings, Journal of Algorithms. 16(2):163-189.

• [7] Charras, C., Lecroq, T., Pehoushek, J.D., 1998, A very fast string matching algorithm for small alphabets and long patterns, in Proceedings of the 9th Annual Symposium on Combinatorial Pattern Matching , M. Farach-Colton ed., Piscataway, New Jersey, Lecture Notes in Computer Science 1448, pp 55-64, Springer-Verlag, Berlin.

Page 33: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

33

• [8] Galil, Z., Seiferas, J., 1983, Time-space optimal string matching, Journal of Computer and System Science 26(3):280-294.

• [9] Galil, Z., Giancarlo, R., 1992, On the exact complexity of string matching: upper bounds, SIAM Journal on Computing, 21(3):407-437.

• [10] Horspool, R.N., 1980, Practical fast searching in strings, Software - Practice & Experience, 10(6):501-506.

• [11] Knuth, D.E., Morris (Jr), J.H., Pratt, V.R., 1977, Fast pattern matching in strings, SIAM Journal on Computing 6(1):323-350.

• [12] Lecroq, T., 1992, A variation on the Boyer-Moore algorithm, Theoretical Computer Science 92(1):119-144.

• [13] Morris (Jr), J.H., Pratt, V.R., 1970, A linear pattern-matching algorithm, Technical Report 40, University of California, Berkeley.

• [14] Sunday, D.M., 1990, A very fast substring search algorithm, Communications of the ACM . 33(8):132-142.

• [15] Simon, I., 1993, String matching algorithms and automata, in in Proceedings of 1st American Workshop on String Processing, R.A. Baeza-Yates and N. Ziviani ed., pp 151-157, Universidade Federal de Minas Gerais, Brazil.

Page 34: 1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String

34

Thanks for your attention.