27
1 Morris-Pratt algorithm Advisor: Prof. R. C. T. L ee Reporter: C. S. Ou A linear pattern-matching algorithm, Technical Report 40, University of California, Berkeley, 1970. Morris (Jr) J. H., Pratt V. R.

1 Morris-Pratt algorithm Advisor: Prof. R. C. T. Lee Reporter: C. S. Ou A linear pattern-matching algorithm, Technical Report 40, University of California,

  • View
    216

  • Download
    2

Embed Size (px)

Citation preview

1

Morris-Pratt algorithm

Advisor: Prof. R. C. T. Lee

Reporter: C. S. Ou

A linear pattern-matching algorithm, Technical Report 40, University of California, Berkeley, 1970.

Morris (Jr) J. H., Pratt V. R.

2

Morris-Pratt algorithm

We are given a text T and a pattern P to find all occurrences of P in Tand perform the comparisons from left to right.

n : the length of Tm : the length of P

Example

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

t A A A A A A T C A C A T T A G C A A A A

p A T C A C A G T A T C A1 2 3 4 5 6 7 8 9 10 11 12

3

Rule 1: The Partial Window RuleThis rule means that instead of a complete window whose is equal to the size of the pattern, we may use a prefix of a complete window to match the prefix of a prefix of the complete pattern.

T

P

A complete window

How do we get the partial window?

4

The basic principle of MP Algorithm is still step by step comparison.

Initially, the length of the partial window is 1.

Initially, we compare T(1) with P(1). If T(1) ≠ P(1), we moveThe pattern one step towards the right.

Example

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

T A A A A A A T C A C A T T A G C A A A A

P C T C A C A G T A T C A1 2 3 4 5 6 7 8 9 10 11 12

P C T C A C A G T A T C A1 2 3 4 5 6 7 8 9 10 11 12

5

If T(1)=P(1), we extend the partial window until a mismatching is found.

Example

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

T A T C A C A G C A C A T T A G C A A A A

P A T C A C A G T A T C A1 2 3 4 5 6 7 8 9 10 11 12

6

Suppose the following condition occurs, should we move patternP only one step towards the right?

The answer is no in this case as we may use Rule 2, the suffix of T to prefix of P rule.

bT

aP

j i+j-1

i

1

1

j+m-1 n

m

Example1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

t A A A A A A T C A C A T T A G C A A A A

p A T C A C A G T A T C A1 2 3 4 5 6 7 8 9 10 11 12

7

Rule 2: The Suffix of T to Prefix of P Rule

For a window to have any chance to match a pattern, in some way, there must be a suffix of the window which is equal to a prefix of the pattern.

T

P

8

The Implication of Rule 2:

Find the longest suffix v of the window which is equal to some prefix of P. Skip the pattern as follows:

T

P

v

v

P v

9

Now, we know that a prefix U of T is equal to a prefix U of P. Thus, instead of finding the longest suffix of T equal to a prefix of P, We may simply find the longest suffix of U of P which is equal to a prefix of P.

U bT

U aP v

Example1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

T A A A A A C A C A C A T T A G C A A A A

P C A C A C A G T A T C A1 2 3 4 5 6 7 8 9 10 11 12

10

Example1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

t A A A A A C A C A C A T T A G C A A A A

p C A C A C A G T A T C A1 2 3 4 5 6 7 8 9 10 11 12

In this case, we can see the longest suffix of U which is equal to a prefix of P is CA.

Thus, we may apply Rule 2 to move P as follows:1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

t A A A A A C A C A C A T T A G C A A A A

p C A C A C A G T A T C A1 2 3 4 5 6 7 8 9 10 11 12

11

The MP Algorithm

Assume that we have already found the largest prefix of T which is equal to a prefix of P.

t

p

U

U a

b

12

The MP Algorithm

Skip the pattern by using Rule 1 and Rule 2.

T

P

v

v v a

b

c

T

P

v

v

b

c

Given a prefix U of T which is equal to a prefix of P, how do we know the longest Suffix of U which is equal to some prefix of U?We do this by pre-processing.

13

for x > 1 and

prefix function 1 2 3 4 5 6 7 8 9 10 11 12 13 0 0 0 1 0 1 2 3 4 2 3 4 -1 0 0 0 1 0 1 2 3 4 2 3 4 1 1 2 3 3 5 5 5 5 5 8 8 8

Preprocessing phase

p A T C A C A T C A T C A1 2 3 4 5 6 7 8 9 10 11 12 13

Example

jf(j)

j - g(j)

1( ) ( ( ))x xf y f f y 1( ) ( ).f y f yLet The prefix function f(j), 2 ≤ j ≤ m, for P( j) can be written as follows:

otherwise

PPthatsuchksmallesttheexiststhereandjifjfjf jfj

kk

0

111)1()( 1)1(

g(j)

1 1( )

( 1) 2

if jg j

f j if j m

MP algorithm uses j – g(j) – 1 to decide the distance that pattern P aligns in text T.

14

prefix function 1 2 3 4 5 6 7 8 9 10 11 12 13 0 0 0 1

p A T C A C A T C A T C A1 2 3 4 5 6 7 8 9 10 11 12 13

Example

jf(j)

j = 1 →f(1) = 0j = 2 →P2 = ‘T’≠ Pf

1(2-1)+1=P1=‘A’ →f(2)=0

j = 3 → P3 = ‘C’≠ Pf 1

(3-1)+1=P1=‘A’ →f(3)=0 j = 4 →P4 = ‘A’= Pf 1(4-1)+1=P1=‘A’ →f(4)=0+1=1

otherwise

PPthatsuchksmallesttheexiststhereandjifjfjf jfj

kk

0

111)1()( 1)1(

15

p A T C A C A T C A T C A1 2 3 4 5 6 7 8 9 10 11 12 13

Example

jf(j)

prefix function 1 2 3 4 5 6 7 8 9 10 11 12 13 0 0 0 1 0 1 2 3 4

j = 5 →P5 = ‘C’≠ Pf 1(5-1)+1=P1+1=‘T’ →f(5)=0j = 6 → P6 = ‘A’= Pf 1(6-1)+1=P1=‘A’ →f(6)=0+1=1 j = 7 → P7 = ‘T’= Pf

1(7-1)+1=P1+1=‘T’ →f(7)=1+1=2

j = 8 → P8 = ‘C’= Pf 1

(8-1)+1=P2+1=‘C’ →f(8)=2+1=3j = 9 → P9 = ‘A’= Pf

1(9-1)+1=P3+1=‘A’ →f(9)=3+1=4

otherwise

PPthatsuchksmallesttheexiststhereandjifjfjf jfj

kk

0

111)1()( 1)1(

16

We have found that f(9) = 4. We now check whether P(10)=P(5) . The answer is no. Does this mean that we should set f(9) to be 0? No.

p A T C A C A T C A T C A1 2 3 4 5 6 7 8 9 10 11 12 13

Example

jf(j)

prefix function 1 2 3 4 5 6 7 8 9 10 11 12 13 0 0 0 1 0 1 2 3 4 2 3 4

j = 10 →P10 = ‘T’≠ Pf 2(10-1)+1=Pf (4)+1=P1+1=P2=‘T’ →f(10)=1+1=2j = 11 → P11 = ‘C’= Pf 1(11-1)+1=P2+1=‘C’ →f(11)=2+1=3 j = 12 → P12 = ‘A’= Pf

1(12-1)+1=P3+1=‘T’ →f(12)=3+1=4

otherwise

PPthatsuchksmallesttheexiststhereandjifjfjf jfj

kk

0

111)1()( 1)1(

17

Then, after a shift, the comparisons can resume between characters c = P(f(i )) and T( i +j) = b without missing any occurrence of P in T, and avoiding a backtrack on the text.

u bT

u aP

i+j-1

i

1

1

j+m-1 n

m

Example

v

aP v c

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

T A A A A A C A C A C A T T A G C A A A A

P C A C A C A G T A T C A1 2 3 4 5 6 7 8 9 10 11 12

P C A C A C A G T A T C A1 2 3 4 5 6 7 8 9 10 11 12

18

Example1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

T A C A C G T A C A C A C A G T A T C A A

P C A C A C A G T A T C A1 2 3 4 5 6 7 8 9 10 11 12

P C A C A C A G T A T C A1 2 3 4 5 6 7 8 9 10 11 12

Shift by 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

T A C A C G T A C A C A C A G T A T C A A

1 2 3 4 5 6 7 8 9 10 11 12 13 1 1 2 2 2 2 2 7 8 9 10 10 10

j

j - g(j)-1

prefix function

19

Example1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

T A C A C G T A C A C A C A G T A T C A A

P C A C A C A G T A T C A1 2 3 4 5 6 7 8 9 10 11 12

P C A C A C A G T A T C A1 2 3 4 5 6 7 8 9 10 11 12

Shift by 2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

T A C A C G T A C A C A C A G T A T C A A

1 2 3 4 5 6 7 8 9 10 11 12 13 1 1 2 2 2 2 2 7 8 9 10 10 10

jprefix function

j - g(j)-1

20

Example1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

T A C A C G T A C A C A C A G T A T C A A

P C A C A C A G T A T C A1 2 3 4 5 6 7 8 9 10 11 12

P C A C A C A G T A T C A1 2 3 4 5 6 7 8 9 10 11 12

Shift by 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

T A C A C G T A C A C A C A G T A T C A A

1 2 3 4 5 6 7 8 9 10 11 12 13 1 1 2 2 2 2 2 7 8 9 10 10 10

jprefix function

j - g(j)-1

21

Example1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

T A C A C G T A C A C A C A G T A T C A A

P C A C A C A G T A T C A1 2 3 4 5 6 7 8 9 10 11 12

P C A C A C A G T A T C A1 2 3 4 5 6 7 8 9 10 11 12

Shift by 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

T A C A C G T A C A C A C A G T A T C A A

1 2 3 4 5 6 7 8 9 10 11 12 13 1 1 2 2 2 2 2 7 8 9 10 10 10

jprefix function

j - g(j)-1

22

Example1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

T A C A C G T A C A C A C A G T A T C A A

P C A C A C A G T A T C A1 2 3 4 5 6 7 8 9 10 11 12

P C A C A C A G T A T C A1 2 3 4 5 6 7 8 9 10 11 12

Shift by 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

T A C A C G T A C A C A C A G T A T C A A

1 2 3 4 5 6 7 8 9 10 11 12 13 1 1 2 2 2 2 2 7 8 9 10 10 10

jprefix function

j - g(j)-1

23

Example1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

T A C A C G T A C A C A C A G T A T C A A

P C A C A C A G T A T C A1 2 3 4 5 6 7 8 9 10 11 12

P C A C A C A G T A T C A1 2 3 4 5 6 7 8 9 10 11 12

Shift by 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

T A C A C G T A C A C A C A G T A T C A A

1 2 3 4 5 6 7 8 9 10 11 12 13 1 1 2 2 2 2 2 7 8 9 10 10 10

jprefix function

j - g(j)-1

24

Example1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

T A C A C G T A C A C A C A G T A T C A A

P C A C A C A G T A T C A1 2 3 4 5 6 7 8 9 10 11 12

P C A C A C A G T A T C A1 2 3 4 5 6 7 8 9 10 11 12

Shift by 10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

T A C A C G T A C A C A C A G T A T C A A

MATCH

1 2 3 4 5 6 7 8 9 10 11 12 13 1 1 2 2 2 2 2 7 8 9 10 10 10

jprefix function

j - g(j)-1

25

Time Complexity

preprocessing phase in O(m) space and time complexity

searching phase in O(n+m) time complexity

26

References

AHO, A.V., HOPCROFT, J.E., ULLMAN, J.D., 1974, The design and analysis of computer algorithms, 2nd Edition, Chapter 9, pp. 317--361, Addison-Wesley Publishing Company.

BEAUQUIER, D., BERSTEL, J., CHRÉTIENNE, P., 1992, Éléments d'algorithmique, Chapter 10, pp 337-377, Masson, Paris.

CROCHEMORE, M., 1997. Off-line serial exact string searching, in Pattern Matching Algorithms, ed. A. Apostolico and Z. Galil, Chapter 1, pp 1-53, Oxford University Press.

HANCART, C., 1992, Une analyse en moyenne de l'algorithme de Morris et Pratt et de ses raffinements, in Théorie des Automates et Applications, Actes des 2e Journées Franco-Belges, D. Krob ed., Rouen, France, 1991, PUR 176, Rouen, France, 99-110.

HANCART, C., 1993. Analyse exacte et en moyenne d'algorithmes de recherche d'un motif dans un texte, Ph. D. Thesis, University Paris 7, France. MORRIS (Jr) J.H., PRATT V.R., 1970, A linear pattern-matching algorithm, Technical Report 40, University of California, Berkeley.

27

Thanks for your attention.