20
1 The Zhu-Takaoka Algorithm Advisor: Prof. R. C. T. Lee Speaker: S. Y. Tang On improving the average case of the Boyer-Moore string matching algorithm, Journal of Information Processing 10(3):173-177, 1987 R. F. ZHU, T. TAKAOKA

1 The Zhu-Takaoka Algorithm Advisor: Prof. R. C. T. Lee Speaker: S. Y. Tang On improving the average case of the Boyer-Moore string matching algorithm,

  • View
    221

  • Download
    3

Embed Size (px)

Citation preview

1

The Zhu-Takaoka Algorithm

 Advisor: Prof. R. C. T. Lee

Speaker: S. Y. Tang

On improving the average case of the Boyer-Moore string matching algorithm, Journal of Information Processing 10(3):173-177, 1987

R. F. ZHU, T. TAKAOKA

2

• The Zhu-Takaoka Algorithm is an algorithm which solves the string matching problem.

• String matching problem:

Input: a text string T of length n and a pattern string P

of length m.

Output: all occurrences of P which occur in T.

3

• The Zhu-Takaoka Algorithm is a variant of the Boyer and Moore Algorithm. The algorithm only improve the bad character of the Boyer and Moore Algorithm.

• Zhu and Takaoka modified the BM Algorithm. They replaced the bad character rule by a

2-substring rule . The good suffix rules are still used.

4

The 2-Substring Rule

• Consider text=ACTGCTAAGTA and pattern=CTAAG.

No GC appears in P.

0 1 2 3 4 5 6 7 8 9 10 11

A C T G C C T A A G T AText

Pattern C T A A G

C T A A G

C T A A G

A C T G C C T A A G T A

A C T G C C T A A G T A

0 1 2 3 4 5 6 7 8 9 10 11

0 1 2 3 4 5 6 7 8 9 10 11

Text

Text

Pattern

Pattern

5

How can we know whether a specified 2-substring appears in P or not?

6

• Example 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

G C A T C G C A G A G G A T A T A C A G T A C GText

Pattern G C A G A G A G

ztBc A C G *

A 8 8 2 8

C 5 8 7 8

G 1 6 7 8

* 8 8 7 8

G C A G A G A GShift by 5

T(CA)=5 means that CA appears in 5 locations from the right end. Thus we can shift by 5. T(GA)=1 means that GA appears in 1 location fromthe right end. If GA is the 2-substring to be matched, we shift 1 step.

Whenever a mismatch or a complete match occurs, we select the last 2-substring in T and search for the rightmost location of this 2-substring in P if it exists. This is done by constructing a ztBc table.

G C A G A G A GShift by 1

7

The preprocessing phase of the algorithm consists in computing foreach pair of characters (a, b) with a, b the rightmost occurrence of ab in x [ 0..m -2]

ztBc[a,b]

2] [0..in

occernot does and [0]

, 2] [0..in

occurnot does and [0] 1

, 2] 2.. [in

occurnot does and

1] .. x[and 2

] ,[

,For

-mx

abbxm k

-mx

abbx-mk

-mk-mx

ab

abk-mk-m-mk

kbaztBc

ba

8

preprocessing phaseConsider text= ATTGCCTAATA and pattern=CTAAG

The alphabet of pattern is {A.C.G.T }; The sign “ * ” denotes a

word of text which never appears in pattern.

First, we fill in the blanks with the length m of pattern.

A C G T *

A 5 5 5 5 5

C 5 5 5 5 5

G 5 5 5 5 5

T 5 5 5 5 5

* 5 5 5 5 5

Example:

9

preprocessing phase

Then, we suppose the last 2-substring ab does not occur in [0..m-2]. If P0 = b, we set ztBc[i , b] = m-1 for all i.

A C G T *

A 5 4 5 5 5

C 5 4 5 5 5

G 5 4 5 5 5

T 5 4 5 5 5

* 5 4 5 5 5

T: ATTGCCTAAGTAP: CTAAG

CTAAG

↑ a

← b

Example:

10

preprocessing phase

Finally, we set ztBC[a,b] = k if k≤ m-2 and P[m-k-2..m-k-1]=ab and ab does not occur in P[m-k-1..m-2].

A C G T *

A 1 4 5 5 5

C 5 4 5 3 5

G 5 4 5 5 5

T 2 4 5 5 5

* 5 4 5 5 5

1P: CTAAG

Example:

2

3

↑ a

← b

11

If ztBc[A,C] = k

• Example

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

G C A T C G C A G A G A G T A T A C A G T A C GText

Pattern G C A G A G A G

ztBc A C G *

A 8 8 2 8

C 5 8 7 8

G 1 6 7 8

* 8 8 7 8

↑ a

← b

• ztBc[C,A] = 5 ; k ≤ m-2 ; ∵ x[8-5-2..8-5-1] = ab (x[1..2] = CA) and “CA” does not occur in x[8-5-1..8-2] (x[2..6] ).

i 0 1 2 3 4 5 6 7

x[i]

G C A G A G A G

Case 1 :

2]. 1.. [in occur not does and

1] 2.. x[and 2

-m-k-mxab

ab-k-m-k-m-mk

G C A G A G A GShift by 5

12

=> If ztBc[A,C] = k

• Example

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

G C A T C G C G G A G A G T A T A C A G T A C GText

Pattern G C A G A G A G

ztBc A C G *

A 8 8 2 8

C 5 8 7 8

G 1 6 7 8

* 8 8 7 8

↑ a

← b

•ztBc[C,G] = 7 ; k = m-1 ; ∵ x[0] = b ( G = G) and “CG” does not occur in x[0..8-2] (x[0..6] ).

i 0 1 2 3 4 5 6 7

x[i]

G C A G A G A G

Case 2 :

, 2] [0..in

occurnot does and [0]; 1

-mx

abbx-mk

G C A G A G A GShift by 7

13

=> If ztBc[A,C] = k

• Example

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

G C A T C G C A G A G A G T A T A C A G T A C GText

Pattern G C A G A G A G

ztBc A C G *

A 8 8 2 8

C 5 8 7 8

G 1 6 7 8

* 8 8 7 8

↑ a

← b•ztBc[A,C] = 8 ; k = m ; ∵ x[0] ≠b (G≠C) and “AC” does not occur in x[0..8-2] ( x[0..6] ).

i 0 1 2 3 4 5 6 7

x[i]

G C A G A G A G

Case 3 :

. 2] [0..in

occernot does and [0] ;

-mx

abbxm k

14

• Full Example

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

G C A T C G C A G A G A G T A T A C A G T A C GText

Pattern G C A G A G A G

ztBc A C G *

A 8 8 2 8

C 5 8 7 8

G 1 6 7 8

* 8 8 7 8

↑ a

← bi 0 1 2 3 4 5 6 7

x[i] G C A G A G A G

bmGs 7 7 7 2 7 4 7 1

G C A G A G A GShift by 5

In the step, we select the ztBc function to shift because ztBc[P6P7=CA] = 5 > bmGs [7] =1. The pattern shifts 5 steps right by case 1.

15

• Full Example

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

G C A T C G C A G A G A G T A T A C A G T A C GText

Pattern

G C A G A G A G

ztBc A C G *

A 8 8 2 8

C 5 8 7 8

G 1 6 7 8

* 8 8 7 8

↑ a

← bi 0 1 2 3 4 5 6 7

x[i] G C A G A G A G

bmGs 7 7 7 2 7 4 7 1

G C A G A G A G

Shift by 7

In the step, we select the bmGs function to shift because ztBc[A,G] = 2 < bmGs [0] = 7.

exact matching

16

• Full Example

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

G C A T C G C A G A G A G T A T A C A G T A C GText

Pattern G C A G A G A G

ztBc A C G *

A 8 8 2 8

C 5 8 7 8

G 1 6 7 8

* 8 8 7 8

↑ a

← b

i 0 1 2 3 4 5 6 7

x[i] G C A G A G A G

bmGs 7 7 7 2 7 4 7 1

G C A G A G A GShift by 4

In the step, we select the bmGs function to shift because ztBc[A,G] = 2 < bmGs [5] = 4.

17

• Full Example

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

G C A T C G C A G A G A G T A T A C A G T A C GText

Pattern

ztBc A C G *

A 8 8 2 8

C 5 8 7 8

G 1 6 7 8

* 8 8 7 8

↑ a

← bi 0 1 2 3 4 5 6 7

x[i] G C A G A G A G

bmGs 7 7 7 2 7 4 7 1

G C A G A G A G

By the bmGs or ztBc function ; We can select the ztBc function or the bmGs function to shift because ztBc[C,G] = 7 = bmGs [6].

18

• preprocessing phase in O(m + ) time and space complexity. ( = the numbers of alphabet of the text ).

• searching phase in O(m × n) time complexity.

Time complexity

2

19

References

1. ZHU, R.F. and TAKAOKA, T., 1987, On improving the average case of the Boyer-Moore string matching algorithm, Journal of Information Processing 10(3):173-177 .

20

Thank you for your attention.