Boyer Moore Searches on Binary Texts

Preview:

DESCRIPTION

Accelerating. Boyer Moore Searches on Binary Texts. Shmuel Tomi Klein Miri Kopel Ben-Nissan Bar Ilan University, ISRAEL. Background and motivation. Boyer Moore algorithm. New binary variant. Analysis. Experiments. Summary. Outline. Background and motivation. - PowerPoint PPT Presentation

Citation preview

Boyer Moore Searches Boyer Moore Searches

on Binary Textson Binary TextsShmuel Tomi Klein Shmuel Tomi Klein Miri Kopel Ben-NissanMiri Kopel Ben-Nissan

Bar Ilan University, ISRAELBar Ilan University, ISRAEL

AcceleratingAccelerating

Outline

Background and motivationBoyer Moore algorithm

Analysis

Experiments

New binary variant

Summary

Background and motivationBoyer Moore algorithm

New binary variant

Analysis

Experiments

Summary

Important application of Automata:

PATTERN MATCHING

KMP BDM BM

Boyer & Moore

this-is-a-sample-text---

pattern

Match Backwards ! !

Mismatch – case 1: Mismatch – case 1: deltadelta11

ub

ua

b does not occur in x

x

y

contains no bcontains no bx

shift

Boyer – Moore Algorithm

ub

uax

y

contains no bcontains no bbx

shift

b occurs in x

Mismatch – case 2: Mismatch – case 2: deltadelta11

Boyer – Moore Algorithm

ub

uax

y

ucx

shift

Mismatch – case 3: Mismatch – case 3: deltadelta22

u reoccurs in x preceded by c ≠ a

Boyer – Moore Algorithm

ub

uax

y

vx

v shift

Mismatch – case 4: Mismatch – case 4: deltadelta22

Only a suffix v of u reoccurs in x

Boyer – Moore Algorithm

Boyer – Moore Example

aaeellmmppxxresrestt

44001133225577

eexxaammppllee

12121111101099887711

example

deltadelta11

deltadelta22

here ihere iss a simple example a simple example

exampleexamplehere is a simhere is a simpple examplele example

exampleexamplehere is a shere is a siimplemple example example

exaexamplemplehere is a simple examhere is a simple exampplele

exampleexamplehere is a simple here is a simple exampleexample

exampleexample

Problems of Binary Boyer & Moore

deltadelta1 1 uselessuseless

most work bymost work by delta delta11

0100101101011101000100110101001

1101100

this-is-a-sample-text---

pattern

Bit-level processing

Need for Binary Boyer & Moore

Compressed Matching

Given E(T) and P look for E(P) in E(T)

rather than P in D(E(T))

Suggested Solution:

BBBMM Blocked Binary Boyer Moore

Matching

k

shsl

BBBMM

Text [ i ]

Pat [ sh , j ]

ffghabdgttiocbsbgghj

0110001001101010

BBBMM

More information in binary case

ASCII

BINARY

BBBMM

101

101

i i + 1i – 1

T

P

101

100

extended extended delta delta11

01

ksl 1slB 20

mBsldelta ],[1

BBBMM

Total size of delta1 tables:

2221

1 k

sl

ksl

If too large, use limit value kK

T

P

sl k

K

Size of delta1 tables reduced to

12 K

BBBMM

Original delta1 : increase of text pointer BBBMM delta1 : shift size

T

P

Mismatch not in last block

Correct[sh,j]

BBBMM

T

P

deltadelta22

][2 matchlenmdelta

jj11223344556677889910

11

12

13

14

15

16

Pat[Pat[jj]]11001100110011001111110011110011deltadelta22[[jj

]]1133

1133

1133

1133

1133

1133

1133

1133

1133

1133

1133

33771155

2211

AnalysisAssumption : random input

Reasonable for compressed text

Expected # comparisons till mismatch:

Bit-wise:

221

m

j

jj

Blocked:

kk

k

sl

km

t

sltk 112

11

1

/

1

)(

AnalysisExpected # bits shifted after mismatch:

Bit-wise: M

Blocked: M’

mmME jm

j

j log),2min(2)(1

MM '

Experiments

English Bible (2.5MB) World Factbook (1.5MB)

Text: Huffman encoded

Patterns: Random substrings

of lengths 10 to 500

k = 8

Experiments:Average # comparisons between shiftsAverage # comparisons between shifts

Bit-wiseBlocked

100 200 300 400 500

1.1

1.2

1.3

1.4

1.5

length of pattern

Experiments:Average size of shiftsAverage size of shifts

Bit-wise

100 200 300 400 500

20

40

60

80

100

length of pattern

Blocked

Experiments:Average # comparisons for 1000 bitsAverage # comparisons for 1000 bits

100 200 300 400 500

100

200

300

400

500

length of pattern

Bit-wise

Blocked

BDM

Experiments:Time to locate first occurrence (ms)Time to locate first occurrence (ms)

100 200 300 400 500

50

100

150

200

250

length of pattern

300

Bit-wise

Blocked

BDMTurbo-BDM

Summary

Blocked variant of BMBlocked variant of BM

Faster than alternatives, Overhead 1-10 KFaster than alternatives, Overhead 1-10 K

Extensions:Extensions:

ASCII, words instead of characters

Thank you Thank you !!

Recommended