17458 String Matching

8/14/2019 17458 String Matching

1/35

Outline

String Matching

Introduction

Nave Algorithm

Rabin-Karp Algorithm

Knuth-Morris-Pratt (KMP) Algorithm


2/35

Introduction

What isstring matching?

Finding all occurrences of apattern in a given text (or

body of text) Many applications

While using editor/word processor/browser

Login name & password checking

Virus detection

Header analysis in data communications

DNA sequence analysis, Web search engines (e.g.

Google), image analysis


3/35

String-Matching Problem

The textis in an array T [1..n] of length n

Thepatternis in an arrayP [1..m] of

length m Elements of TandPare characters from a

finite alphabet

E.g., = {0,1} or = {a, b, , z} Usually TandPare calledstringsof

characters


4/35

String-Matching Problem contd

We say that patternPoccurs with shift sintext T if:

a) 0 s n-m andb) T [(s+1)..(s+m)] =P [1..m]

IfPoccurs with shiftsin T, thensis avalid shift, otherwisesis an invalid shift

String-matching problem: finding all validshifts for a given TandP


5/35

Example 1

a b c a b a a b c a b a c

a b a a

text T

pattern P s= 3

shift s = 3is a valid shift

(n=13, m=4 and 0 s n-m holds)

1 2 3 4 5 6 7 8 9 10 11 12 13

1 2 3 4


6/35

Example 2

a b c a b a a b c a b a a

a b a a

text T

pattern P

s= 3

a b a a

a b a a

s= 9

1 2 3 4 5 6 7 8 9 10 11 12 13

1 2 3 4


7/35

Nave String-Matching Algorithm

Input: Text strings T [1..n] andP[1..m]Result: All valid shifts displayed

NAVE-STRING-MATCHER(T,P)n length[T]

m length[P]

fors 0 ton-mifP[1..m] = T [(s+1)..(s+m)]

print pattern occurs with shifts


8/35

Nave Algorithm

The Nave algorithm consists in checking, at all the

positions in the text between 0 to n-m, whether an

occurrence of the pattern starts there or not.

After each attempt, it shifts the pattern by exactly oneposition to the right.

Example (from left to right):

a b c a b c a

a b c a (shift = 0)

a b c a (shift = 1)

a b c a (shift = 2)

a b c a (shift = 3)


9/35

Analysis: Worst-case Example

a a a a a a a a a a a a atext T

pattern P

a a a b

a a a b

1 2 3 4 5 6 7 8 9 10 11 12 13

1 2 3 4

a a a b


10/35

Worst-case Analysis

There are mcomparisons for each shift in theworst case

There are n-m+1 shifts So, the worst-case running time is ((n-

m+1)m)

In the example on previous slide, we have (13-4+1)4

comparisons in total Nave method is inefficient because information

from a shift is not used again


11/35

Analysis Brute force pattern matching runs in time

O(mn) in the worst case. But most searches of ordinary text take

O(m+n), which is very quick.

continued


12/35

Brute-force Analysis (Best)

Best Case

Example1: Found in first position of text

Text: 0000000000000000001 Pattern: 000

Cost = O(M)

Example2: Pattern Not found and always a

mismatch on first character Text: 0000000000000000001

Pattern: 11

Cost = O(N+M)


13/35

Nave Algorithm

Example (from right to left):

a b c a b c a

a b c a (shift =3)

a b c a (shift = 2)

a b c a (shift = 1)

a b c a (shift = 0)

Pattern occur with shift 0 and 3


14/35


Has a worst-case running time of O((n-m+1)m) but average-case is O(n+m)

Also works well in practice Based on number-theoretic notion of

modularequivalence

We assume that = {0,1, 2, , 9}, i.e.,each character is a decimal digit

In general, use radix-dwhere d= ||


15/35

Rabin-Karp Approach

We can view a string of kcharacters (digits)

as a length-kdecimal number

E.g., the string 31425 corresponds to the

decimal number 31,425

Given a patternP [1..m], letpdenote the

corresponding decimal value

Given a text T [1..n], let tsdenote the decimal

value of the length-msubstring T

[(s+1)..(s+m)] fors=0,1,,(n-m)


16/35

Rabin-Karp Approach contd

ts=piff T [(s+1)..(s+m)] =P [1..m]

sis a valid shift iff ts=p

pcan be computed in O(m) time p=P[m] + 10 (P[m-1] + 10 (P[m-2]+))

t0can similarly be computed in O(m) time

Other t1, t2,, tn-mcan be computed in O(n-m) time since ts+1can be computed from tsin

constant time


17/35

Rabin-Karp Approach contd ts+1= 10(ts- 10

m-1T [s+1]) + T[s+m+1]

E.g., if T={,3,1,4,1,5,2,}, m=5 and ts=31,415, then ts+1= 10(31415100003) + 2

=14152 Thus we can compute p in (m) and can

compute t0, t1, t2,, tn-min (n-m+1) time

And we can find al occurrences of the pattern

P[1m] in text T[1n] with (m)preprocessing time and (n-m+1) matchingtime.

Buta problem: this is assumingpand tsare small numbers

They may be too large to work with easily


18/35

Rabin-Karp Approach contd

Solution: we can use modular arithmetic with

a suitable modulus, q

E.g.,

ts+1(10(tsT[s+1]h) + T[s+m+1]) (mod q)

Where h =10 m-1(mod q)

qis chosen as a smallprime number ; e.g., 13

for radix 10

Generally, if the radix is d, then dqshould fit

within one computer word


19/35

How values modulo 13 are computed

3 1 4 1 5 2

7 8

14152((314153 10000) 10 + 2)(mod13)

((73 3) 10 + 2 )(mod 13)

8(mod 13)

old high-order digit

new low-order digit


20/35

Problem of Spurious Hits tsp (mod q) does not imply that ts=p

Modular equivalence does not necessarily meanthat two integers are equal

A case in which tsp (mod q) when ts p iscalled aspurious hit

On the other hand, if two integers are notmodular equivalent, then they cannot beequal


21/35

Example

2 3 1 4 1 5 2 6 7 3 9 9 2 1

3 1 4 1 5

1 2 3 4 5 6 7 8 9 10 11 12 13 14

pattern

text

1 7 8 4 5 10 11 7 9 11

7

mod 13

mod 13

valid

match

spurious

hit


22/35


Basic structure like the nave algorithm,but uses modular arithmetic as described

For each hit, i.e., for eachswhere tsp(mod q), verify character by characterwhethersis a valid shift or a spurious hit

In the worst case, every shift is verified

Running time can be shown as O((n-m+1)m)

Average-case running time is O(n+m)


23/35

Example 2

Let T = a b c b a b and P = a b c

Take a = 97, b = 98, c= 99 (i.e. ASCII value of characters).

= 256.

Integer value of P,p = c + 256(b+256a)= [99 + 256(98+25697)] % 256

=197

In similar fashion, we can calculate hash value of m-lengthtext and compare to check valid / spurious hit (as in

previous slides).Analysis

In the worst case, every shift is verified

Running time can be shown as O((n-m+1)m)

Average-case running time is O (n + m)


24/35

3. The KMP Algorithm The Knuth-Morris-Pratt (KMP) algorithm

looks for the pattern in the text in a left-to-

rightorder (like the brute force algorithm). But it shifts the pattern more intelligently

than the brute force algorithm.

continued


25/35

If a mismatch occurs between the text and

pattern P at P[j], what is the mostwe can

shift the pattern to avoid wastefulcomparisons?

Answer: the largest prefix of P[0 .. j-1] thatis a suffix of P[1 .. j-1]


26/35

ExampleT:P:

jnew= 2j = 5


27/35

Why Find largest prefix (start) of:

"a b a a b" ( P[0..j-1] )which is suffix (end) of:

"b a a b" ( p[1 .. j-1] ) Answer: "a b"

Set j = 2 // the new j value

j == 5


28/35

KMP Failure Function KMP preprocesses the pattern to find

matches of prefixes of the pattern with thepattern itself.

j = mismatch position in P[]

k = position before the mismatch (k = j-1).

Thefailure functionF(k) is defined as the

sizeof the largest prefix of P[0..k] that isalso a suffix of P[1..k].


29/35

P: "abaaba"j: 012345

In code, F() is represented by an array, like

the table.

Failure Function Example

F(k) is the size of

the largest prefix.1

3

2

4210j

100F(j)

(k == j-1)


30/35

Why is F(4) == 2? F(4) means

find the size of the largest prefix of P[0..4] that

is also a suffix of P[1..4]= find the size largest prefix of "abaab" that

is also a suffix of "baab"= find the size of "ab"= 2

P: "abaaba"


31/35

Knuth-Morris-Pratts algorithm modifies thebrute-force algorithm.

if a mismatch occurs at P[j](i.e. P[j] != T[i]), then

k = j-1;j = F(k); // obtain the new j

Using the Failure Function


32/35

Example

1

a b a c a a b a c a b a c a b a a b b

7

8

19181715

a b a c a b

1614

13

2 3 4 5 6

9

a b a c a b

a b a c a b

a b a c a b

a b a c a b

10 11 12

c

0

3

1

4210k

100F(k)

T:P:


33/35

Why is F(4) == 1? F(4) means

find the size of the largest prefix of P[0..4] that

is also a suffix of P[1..4]= find the size largest prefix of "abaca" that

is also a suffix of "baca"= find the size of "a"= 1

P: "abacab"


34/35

KMP Advantages KMP runs in optimal time: O(m+n)

very fast The algorithm never needs to move

backwards in the input text, Tthis makes the algorithm good for processing

very large files that are read in from externaldevices or through a network stream


35/35

KMP Disadvantages KMP doesnt work so well as the size of the

alphabet increasesmore chance of a mismatch (more possible

mismatches)mismatches tend to occur early in the pattern,

but KMP is faster when the mismatches occurlater

Documents

17458 String Matching