Upload
duration123
View
241
Download
0
Embed Size (px)
Citation preview
8/14/2019 17458 String Matching
1/35
Outline
String Matching
Introduction
Nave Algorithm
Rabin-Karp Algorithm
Knuth-Morris-Pratt (KMP) Algorithm
8/14/2019 17458 String Matching
2/35
Introduction
What isstring matching?
Finding all occurrences of apattern in a given text (or
body of text) Many applications
While using editor/word processor/browser
Login name & password checking
Virus detection
Header analysis in data communications
DNA sequence analysis, Web search engines (e.g.
Google), image analysis
8/14/2019 17458 String Matching
3/35
String-Matching Problem
The textis in an array T [1..n] of length n
Thepatternis in an arrayP [1..m] of
length m Elements of TandPare characters from a
finite alphabet
E.g., = {0,1} or = {a, b, , z} Usually TandPare calledstringsof
characters
8/14/2019 17458 String Matching
4/35
String-Matching Problem contd
We say that patternPoccurs with shift sintext T if:
a) 0 s n-m andb) T [(s+1)..(s+m)] =P [1..m]
IfPoccurs with shiftsin T, thensis avalid shift, otherwisesis an invalid shift
String-matching problem: finding all validshifts for a given TandP
8/14/2019 17458 String Matching
5/35
Example 1
a b c a b a a b c a b a c
a b a a
text T
pattern P s= 3
shift s = 3is a valid shift
(n=13, m=4 and 0 s n-m holds)
1 2 3 4 5 6 7 8 9 10 11 12 13
1 2 3 4
8/14/2019 17458 String Matching
6/35
Example 2
a b c a b a a b c a b a a
a b a a
text T
pattern P
s= 3
a b a a
a b a a
s= 9
1 2 3 4 5 6 7 8 9 10 11 12 13
1 2 3 4
8/14/2019 17458 String Matching
7/35
Nave String-Matching Algorithm
Input: Text strings T [1..n] andP[1..m]Result: All valid shifts displayed
NAVE-STRING-MATCHER(T,P)n length[T]
m length[P]
fors 0 ton-mifP[1..m] = T [(s+1)..(s+m)]
print pattern occurs with shifts
8/14/2019 17458 String Matching
8/35
Nave Algorithm
The Nave algorithm consists in checking, at all the
positions in the text between 0 to n-m, whether an
occurrence of the pattern starts there or not.
After each attempt, it shifts the pattern by exactly oneposition to the right.
Example (from left to right):
a b c a b c a
a b c a (shift = 0)
a b c a (shift = 1)
a b c a (shift = 2)
a b c a (shift = 3)
8/14/2019 17458 String Matching
9/35
Analysis: Worst-case Example
a a a a a a a a a a a a atext T
pattern P
a a a b
a a a b
1 2 3 4 5 6 7 8 9 10 11 12 13
1 2 3 4
a a a b
8/14/2019 17458 String Matching
10/35
Worst-case Analysis
There are mcomparisons for each shift in theworst case
There are n-m+1 shifts So, the worst-case running time is ((n-
m+1)m)
In the example on previous slide, we have (13-4+1)4
comparisons in total Nave method is inefficient because information
from a shift is not used again
8/14/2019 17458 String Matching
11/35
Analysis Brute force pattern matching runs in time
O(mn) in the worst case. But most searches of ordinary text take
O(m+n), which is very quick.
continued
8/14/2019 17458 String Matching
12/35
Brute-force Analysis (Best)
Best Case
Example1: Found in first position of text
Text: 0000000000000000001 Pattern: 000
Cost = O(M)
Example2: Pattern Not found and always a
mismatch on first character Text: 0000000000000000001
Pattern: 11
Cost = O(N+M)
8/14/2019 17458 String Matching
13/35
Nave Algorithm
Example (from right to left):
a b c a b c a
a b c a (shift =3)
a b c a (shift = 2)
a b c a (shift = 1)
a b c a (shift = 0)
Pattern occur with shift 0 and 3
8/14/2019 17458 String Matching
14/35
Rabin-Karp Algorithm
Has a worst-case running time of O((n-m+1)m) but average-case is O(n+m)
Also works well in practice Based on number-theoretic notion of
modularequivalence
We assume that = {0,1, 2, , 9}, i.e.,each character is a decimal digit
In general, use radix-dwhere d= ||
8/14/2019 17458 String Matching
15/35
Rabin-Karp Approach
We can view a string of kcharacters (digits)
as a length-kdecimal number
E.g., the string 31425 corresponds to the
decimal number 31,425
Given a patternP [1..m], letpdenote the
corresponding decimal value
Given a text T [1..n], let tsdenote the decimal
value of the length-msubstring T
[(s+1)..(s+m)] fors=0,1,,(n-m)
8/14/2019 17458 String Matching
16/35
Rabin-Karp Approach contd
ts=piff T [(s+1)..(s+m)] =P [1..m]
sis a valid shift iff ts=p
pcan be computed in O(m) time p=P[m] + 10 (P[m-1] + 10 (P[m-2]+))
t0can similarly be computed in O(m) time
Other t1, t2,, tn-mcan be computed in O(n-m) time since ts+1can be computed from tsin
constant time
8/14/2019 17458 String Matching
17/35
Rabin-Karp Approach contd ts+1= 10(ts- 10
m-1T [s+1]) + T[s+m+1]
E.g., if T={,3,1,4,1,5,2,}, m=5 and ts=31,415, then ts+1= 10(31415100003) + 2
=14152 Thus we can compute p in (m) and can
compute t0, t1, t2,, tn-min (n-m+1) time
And we can find al occurrences of the pattern
P[1m] in text T[1n] with (m)preprocessing time and (n-m+1) matchingtime.
Buta problem: this is assumingpand tsare small numbers
They may be too large to work with easily
8/14/2019 17458 String Matching
18/35
Rabin-Karp Approach contd
Solution: we can use modular arithmetic with
a suitable modulus, q
E.g.,
ts+1(10(tsT[s+1]h) + T[s+m+1]) (mod q)
Where h =10 m-1(mod q)
qis chosen as a smallprime number ; e.g., 13
for radix 10
Generally, if the radix is d, then dqshould fit
within one computer word
8/14/2019 17458 String Matching
19/35
How values modulo 13 are computed
3 1 4 1 5 2
7 8
14152((314153 10000) 10 + 2)(mod13)
((73 3) 10 + 2 )(mod 13)
8(mod 13)
old high-order digit
new low-order digit
8/14/2019 17458 String Matching
20/35
Problem of Spurious Hits tsp (mod q) does not imply that ts=p
Modular equivalence does not necessarily meanthat two integers are equal
A case in which tsp (mod q) when ts p iscalled aspurious hit
On the other hand, if two integers are notmodular equivalent, then they cannot beequal
8/14/2019 17458 String Matching
21/35
Example
2 3 1 4 1 5 2 6 7 3 9 9 2 1
3 1 4 1 5
1 2 3 4 5 6 7 8 9 10 11 12 13 14
pattern
text
1 7 8 4 5 10 11 7 9 11
7
mod 13
mod 13
valid
match
spurious
hit
8/14/2019 17458 String Matching
22/35
Rabin-Karp Algorithm
Basic structure like the nave algorithm,but uses modular arithmetic as described
For each hit, i.e., for eachswhere tsp(mod q), verify character by characterwhethersis a valid shift or a spurious hit
In the worst case, every shift is verified
Running time can be shown as O((n-m+1)m)
Average-case running time is O(n+m)
8/14/2019 17458 String Matching
23/35
Example 2
Let T = a b c b a b and P = a b c
Take a = 97, b = 98, c= 99 (i.e. ASCII value of characters).
= 256.
Integer value of P,p = c + 256(b+256a)= [99 + 256(98+25697)] % 256
=197
In similar fashion, we can calculate hash value of m-lengthtext and compare to check valid / spurious hit (as in
previous slides).Analysis
In the worst case, every shift is verified
Running time can be shown as O((n-m+1)m)
Average-case running time is O (n + m)
8/14/2019 17458 String Matching
24/35
3. The KMP Algorithm The Knuth-Morris-Pratt (KMP) algorithm
looks for the pattern in the text in a left-to-
rightorder (like the brute force algorithm). But it shifts the pattern more intelligently
than the brute force algorithm.
continued
8/14/2019 17458 String Matching
25/35
If a mismatch occurs between the text and
pattern P at P[j], what is the mostwe can
shift the pattern to avoid wastefulcomparisons?
Answer: the largest prefix of P[0 .. j-1] thatis a suffix of P[1 .. j-1]
8/14/2019 17458 String Matching
26/35
ExampleT:P:
jnew= 2j = 5
8/14/2019 17458 String Matching
27/35
Why Find largest prefix (start) of:
"a b a a b" ( P[0..j-1] )which is suffix (end) of:
"b a a b" ( p[1 .. j-1] ) Answer: "a b"
Set j = 2 // the new j value
j == 5
8/14/2019 17458 String Matching
28/35
KMP Failure Function KMP preprocesses the pattern to find
matches of prefixes of the pattern with thepattern itself.
j = mismatch position in P[]
k = position before the mismatch (k = j-1).
Thefailure functionF(k) is defined as the
sizeof the largest prefix of P[0..k] that isalso a suffix of P[1..k].
8/14/2019 17458 String Matching
29/35
P: "abaaba"j: 012345
In code, F() is represented by an array, like
the table.
Failure Function Example
F(k) is the size of
the largest prefix.1
3
2
4210j
100F(j)
(k == j-1)
8/14/2019 17458 String Matching
30/35
Why is F(4) == 2? F(4) means
find the size of the largest prefix of P[0..4] that
is also a suffix of P[1..4]= find the size largest prefix of "abaab" that
is also a suffix of "baab"= find the size of "ab"= 2
P: "abaaba"
8/14/2019 17458 String Matching
31/35
Knuth-Morris-Pratts algorithm modifies thebrute-force algorithm.
if a mismatch occurs at P[j](i.e. P[j] != T[i]), then
k = j-1;j = F(k); // obtain the new j
Using the Failure Function
8/14/2019 17458 String Matching
32/35
Example
1
a b a c a a b a c a b a c a b a a b b
7
8
19181715
a b a c a b
1614
13
2 3 4 5 6
9
a b a c a b
a b a c a b
a b a c a b
a b a c a b
10 11 12
c
0
3
1
4210k
100F(k)
T:P:
8/14/2019 17458 String Matching
33/35
Why is F(4) == 1? F(4) means
find the size of the largest prefix of P[0..4] that
is also a suffix of P[1..4]= find the size largest prefix of "abaca" that
is also a suffix of "baca"= find the size of "a"= 1
P: "abacab"
8/14/2019 17458 String Matching
34/35
KMP Advantages KMP runs in optimal time: O(m+n)
very fast The algorithm never needs to move
backwards in the input text, Tthis makes the algorithm good for processing
very large files that are read in from externaldevices or through a network stream
8/14/2019 17458 String Matching
35/35
KMP Disadvantages KMP doesnt work so well as the size of the
alphabet increasesmore chance of a mismatch (more possible
mismatches)mismatches tend to occur early in the pattern,
but KMP is faster when the mismatches occurlater