Upload
tobias-russell
View
254
Download
0
Embed Size (px)
Citation preview
1
Contest AlgorithmsJanuary 2016
Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp
13. String Searching
Contest Algorithms: 13. String Srch
Definition: given a text string T and a search string (pattern) P, find P inside
T T: “the rain in spain stays mainly on the plain” P: “n th”
Applications: text editors, Web search engines (e.g. Google), image analysis
1. What is String Searching?
Assume S is a string of size m.
A substring S[i .. j] of S is the string fragment between indexes i and j.
A prefix of S is a substring S[0 .. i] A suffix of S is a substring S[i .. m-1]
i is any index between 0 and m-1
String Concepts
"start of S"
"end of S"
Substring S[1..3] == "ndr"
All possible prefixes of S: "andrew", "andre", "andr", "and", "an”, "a"
All possible suffixes of S: "andrew", "ndrew", "drew", "rew", "ew", "w"
Examplesa n d r e w
S
0 5
Check each position in the text T to see if the pattern P starts in that position
2. The Brute Force Algorithm
a n d r e wT:
r e wP:
a n d r e wT:
r e wP:
. . . .P moves 1 char at a time through T
Contest Algorithms:13. String Srch 6
public static int brute(String text, String pattern) { int n = text.length(); int m = pattern.length(); int j; for(int i=0; i <= (n-m); i++) { j = 0; while ((j < m) && (text.charAt(i+j) == pattern.charAt(j)) ) j++; if (j == m) return i; // match at i } return -1; // no match } // end of brute()
Code see BruteSearch.java
Contest Algorithms:13. String Srch 7
Easy to code No preprocessing needs to be done on the pattern
Usually takes O(n+m) steps – not so bad n = length of text; m = length of pattern
Worst case scenario O(nm) when searching for aaabin aaaaaaaaaaaaaaaaaaaaaaaab
Properties of Brute-force Search
The Knuth-Morris-Pratt (KMP) algorithm shifts the pattern more intelligently than the brute force algorithm.
steps are bigger than just 1 character move
3. The KMP Algorithm
continued
If a mismatch occurs between the text and pattern P at P[j], what is the most we can shift the pattern to avoid wasteful comparisons?
Answer: the largest prefix of P[0 .. j-1] that is a suffix of P[1 .. j-1]
Example
T:
P:
jnew = 2
j = 5
i
Find largest prefix (start) of:"a b a a b" ( P[0..j-1] )
which is suffix (end) of:"b a a b" ( p[1 .. j-1] )
Answer: "a b" Set j = 2 // the new j value
Whyj == 5
KMP preprocesses the pattern to find matches of prefixes of the pattern with the pattern itself.
j = mismatch position in P[] k = position before the mismatch (k = j-1).
The failure function F(k) is defined as the size of the largest prefix of P[0..k] that is also a suffix of P[1..k].
KMP Failure Function
P: "a b a a b a" j: 0 1 2 3 4 5
In code, F() is represented by an array, like the table.
Failure Function Example
F(k) is the size of the largest prefix.
1
3
2
4210j
100F(j)
k
F(k)
(k == j-1)
F(4) means find the size of the largest prefix of P[0..4] that is also a
suffix of P[1..4]= find the size largest prefix of "abaab" that
is also a suffix of "baab"= find the size of "ab"= 2
Why is F(4) == 2?P: "abaaba"
Knuth-Morris-Pratt’s algorithm modifies the brute-force algorithm.
if a mismatch occurs at P[j] (i.e. P[j] != T[i]), then k = j-1; j = F(k); // obtain the new j
Using the Failure Function
int kmpMatch(String text, String pattern) { int n = text.length(); int m = pattern.length();
int fail[] = computeFail(pattern);
int i=0; int j=0; :
Code
Return index where pattern starts, or -1
see KmpSearch.java
while (i < n) { if (pattern.charAt(j) == text.charAt(i)) { if (j == m - 1) return i - m + 1; // match i++; j++; } else if (j > 0) j = fail[j-1]; else i++; } return -1; // no match } // end of kmpMatch()
int[] computeFail(String pattern) { int fail[] = new int[pattern.length()]; fail[0] = 0;
int m = pattern.length(); int j = 0; int i = 1; :
while (i < m) { if (pattern.charAt(j) == pattern.charAt(i)) { //j+1 chars match fail[i] = j + 1; i++; j++; } else if (j > 0) // j follows matching prefix j = fail[j-1]; else { // no match fail[i] = 0; i++; } } return fail; } // end of computeFail() Similar code
to kmpMatch()
Example
1
a b a c a a b a c a b a c a b a a b b
7
8
19181715
a b a c a b
1614
13
2 3 4 5 6
9
a b a c a b
a b a c a b
a b a c a b
a b a c a b
10 11 12
c
0
3
1
4210k
100F(k)
T:
P:
F(4) means find the size of the largest prefix of P[0..4] that is also a suffix
of P[1..4]= find the size largest prefix of "abaca" that
is also a suffix of "baca"= find the size of "a"= 1
Why is F(4) == 1?P: "abacab"
Contest Algorithms:13. String Srch 22
Time to find match is only O(n) with O(m) preprocessing time
n = length of text; m = length of the pattern
Can be modified to search for multiple patterns in a single search.
Properties of KMP
KMP doesn’t work so well as the size of the alphabet increases
more chance of a mismatch (more possible mismatches) mismatches tend to occur early in the pattern, but KMP is
faster when the mismatches occur later
KMP Disadvantage
The basic algorithm doesn't take into account the letter in the text that caused the mismatch.
KMP Extensions
a a ab b
a a ab b a
x
a a ab b a
T:
P:
Basic KMPdoes not do this.
String search is based on a hash function applied to the pattern and substrings in the text
Look for a match by comparing the hash values, not substrings.
5. The Rabin-Karp Algorithm
Contest Algorithms:13. String Srch 26
long hash(String s) { long h = 0; for (int j = 0; j < s.size(); j++) h = (R * h + key.charAt(j)) % Q; // % acts as mod return h; }
R == radix; often 10 for numeric data; 128 for ASCII, etc.
Q == a large prime number; e.g. 997
Typical hash function
hash("26535") == 613
Contest Algorithms:13. String Srch 27
Hash Function explained
Tt0 t1 t2 t3 tm-2 tm-1 tm tm+1... ... ... ... ...
Pp0 p1 p2 p3 pm-2 pm-1... ... ...
pattern has m chars
hash(P)
examine m char of text at a time = Xi
hash(Xi)
Contest Algorithms:13. String Srch 28
The hash function calculates: hash(Xi) = ( to*Rm-1 + t1*Rm-2 + t3*Rm-1 + ... tm-2*R + tm-1 ) mod Q
Contest Algorithms:13. String Srch 29
T = "31415926535" and P = "26" R = 10; Q = 11 hash("ab") = (a*10 + b) mod 11
Example
13 14 95 62 35 5T
62P hash(P) == hash("26") == 26 mod 11 = 4
Iterate through the Text
13 14 95 62 35 5
13 14 95 62 35 5
14 mod 11 = 3 not equal to 4
31 mod 11 = 9 not equal to 4
13 14 95 62 35 5
41 mod 11 = 8 not equal to 4
13 14 95 62 35 5
15 mod 11 = 4 equal to 4 -> wrong match
13 14 95 62 35 5
59 mod 11 = 4 equal to 4 -> wrong match
13 14 95 62 35 5
92 mod 11 = 4 equal to 4 -> wrong match
13 14 95 62 35 5
26 mod 11 = 4 equal to 4 -> correct match
Contest Algorithms:13. String Srch 32
The hash() function uses modulo Q, so the range of results is 0 to Q-1.
If Q is small then it is likely that two different strings will hash to the same result
probability is 1/Q
Solution is to make Q very big, which reduces the chance of a wrong match. (e.g. Q = 232-1 == 4.3 billion)
Also double-check the match using string operations
Why Wrong Matches?
This is an example of a Monte Carlo algorithm it's fast but may output an incorrect answer with a small
probability (1/Q)
The "double-checking" approach is known as a Las Vegas algorithm
it can be slow
Gambling Names
Contest Algorithms:13. String Srch 34
After the hash() of the first substring of T, there is no need to keep calling hash() for the 2nd substring, 3rd substring, etc.
It is possible to calculate the next hash (e.g. hash(Xi+1)) based on the current hash value (hash(Xi))
much faster (O(m) --> O(1) running time) less memory needed
Speeding up hash Calculation
Contest Algorithms:13. String Srch 35
hash(Xi) = ( to*Rm-1 + t1*Rm-2 + t3*Rm-1 + ... tm-2*R + tm-1 ) mod Q
hash(Xi+1) = ( t1*Rm-1 + t2*Rm-2 + t3*Rm-1 + ... tm-1*R + tm ) mod Q
Connection between hash()s
T t0 t1 t2 t3 tm-2 tm-1 tm tm+1... ... ... ... ...
Xi
Xi+1
Contest Algorithms:13. String Srch 36
Therefore: hash(Xi+1) = ( ( hash(Xi+1) - t0*Rm-1 ) mod Q )*R
+ tm mod Q ) mod Q
= ( ( hash(Xi+1) + ( t0*Qm-1 - t0*Rm-1 )) mod Q )*R
+ tm mod Q ) mod Q
= ( ( hash(Xi+1) + t0( Q - (Rm-1 mod Q) ) )*R
+ tm ) mod Q
old front value
new end value
include so mod value is positive
a constant,which can be pre-calculated
Using:
Modulo Properties
We move through the text left-to-right, one character at a time, building up the hash for an m-character substring from preceding hash values.
Creating the Hash
P: "26535" R = 10, Q = 997
Hash of the Pattern
the hash value for the pattern
T: "3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3" M = 5, R = 10, Q = 997
Hashing the Text Substrings
In the code RM = Rm-1 mod Q
The hash values forthe M-char substrings
Contest Algorithms:13. String Srch 41
public static void main(String[] args) { if (args.length != 2) { System.out.println("Usage: java RabinKarp <text> <pattern>"); return; }
RabinKarp searcher = new RabinKarp(args[1]); int pos = searcher.search(args[0]); showPos(args[0], args[1], pos); } // end of main()
Code see RabinKarp.java
Contest Algorithms:13. String Srch 42
public class RabinKarp{ private static final int R = 256; // radix
private String pat; // the pattern; needs to be global for LV checking private long patHash; // pattern hash value
private int M; // pattern length private long Q; // a large prime, small enough to avoid long overflow private long RM; // == R^(M-1) % Q
public RabinKarp(String pat) { this.pat = pat; // save pattern (needed only for Las Vegas) M = pat.length(); Q = longRandomPrime();
// precompute R^(M-1) % Q for use in removing leading digit RM = 1; for (int i = 1; i <= M - 1; i++) RM = (R * RM) % Q; patHash = hash(pat, M); } // end of RabinKarp() :
Contest Algorithms:13. String Srch 43
private static long longRandomPrime() // a random 31-bit probable prime { BigInteger prime = new BigInteger(31, 20, new Random()); return prime.longValue(); }
private long hash(String key, int M) // Compute hash for key[0..M-1]. { long h = 0; for (int j = 0; j < M; j++) h = (R * h + key.charAt(j)) % Q; return h; } // end of hash()
Contest Algorithms:13. String Srch 44
public int search(String txt) { int N = txt.length(); if (N < M) return -1; long txtHash = hash(txt, M);
// hash match found at offset 0, so double-check if ((patHash == txtHash) && check(txt, 0)) return 0;
// iterate through the text for (int i = M; i < N; i++) { // Calculate new hash by removing leading digit, add trailing digit txtHash = (txtHash + Q - RM * txt.charAt(i - M) % Q) % Q; txtHash = (txtHash * R + txt.charAt(i)) % Q;
// found a hash match, so double-check int offset = i - M + 1; if ((patHash == txtHash) && check(txt, offset)) return offset; } return -1; // no match found } // end of search()
Contest Algorithms:13. String Srch 45
private boolean check(String txt, int i) // Las Vegas version: does pat[] match txt[i..i-M+1] ? { for (int j = 0; j < M; j++) if (pat.charAt(j) != txt.charAt(i + j)) return false; return true; } // end of check()
Contest Algorithms:13. String Srch 46
Has a poor worst-case running time (O(nm)), and so KMP is probably better for string searching.
KMP's hashing technique allows the search algorithm to be used on other things than text
e.g. image, audio, video search
Rabin-Karp can be easily modified to do fast multiple pattern search.
check whether the hash of a string in the text belongs to a set of hash values of patterns
Properties of Rabin-Karp
Contest Algorithms:13. String Srch 47
Algorithm Preprocessing timem = pat len.
Matching time (average, worst)
n = text len;
Brute force 0 (no preprocessing) O(n+m), O(nm)
Knuth-Morris-Pratt O(m) O(n)
Rabin-Karp O(m) O(n+m), O(nm)
6. Summary
35 algorithms with C code at http://www-igm.univ-mlv.fr/~lecroq/string/