Upload
emlyn
View
30
Download
1
Embed Size (px)
DESCRIPTION
Search Algorithms Winter Semester 2004/2005 25 Oct 2004 3rd Lecture. Christian Schindelhauer [email protected]. Chapter I. Chapter I Searching Text 18 Oct 2004. Searching Text (Overview). The task of string matching Easy as a pie The naive algorithm How would you do it? - PowerPoint PPT Presentation
Citation preview
1
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Search AlgorithmsWinter Semester 2004/2005
25 Oct 20043rd Lecture
Christian Schindelhauer
Search Algorithms, WS 2004/05 2
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Chapter I
Chapter I
Searching Text
18 Oct 2004
Search Algorithms, WS 2004/05 3
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Searching Text (Overview)
The task of string matching
– Easy as a pieThe naive algorithm
– How would you do it?The Rabin-Karp algorithm
– Ingenious use of primes and number theoryThe Knuth-Morris-Pratt algorithm
– Let a (finite) automaton do the job
– This is optimalThe Boyer-Moore algorithm
– Bad letters allow us to jump through the text
– This is even better than optimal (in practice)Literature
– Cormen, Leiserson, Rivest, “Introduction to Algorithms”, chapter 36, string matching, The MIT Press, 1989, 853-885.
Search Algorithms, WS 2004/05 4
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
The Naive Algorithm
Naive-String-Matcher(T,P)
1. n length(T)
2. m length(P)
3. for s 0 to n-m do
4. if P[1..m] = T[s+1 .. s+m] then
5. return “Pattern occurs with shift s”
6. fi
7. od
Fact: The naive string matcher needs worst case running time O((n-m+1) m) For n = 2m this is O(n2) The naive string matcher is not optimal, since string matching can be
done in time O(m + n)
Search Algorithms, WS 2004/05 5
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
The Rabin-Karp-Algorithm
Idea: Compute
– checksum for pattern P and
– checksum for each sub-string of T of length m
am n ma a an p ta i ip t pi i
4 2 3 1 4 2 3 1 13 2 3 1 10
p ta i
3
valid hitspurious
hit
checksums
checksum
Search Algorithms, WS 2004/05 6
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Finite-Automaton-Matcher
The example automaton accepts at the end of occurences of the pattern abba
For every pattern of length m there exists an automaton with m+1 states that solves the pattern matching problem with the following algorithm:
Finite-Automaton-Matcher(T,,P)
Ø n length(T)
Ø q 0
Ø for i 1 to n do
Ø q (q,T[i])
Ø if q = m then
Ø s i - m
Ø return “Pattern occurs with shift” s
Ø fi
Ø od
Search Algorithms, WS 2004/05 7
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
The Finite-Automaton-Matcher
Q is a finite set of states
q0 Q is the start state
Q is a set of accepting sates
: input alphabet
: Q Q: transition function
0
1
4
2
3
ab
b
a
a
b
b
b
a
a
input
state a b
0 1 0
1 1 2
2 1 3
3 4 0
4 1 2
a ab b ab ab ab
0 1 2 1 2 3 4 2 3 4 1
Search Algorithms, WS 2004/05 8
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Knuth-Morris-Pratt Pattern Matching
KMP-Matcher(T,P)
1. n length(T)
2. m length(P)
3. Compute-Prefix-Function(P)
4. q 0
5. for i 1 to n do
6. while q > 0 and P[q+1] T[i] do
7. q [q] od
8. if P[q+1] = T[i] then
9. q q+1fi
10. if q = m then
11. print “Pattern occurs with shift”i-m
12. q [q]fi
od
am n ma a am p a
m
m
m
a
m
a
m a
m a
m
m
a
m
m
m ma
mm a
m
m
Pattern
m ma a
Search Algorithms, WS 2004/05 9
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Boyer-Moore: The ideas!
am n ma a an p ta i ip t pi i
p ti i
p ti i
Start comparingat the end
What’s this? There is no “a” in the search patternWe can shift m+1 letters
An “a” again...
p ti i
First wrongletter!
Do a large shift!
p ti i
Bingo!Do anotherlarge shift! p ti i
That’s it!10 letters compared
and ready!
Search Algorithms, WS 2004/05 10
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Boyer-Moore-Matcher(T,P,)
1. n length(T)2. m length(P)3. Compute-Last-Occurence-Function(P,m, )4. Compute-Good-Suffix(P,m)5. s 06. while s n-m do7. j m8. while j > 0 and P[j] = T[s+j] do9. j j-1
od10. if j=0 then11. print “Pattern occurs with shift” s12. s s+ [0]
else13. s s+ max([j], j - [T[s+j]] )
fiod
We start comparingat the right end
Bad character shift
Valid shifts
Success!Now do a valid shift
Shift as far as possible indicated bybad character heuristic or good suffix heuristic
Search Algorithms, WS 2004/05 11
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Boyer-Moore: Last-occurrence
am n ma a tn p ta i ip t pi i
p ti i
p ti i
What’s this? There is no “a” in the
search patternWe can shift by
j - [a] = 4-0 letters
“t” occurs in “piti” at the
3rd position:Shift by
j - [a] = 4-3= one step
p ti i
“p” occurs in“piti” at the
first positionShift by
j - [a] = 4-1= 3 letters
p ti i
There is no “a” in the search pattern
We can shift by at leastj - [a] = 2-0 letters
j=4
j=4
j=4
j=2
Search Algorithms, WS 2004/05 12
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Compute-Last-Occurrence-Function(P,m,)
1. for each character a do2. [a] 0
od3. for j 1 to m do4. [P[j]] j
od5. return
6. Running time: O(|| + m)
p ti i
a
i
p
t
Search Algorithms, WS 2004/05 13
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
The Prefix Function
[q] := max {k : k < q and Pk is a suffix of Pq}
ba a ba a aa a
ba a ba a aa a[7] = 4
ba a ba a aa
b
a
ba a ba a aa
ba a ba aa
ba a ba aa
ba a ba a
ba a ba
P8
P7b
P7
P6
P5
Text
Pattern
Search Algorithms, WS 2004/05 14
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
a
ba a ba
[q] := max {k : k < q and Pk is a suffix of Pq}Pattern:
ba a ba a [6] = 3
ba a a
ba a
[4] = 1
ba a ba
ba a a
[5] = 2
a [1] = 0
ba
a
[2] = 0
ba a
ba
[3] = 1
ba a ba a aa a
ba a ba aa
ba a ba a
[7] = 4
ba a ba a aa
ba a ba aa
[8] = 1
ba a ba a aa
ba a baa
a
[9] = 1a
Search Algorithms, WS 2004/05 15
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Computing
Compute-Prefix-Function(P)1. m length(P)2. [1] 03. k 04. for q 2 to m do5. while k > 0 and P[k+1] P[q] do6. k [k]
od7. if P[k+1] = P[q] then8. k k+1
fi9. [q] k
od
If Pk+1 is not a suffix of Pq
... shift the pattern to the nextreasonable position
(given by smaller values of )
If the letter fits, then increment position(otherwise k = 0)
We have found the position such that[q] := max {k : k < q and Pk is a suffix of Pq}
Search Algorithms, WS 2004/05 16
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
n
Boyer-Moore: Good Suffix - the far jump
am ma a anPattern:
First mismatch
ma a an
ma a an
na ma a an
nam ma a an
nam ma a an
nam ma a an
nam ma a an
nam ma a an
nam
nam
m
Is Rev(P)5 a suffix of Rev(P)6?
Is Rev(P)5 a suffix of Rev(P)7?
Is Rev(P)5 a suffix of Rev(P)8?(or P5 a suffix of P8)?
Is P4 a suffix of P8?
Is P3 a suffix of P8?
Is P2 a suffix of P8?
Is P1 a suffix of P8?
Is P0 a suffix of P8?
[q] := max {k : k < q and Pk is a suffix of Pq}
[8]=4Shift =m- [j]=8-4 =4
j=6
Search Algorithms, WS 2004/05 17
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
m
Boyer-Moore: Good Suffix - the small jump
am ma a amPattern:
First mismatch
ma a an
ma a am
ma ma a am
mam ma a am
mam ma a am
mam ma a am
mam ma a am
mam ma a am
mam
mam
m
Is P4 a suffix of P8?
Is P3 a suffix of P8?
Is P2 a suffix of P8?
Is P1 a suffix of P8?
Is P0 a suffix of P8?
f[6]=8Shift (f[j]-j)=8-6=2
j=6
f[j] := min{k : k > j and Rev(P)j is a suffix of Rev(P)k}’[q] := max {k : k < q and Rev(P)k is a suffix of Rev(P)q}
Is Rev(P)5 a suffix of Rev(P)6?
Is Rev(P)5 a suffix of Rev(P)7?
Is Rev(P)5 a suffix of Rev(P)8?(or P5 a suffix of P8)?
Search Algorithms, WS 2004/05 18
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Boyer-Moore: Good Suffix - the small jump
Pattern:
j=6
f[6]=8Shift (f[j]-j)=8-6=2
f[j] := min{k : k > j and Rev(P)j is a suffix of Rev(P)k}’[q] := max {k : k < q and Rev(P)k is a suffix of Rev(P)q}
Search Algorithms, WS 2004/05 19
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Why is it the same?
’[k] := max {j : j < k and Rev(P)j is a suffix of Rev(P)k}
Matrix forRev(P)j is a suffix of Rev(P)k
k
j
f[j] := min{k : k > j and Rev(P)j is a suffix
of Rev(P)k}
Search Algorithms, WS 2004/05 20
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Compute-Good-Suffix-Function(P,m)
1. Compute-Prefix-Function(P)2. P’ reverse(P)3. ’ Compute-Prefix-Function(P’)4. for j 0 to m do5. [j] m - [m]
od6. for l 1 to m do7. j m - ’[l]8. if [j] > l - ’[l] then9. [j] l - ’[l]
fiod
10.return
11.Running time: O(m)
The far jump
or is it a small jump
Search Algorithms, WS 2004/05 21
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Boyer-Moore-Matcher(T,P,)
1. n length(T)2. m length(P)3. Compute-Last-Occurence-Function(P,m, )4. Compute-Good-Suffix(P,m)
5. s 06. while s n-m do7. j m8. while j > 0 and P[j] = T[s+j] do9. j j-1
od10. if j=0 then11. print “Pattern occurs with shift”
s
12. s s+ [0] else
13. s s+ max([j], j - [T[s+j]] )fi
od
Running time: O((n-m+1)m) in the worst case
In practice:
O(n/m + v m + m + ||) for v hits in the text
Search Algorithms, WS 2004/05 22
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Chapter II
Chapter IISearching in
Compressed Text25 Oct 2004
Search Algorithms, WS 2004/05 23
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Searching in Compressed Text (Overview)
What is Text Compression
– Definition
– The Shannon Bound
– Huffman Codes
– The Kolmogorov MeasureSearching in Non-adaptive Codes
– KMP in Huffman CodesSearching in Adaptive Codes
– The Lempel-Ziv Codes
– Pattern Matching in Z-Compressed Files
– Adapting Compression for Searching
Search Algorithms, WS 2004/05 24
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
What is Text Compression?
First approach:– Given a text s n
– Find a compressed version c m such that m < n– Such that s can be derived from c
Formal:– Compression Function f : * *
• is one-to-one (injective) and efficiently invertibleFact:
– Most of all text is uncompressibleProof:
– There are (||m+1-1)/(||-1) strings of length at most m– There are ||n strings of length n– From these strings at most (||m+1-1)/(||-1) strings can be compressed– This is fraction of at most ||m-n+1/(||-1)– E.g. for || = 256 and m=n-10 this is 8.3 × 10-25
• which implies that only 8.3 × 10-25 of all files of n bytes can be compressed to a string of length n-10
Search Algorithms, WS 2004/05 25
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Why does Text Compression work?
Usually texts are using letters with different frequencies
– Relative Frequencies of Letters in General English Plain text From Cryptographical Mathematics, by Robert Edward Lewand:
• e: 12%, t: 10%, a: 8%, i: 7%, n: 7%, o: 7%
• ...
• k: 0.4%, x: 0. 2%, j: 0. 2%, q: 0. 09%, z:0. 06%
– Special characters like $,%,# occur even less frequent
– Some character encodings are (nearly) unused, e.g. bytecode: 0 of ASCIIText underlies a lot of rules
– Words are (usually) the same (collected in dictionaries)
– Not all words can be used in combination
– Sentences are structured (grammar)
– Program codes use code words
– Digitally encoded pictures have smooth areas, where colors change gradually
– Patterns repeat
Search Algorithms, WS 2004/05 26
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Information Theory: The Shannon bound
C. E. Shannon in his 1949 paper "A Mathematical Theory of Communication".
Shannon derives his definition of entropy
The entropy rate of a data source means the average number of bits per symbol needed to encode it.
Example text: ababababab
– Entropy: 1
– Encoding:
• Use 0 for a
• Use 1 for b
– Code: 0101010101Huffman Codes are a way to derive such a Shannon bound (for sufficiently
large text)
Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (Unkomprimiert)“
benötigt.
Search Algorithms, WS 2004/05 27
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Huffman Code
Huffman Code–is adapted for each text (but not within the text)
–consists of a • dictionary, which maps each letter
of a text to a binary string and• the code given as a prefix-free
binary encodingPrefix-free code
–uses strings s1,s2,...,sm of variable length such that no strint si is a prefix of sj
am n ma a am p
i ip t
Letter Frequency Code
a 5 10
i 4 01
p 3 111
m 2 000
t 2 001
n 2 110
Example of Huffman encoding:–Text:
i ipt
a
am n ma a am p i ip t i ipta
10000 110 00010 10 10000 111 01 01111 001 01 0111100110
Encoding:
Search Algorithms, WS 2004/05 28
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Computing Huffman Codes
Compute the letter frequenciesBuild root nodes labeled with
frequencies repeat
– Build node connected the two least frequent unlinked nodes
– Mark sons with 0 and 1– Father node carries the sum of the
frequenciesuntil one tree is leftThe path to each letter carries the code
Letter Frequency
a 5
i 4
p 3
m 2
t 2
n 2
a inp mt
5
23 245 2
4
8
1018
1 0
01
01
1
1
0
111 110
0
10 01 001 000
Letter Code
a 10
i 01
p 111
m 000
t 001
n 110
Search Algorithms, WS 2004/05 29
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Searching in Huffman Codes
Let u be the size of the compressed textLet v be the size of the pattern Huffman-encoded according to the text
dictionary
KMP can search in Huffman Codes in time O(u+v+m)
Encoding the pattern takes O(v+m) stepsBuilding the prefix takes time O(v)Searching the text on a bit level takes time O(u+v)
Problems:
–This algorithm is bit-wise not byte-wise
• Exercise: Develop a byte-wise strategy
Search Algorithms, WS 2004/05 30
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
The Downside of Huffman Codes
Example: Consider 128 Byte text:– abbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabba
abbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabba
– will be encoded using 16 Bytes (and an extra byte for the dictionary) as – 0110011001100110011001100110011001100110011001100110011001100110– 0110011001100110011001100110011001100110011001100110011001100110
– This does not use the full compression possibilities for this text
– E.g. using (abba)^32 would need only 9 BytesThe perfect code:
– A self-extracting program for a string x is a program that started without input produces the output x and then halts.
– So, the smallest self-extracting-program is the ultimate encodingKolmogorov complexity K(x) of a string x denotes the length of such an
self-extracting program for x
Search Algorithms, WS 2004/05 31
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Kolmogoroff Complexity
Is the Kolmogorov Complexity depending on programming language?
– No, as long as the programming language is universal, e.g. can simulate any Turing machine
Lemma
Let K1(x) and K2(x) denote the Kolmogorov Complexity with respect to two arbitrary universal programming languages.Then for a constant c and all strings x:
K1(x) K2(x) + c
Is the Kolmogorov Complexity useful?
– No:
Theorem
K(x) is not recursive.
Search Algorithms, WS 2004/05 32
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Proof of Lemma
Lemma
Let K1(x) and K2(x) denote the Kolmogorov Complexity with respect to two arbitrary universal programming languages.Then for a constant c and all strings x:
K1(x) K2(x) + c
ProofLet M1 be the self-extracting program for x with respect to the first language
Let U be a universal program in the seconde that simulates a given machine M1of the first language
The output of U(M1,) is x
Then, the can find a machine M2 of length |U|+|M1|+O(1) that has the same functionality as U(M1,)
– by using S-m-n-TheoremSince |U| is a fixed (constant-sized) machine this proves the statement.
Search Algorithms, WS 2004/05 33
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Proof of the Theorem
Theorem
K(x) is not recursive.
ProofAssume K(x) is recursive.For string length n let xn denote the smallest string of length n such that
K(x) |x| = nWe can enumerate xn
– Compute for all strings x of size n the Kolmogorov complexity K(x) and output the first string x with K(x)n
Let M be the program computing xn on input n
We can efficiently encode xn:
– Combine M with binary encoded n:
K(x) log n + |M|
= log n + O(1) For large enough n this is a contradiction to K(x) n
34
HEINZ NIXDORF INSTITUTEUniversity of Paderborn
Algorithms and ComplexityChristian Schindelhauer
Thanks for your attentionEnd of 3rd lectureNext lecture: Mo 8 Nov 2004, 11.15 am, FU 116
Next exercise class: Mo 25 Oct 2004, 1.15 pm, F0.530 or We 27 Oct 2004, 1.00 pm, E2.316