Search Algorithms Winter Semester 2004/2005 25 Oct 2004 3rd Lecture

1

HEINZ NIXDORF INSTITUTEUniversity of Paderborn

Algorithms and ComplexityChristian Schindelhauer

Search AlgorithmsWinter Semester 2004/2005

25 Oct 20043rd Lecture

Christian Schindelhauer

[email protected]

Search Algorithms, WS 2004/05 2



Chapter I

Chapter I

Searching Text

18 Oct 2004




Searching Text (Overview)

The task of string matching

– Easy as a pieThe naive algorithm

– How would you do it?The Rabin-Karp algorithm

– Ingenious use of primes and number theoryThe Knuth-Morris-Pratt algorithm

– Let a (finite) automaton do the job

– This is optimalThe Boyer-Moore algorithm

– Bad letters allow us to jump through the text

– This is even better than optimal (in practice)Literature

– Cormen, Leiserson, Rivest, “Introduction to Algorithms”, chapter 36, string matching, The MIT Press, 1989, 853-885.




The Naive Algorithm

Naive-String-Matcher(T,P)

1. n length(T)

2. m length(P)

3. for s 0 to n-m do

4. if P[1..m] = T[s+1 .. s+m] then

5. return “Pattern occurs with shift s”

6. fi

7. od

Fact: The naive string matcher needs worst case running time O((n-m+1) m) For n = 2m this is O(n2) The naive string matcher is not optimal, since string matching can be

done in time O(m + n)




The Rabin-Karp-Algorithm

Idea: Compute

– checksum for pattern P and

– checksum for each sub-string of T of length m

am n ma a an p ta i ip t pi i

4 2 3 1 4 2 3 1 13 2 3 1 10

p ta i

3

valid hitspurious

hit

checksums

checksum




Finite-Automaton-Matcher

The example automaton accepts at the end of occurences of the pattern abba

For every pattern of length m there exists an automaton with m+1 states that solves the pattern matching problem with the following algorithm:

Finite-Automaton-Matcher(T,,P)

Ø n length(T)

Ø q 0

Ø for i 1 to n do

Ø q (q,T[i])

Ø if q = m then

Ø s i - m

Ø return “Pattern occurs with shift” s

Ø fi

Ø od




The Finite-Automaton-Matcher

Q is a finite set of states

q0 Q is the start state

Q is a set of accepting sates

: input alphabet

: Q Q: transition function

0

1

4

2

3

ab

b

a

a

b

b

b

a

a

input

state a b

0 1 0

1 1 2

2 1 3

3 4 0

4 1 2

a ab b ab ab ab

0 1 2 1 2 3 4 2 3 4 1




Knuth-Morris-Pratt Pattern Matching

KMP-Matcher(T,P)

1. n length(T)

2. m length(P)

3. Compute-Prefix-Function(P)

4. q 0

5. for i 1 to n do

6. while q > 0 and P[q+1] T[i] do

7. q [q] od

8. if P[q+1] = T[i] then

9. q q+1fi

10. if q = m then

11. print “Pattern occurs with shift”i-m

12. q [q]fi

od

am n ma a am p a

m

m

m

a

m

a

m a

m a

m

m

a

m

m

m ma

mm a

m

m

Pattern

m ma a




Boyer-Moore: The ideas!

am n ma a an p ta i ip t pi i

p ti i

p ti i

Start comparingat the end

What’s this? There is no “a” in the search patternWe can shift m+1 letters

An “a” again...

p ti i

First wrongletter!

Do a large shift!

p ti i

Bingo!Do anotherlarge shift! p ti i

That’s it!10 letters compared

and ready!




Boyer-Moore-Matcher(T,P,)

1. n length(T)2. m length(P)3. Compute-Last-Occurence-Function(P,m, )4. Compute-Good-Suffix(P,m)5. s 06. while s n-m do7. j m8. while j > 0 and P[j] = T[s+j] do9. j j-1

od10. if j=0 then11. print “Pattern occurs with shift” s12. s s+ [0]

else13. s s+ max([j], j - [T[s+j]] )

fiod

We start comparingat the right end

Bad character shift

Valid shifts

Success!Now do a valid shift

Shift as far as possible indicated bybad character heuristic or good suffix heuristic




Boyer-Moore: Last-occurrence

am n ma a tn p ta i ip t pi i

p ti i

p ti i

What’s this? There is no “a” in the

search patternWe can shift by

j - [a] = 4-0 letters

“t” occurs in “piti” at the

3rd position:Shift by

j - [a] = 4-3= one step

p ti i

“p” occurs in“piti” at the

first positionShift by

j - [a] = 4-1= 3 letters

p ti i

There is no “a” in the search pattern

We can shift by at leastj - [a] = 2-0 letters

j=4

j=4

j=4

j=2




Compute-Last-Occurrence-Function(P,m,)

1. for each character a do2. [a] 0

od3. for j 1 to m do4. [P[j]] j

od5. return

6. Running time: O(|| + m)

p ti i

a

i

p

t




The Prefix Function

[q] := max {k : k < q and Pk is a suffix of Pq}

ba a ba a aa a

ba a ba a aa a[7] = 4

ba a ba a aa

b

a

ba a ba a aa

ba a ba aa

ba a ba aa

ba a ba a

ba a ba

P8

P7b

P7

P6

P5

Text

Pattern




a

ba a ba

[q] := max {k : k < q and Pk is a suffix of Pq}Pattern:

ba a ba a [6] = 3

ba a a

ba a

[4] = 1

ba a ba

ba a a

[5] = 2

a [1] = 0

ba

a

[2] = 0

ba a

ba

[3] = 1

ba a ba a aa a

ba a ba aa

ba a ba a

[7] = 4

ba a ba a aa

ba a ba aa

[8] = 1

ba a ba a aa

ba a baa

a

[9] = 1a




Computing

Compute-Prefix-Function(P)1. m length(P)2. [1] 03. k 04. for q 2 to m do5. while k > 0 and P[k+1] P[q] do6. k [k]

od7. if P[k+1] = P[q] then8. k k+1

fi9. [q] k

od

If Pk+1 is not a suffix of Pq

... shift the pattern to the nextreasonable position

(given by smaller values of )

If the letter fits, then increment position(otherwise k = 0)

We have found the position such that[q] := max {k : k < q and Pk is a suffix of Pq}




n

Boyer-Moore: Good Suffix - the far jump

am ma a anPattern:

First mismatch

ma a an

ma a an

na ma a an

nam ma a an

nam ma a an

nam ma a an

nam ma a an

nam ma a an

nam

nam

m

Is Rev(P)5 a suffix of Rev(P)6?


Is Rev(P)5 a suffix of Rev(P)8?(or P5 a suffix of P8)?

Is P4 a suffix of P8?





[q] := max {k : k < q and Pk is a suffix of Pq}

[8]=4Shift =m- [j]=8-4 =4

j=6




m

Boyer-Moore: Good Suffix - the small jump

am ma a amPattern:

First mismatch

ma a an

ma a am

ma ma a am

mam ma a am

mam ma a am

mam ma a am

mam ma a am

mam ma a am

mam

mam

m






f[6]=8Shift (f[j]-j)=8-6=2

j=6

f[j] := min{k : k > j and Rev(P)j is a suffix of Rev(P)k}’[q] := max {k : k < q and Rev(P)k is a suffix of Rev(P)q}



Is Rev(P)5 a suffix of Rev(P)8?(or P5 a suffix of P8)?




Boyer-Moore: Good Suffix - the small jump

Pattern:

j=6

f[6]=8Shift (f[j]-j)=8-6=2

f[j] := min{k : k > j and Rev(P)j is a suffix of Rev(P)k}’[q] := max {k : k < q and Rev(P)k is a suffix of Rev(P)q}




Why is it the same?

’[k] := max {j : j < k and Rev(P)j is a suffix of Rev(P)k}

Matrix forRev(P)j is a suffix of Rev(P)k

k

j

f[j] := min{k : k > j and Rev(P)j is a suffix

of Rev(P)k}




Compute-Good-Suffix-Function(P,m)

1. Compute-Prefix-Function(P)2. P’ reverse(P)3. ’ Compute-Prefix-Function(P’)4. for j 0 to m do5. [j] m - [m]

od6. for l 1 to m do7. j m - ’[l]8. if [j] > l - ’[l] then9. [j] l - ’[l]

fiod

10.return

11.Running time: O(m)

The far jump

or is it a small jump




Boyer-Moore-Matcher(T,P,)

1. n length(T)2. m length(P)3. Compute-Last-Occurence-Function(P,m, )4. Compute-Good-Suffix(P,m)

5. s 06. while s n-m do7. j m8. while j > 0 and P[j] = T[s+j] do9. j j-1

od10. if j=0 then11. print “Pattern occurs with shift”

s

12. s s+ [0] else

13. s s+ max([j], j - [T[s+j]] )fi

od

Running time: O((n-m+1)m) in the worst case

In practice:

O(n/m + v m + m + ||) for v hits in the text




Chapter II

Chapter IISearching in

Compressed Text25 Oct 2004




Searching in Compressed Text (Overview)

What is Text Compression

– Definition

– The Shannon Bound

– Huffman Codes

– The Kolmogorov MeasureSearching in Non-adaptive Codes

– KMP in Huffman CodesSearching in Adaptive Codes

– The Lempel-Ziv Codes

– Pattern Matching in Z-Compressed Files

– Adapting Compression for Searching




What is Text Compression?

First approach:– Given a text s n

– Find a compressed version c m such that m < n– Such that s can be derived from c

Formal:– Compression Function f : * *

• is one-to-one (injective) and efficiently invertibleFact:

– Most of all text is uncompressibleProof:

– There are (||m+1-1)/(||-1) strings of length at most m– There are ||n strings of length n– From these strings at most (||m+1-1)/(||-1) strings can be compressed– This is fraction of at most ||m-n+1/(||-1)– E.g. for || = 256 and m=n-10 this is 8.3 × 10-25

• which implies that only 8.3 × 10-25 of all files of n bytes can be compressed to a string of length n-10




Why does Text Compression work?

Usually texts are using letters with different frequencies

– Relative Frequencies of Letters in General English Plain text From Cryptographical Mathematics, by Robert Edward Lewand:

• e: 12%, t: 10%, a: 8%, i: 7%, n: 7%, o: 7%

• ...

• k: 0.4%, x: 0. 2%, j: 0. 2%, q: 0. 09%, z:0. 06%

– Special characters like $,%,# occur even less frequent

– Some character encodings are (nearly) unused, e.g. bytecode: 0 of ASCIIText underlies a lot of rules

– Words are (usually) the same (collected in dictionaries)

– Not all words can be used in combination

– Sentences are structured (grammar)

– Program codes use code words

– Digitally encoded pictures have smooth areas, where colors change gradually

– Patterns repeat




Information Theory: The Shannon bound

C. E. Shannon in his 1949 paper "A Mathematical Theory of Communication".

Shannon derives his definition of entropy

The entropy rate of a data source means the average number of bits per symbol needed to encode it.

Example text: ababababab

– Entropy: 1

– Encoding:

• Use 0 for a

• Use 1 for b

– Code: 0101010101Huffman Codes are a way to derive such a Shannon bound (for sufficiently

large text)

Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (Unkomprimiert)“

benötigt.




Huffman Code

Huffman Code–is adapted for each text (but not within the text)

–consists of a • dictionary, which maps each letter

of a text to a binary string and• the code given as a prefix-free

binary encodingPrefix-free code

–uses strings s1,s2,...,sm of variable length such that no strint si is a prefix of sj

am n ma a am p

i ip t

Letter Frequency Code

a 5 10

i 4 01

p 3 111

m 2 000

t 2 001

n 2 110

Example of Huffman encoding:–Text:

i ipt

a

am n ma a am p i ip t i ipta

10000 110 00010 10 10000 111 01 01111 001 01 0111100110

Encoding:




Computing Huffman Codes

Compute the letter frequenciesBuild root nodes labeled with

frequencies repeat

– Build node connected the two least frequent unlinked nodes

– Mark sons with 0 and 1– Father node carries the sum of the

frequenciesuntil one tree is leftThe path to each letter carries the code

Letter Frequency

a 5

i 4

p 3

m 2

t 2

n 2

a inp mt

5

23 245 2

4

8

1018

1 0

01

01

1

1

0

111 110

0

10 01 001 000

Letter Code

a 10

i 01

p 111

m 000

t 001

n 110




Searching in Huffman Codes

Let u be the size of the compressed textLet v be the size of the pattern Huffman-encoded according to the text

dictionary

KMP can search in Huffman Codes in time O(u+v+m)

Encoding the pattern takes O(v+m) stepsBuilding the prefix takes time O(v)Searching the text on a bit level takes time O(u+v)

Problems:

–This algorithm is bit-wise not byte-wise

• Exercise: Develop a byte-wise strategy




The Downside of Huffman Codes

Example: Consider 128 Byte text:– abbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabba

abbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabba

– will be encoded using 16 Bytes (and an extra byte for the dictionary) as – 0110011001100110011001100110011001100110011001100110011001100110– 0110011001100110011001100110011001100110011001100110011001100110

– This does not use the full compression possibilities for this text

– E.g. using (abba)^32 would need only 9 BytesThe perfect code:

– A self-extracting program for a string x is a program that started without input produces the output x and then halts.

– So, the smallest self-extracting-program is the ultimate encodingKolmogorov complexity K(x) of a string x denotes the length of such an

self-extracting program for x




Kolmogoroff Complexity

Is the Kolmogorov Complexity depending on programming language?

– No, as long as the programming language is universal, e.g. can simulate any Turing machine

Lemma

Let K1(x) and K2(x) denote the Kolmogorov Complexity with respect to two arbitrary universal programming languages.Then for a constant c and all strings x:

K1(x) K2(x) + c

Is the Kolmogorov Complexity useful?

– No:

Theorem

K(x) is not recursive.




Proof of Lemma

Lemma

Let K1(x) and K2(x) denote the Kolmogorov Complexity with respect to two arbitrary universal programming languages.Then for a constant c and all strings x:

K1(x) K2(x) + c

ProofLet M1 be the self-extracting program for x with respect to the first language

Let U be a universal program in the seconde that simulates a given machine M1of the first language

The output of U(M1,) is x

Then, the can find a machine M2 of length |U|+|M1|+O(1) that has the same functionality as U(M1,)

– by using S-m-n-TheoremSince |U| is a fixed (constant-sized) machine this proves the statement.




Proof of the Theorem

Theorem

K(x) is not recursive.

ProofAssume K(x) is recursive.For string length n let xn denote the smallest string of length n such that

K(x) |x| = nWe can enumerate xn

– Compute for all strings x of size n the Kolmogorov complexity K(x) and output the first string x with K(x)n

Let M be the program computing xn on input n

We can efficiently encode xn:

– Combine M with binary encoded n:

K(x) log n + |M|

= log n + O(1) For large enough n this is a contradiction to K(x) n

34



Thanks for your attentionEnd of 3rd lectureNext lecture: Mo 8 Nov 2004, 11.15 am, FU 116

Next exercise class: Mo 25 Oct 2004, 1.15 pm, F0.530 or We 27 Oct 2004, 1.00 pm, E2.316

Documents

Search Algorithms Winter Semester 2004/2005 25 Oct 2004 3rd Lecture