1 Chapter 7: Selected Algorithms 7.1 External Search 7.2 External Sorting 7.3 Text searching

1

Chapter 7: Selected Algorithms

7.1 External Search

7.2 External Sorting

7.3 Text searching

2

7.2 External Sorting

Problem: Sorting big amount of data, as in external searching, stored in blocks (pages).

efficiency: number of the access to pages should be kept low!

Strategy: Sorting algorithm which processes the data sequentially (no frequent page exchanges): MergeSort!

General form for Mergemergesort(S) # retorna el conjunto S ordenado

{

if(S es vacío o tiene sólo 1 elemento)

return(S);

else {

Dividir S en dos mitades A y B;

A'=mergesort(A);

B'=mergesort(B);

return(merge(A',B'));

}

}

3

4

Start: n data in a file g1, divided in pages of size b:

Page 1: s1,…,sb

Page 2: sb+1,…s2b …

Page k: s(k-1)b+1 ,…,sn

( k = [n/b]+ )When sequentially processed: only k page accesses

instead of n.

5

Variation of MergeSort for external sorting

MergeSort: Divide-and-Conquer-Algorithm

for external sorting: without divide-step,

only merge.

Definition: run := ordered subsequence within a file.

Strategy: by merging increasingly bigger generated runs until everything is sorted.

6

Algorithm

1. Step: Generate from the sequence in the input file g1

„starting runs“ and distribute them in two files f1 and f2,

with the same number of runs (1) in each.

(for this there are many strategies, later).

Now: use four files f1, f2, g1, g2.

7

2. Step (main step):

While the number of runs > 1 repeat: {

• Merge each two runs from f1 and f2 to a double sized run alternating to g1 und g2, until there are no more runs in f1 and f2.

• Merge each two runs from g1 and g2 to a double sized run alternating to f1 and f2, until there are no more runs in g1 und g2.

}

Each loop = two phases

8

Example:Start:

g1: 64, 17, 3, 99, 79, 78, 19, 13, 67, 34, 8, 12, 50

1st. step (length of starting run= 1):

f1: 64 | 3 | 79 | 19 | 67 | 8 | 50

f2: 17 | 99 | 78 | 13 | 34 | 12

Main step,

1st. loop, part 1 (1st. Phase ):

g1: 17, 64 | 78, 79 | 34, 67 | 50

g2: 3, 99 | 13, 19 | 8, 12

1st. loop, part 2 (2nd. Phase):

f1: 3, 17, 64, 99 | 8, 12, 34, 67 |

f2: 13, 19, 78, 79 | 50 |

9

Example continuation

1st. loop, part 2 (2nd. Phase):

f1: 3, 17, 64, 99 | 8, 12, 34, 67 |

f2: 13, 19, 78, 79 | 50 |

2nd. loop, part 1 (3rd. Phase):

g1: 3, 13, 17, 19, 64, 78, 79, 99 |

g2: 8, 12, 34, 50, 67 |

2nd. loop, part 2 (4th. Phase):

f1: 3, 8, 12, 13, 17, 19, 34, 50, 64, 67, 78, 79, 99 |

f2:

10

Implementation:

For each file f1, f2, g1, g2 at least one page of them is stored in principal memory (RAM), even better, a second one might be stored as buffer.

Read/write operations are made page-wise.

11

Costs

Page accesses during 1. step and each phase: O(n/b)

In each phase we divide the number of runs by 2, thus:

Total number of accesses to pages: O((n/b) log n),when starting with runs of length 1.

Internal computing time in 1 step and each phase is: O(n).

Total internal computing time: O( n log n ).

12

Two variants of the first step: creation of the start runs

• A) Direct mixing

sort in primary memory („internally“) as many data as possible, for example m data sets

First run of a (fixed!) length m,

thus r := n/m starting runs.

Then we have the total number of page accesses:

O( (n/b) log(r) ).

13

Two variants of the first step: creation of the start runs

• B) Natural mixing Creates starting runs of variable length.

Advantage: we can take advantage of ordered subsequences that the file may contain

Noteworthy: starting runs can be made longer by using the replacement-selection method by having a bigger primary storage !

14

Replacement-Selection

Read m data from the input file in the primary memory (array).

repeat { mark all data in the array as „now“. start a new run. while there is a „now“ marked data in the array {• select the smallest (smallest key) from all „now“ marked

data,• print it in the output file,• replace the number in the array with a number read from

the input file (if there are still some) mark it „now“ if it is bigger or equal to the last outputted data, else mark it as „not now“.

}}Until there are no data in the input file.

15

Example: array in primary storage with capacity of 3The input file has the following data:

64, 17, 3, 99, 79, 78, 19, 13, 67, 34, 8, 12, 50

In the array: („not now“ data written in parenthesis)

Runs : 3, 17, 64, 78, 79, 99 | 13, 19, 34, 67 | 8, 12, 50

64 17 3

64 17 99

64 79 99

78 79 99

(19) 79 99

(19) (13) 99

(19) (13) (67)

8 12 50

12 50

50

19 13 67

19 34 67

(8) 34 67

(8) (12) 67

(8) (12) (50)

16

Implementation:

In an array:

• At the front: Heap for „now“ marked data,

• At the back: refilled „not now“ data.

Note: all „now“ elements go to the current generated run.

17

Expected length of the starting runs using the replace-select method:

• 2•m • (m = size of the array in the primary storage = number of data that fit into primary storage) by equally probabilities distribution

• Even bigger if there is some previous sorting!

18

Multi-way merging

Instead of using two input and two output files (alternating f1, f2 and g1, g2)

Use k input and k output files, in order to me able to merge always k runs in one.

In each step: take the smallest number among the k runs and output it to the current output file.

19

Cost:

In each phase: number of runs is devided by k,

Thus, if we have r starting runs we need only logk(r) phases

(instead of log2(r)).

Total number of accesses to pages:

O( (n/b) logk(r) ).

Internal computing time for each phase: O(n log2 (k))

Total internal computing time:

O( n log2(k) logk(r)) = O( n log2(r) ).

20

Chapter 7.3

• Text searching according to Boyer and Moore– Position index und matching direction– ShiftRight as static funktion– Bad Character Heuristic– Good-Suffix Heuristic

21

Text searching

Problem: test if a pattern (string) s appears in a text (string) t or not.

With a naïve approach: Algorithm taking O(|s| |t| ).

Now better algorithms:• Knuth, Morris, Pratt (1977) • Boyer und Moore (1977).

22

Naive Algorithm

Operations: ohne1(String) String, anf1(String) char

Algorithm prefix(s, t: String) Boolean { if (s empty) then { output true; exit }; if (t empty) then { output false; exit }; if anf1(s) = anf1(t) then output prefix(ohne1(s),ohne1(t)) else output false }

algorithm SubString(s, t: String) Boolean { res := false; while (t not empty) and (res=false) perform { if prefix(s,t) then res := true else t := ohne1(t) }; output res }

Cost: O( |s| • |t| )

23

The Knuth, Morris, Pratt (KMP) Algorithm

The naive algorithm shifts the pattern 1 possition to the right when a mismath happens.The KMP algorithm exploits the characteristics of the pattern in order to shift it to the right as far as possible.

How ?

If there are some sub-pattern repeated in the pattern we can use them in the following way:

24

Knuth, Morris, Pratt (2)

• Lets asume that comparing the text with the pattern at a certain point we have j characters that match but the character at the position j+1 does not

Text :

Pattern :

25


If we have a coincidence of the last i characters of the pattern (that is from the character j-i+1 to j, including both) that matched the text with the i first characters of the pattern, then we can move the pattern to the right and start checking

Pattern :

a prefix of length i

a sufix of length i

26


• An interesting characteristic of this approach is that we can calculate beforehand (before starting the search) how much we can shift the pattern, because it depends on the pattern itsef (does it has similarities inside?)

• We will define the so called „failure function“ – Be the pattern composed by the characters b1b2..bm

– f(j) = max( i < j, | max(b1 ... bj = bj-i+1 ... bj, j = 1..m)

Pattern :

27

Algorithmus von Knuth, Morris, Pratt (5)

• After defining this function we can explain the algorithm the following way: Start comparing the the text with the pattern from left to right at position k = 0. if at a certain position k the character of the text does not match with the character of the pattern at the position j+1 , then continue comparing the pattern at f(j)+1 from k on (because we know that the characters before already match)

Pattern :

Text :

28

The Algorithm (in pseudo-JAVA) assuming f(i) already calculated

// n = length of the text // m = length of the pattern// indexes start from 1 int k=0; int j=0; while (k<n && j<m) { while (j>0 && text[k+1]!=pattern[j+1]) { j=f[j]; } if (text[k+1])==pattern[j+1])) { j++; } k++; } // j==m => matching k == n => failure

29

Construction of the f(i) function// m length of the pattern// indexes begin with 1 int[] f=new int[m]; f[1]=0; int j=1; int i; while (j<m) { i=f[j]; while (i>0 && pattern[i+1]!=pattern[j+1]) { i=f[i]; } if (pattern[i+1]== pattern[j+1]) { f[j+1]=i+1; } else { f[j+1]=0; } j++; }

30

Algorithmus von Boyer und Moore

Ideen:• Verschiebe das Wort s allmählich von links nach

rechts, aber• Vergleiche Wort s mit Text t im Wort s von rechts nach

links.

Zwei Heuristiken zum Verschieben des Suchstrings s.• Bad-Character-Heuristik• Good-Suffix-HeuristikAufwand: auch O(|t|+|s|).

31

Heuristiken

32

Erläuterungen zum Bild

In a) wird der Suchstring "reminiscence" von rechts nach links mit dem Text verglichen. Das Suffix "ce" stimmt überein, aber der "Bad-Character" "i" stimmt nicht mehr mit dem korrespondierenden "n" des Suchstrings überein. In b) wird der Suchstring nach der Bad-Character-Heuristik so weit nach rechts verschoben, bis der "Bad-Character" "i" mit dem am weitesten rechts auftretenden Vorkommen von "i" im Suchstring übereinstimmt. In c) wird nach der Good-Suffix-Heuristik das gefundene "Good-Suffix" "ce" mit dem Suchstring verglichen. Kommt dieses Suffix ein weiteres Mal im Suchstring vor, so kann der Suchstring so weit verschoben werden, dass

dieses erneute Auftreten mit dem Text übereinstimmt.

33

Die "Bad-Character Heuristik"

Matchfehler an der Stelle j mit s[j] t[pos+j], 1 j d (pos ist die Stelle vor dem aktuellen Beginn des Suchstrings)1) Das falsche Zeichen t[pos+j] tritt im Suchstring nicht auf. Nun können wir ohne Fehler den Suchstring um j weiterschieben. 2) Das falsche Zeichen t[pos+j] tritt im Suchstring auf. Sei nun k der größte Index mit 1 k d, an dem s[k]=t[pos+j] gilt. Ist dann k<j, so wollen wir den Suchstring um j-k weiterschieben. Hier haben wir dann mindestens eine Übereinstimmung im Zeichen s[k] = t[pos+j]. Man kann den Wert k im voraus für jedes verschiedene Zeichen des Suchstrings als Funktion b(a) bestimmen, wobei a aus dem erlaubten Alphabet ist. b(a) gibt die Position des am weitesten rechts stehenden Auftreten vom Zeichen a im Suchstring an. Damit ist eine Verschiebung um j - k = j - b(t[pos + j]). zu machen. 3) Gilt allerdings k>j, so liefert die Heuristik einen negativen Shift j - k, der ignoriert wird, also Verschiebung um 1.

34

Liste des rechtesten Wiedervorkommens im blauenSuchstring

http://wwwmayr.informatik.tu-muenchen.de/lehre/1999SS/proseminar/jakob/

http://wwwmayr.informatik.tu-muenchen.de/lehre/1999SS/proseminar/jakob/

35

Beispiel BCH

Rechtestes Auftreten im Suchstring finden

36

"Good-Suffix Heuristik" Angenommen, wir haben einen Matchfehler an der Stelle j mit s[j] t[pos+j], 0 j d gefunden (die weiter rechts liegenden Zeichen stimmen also überein, pos ist die aktuelle Position in t ). Gilt j= d, so schieben wir den Suchstring einfach um eine Position weiter. Gilt jedoch j<d, so haben wir d-j Übereinstimmungen. Das Suffix des Suchstrings s der Länge d-j und der passende Textstring t von der Stelle pos+1 an stimmen links von pos+d in d-j Zeichen überein.

pos

j+1 d

s

j0

Die “Good-sufix” Funktion

37

Nun berechnen wir für jede Position j im Suchstring die Größeg[j] := d- max{k: 0 k < d; (s[j + 1...d] ist Suffix von sk oder sk ist Suffix von s[j + 1...d])}.

g heißt dann "Good-Suffix"-Funktion und kann im Vorhinein für alle 0 j d berechnet werden. Sie gibt die kleinste Anzahl von Zeichen an, um die wir den Suchstring s nach rechts schieben können, ohne Übereinstimmungen mit dem Text zu verlieren.

s=nennen s1 = n, s2 = ne, s3 = nen, s4=nenn, s5 = nenne, s6 = nennens[6..6] =n, s[5..5] =en, s[3..5] =nen, s[2..5] =nnen, s[1..5] =ennen, g[0]= 6-max{1,3}, g[1]=3, g[2]=3, g[3]=3, g[4]=3, g[5]=6-4

j+1 ds

j0 k

38

Good suffix alternativ

L'[ ] und l'[ ] für das Beispiel-Suchmuster: l'[pos] := Länge des längsten Suffix in Muster[pos..n], das auch Präfix ist.

L'[pos] := Rechtes Ende der rechtesten Kopie von Muster[pos..n].

39

Good Suffix BeispielAchtung – Verschiebung um 1 Länge d=11

Pos=0, j=6, g(6)=11-6=5

Pos=7, j=5, g(5)=11-3=8

k<d, g(0)=11-3=8

Fazit: 11 Gesamtlänge. Die gegebene Heuristik arbeitet gut

40

Weitere Beispiele:Wir kennen keinen nennenswerten Fall nennen

Hier ist d=6, j=4 und der Buchstabe k tritt nicht im Suchstring auf. Wir können demnach den String nach der Bad-Charakter Heuristik um 4 Plätze weiterschieben. Good-Suffix-Heuristik: Das Good-Suffix ist en; Verschiebung: um 3 Positionen

Wir kennen keinen nennenswerten Fall nennen Nunmehr kommt der Mismatch-Buchstabe n im Suchstring viermal vor. Das maximale Vorkommen ist k=6. Wir müssen also die Good-Suffix Heuristik anwenden. Im Vorhinein haben wir g[5] = 6-4=2 berechnet und können den Suchstring um zwei Plätze nach rechts weiterschieben:

Wir kennen keinen nennenswerten Fall nennen Hier ist j=1. Die Bad-Character Heuristik ermöglicht uns lediglich, den String um eine Position nach rechts zu verschieben. Das Good-Suffix ist jedoch ennen, und das Präfix nen das Suchstrings ist ein Suffix des Good-Suffix. Wir haben also vorher schon g[1]= 6-3=3 berechnet. Die Good-Suffix Heuristik erlaubt uns also, den Suchstring um drei Positionen nach rechts weiterzuschieben.

Documents

1 Chapter 7: Selected Algorithms 7.1 External Search 7.2 External Sorting 7.3 Text searching