6
Sorting Really Big Files Sorting Part 3

Sorting Really Big Files

Embed Size (px)

DESCRIPTION

Sorting Really Big Files. Sorting Part 3. Using K Temporary Files. Given N records in file F M records will fit into internal memory Use K temp files, where K = N / M Create K sorted files from F, then merge them Problems computers compare 2 values at once, not K values - PowerPoint PPT Presentation

Citation preview

Page 1: Sorting  Really Big Files

Sorting Really Big Files

Sorting Part 3

Page 2: Sorting  Really Big Files

Using K Temporary Files Given

N records in file F M records will fit into internal memory Use K temp files, where K = N / M

Create K sorted files from F, then merge them

Problems computers compare 2 values at once, not K values merging only 2 of K runs at once creates LOTS of temp files in the illustration on the next page, notice that we soon begin

merging small runs with big temp files too many comparisons

Page 3: Sorting  Really Big Files

Alternative Merging Strategy

R1 R2

T2

R3T1

R4

F

R1 R2

T2

R3

T1

R4

F

R5T3

R5

T3

empty

S1 S2

R1 = Run 1R2 = Run 2etc

What would these trees look like with 8 runs?

Page 4: Sorting  Really Big Files

N-Way Merge We can create that tree using just 4 temp files

2 are input and 2 are output, the pairs alternate being input and output files

AlgorithmWrite Run 1 into T1Write Run 2 into T2Write Run 3 into T1Write Run 4 into T2...Merge first runs in T1 and T2 into T3Merge second runs in T1 and T2 into T4Merge thirds runs in T1 and T2 into T3...Merge first runs in T3 and T4 into T1Merge second runs in T3 and T4 into T2...

Page 5: Sorting  Really Big Files

N-Way MergeStep

NumberFiles Contain Runs

1T1 - R1 R3 R5 R7 R9T2 - R2 R4 R6 R8 R10T3 - T4 -

2T1 -T2 -T3 - R1-R2 R5-R6 R9-10T4 - R3-R4 R7-R8

3T1 - R1-R4 R9-R10T2 - R5-R8T3 - T4 -

4T1 -T2 -T3 - R1-R8T4 - R9-R10

5T1 - R1-R10T2 -T3 -T4 -

T1 T2

F

T3 T4

T1 T2

T3 T4

Page 6: Sorting  Really Big Files

Analysis

Number of Comparisons:N-Way Merge -- O (n log2 n)K Temp Files -- O ( n2 )

Disk Space

Could the run size be one record? In other words, is the internal sort necessary?