Upload
april-morton
View
212
Download
0
Embed Size (px)
Citation preview
Algorithms forInformation Retrieval
Is algorithmic designa 5-mins thinking task ???
Toy problem #1: Max Subarray
Algorithm Compute P[1,n] array of Prefix-Sums over A Compute M[1,n] array of Mins over P Find end such that P[end]-M[end] is maximum.
start is such that P[start] is minimum.
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
Goal: Find the time window achieving the best “market performance”.
Math Problem: Find the subarray of maximum sum.
P = 2 -3 3 4 2 6 9 -4 5 -1 6
M = 2 -3 -3 -3 -3 -3 -3 -4 -4 -4 -4
Note:
•Find maxsumx≤y A[x,y]
= maxx≤y P[y] – P[x]
= maxy [ P[y] – (minx≤yP[x]) ]
Toy problem #1(solution 2)
Algorithm sum=0; For i=1,...,n do
If (sum + A[i] ≤ 0) sum=0; else MAX(max_sum, sum+A[i]); sum +=A[i];
A = 2 -5 6 1 -2 4 3 -13 9 -6 7
Note:•Sum = 0 when OPT starts;•Sum > 0 within OPT
A = Optimum≤0 ≥0
Toy problem #2: Top-freq elements
Algorithm Use a pair of variables <X,C> For each item s of the stream,
if (X==s) then C++ else { C--; if (C==0) X=s; C=1;}
Return X;
Goal: Top queries over a stream of n items (large).
Math Problem: Find the item y whose frequency is > n/2, using the smallest space. (i.e. If mode occurs > n/2)
ProofIf X≠y, then every one of y’soccurrences has a “negative” mate.
Hence these mates should be ≥#y.
As a result, 2 * #occ(y) > n...
A = b a c c c d c b a a a c c b c c c<b,1> <a,1><c,1><c,2><c,3> <c,2><c,3><c,2> <c,1> <a,1><a,2><a,1><c,1><b,1><c,1>.<c,2><c,3>
Problemsif ≤ n/2
Toy problem #3 : Indexing
Consider the following TREC collection: N = 6 * 109 size n = 106 documents TotT= 109 (avg term length is 6 chars) t = 5 * 105 distinct terms
What kind of data structure we build to support word-based searches ?
Solution 1: Term-Doc matrix
1 if play contains word, 0 otherwise
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
t=5
00
K
n = 1 million
Space is 500Gb !
Solution 2: Inverted index
Brutus
Calpurnia
Caesar
1 2 3 5 8 13 21 34
2 4 8 16 32 64128
13 16
1. Typically <termID,docID,pos> use about 12 bytes
2. We have 109 total terms at least 12Gb space3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!
We can do still better:i.e. 3050% original text
Toy problem #4 : sorting
How to sort tuples (objects) on disk 109 objects of 12 bytes each, hence 12 Gb
Key observation: Array A to sort is an “array of pointers to objects” For each object-to-object comparison A[i] vs A[j]:
2 random accesses to memory locations A[i] and A[j] If we use qsort, this is an indirect sort !!! (n log n) random memory accesses !!
(I/Os ?)
Memory containing the tuples (objects)
A
Cost of Quicksort on large data
Some typical parameter settings
N=109 tuples of 12 bytes each
Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms
Analysis of qsort on disk: qsort is an indirect sort: (n log2 n) random memory
accesses
[5ms] * n log2 n = 109 * log2 (109) * 5ms ≥ 3 years
In practice a little bit better because of caching, but...
B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB n=109 insertions Data get distributed arbitrarily !!!
Tuples
B-tree internal nodes
B-tree leaves(“tuple pointers")
What aboutlisting tuples in order ?
Possibly 109 random I/Os = 109 * 5ms 2 months
Binary Merge-Sort
Merge-Sort(A)01 if length(A) > 1 then02 Copy the first half of A into array A1 03 Copy the second half of A into array A204 Merge-Sort(A1)05 Merge-Sort(A2)06 Merge(A, A1, A2)
Merge-Sort(A)01 if length(A) > 1 then02 Copy the first half of A into array A1 03 Copy the second half of A into array A204 Merge-Sort(A1)05 Merge-Sort(A2)06 Merge(A, A1, A2)
Divide
ConquerCombine
Merge-Sort Recursion Tree
10 2
2 10
5 1
1 5
13 19
13 19
9 7
7 9
15 4
4 15
8 3
3 8
12 17
12 17
6 11
6 11
1 2 5 10 7 9 13 19 3 4 8 15 6 11 12 17
1 2 5 7 9 10 13 19 3 4 6 8 11 12 15 17
1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19
log2 n
How do we exploit the disk features ??
External Binary Merge-Sort
Increase the size of initial runs to be merged!
10 2 5 1 13 19 9 7 15 4 8 3 12 17 6 11
1 2 5 10 7 9 13 19 3 4 8 15 6 11 12 17
1 2 5 7 9 10 13 19 3 4 6 8 11 12 15 17
1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19
Main-memorysort
Main-memorysort
Main-memorysort
Main-memorysort
External two-way merge
External two-way merges
N/M runs, each level is 2 passes (R/W) over the data
Cost of External Binary Merge-Sort
Some typical parameter settings:
n=109 tuples of 12 bytes each, N=12 Gb of data
Typical Disk (Seagate): seek time ~8ms
avg transfer rate is 100Mb per sec = 10-8 secs/byte
Analysis of binary-mergesort on disk (M = 10Mb = 106 tuples):
Data divided into (N/M) runs: 103 runs
#levels is log2 (N/M) 10
It executes 2 * log2 (N/M) 20 passes (R/W) over the data
I/O-scanning cost: 20 * [12 * 109] * 10-8 2400 sec = 40 min
Multi-way Merge-Sort
Sort N items using internal-memory M and disk-pages of size B: Pass 1: Produce (N/M) sorted runs. Pass 2, …: merge X M/B runs each pass.
Main memory buffers of B items
INPUT 1
INPUT X
OUTPUT
DiskDisk
INPUT 2
. . . . . .
. . .
Multiway Merging
Out File:
Run 1 Run 2
Merged run
Current page
Current page
EOF
Bf1p1
Bf2p2 Bfo
po
min(Bf1[p1], Bf2[p2], …, Bfx[pX])
Fetch, if pi = B
Flush, ifBfo full
Run X=M/B
Current page
BfxpX
Cost of Multi-way Merge-Sort
Number of passes = logM/B #runs logM/B N/M
Cost of a pass = 2 * (N/B) I/Os
Increasing the fan-out (M/B) increases #I/Os per pass!
Parameters M = 10Mb; B = 8Kb; N = 12 Gb; N/M 103 runs; #passes = logM/B N/M 1 !!!
•I/O-scanning: 20 passes (40m) 2 passes (4 m)
Tuning dependson disk features
Does compression may help?
Goal: enlarge M and reduce N #passes = O(logM/B N/M) Cost of a pass = O(N/B)
Please !!Do not underestimate
the features of disks
in algorithmic design