Algorithms for Information Retrieval Is algorithmic design a 5-mins thinking task ???

Algorithms forInformation Retrieval

Is algorithmic designa 5-mins thinking task ???

Toy problem #1: Max Subarray

Algorithm Compute P[1,n] array of Prefix-Sums over A Compute M[1,n] array of Mins over P Find end such that P[end]-M[end] is maximum.

start is such that P[start] is minimum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Goal: Find the time window achieving the best “market performance”.

Math Problem: Find the subarray of maximum sum.

P = 2 -3 3 4 2 6 9 -4 5 -1 6

M = 2 -3 -3 -3 -3 -3 -3 -4 -4 -4 -4

Note:

•Find maxsumx≤y A[x,y]

= maxx≤y P[y] – P[x]

= maxy [ P[y] – (minx≤yP[x]) ]

Toy problem #1(solution 2)

Algorithm sum=0; For i=1,...,n do

If (sum + A[i] ≤ 0) sum=0; else MAX(max_sum, sum+A[i]); sum +=A[i];

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Note:•Sum = 0 when OPT starts;•Sum > 0 within OPT

A = Optimum≤0 ≥0

Toy problem #2: Top-freq elements

Algorithm Use a pair of variables <X,C> For each item s of the stream,

if (X==s) then C++ else { C--; if (C==0) X=s; C=1;}

Return X;

Goal: Top queries over a stream of n items (large).

Math Problem: Find the item y whose frequency is > n/2, using the smallest space. (i.e. If mode occurs > n/2)

ProofIf X≠y, then every one of y’soccurrences has a “negative” mate.

Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > n...

A = b a c c c d c b a a a c c b c c c<b,1> <a,1><c,1><c,2><c,3> <c,2><c,3><c,2> <c,1> <a,1><a,2><a,1><c,1><b,1><c,1>.<c,2><c,3>

Problemsif ≤ n/2

Toy problem #3 : Indexing

Consider the following TREC collection: N = 6 * 109 size n = 106 documents TotT= 109 (avg term length is 6 chars) t = 5 * 105 distinct terms

What kind of data structure we build to support word-based searches ?

Solution 1: Term-Doc matrix

1 if play contains word, 0 otherwise

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

t=5

00

K

n = 1 million

Space is 500Gb !

Solution 2: Inverted index

Brutus

Calpurnia

Caesar

1 2 3 5 8 13 21 34

2 4 8 16 32 64128

13 16

1. Typically <termID,docID,pos> use about 12 bytes

2. We have 109 total terms at least 12Gb space3. Compressing 6Gb documents gets 1.5Gb data

Better index but yet it is >10 times the text !!!!

We can do still better:i.e. 3050% original text

Toy problem #4 : sorting

How to sort tuples (objects) on disk 109 objects of 12 bytes each, hence 12 Gb

Key observation: Array A to sort is an “array of pointers to objects” For each object-to-object comparison A[i] vs A[j]:

2 random accesses to memory locations A[i] and A[j] If we use qsort, this is an indirect sort !!! (n log n) random memory accesses !!

(I/Os ?)

Memory containing the tuples (objects)

A

Cost of Quicksort on large data

Some typical parameter settings

N=109 tuples of 12 bytes each

Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of qsort on disk: qsort is an indirect sort: (n log2 n) random memory

accesses

[5ms] * n log2 n = 109 * log2 (109) * 5ms ≥ 3 years

In practice a little bit better because of caching, but...

B-trees for sorting ?

Using a well-tuned B-tree library: Berkeley DB n=109 insertions Data get distributed arbitrarily !!!

Tuples

B-tree internal nodes

B-tree leaves(“tuple pointers")

What aboutlisting tuples in order ?

Possibly 109 random I/Os = 109 * 5ms 2 months

Binary Merge-Sort

Merge-Sort(A)01 if length(A) > 1 then02 Copy the first half of A into array A1 03 Copy the second half of A into array A204 Merge-Sort(A1)05 Merge-Sort(A2)06 Merge(A, A1, A2)

Merge-Sort(A)01 if length(A) > 1 then02 Copy the first half of A into array A1 03 Copy the second half of A into array A204 Merge-Sort(A1)05 Merge-Sort(A2)06 Merge(A, A1, A2)

Divide

ConquerCombine

Merge-Sort Recursion Tree

10 2

2 10

5 1

1 5

13 19

13 19

9 7

7 9

15 4

4 15

8 3

3 8

12 17

12 17

6 11

6 11

1 2 5 10 7 9 13 19 3 4 8 15 6 11 12 17

1 2 5 7 9 10 13 19 3 4 6 8 11 12 15 17

1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19

log2 n

How do we exploit the disk features ??

External Binary Merge-Sort

Increase the size of initial runs to be merged!

10 2 5 1 13 19 9 7 15 4 8 3 12 17 6 11

1 2 5 10 7 9 13 19 3 4 8 15 6 11 12 17

1 2 5 7 9 10 13 19 3 4 6 8 11 12 15 17

1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19

Main-memorysort

Main-memorysort

Main-memorysort

Main-memorysort

External two-way merge

External two-way merges

N/M runs, each level is 2 passes (R/W) over the data

Cost of External Binary Merge-Sort

Some typical parameter settings:

n=109 tuples of 12 bytes each, N=12 Gb of data

Typical Disk (Seagate): seek time ~8ms

avg transfer rate is 100Mb per sec = 10-8 secs/byte

Analysis of binary-mergesort on disk (M = 10Mb = 106 tuples):

Data divided into (N/M) runs: 103 runs

#levels is log2 (N/M) 10

It executes 2 * log2 (N/M) 20 passes (R/W) over the data

I/O-scanning cost: 20 * [12 * 109] * 10-8 2400 sec = 40 min

Multi-way Merge-Sort

Sort N items using internal-memory M and disk-pages of size B: Pass 1: Produce (N/M) sorted runs. Pass 2, …: merge X M/B runs each pass.

Main memory buffers of B items

INPUT 1

INPUT X

OUTPUT

DiskDisk

INPUT 2

. . . . . .

. . .

Multiway Merging

Out File:

Run 1 Run 2

Merged run

Current page

Current page

EOF

Bf1p1

Bf2p2 Bfo

po

min(Bf1[p1], Bf2[p2], …, Bfx[pX])

Fetch, if pi = B

Flush, ifBfo full

Run X=M/B

Current page

BfxpX

Cost of Multi-way Merge-Sort

Number of passes = logM/B #runs logM/B N/M

Cost of a pass = 2 * (N/B) I/Os

Increasing the fan-out (M/B) increases #I/Os per pass!

Parameters M = 10Mb; B = 8Kb; N = 12 Gb; N/M 103 runs; #passes = logM/B N/M 1 !!!

•I/O-scanning: 20 passes (40m) 2 passes (4 m)

Tuning dependson disk features

Does compression may help?

Goal: enlarge M and reduce N #passes = O(logM/B N/M) Cost of a pass = O(N/B)

Please !!Do not underestimate

the features of disks

in algorithmic design

Documents

Algorithms for Information Retrieval Is algorithmic design a 5-mins thinking task ???