INF 2914 Information Retrieval and Web Search

1

INF 2914Information Retrieval and Web

Search

Lecture 6: Index ConstructionThese slides are adapted from

Stanford’s class CS276 / LING 286Information Retrieval and Web

Mining

2

(Offline) Search Engine Data Flow

- Parse- Tokenize- Per page analysis

tokenizedweb pages

duptable

Parse & Tokenize Global Analysis

2

invertedtext index

1

Crawler

web page - Scan tokenized web pages, anchor text, etc- Generate text index

Index Build

-Dup detection -Static rank - Anchor text -Spam analysis-- …

3 4

ranktable

anchortext

in background

spam table

3

Inverted index

For each term T, we must store a list of all documents that contain T.

Brutus

Calpurnia

Caesar

2 4 8 16 32 64 128

2 3 5 8 13 21 34

13 16

1

Dictionary Postings lists

Sorted by docID (more later on why).

Posting

4

Inverted index construction

Tokenizer

Token stream. Friends Romans Countrymen

Linguistic modules

Modified tokens. friend roman countryman

Indexer

Inverted index.

friend

roman

countryman

2 4

2

13 16

1

Documents tobe indexed.

Friends, Romans, countrymen.

5

Sequence of (Modified token, Document ID) pairs.

I did enact JuliusCaesar I was killed

i' the Capitol; Brutus killed me.

Doc 1

So let it be withCaesar. The noble

Brutus hath told youCaesar was ambitious

Doc 2

Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2

caesar 2was 2ambitious 2

Indexer steps

6

Sort by terms. Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2

Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2

Core indexing step.

7

Multiple term entries in a single document are merged.

Frequency information is added.

Term Doc # Term freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1

Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2

Why frequency?Will discuss later.

8

The result is split into a Dictionary file and a Postings file.

Doc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1

Term N docs Coll freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1

Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1

9

The index we just built

How do we process a query?

10

Query processing: AND

Consider processing the query:Brutus AND Caesar Locate Brutus in the Dictionary;

Retrieve its postings. Locate Caesar in the Dictionary;

Retrieve its postings. “Merge” the two postings:

128

34

2 4 8 16 32 64

1 2 3 5 8 13

21

Brutus

Caesar

11

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

The merge

Walk through the two postings simultaneously, in time linear in the total number of postings entries

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

Brutus

Caesar2 8

If the list lengths are x and y, the merge takes O(x+y)operations.Crucial: postings sorted by docID.

12

Index construction

How do we construct an index? What strategies can we use with limited

main memory?

13

Our corpus for this lecture

Number of docs = n = 1M Each doc has 1K terms

Number of distinct terms = m = 500K 667 million postings entries

14

./ ln~ / m/J

/

1JmnJHnJinJ

Jm

i

How many postings?

Number of 1’s in the i th block = nJ/i Summing this over m/J blocks, we have

For our numbers, this should be about 667 million postings.

15

I did enact JuliusCaesar I was killed i' the Capitol; Brutus killed me.

Doc 1

So let it be withCaesar. The nobleBrutus hath told youCaesar was ambitious

Doc 2


Recall index construction

Documents are processed to extract words and these are saved with the Document ID.

16


Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2

We focus on this sort step.We have 667M items to sort.

Key step

After all documents have been processed the inverted file is sorted by terms.

17

Index construction

At 10-12 bytes per postings entry, demands several temporary gigabytes

18

System parameters for design

Disk seek ~ 10 milliseconds Block transfer from disk ~ 1 microsecond

per byte (following a seek) All other ops ~ 1 microsecond

E.g., compare two postings entries and decide their merge order

19

If every comparison took 2 disk seeks, and N items could besorted with N log2N comparisons, how long would this take?

Bottleneck

Build postings entries one doc at a time Now sort postings entries by term (then by

doc within each term) Doing this with random disk seeks would

be too slow – must sort N=667M records

2020

If every comparison took 2 disk seeks, and N items could besorted with N log2N comparisons, how long would this take?

12.4 years!!!

Disk-based sorting

Build postings entries one doc at a time Now sort postings entries by term Doing this with random disk seeks would

be too slow – must sort N=667M records

2121

Sorting with fewer disk seeks

12-byte (4+4+4) records (term, doc, freq). These are generated as we process docs. Must now sort 667M such 12-byte records

by term. Define a Block ~ 10M such records

can “easily” fit a couple into memory. Will have 64 such blocks to start with.

Will sort within blocks first, then merge the blocks into one long sorted order.

2222

Sorting 64 blocks of 10M records

First, read each block and sort within: Quicksort takes 2N ln N expected steps In our case 2 x (10M ln 10M) steps Time to Quicksort each block = 320 seconds

Total time to read each block from disk and write it back

120M x 2 x 10-6 = 240 seconds 64 times this estimate - gives us 64 sorted runs of

10M records each Total Quicksort time = 5.6 hours Total read+write time = 4.2 hours Total for this phase ~ 10 hours

Need 2 copies of data on disk, throughout

2323Disk

1

3 4

22

1

4

3

Runs beingmerged.

Merged run.

Merging 64 sorted runs

Merge tree of log264= 6 layers. During each layer, read into memory runs

in blocks of 10M, merge, write back.

2424

…

…

Sorted runs.

1 2 6463

32 runs, 20M/run

16 runs, 40M/run8 runs, 80M/run4 runs … ?2 runs … ?1 run … ?

Bottom levelof tree.

Merge tree

2525

Merging 64 runs

Time estimate for disk transfer: 6 x Time to read+write 64 blocks = 6 x 4.2 hours ~ 25 hours

Time estimate for the merge operation: 6 x 640M x 10-6 = 1 hour

Time estimate for the overall algorithm: Sort time + Merge time ~ 10 + 26 ~ 36 hours

Lower bound (main memory sort): Time to read+write = 4.2 hours Time to sort in memory = 10.7 hours Total time ~ 15 hours

2626

Some indexing numbers

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Trevi Lucy Melnik et. al. Long andSuel

Lucene

K d

oc

um

en

ts p

er

ho

ur

2727

How to improve indexing time?

Compression of the sorted runs Multi-way merge

Heap merge all runs Radix sort (linear time sorting) Pipelining reading, sorting, and writing

phases

28

Multi-way merge

28

…Sorted runs

1 2 6463

Heap

2929

Indexing improvements

Radix sort Linear time sorting Flexibility in defining the sort criteria Bigger sort buffers increase performance (contradicting

previous literature) (see VLDB paper on the references) Pipelining read and sort + write phases

B1

B2

B1

B2

B1

B2

B1

B2

time

Read

Sort + Write

3030

Positional indexing

Given documents:D1: This is a testD2: Is this a testD3: This is not a test

Reorganize by term:

TERM DOC LOC DATA(caps)this 1 0 1is 1 1 0a 1 2 0test 1 3 0is 2 0 1this 2 1 0a 2 2 0test 2 3 0this 3 0 1is 3 1 0not 3 2 0a 3 3 0test 3 4 0

3131

Positional indexing

In “postings list” format:

a (1,2,0),(2,2,0),(3,3,0)is (1,1,0),(2,0,1),(3,1,0)not (3,2,0)test (1,3,0),(2,3,0),(3,4,0)this (1,0,1),(2,1,0),(3,0,1)

Sort by <term, doc, loc>:

TERM DOC LOC DATA(caps)a 1 2 0a 2 2 0a 3 3 0is 1 1 0is 2 0 1is 3 1 0not 3 2 0test 1 3 0test 2 3 0test 3 4 0this 1 0 1this 2 1 0this 3 0 1

3232

Positional indexing with radix sort

Radix key = Token hash = 8 bytes Document ID = 8 bytes Location = 4 bytes, but no need to sort by

location since Radix sort is stable!

3333

Distributed indexing

Maintain a master machine directing the indexing job – considered “safe”

Break up indexing into sets of (parallel) tasks

Master machine assigns each task to an idle machine from a pool

3434

Parallel tasks

We will use two sets of parallel tasks Parsers Inverters

Break the input document corpus into splits Each split is a subset of documents

Master assigns a split to an idle parser machine

Parser reads a document at a time and emits (term, doc) pairs

3535

Parallel tasks

Parser writes pairs into j partitions Each for a range of terms’ first letters

(e.g., a-f, g-p, q-z) – here j=3. Now to complete the index inversion

3636

splits

Parser

Parser

Parser

Master

a-fg-p q-z

a-fg-p q-z

a-fg-p q-z

Inverter

Inverter

Inverter

Postings

a-f

g-p

q-z

assign assign

Data flow

3737

Above process flow a special case of MapReduce.

Inverters

Collect all (term, doc) pairs for a partition Sorts and writes to postings list Each partition contains a set of postings

3838

MapReduce

Model for processing large data sets. Contains Map and Reduce functions. Runs on a large cluster of machines. A lot of MapReduce programs are executed

on Google’s cluster everyday.

3939

Motivation

Input data is large The whole Web, billions of Pages

Lots of machines Use them efficiently

4040

A real example

Term frequencies through the whole Web repository.

Count of URL access frequency. Reverse web-link graph ….

4141

Programming model Input & Output: each a set of key/value pairs Programmer specifies two functions: map (in_key, in_value) -> list(out_key,

intermediate_value) Processes input key/value pair Produces set of intermediate pairs reduce (out_key, list(intermediate_value)) ->

list(out_value) Combines all intermediate values for a particular key Produces a set of merged output values (usually just

one)

4242

Example

Page 1: the weather is good Page 2: today is good Page 3: good weather is good.

4343

Example: Count word occurrences

map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1");

reduce(String output_key, Iterator intermediate_values):

// output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));

4444

Map output

Worker 1: (the 1), (weather 1), (is 1), (good 1).

Worker 2: (today 1), (is 1), (good 1).

Worker 3: (good 1), (weather 1), (is 1), (good 1).

4545

Reduce Input

Worker 1: (the 1)

Worker 2: (is 1), (is 1), (is 1)

Worker 3: (weather 1), (weather 1)

Worker 4: (today 1)

Worker 5: (good 1), (good 1), (good 1), (good 1)

4646

Reduce Output

Worker 1: (the 1)

Worker 2: (is 3)

Worker 3: (weather 2)

Worker 4: (today 1)

Worker 5: (good 4)

4747

4848

4949

Fault tolerance

Typical cluster: 100s/1000s of 2-CPU x86 machines, 2-4 GB

of memory Storage is on local IDE disks GFS: distributed file system manages data

(SOSP'03) Job scheduling system: jobs made up of

tasks, scheduler assigns tasks to machines

Implementation is a C++ library linked into user programs)

5050

Fault tolerance

On worker failure: Detect failure via periodic heartbeats Re-execute completed and in-progress map

tasks Re-execute in progress reduce tasks Task completion committed through master

Master failure: Could handle, but don't yet (master failure

unlikely)

5151

Performance

Scan 10^10 100-byte records to extract records matching a rare pattern (92K matching records): 150 seconds.

Sort 10^10 100-byte records (modeled after TeraSort benchmark): 839 seconds.

5252

More and more MapReduce

5353

Experience: Rewrite of Production Indexing System

Rewrote Google's production indexing system using MapReduce Set of 24 MapReduce operations New code is simpler, easier to understand MapReduce takes care of failures, slow

machines Easy to make indexing faster by adding

more machines

5454

MapReduce Overview

MapReduce has proven to be a useful abstraction

Greatly simplifies large-scale computations at Google

Fun to use: focus on problem, let library deal w/ messy details

5555

Resources

MG Chapter 5 MapReduce: Simplified Data Processing on

Large Clusters, Jeffrey Dean and Sanjay Ghemawat

Indexing Shared Content in Information Retrieval Systems, A. Broder et. al., EDBT2006

High Performance Index Build Algorithms for Intranet Search Engines, M. F. Fontoura et. al., VLDB 2004