55
1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information Retrieval and Web Mining

INF 2914 Information Retrieval and Web Search

Embed Size (px)

DESCRIPTION

INF 2914 Information Retrieval and Web Search. Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information Retrieval and Web Mining. (Offline) Search Engine Data Flow. Parse & Tokenize. Global Analysis. Index Build. Crawler. Dup detection - PowerPoint PPT Presentation

Citation preview

Page 1: INF 2914 Information Retrieval and Web Search

1

INF 2914Information Retrieval and Web

Search

Lecture 6: Index ConstructionThese slides are adapted from

Stanford’s class CS276 / LING 286Information Retrieval and Web

Mining

Page 2: INF 2914 Information Retrieval and Web Search

2

(Offline) Search Engine Data Flow

- Parse- Tokenize- Per page analysis

tokenizedweb pages

duptable

Parse & Tokenize Global Analysis

2

invertedtext index

1

Crawler

web page - Scan tokenized web pages, anchor text, etc- Generate text index

Index Build

-Dup detection -Static rank - Anchor text -Spam analysis-- …

3 4

ranktable

anchortext

in background

spam table

Page 3: INF 2914 Information Retrieval and Web Search

3

Inverted index

For each term T, we must store a list of all documents that contain T.

Brutus

Calpurnia

Caesar

2 4 8 16 32 64 128

2 3 5 8 13 21 34

13 16

1

Dictionary Postings lists

Sorted by docID (more later on why).

Posting

Page 4: INF 2914 Information Retrieval and Web Search

4

Inverted index construction

Tokenizer

Token stream. Friends Romans Countrymen

Linguistic modules

Modified tokens. friend roman countryman

Indexer

Inverted index.

friend

roman

countryman

2 4

2

13 16

1

Documents tobe indexed.

Friends, Romans, countrymen.

Page 5: INF 2914 Information Retrieval and Web Search

5

Sequence of (Modified token, Document ID) pairs.

I did enact JuliusCaesar I was killed

i' the Capitol; Brutus killed me.

Doc 1

So let it be withCaesar. The noble

Brutus hath told youCaesar was ambitious

Doc 2

Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2

caesar 2was 2ambitious 2

Indexer steps

Page 6: INF 2914 Information Retrieval and Web Search

6

Sort by terms. Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2

Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2

Core indexing step.

Page 7: INF 2914 Information Retrieval and Web Search

7

Multiple term entries in a single document are merged.

Frequency information is added.

Term Doc # Term freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1

Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2

Why frequency?Will discuss later.

Page 8: INF 2914 Information Retrieval and Web Search

8

The result is split into a Dictionary file and a Postings file.

Doc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1

Term N docs Coll freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1

Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1

Page 9: INF 2914 Information Retrieval and Web Search

9

The index we just built

How do we process a query?

Page 10: INF 2914 Information Retrieval and Web Search

10

Query processing: AND

Consider processing the query:Brutus AND Caesar Locate Brutus in the Dictionary;

Retrieve its postings. Locate Caesar in the Dictionary;

Retrieve its postings. “Merge” the two postings:

128

34

2 4 8 16 32 64

1 2 3 5 8 13

21

Brutus

Caesar

Page 11: INF 2914 Information Retrieval and Web Search

11

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

The merge

Walk through the two postings simultaneously, in time linear in the total number of postings entries

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

Brutus

Caesar2 8

If the list lengths are x and y, the merge takes O(x+y)operations.Crucial: postings sorted by docID.

Page 12: INF 2914 Information Retrieval and Web Search

12

Index construction

How do we construct an index? What strategies can we use with limited

main memory?

Page 13: INF 2914 Information Retrieval and Web Search

13

Our corpus for this lecture

Number of docs = n = 1M Each doc has 1K terms

Number of distinct terms = m = 500K 667 million postings entries

Page 14: INF 2914 Information Retrieval and Web Search

14

./ ln~ / m/J

/

1JmnJHnJinJ

Jm

i

How many postings?

Number of 1’s in the i th block = nJ/i Summing this over m/J blocks, we have

For our numbers, this should be about 667 million postings.

Page 15: INF 2914 Information Retrieval and Web Search

15

I did enact JuliusCaesar I was killed i' the Capitol; Brutus killed me.

Doc 1

So let it be withCaesar. The nobleBrutus hath told youCaesar was ambitious

Doc 2

Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2

Recall index construction

Documents are processed to extract words and these are saved with the Document ID.

Page 16: INF 2914 Information Retrieval and Web Search

16

Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2

Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2

We focus on this sort step.We have 667M items to sort.

Key step

After all documents have been processed the inverted file is sorted by terms.

Page 17: INF 2914 Information Retrieval and Web Search

17

Index construction

At 10-12 bytes per postings entry, demands several temporary gigabytes

Page 18: INF 2914 Information Retrieval and Web Search

18

System parameters for design

Disk seek ~ 10 milliseconds Block transfer from disk ~ 1 microsecond

per byte (following a seek) All other ops ~ 1 microsecond

E.g., compare two postings entries and decide their merge order

Page 19: INF 2914 Information Retrieval and Web Search

19

If every comparison took 2 disk seeks, and N items could besorted with N log2N comparisons, how long would this take?

Bottleneck

Build postings entries one doc at a time Now sort postings entries by term (then by

doc within each term) Doing this with random disk seeks would

be too slow – must sort N=667M records

Page 20: INF 2914 Information Retrieval and Web Search

2020

If every comparison took 2 disk seeks, and N items could besorted with N log2N comparisons, how long would this take?

12.4 years!!!

Disk-based sorting

Build postings entries one doc at a time Now sort postings entries by term Doing this with random disk seeks would

be too slow – must sort N=667M records

Page 21: INF 2914 Information Retrieval and Web Search

2121

Sorting with fewer disk seeks

12-byte (4+4+4) records (term, doc, freq). These are generated as we process docs. Must now sort 667M such 12-byte records

by term. Define a Block ~ 10M such records

can “easily” fit a couple into memory. Will have 64 such blocks to start with.

Will sort within blocks first, then merge the blocks into one long sorted order.

Page 22: INF 2914 Information Retrieval and Web Search

2222

Sorting 64 blocks of 10M records

First, read each block and sort within: Quicksort takes 2N ln N expected steps In our case 2 x (10M ln 10M) steps Time to Quicksort each block = 320 seconds

Total time to read each block from disk and write it back

120M x 2 x 10-6 = 240 seconds 64 times this estimate - gives us 64 sorted runs of

10M records each Total Quicksort time = 5.6 hours Total read+write time = 4.2 hours Total for this phase ~ 10 hours

Need 2 copies of data on disk, throughout

Page 23: INF 2914 Information Retrieval and Web Search

2323Disk

1

3 4

22

1

4

3

Runs beingmerged.

Merged run.

Merging 64 sorted runs

Merge tree of log264= 6 layers. During each layer, read into memory runs

in blocks of 10M, merge, write back.

Page 24: INF 2914 Information Retrieval and Web Search

2424

Sorted runs.

1 2 6463

32 runs, 20M/run

16 runs, 40M/run8 runs, 80M/run4 runs … ?2 runs … ?1 run … ?

Bottom levelof tree.

Merge tree

Page 25: INF 2914 Information Retrieval and Web Search

2525

Merging 64 runs

Time estimate for disk transfer: 6 x Time to read+write 64 blocks = 6 x 4.2 hours ~ 25 hours

Time estimate for the merge operation: 6 x 640M x 10-6 = 1 hour

Time estimate for the overall algorithm: Sort time + Merge time ~ 10 + 26 ~ 36 hours

Lower bound (main memory sort): Time to read+write = 4.2 hours Time to sort in memory = 10.7 hours Total time ~ 15 hours

Page 26: INF 2914 Information Retrieval and Web Search

2626

Some indexing numbers

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Trevi Lucy Melnik et. al. Long andSuel

Lucene

K d

oc

um

en

ts p

er

ho

ur

Page 27: INF 2914 Information Retrieval and Web Search

2727

How to improve indexing time?

Compression of the sorted runs Multi-way merge

Heap merge all runs Radix sort (linear time sorting) Pipelining reading, sorting, and writing

phases

Page 28: INF 2914 Information Retrieval and Web Search

28

Multi-way merge

28

…Sorted runs

1 2 6463

Heap

Page 29: INF 2914 Information Retrieval and Web Search

2929

Indexing improvements

Radix sort Linear time sorting Flexibility in defining the sort criteria Bigger sort buffers increase performance (contradicting

previous literature) (see VLDB paper on the references) Pipelining read and sort + write phases

B1

B2

B1

B2

B1

B2

B1

B2

time

Read

Sort + Write

Page 30: INF 2914 Information Retrieval and Web Search

3030

Positional indexing

Given documents:D1: This is a testD2: Is this a testD3: This is not a test

Reorganize by term:

TERM DOC LOC DATA(caps)this 1 0 1is 1 1 0a 1 2 0test 1 3 0is 2 0 1this 2 1 0a 2 2 0test 2 3 0this 3 0 1is 3 1 0not 3 2 0a 3 3 0test 3 4 0

Page 31: INF 2914 Information Retrieval and Web Search

3131

Positional indexing

In “postings list” format:

a (1,2,0),(2,2,0),(3,3,0)is (1,1,0),(2,0,1),(3,1,0)not (3,2,0)test (1,3,0),(2,3,0),(3,4,0)this (1,0,1),(2,1,0),(3,0,1)

Sort by <term, doc, loc>:

TERM DOC LOC DATA(caps)a 1 2 0a 2 2 0a 3 3 0is 1 1 0is 2 0 1is 3 1 0not 3 2 0test 1 3 0test 2 3 0test 3 4 0this 1 0 1this 2 1 0this 3 0 1

Page 32: INF 2914 Information Retrieval and Web Search

3232

Positional indexing with radix sort

Radix key = Token hash = 8 bytes Document ID = 8 bytes Location = 4 bytes, but no need to sort by

location since Radix sort is stable!

Page 33: INF 2914 Information Retrieval and Web Search

3333

Distributed indexing

Maintain a master machine directing the indexing job – considered “safe”

Break up indexing into sets of (parallel) tasks

Master machine assigns each task to an idle machine from a pool

Page 34: INF 2914 Information Retrieval and Web Search

3434

Parallel tasks

We will use two sets of parallel tasks Parsers Inverters

Break the input document corpus into splits Each split is a subset of documents

Master assigns a split to an idle parser machine

Parser reads a document at a time and emits (term, doc) pairs

Page 35: INF 2914 Information Retrieval and Web Search

3535

Parallel tasks

Parser writes pairs into j partitions Each for a range of terms’ first letters

(e.g., a-f, g-p, q-z) – here j=3. Now to complete the index inversion

Page 36: INF 2914 Information Retrieval and Web Search

3636

splits

Parser

Parser

Parser

Master

a-fg-p q-z

a-fg-p q-z

a-fg-p q-z

Inverter

Inverter

Inverter

Postings

a-f

g-p

q-z

assign assign

Data flow

Page 37: INF 2914 Information Retrieval and Web Search

3737

Above process flow a special case of MapReduce.

Inverters

Collect all (term, doc) pairs for a partition Sorts and writes to postings list Each partition contains a set of postings

Page 38: INF 2914 Information Retrieval and Web Search

3838

MapReduce

Model for processing large data sets. Contains Map and Reduce functions. Runs on a large cluster of machines. A lot of MapReduce programs are executed

on Google’s cluster everyday.

Page 39: INF 2914 Information Retrieval and Web Search

3939

Motivation

Input data is large The whole Web, billions of Pages

Lots of machines Use them efficiently

Page 40: INF 2914 Information Retrieval and Web Search

4040

A real example

Term frequencies through the whole Web repository.

Count of URL access frequency. Reverse web-link graph ….

Page 41: INF 2914 Information Retrieval and Web Search

4141

Programming model Input & Output: each a set of key/value pairs Programmer specifies two functions: map (in_key, in_value) -> list(out_key,

intermediate_value) Processes input key/value pair Produces set of intermediate pairs reduce (out_key, list(intermediate_value)) ->

list(out_value) Combines all intermediate values for a particular key Produces a set of merged output values (usually just

one)

Page 42: INF 2914 Information Retrieval and Web Search

4242

Example

Page 1: the weather is good Page 2: today is good Page 3: good weather is good.

Page 43: INF 2914 Information Retrieval and Web Search

4343

Example: Count word occurrences

map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1");

reduce(String output_key, Iterator intermediate_values):

// output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));

Page 44: INF 2914 Information Retrieval and Web Search

4444

Map output

Worker 1: (the 1), (weather 1), (is 1), (good 1).

Worker 2: (today 1), (is 1), (good 1).

Worker 3: (good 1), (weather 1), (is 1), (good 1).

Page 45: INF 2914 Information Retrieval and Web Search

4545

Reduce Input

Worker 1: (the 1)

Worker 2: (is 1), (is 1), (is 1)

Worker 3: (weather 1), (weather 1)

Worker 4: (today 1)

Worker 5: (good 1), (good 1), (good 1), (good 1)

Page 46: INF 2914 Information Retrieval and Web Search

4646

Reduce Output

Worker 1: (the 1)

Worker 2: (is 3)

Worker 3: (weather 2)

Worker 4: (today 1)

Worker 5: (good 4)

Page 47: INF 2914 Information Retrieval and Web Search

4747

Page 48: INF 2914 Information Retrieval and Web Search

4848

Page 49: INF 2914 Information Retrieval and Web Search

4949

Fault tolerance

Typical cluster: 100s/1000s of 2-CPU x86 machines, 2-4 GB

of memory Storage is on local IDE disks GFS: distributed file system manages data

(SOSP'03) Job scheduling system: jobs made up of

tasks, scheduler assigns tasks to machines

Implementation is a C++ library linked into user programs)

Page 50: INF 2914 Information Retrieval and Web Search

5050

Fault tolerance

On worker failure: Detect failure via periodic heartbeats Re-execute completed and in-progress map

tasks Re-execute in progress reduce tasks Task completion committed through master

Master failure: Could handle, but don't yet (master failure

unlikely)

Page 51: INF 2914 Information Retrieval and Web Search

5151

Performance

Scan 10^10 100-byte records to extract records matching a rare pattern (92K matching records): 150 seconds.

Sort 10^10 100-byte records (modeled after TeraSort benchmark): 839 seconds.

Page 52: INF 2914 Information Retrieval and Web Search

5252

More and more MapReduce

Page 53: INF 2914 Information Retrieval and Web Search

5353

Experience: Rewrite of Production Indexing System

Rewrote Google's production indexing system using MapReduce Set of 24 MapReduce operations New code is simpler, easier to understand MapReduce takes care of failures, slow

machines Easy to make indexing faster by adding

more machines

Page 54: INF 2914 Information Retrieval and Web Search

5454

MapReduce Overview

MapReduce has proven to be a useful abstraction

Greatly simplifies large-scale computations at Google

Fun to use: focus on problem, let library deal w/ messy details

Page 55: INF 2914 Information Retrieval and Web Search

5555

Resources

MG Chapter 5 MapReduce: Simplified Data Processing on

Large Clusters, Jeffrey Dean and Sanjay Ghemawat

Indexing Shared Content in Information Retrieval Systems, A. Broder et. al., EDBT2006

High Performance Index Build Algorithms for Intranet Search Engines, M. F. Fontoura et. al., VLDB 2004