1
INF 2914Information Retrieval and Web
Search
Lecture 6: Index ConstructionThese slides are adapted from
Stanford’s class CS276 / LING 286Information Retrieval and Web
Mining
2
(Offline) Search Engine Data Flow
- Parse- Tokenize- Per page analysis
tokenizedweb pages
duptable
Parse & Tokenize Global Analysis
2
invertedtext index
1
Crawler
web page - Scan tokenized web pages, anchor text, etc- Generate text index
Index Build
-Dup detection -Static rank - Anchor text -Spam analysis-- …
3 4
ranktable
anchortext
in background
spam table
3
Inverted index
For each term T, we must store a list of all documents that contain T.
Brutus
Calpurnia
Caesar
2 4 8 16 32 64 128
2 3 5 8 13 21 34
13 16
1
Dictionary Postings lists
Sorted by docID (more later on why).
Posting
4
Inverted index construction
Tokenizer
Token stream. Friends Romans Countrymen
Linguistic modules
Modified tokens. friend roman countryman
Indexer
Inverted index.
friend
roman
countryman
2 4
2
13 16
1
Documents tobe indexed.
Friends, Romans, countrymen.
5
Sequence of (Modified token, Document ID) pairs.
I did enact JuliusCaesar I was killed
i' the Capitol; Brutus killed me.
Doc 1
So let it be withCaesar. The noble
Brutus hath told youCaesar was ambitious
Doc 2
Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2
caesar 2was 2ambitious 2
Indexer steps
6
Sort by terms. Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2
Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2
Core indexing step.
7
Multiple term entries in a single document are merged.
Frequency information is added.
Term Doc # Term freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1
Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2
Why frequency?Will discuss later.
8
The result is split into a Dictionary file and a Postings file.
Doc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1
Term N docs Coll freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1
Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1
9
The index we just built
How do we process a query?
10
Query processing: AND
Consider processing the query:Brutus AND Caesar Locate Brutus in the Dictionary;
Retrieve its postings. Locate Caesar in the Dictionary;
Retrieve its postings. “Merge” the two postings:
128
34
2 4 8 16 32 64
1 2 3 5 8 13
21
Brutus
Caesar
11
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
The merge
Walk through the two postings simultaneously, in time linear in the total number of postings entries
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
Brutus
Caesar2 8
If the list lengths are x and y, the merge takes O(x+y)operations.Crucial: postings sorted by docID.
12
Index construction
How do we construct an index? What strategies can we use with limited
main memory?
13
Our corpus for this lecture
Number of docs = n = 1M Each doc has 1K terms
Number of distinct terms = m = 500K 667 million postings entries
14
./ ln~ / m/J
/
1JmnJHnJinJ
Jm
i
How many postings?
Number of 1’s in the i th block = nJ/i Summing this over m/J blocks, we have
For our numbers, this should be about 667 million postings.
15
I did enact JuliusCaesar I was killed i' the Capitol; Brutus killed me.
Doc 1
So let it be withCaesar. The nobleBrutus hath told youCaesar was ambitious
Doc 2
Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2
Recall index construction
Documents are processed to extract words and these are saved with the Document ID.
16
Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2
Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2
We focus on this sort step.We have 667M items to sort.
Key step
After all documents have been processed the inverted file is sorted by terms.
17
Index construction
At 10-12 bytes per postings entry, demands several temporary gigabytes
18
System parameters for design
Disk seek ~ 10 milliseconds Block transfer from disk ~ 1 microsecond
per byte (following a seek) All other ops ~ 1 microsecond
E.g., compare two postings entries and decide their merge order
19
If every comparison took 2 disk seeks, and N items could besorted with N log2N comparisons, how long would this take?
Bottleneck
Build postings entries one doc at a time Now sort postings entries by term (then by
doc within each term) Doing this with random disk seeks would
be too slow – must sort N=667M records
2020
If every comparison took 2 disk seeks, and N items could besorted with N log2N comparisons, how long would this take?
12.4 years!!!
Disk-based sorting
Build postings entries one doc at a time Now sort postings entries by term Doing this with random disk seeks would
be too slow – must sort N=667M records
2121
Sorting with fewer disk seeks
12-byte (4+4+4) records (term, doc, freq). These are generated as we process docs. Must now sort 667M such 12-byte records
by term. Define a Block ~ 10M such records
can “easily” fit a couple into memory. Will have 64 such blocks to start with.
Will sort within blocks first, then merge the blocks into one long sorted order.
2222
Sorting 64 blocks of 10M records
First, read each block and sort within: Quicksort takes 2N ln N expected steps In our case 2 x (10M ln 10M) steps Time to Quicksort each block = 320 seconds
Total time to read each block from disk and write it back
120M x 2 x 10-6 = 240 seconds 64 times this estimate - gives us 64 sorted runs of
10M records each Total Quicksort time = 5.6 hours Total read+write time = 4.2 hours Total for this phase ~ 10 hours
Need 2 copies of data on disk, throughout
2323Disk
1
3 4
22
1
4
3
Runs beingmerged.
Merged run.
Merging 64 sorted runs
Merge tree of log264= 6 layers. During each layer, read into memory runs
in blocks of 10M, merge, write back.
2424
…
…
Sorted runs.
1 2 6463
32 runs, 20M/run
16 runs, 40M/run8 runs, 80M/run4 runs … ?2 runs … ?1 run … ?
Bottom levelof tree.
Merge tree
2525
Merging 64 runs
Time estimate for disk transfer: 6 x Time to read+write 64 blocks = 6 x 4.2 hours ~ 25 hours
Time estimate for the merge operation: 6 x 640M x 10-6 = 1 hour
Time estimate for the overall algorithm: Sort time + Merge time ~ 10 + 26 ~ 36 hours
Lower bound (main memory sort): Time to read+write = 4.2 hours Time to sort in memory = 10.7 hours Total time ~ 15 hours
2626
Some indexing numbers
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Trevi Lucy Melnik et. al. Long andSuel
Lucene
K d
oc
um
en
ts p
er
ho
ur
2727
How to improve indexing time?
Compression of the sorted runs Multi-way merge
Heap merge all runs Radix sort (linear time sorting) Pipelining reading, sorting, and writing
phases
28
Multi-way merge
28
…Sorted runs
1 2 6463
Heap
2929
Indexing improvements
Radix sort Linear time sorting Flexibility in defining the sort criteria Bigger sort buffers increase performance (contradicting
previous literature) (see VLDB paper on the references) Pipelining read and sort + write phases
B1
B2
B1
B2
B1
B2
B1
B2
time
Read
Sort + Write
3030
Positional indexing
Given documents:D1: This is a testD2: Is this a testD3: This is not a test
Reorganize by term:
TERM DOC LOC DATA(caps)this 1 0 1is 1 1 0a 1 2 0test 1 3 0is 2 0 1this 2 1 0a 2 2 0test 2 3 0this 3 0 1is 3 1 0not 3 2 0a 3 3 0test 3 4 0
3131
Positional indexing
In “postings list” format:
a (1,2,0),(2,2,0),(3,3,0)is (1,1,0),(2,0,1),(3,1,0)not (3,2,0)test (1,3,0),(2,3,0),(3,4,0)this (1,0,1),(2,1,0),(3,0,1)
Sort by <term, doc, loc>:
TERM DOC LOC DATA(caps)a 1 2 0a 2 2 0a 3 3 0is 1 1 0is 2 0 1is 3 1 0not 3 2 0test 1 3 0test 2 3 0test 3 4 0this 1 0 1this 2 1 0this 3 0 1
3232
Positional indexing with radix sort
Radix key = Token hash = 8 bytes Document ID = 8 bytes Location = 4 bytes, but no need to sort by
location since Radix sort is stable!
3333
Distributed indexing
Maintain a master machine directing the indexing job – considered “safe”
Break up indexing into sets of (parallel) tasks
Master machine assigns each task to an idle machine from a pool
3434
Parallel tasks
We will use two sets of parallel tasks Parsers Inverters
Break the input document corpus into splits Each split is a subset of documents
Master assigns a split to an idle parser machine
Parser reads a document at a time and emits (term, doc) pairs
3535
Parallel tasks
Parser writes pairs into j partitions Each for a range of terms’ first letters
(e.g., a-f, g-p, q-z) – here j=3. Now to complete the index inversion
3636
splits
Parser
Parser
Parser
Master
a-fg-p q-z
a-fg-p q-z
a-fg-p q-z
Inverter
Inverter
Inverter
Postings
a-f
g-p
q-z
assign assign
Data flow
3737
Above process flow a special case of MapReduce.
Inverters
Collect all (term, doc) pairs for a partition Sorts and writes to postings list Each partition contains a set of postings
3838
MapReduce
Model for processing large data sets. Contains Map and Reduce functions. Runs on a large cluster of machines. A lot of MapReduce programs are executed
on Google’s cluster everyday.
3939
Motivation
Input data is large The whole Web, billions of Pages
Lots of machines Use them efficiently
4040
A real example
Term frequencies through the whole Web repository.
Count of URL access frequency. Reverse web-link graph ….
4141
Programming model Input & Output: each a set of key/value pairs Programmer specifies two functions: map (in_key, in_value) -> list(out_key,
intermediate_value) Processes input key/value pair Produces set of intermediate pairs reduce (out_key, list(intermediate_value)) ->
list(out_value) Combines all intermediate values for a particular key Produces a set of merged output values (usually just
one)
4242
Example
Page 1: the weather is good Page 2: today is good Page 3: good weather is good.
4343
Example: Count word occurrences
map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1");
reduce(String output_key, Iterator intermediate_values):
// output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));
4444
Map output
Worker 1: (the 1), (weather 1), (is 1), (good 1).
Worker 2: (today 1), (is 1), (good 1).
Worker 3: (good 1), (weather 1), (is 1), (good 1).
4545
Reduce Input
Worker 1: (the 1)
Worker 2: (is 1), (is 1), (is 1)
Worker 3: (weather 1), (weather 1)
Worker 4: (today 1)
Worker 5: (good 1), (good 1), (good 1), (good 1)
4646
Reduce Output
Worker 1: (the 1)
Worker 2: (is 3)
Worker 3: (weather 2)
Worker 4: (today 1)
Worker 5: (good 4)
4747
4848
4949
Fault tolerance
Typical cluster: 100s/1000s of 2-CPU x86 machines, 2-4 GB
of memory Storage is on local IDE disks GFS: distributed file system manages data
(SOSP'03) Job scheduling system: jobs made up of
tasks, scheduler assigns tasks to machines
Implementation is a C++ library linked into user programs)
5050
Fault tolerance
On worker failure: Detect failure via periodic heartbeats Re-execute completed and in-progress map
tasks Re-execute in progress reduce tasks Task completion committed through master
Master failure: Could handle, but don't yet (master failure
unlikely)
5151
Performance
Scan 10^10 100-byte records to extract records matching a rare pattern (92K matching records): 150 seconds.
Sort 10^10 100-byte records (modeled after TeraSort benchmark): 839 seconds.
5252
More and more MapReduce
5353
Experience: Rewrite of Production Indexing System
Rewrote Google's production indexing system using MapReduce Set of 24 MapReduce operations New code is simpler, easier to understand MapReduce takes care of failures, slow
machines Easy to make indexing faster by adding
more machines
5454
MapReduce Overview
MapReduce has proven to be a useful abstraction
Greatly simplifies large-scale computations at Google
Fun to use: focus on problem, let library deal w/ messy details
5555
Resources
MG Chapter 5 MapReduce: Simplified Data Processing on
Large Clusters, Jeffrey Dean and Sanjay Ghemawat
Indexing Shared Content in Information Retrieval Systems, A. Broder et. al., EDBT2006
High Performance Index Build Algorithms for Intranet Search Engines, M. F. Fontoura et. al., VLDB 2004