INVERTED INDEX, COMPRESSING INVERTED INDEX AND COMPUTING SCORE IN COMPLETE SEARCH SYSTEM CHINTAN MISTRY MRUGANK DALAL

INVERTED INDEX,COMPRESSING INVERTED INDEXANDCOMPUTING SCORE IN COMPLETE SEARCH SYSTEM

CHINTAN MISTRY

MRUGANK DALAL

Indexing in Search Engine

User

query

Results

LinguisticPreprocessi

ng

LinguisticPreprocessi

ng Already built

Inverted Index

Lookup the documents

that contain the

terms

Already built

Inverted Index

Lookup the documents

that contain the

terms

Normalized terms

Rank the returned

documents according to

their relevancy

Rank the returned

documents according to

their relevancy

Documents

Forward index What is INVERTED INDEX? First look at the FORWARD INDEX!

Documents Words

Querying the forward index would require sequential iteration through each document and to each word to verify a matching document

Too much time, memory and resources required!

Document 1 Hat, dog, the, cow, is, now

Document 2 Cow, run, away, morning, in, tree

Document 3 What, family, at, some, is, take

What is inverted index?

Opposed to forward index, store the list of documents per each wordDirectly access the set of documents containing the word

One posting

Posting List

How to build inverted index? (1/3)

Build index in advance1. Collect the documents2. Turning each document into a list of tokens3. Do linguistic preprocessing, producing list of normalized tokens, which are the indexing terms4. Index the documents (i.e. postings) for each word (i.e. dictionary)


Given two documents:Document1 Document2

This is first document.Microsoft’s

products are office, visio, and

sql server

This is first document.Microsoft’s

products are office, visio, and

sql server

This is second document.

Google’s services are gmail, google labs and google

code.

This is second document.

Google’s services are gmail, google labs and google

code.


Sort based indexing: 1. Sort the terms alphabetically 2. Instances of the same term are grouped

by word and then documentID 3. The terms and documentIDs are then

separated out

Reduces storage requirement Dictionary commonly kept in memory while

postings list kept on disk

Blocked sort based indexing

Use termID instead of term Main memory is insufficient to collect

termID-docID pair, we need external sorting algorithm that uses disk Segment the collection into parts of equal size Sorts and group the termID-docID pairs of

each part in memory Store the intermediate result onto disk Merges all intermediate results into the final

index Running Time: O (T log T)

Single-pass in-memory indexing

SPIMI uses term instead of termID Writes each block’s dictionary to disk,

and then starts a new dictionary for the next block

Assume we have stream of term-docID pairs, Tokens are processed one by one, when a

term occurs for the first time, it is added to the dictionary, and a new posting list is created.

Difference between BSBI and SPIMI

SPIMI BSBI1. Add postings directly to

postings list

2. It is faster then BSBI because there is no Sorting necessary

3. It saves memory because No termID needs to be stored

4. Time complexity O( T )

1. Collect term-docID pairs , sort them and then create postings list

2. Slower then SPIMI

3. Require to store termID , so need more space

4. Time complexity O( T logT)

Distributed Indexing (1/4)

We can not perform index construction on single computer, web search engine uses distributed indexing algorithms for index construction

Partitioned the work across several machine Use MapReduce architecture:

A general architecture for distributed computing Divide the work into chunks that can easily

assign and reassign. Map and Reduce phase



MAP PHASE: Mapping the splits of the input data to key-value

pairs Each parser writes its output to local segment file These machines are called parsers

REDUCE PHASE: Partition the keys into j term partitions and

having the parsers write key-value pair for each term partition into a separate file.

The parser write the corresponding segment files, one for each term partition.


REDUCE PHASE (cont.): Collecting all values (docIDs) for a given key

(termID) into one list is the task of inverter The master assigns each term partition to a

different inverter Finally, the list of values is sorted for each key

and written to the final sorted postings list.

Dynamic indexing

Motivation: what we have seen so far was static collection of documents, what if the document is added, updated or deleted?

Maintain 2 indexes: Main and Auxiliary Auxiliary index is kept in memory, searches are

run across both indexes, and results are merged When auxiliary index becomes too large, merge it

into the main index Deleted document can be filtered out while

returning the results

Querying distributed indexes (1/2)

Partition by terms: Partition the dictionary of index terms into subsets,

along with a postings list of those term Query is routed to the nodes, allows greater

concurrency Sending a long lists of postings between set of nodes

for merging; cost is very high and it outweighs the greater concurrency

Partition by documents: Each node contains the index for a subset of all

documents Query is distributed to all nodes, then results are

merged

Querying distributed indexes (2/2)

Partition by documents (cont.): Problem: idf must be calculated for an entire

collection even though the index at single node contains only subset of documents

The query is broadcasted to each of the nodes, with top k results from each node being merged to find top k documents of the query.

Index compression (1/8)

Compression techniques for dictionary and posting list Advantages

Less disk space Use of caching: frequently used terms can be

cached in memory for faster processing, and compression techniques allows more terms to be stored in memory

Faster data transfer from disk to memory: total time of transferring a compressed data from disk and decompress it is less than transferring uncompressed data


Dictionary compression:

It’s small compared to posting lists, so why to compress?

Because when large part (think of a millions of terms in it!) of dictionary is on disk, then many more disk seeks are necessary

Goal is to fit this dictionary into memory for high response time


1. Dictionary as an array: Can be stored in an array of fixed width

entries

For ex. We have 4,00,000 terms in dictionary; 4,00,000 * (20+4+4) = 11.2 MB


Any problem in storing dictionary as an array?

1. Average length of term in English language is about eight chars, so we are wasting 12 chars

2. No way of storing terms of more than 20 chars like hydrochlorofluorocarbons

SOLUTION? 2. Dictionary as a string:

Store it as a one long string of characters Pointer marks the end of the preceding term

and the beginning of the next


2. Dictionary as a string (cont.):

4,00,000 * (4+4+3+8) = 7.6 MB (compared to 11.2 MB earlier)


3. Blocked storage: Group the terms in the string into blocks of size

k and keeping a term pointer only for the first term of each block.

k=4;

We save,(k-1)*3 =9 bytes for term pointerBut,Need additional 4 bytes for term length

4,00,000 * (1/4) * 5 = 7.1 MB (compared to 7.6 MB)


4. Blocked storage with front coding: Common prefixes

According to experience conducted by author: Size reduced to 5.9 MB (compared to 7.1 MB)


Posting file compression: By Encoding Gaps: gaps between postings are

shorter so we can store gaps rather than storing the

posting itself

Review : Scoring , term weighting

Meta data:- information about document Metadata generally consist of “fields”

E.g. date of creation , authors , title etc.

Zone :- similar to fields Difference : zone is arbitrary free text

E.g. Abstract , overview


Term Frequency(tf) : # of occurrence of term in document

Problem : size of documents => inappropriate ranking Document frequency(dft): # of documents in

collection which contain ‘term’ from query. Inverse Document Frequency(idft):

idft = log( N / dft) : N =total # of doc

Significance of idf If low it’s a common term (e.g. stop word ) If high rare word ( e.g. apothecary )


Tf-idf weightingtf-idft,d = tft,d * idft .

High :when term occurs many time in small # of docs

Low: when it occurs fewer time in docs or it occurs in many docs

Lowest: when term is in almost all documents. Score of document: Score(q,d) = ∑ (t€q)tf-idft,d

Computing score in complete search system

Inexact top K document retrieval

Motivation : to reduce the cost of calculating score for all N documents

We calculate score ONLY for top K documents whose scores are likely to be high w.r.t given query

How :1. Find set A of documents who are

contenderswhere K < A << N.

2. Return the K top scoring docs from A

Index Elimination

Idf preset threshold : Only traverse postings for terms with high idf

Benefit : low idf postings are long so we remove them from

counting score.

Include all terms: Only traverse documents with many query

terms in it. Danger: we may end up with less than K docs at

last.

Champion lists

Champion list = fancy list = top docs Set of r documents for each term t in dictionary

which are pre-computed The weights for t are high

How to create set A Take a union of champion list for each term in

query Compute score only for docs which are in union

How and when to decide ‘r’ Highly application dependent Create list at the time of indexing documents

Problem : ????????

Static quality scores and ordering

In many search engine we have Measure of quality g(d) for each documents

The net score is calculated Combination of g(d) and tf-idf score.

How to achieve this Document posting list is in decreasing order for

g(d) So we just traversed first few documents in list

Global champion list : Chose r documents with highest value of g(d)+tf-

idf

Cluster pruning (1/2)

We cluster document in preprocessing step1. Pick √N documents : call them ‘leaders’2. For each document who is not leader we

compute nearest leader Followers: docs which are not leaders Each leader has approximately √N

followers

Cluster pruning (2/2)

How does it help: Given a query q find leader L nearest to q

i.e calculating score for only root N docs Set A contains leader L with root N

followers i.e calculating score for only root N docs

Tiered indexes

auto

car

best

auto

car

best

Doc1

Doc 1

Doc 2

Doc 3

Doc 1

Doc1

Doc 4

Doc 4

Doc 2

Tier 1

Tier 2

Addressing an issue of getting set A of contenders less than K documents

Preset threshold value set to 20

Preset threshold value set to 10

A complete search systemParsing

Linguistics

Indexers

Scoring paramete

rsMLR

User Query

Free text query parser

Spell correctionScoring and

Ranking

Documents

Training set

Documents

cache

Result Page

Metadata in zone and field indexes

Inexact top K retrieval

Tiered inverted positional index

K - gram

Indexes

Questions

?

Thank you