Upload
blake-preston
View
224
Download
3
Tags:
Embed Size (px)
Citation preview
INVERTED INDEX,COMPRESSING INVERTED INDEXANDCOMPUTING SCORE IN COMPLETE SEARCH SYSTEM
CHINTAN MISTRY
MRUGANK DALAL
Indexing in Search Engine
User
query
Results
LinguisticPreprocessi
ng
LinguisticPreprocessi
ng Already built
Inverted Index
Lookup the documents
that contain the
terms
Already built
Inverted Index
Lookup the documents
that contain the
terms
Normalized terms
Rank the returned
documents according to
their relevancy
Rank the returned
documents according to
their relevancy
Documents
Forward index What is INVERTED INDEX? First look at the FORWARD INDEX!
Documents Words
Querying the forward index would require sequential iteration through each document and to each word to verify a matching document
Too much time, memory and resources required!
Document 1 Hat, dog, the, cow, is, now
Document 2 Cow, run, away, morning, in, tree
Document 3 What, family, at, some, is, take
What is inverted index?
Opposed to forward index, store the list of documents per each wordDirectly access the set of documents containing the word
One posting
Posting List
How to build inverted index? (1/3)
Build index in advance1. Collect the documents2. Turning each document into a list of tokens3. Do linguistic preprocessing, producing list of normalized tokens, which are the indexing terms4. Index the documents (i.e. postings) for each word (i.e. dictionary)
How to build inverted index? (2/3)
Given two documents:Document1 Document2
This is first document.Microsoft’s
products are office, visio, and
sql server
This is first document.Microsoft’s
products are office, visio, and
sql server
This is second document.
Google’s services are gmail, google labs and google
code.
This is second document.
Google’s services are gmail, google labs and google
code.
How to build inverted index? (3/3)
Sort based indexing: 1. Sort the terms alphabetically 2. Instances of the same term are grouped
by word and then documentID 3. The terms and documentIDs are then
separated out
Reduces storage requirement Dictionary commonly kept in memory while
postings list kept on disk
Blocked sort based indexing
Use termID instead of term Main memory is insufficient to collect
termID-docID pair, we need external sorting algorithm that uses disk Segment the collection into parts of equal size Sorts and group the termID-docID pairs of
each part in memory Store the intermediate result onto disk Merges all intermediate results into the final
index Running Time: O (T log T)
Single-pass in-memory indexing
SPIMI uses term instead of termID Writes each block’s dictionary to disk,
and then starts a new dictionary for the next block
Assume we have stream of term-docID pairs, Tokens are processed one by one, when a
term occurs for the first time, it is added to the dictionary, and a new posting list is created.
Difference between BSBI and SPIMI
SPIMI BSBI1. Add postings directly to
postings list
2. It is faster then BSBI because there is no Sorting necessary
3. It saves memory because No termID needs to be stored
4. Time complexity O( T )
1. Collect term-docID pairs , sort them and then create postings list
2. Slower then SPIMI
3. Require to store termID , so need more space
4. Time complexity O( T logT)
Distributed Indexing (1/4)
We can not perform index construction on single computer, web search engine uses distributed indexing algorithms for index construction
Partitioned the work across several machine Use MapReduce architecture:
A general architecture for distributed computing Divide the work into chunks that can easily
assign and reassign. Map and Reduce phase
Distributed Indexing (2/4)
Distributed Indexing (3/4)
MAP PHASE: Mapping the splits of the input data to key-value
pairs Each parser writes its output to local segment file These machines are called parsers
REDUCE PHASE: Partition the keys into j term partitions and
having the parsers write key-value pair for each term partition into a separate file.
The parser write the corresponding segment files, one for each term partition.
Distributed Indexing (4/4)
REDUCE PHASE (cont.): Collecting all values (docIDs) for a given key
(termID) into one list is the task of inverter The master assigns each term partition to a
different inverter Finally, the list of values is sorted for each key
and written to the final sorted postings list.
Dynamic indexing
Motivation: what we have seen so far was static collection of documents, what if the document is added, updated or deleted?
Maintain 2 indexes: Main and Auxiliary Auxiliary index is kept in memory, searches are
run across both indexes, and results are merged When auxiliary index becomes too large, merge it
into the main index Deleted document can be filtered out while
returning the results
Querying distributed indexes (1/2)
Partition by terms: Partition the dictionary of index terms into subsets,
along with a postings list of those term Query is routed to the nodes, allows greater
concurrency Sending a long lists of postings between set of nodes
for merging; cost is very high and it outweighs the greater concurrency
Partition by documents: Each node contains the index for a subset of all
documents Query is distributed to all nodes, then results are
merged
Querying distributed indexes (2/2)
Partition by documents (cont.): Problem: idf must be calculated for an entire
collection even though the index at single node contains only subset of documents
The query is broadcasted to each of the nodes, with top k results from each node being merged to find top k documents of the query.
Index compression (1/8)
Compression techniques for dictionary and posting list Advantages
Less disk space Use of caching: frequently used terms can be
cached in memory for faster processing, and compression techniques allows more terms to be stored in memory
Faster data transfer from disk to memory: total time of transferring a compressed data from disk and decompress it is less than transferring uncompressed data
Index compression (2/8)
Dictionary compression:
It’s small compared to posting lists, so why to compress?
Because when large part (think of a millions of terms in it!) of dictionary is on disk, then many more disk seeks are necessary
Goal is to fit this dictionary into memory for high response time
Index compression (3/8)
1. Dictionary as an array: Can be stored in an array of fixed width
entries
For ex. We have 4,00,000 terms in dictionary; 4,00,000 * (20+4+4) = 11.2 MB
Index compression (4/8)
Any problem in storing dictionary as an array?
1. Average length of term in English language is about eight chars, so we are wasting 12 chars
2. No way of storing terms of more than 20 chars like hydrochlorofluorocarbons
SOLUTION? 2. Dictionary as a string:
Store it as a one long string of characters Pointer marks the end of the preceding term
and the beginning of the next
Index compression (5/8)
2. Dictionary as a string (cont.):
4,00,000 * (4+4+3+8) = 7.6 MB (compared to 11.2 MB earlier)
Index compression (6/8)
3. Blocked storage: Group the terms in the string into blocks of size
k and keeping a term pointer only for the first term of each block.
k=4;
We save,(k-1)*3 =9 bytes for term pointerBut,Need additional 4 bytes for term length
4,00,000 * (1/4) * 5 = 7.1 MB (compared to 7.6 MB)
Index compression (7/8)
4. Blocked storage with front coding: Common prefixes
According to experience conducted by author: Size reduced to 5.9 MB (compared to 7.1 MB)
Index compression (8/8)
Posting file compression: By Encoding Gaps: gaps between postings are
shorter so we can store gaps rather than storing the
posting itself
Review : Scoring , term weighting
Meta data:- information about document Metadata generally consist of “fields”
E.g. date of creation , authors , title etc.
Zone :- similar to fields Difference : zone is arbitrary free text
E.g. Abstract , overview
Review : Scoring , term weighting
Term Frequency(tf) : # of occurrence of term in document
Problem : size of documents => inappropriate ranking Document frequency(dft): # of documents in
collection which contain ‘term’ from query. Inverse Document Frequency(idft):
idft = log( N / dft) : N =total # of doc
Significance of idf If low it’s a common term (e.g. stop word ) If high rare word ( e.g. apothecary )
Review : Scoring , term weighting
Tf-idf weightingtf-idft,d = tft,d * idft .
High :when term occurs many time in small # of docs
Low: when it occurs fewer time in docs or it occurs in many docs
Lowest: when term is in almost all documents. Score of document: Score(q,d) = ∑ (t€q)tf-idft,d
Computing score in complete search system
Inexact top K document retrieval
Motivation : to reduce the cost of calculating score for all N documents
We calculate score ONLY for top K documents whose scores are likely to be high w.r.t given query
How :1. Find set A of documents who are
contenderswhere K < A << N.
2. Return the K top scoring docs from A
Index Elimination
Idf preset threshold : Only traverse postings for terms with high idf
Benefit : low idf postings are long so we remove them from
counting score.
Include all terms: Only traverse documents with many query
terms in it. Danger: we may end up with less than K docs at
last.
Champion lists
Champion list = fancy list = top docs Set of r documents for each term t in dictionary
which are pre-computed The weights for t are high
How to create set A Take a union of champion list for each term in
query Compute score only for docs which are in union
How and when to decide ‘r’ Highly application dependent Create list at the time of indexing documents
Problem : ????????
Static quality scores and ordering
In many search engine we have Measure of quality g(d) for each documents
The net score is calculated Combination of g(d) and tf-idf score.
How to achieve this Document posting list is in decreasing order for
g(d) So we just traversed first few documents in list
Global champion list : Chose r documents with highest value of g(d)+tf-
idf
Cluster pruning (1/2)
We cluster document in preprocessing step1. Pick √N documents : call them ‘leaders’2. For each document who is not leader we
compute nearest leader Followers: docs which are not leaders Each leader has approximately √N
followers
Cluster pruning (2/2)
How does it help: Given a query q find leader L nearest to q
i.e calculating score for only root N docs Set A contains leader L with root N
followers i.e calculating score for only root N docs
Tiered indexes
auto
car
best
auto
car
best
Doc1
Doc 1
Doc 2
Doc 3
Doc 1
Doc1
Doc 4
Doc 4
Doc 2
Tier 1
Tier 2
Addressing an issue of getting set A of contenders less than K documents
Preset threshold value set to 20
Preset threshold value set to 10
A complete search systemParsing
Linguistics
Indexers
Scoring paramete
rsMLR
User Query
Free text query parser
Spell correctionScoring and
Ranking
Documents
Training set
Documents
cache
Result Page
Metadata in zone and field indexes
Inexact top K retrieval
Tiered inverted positional index
K - gram
Indexes
Questions
?
Thank you