Upload
callum
View
15
Download
0
Embed Size (px)
DESCRIPTION
Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman. Presented by Laura C. Vandivier For ITCS6050, UNCC, Fall 2008. Overview. Indexing Ranking Query Expansion Query Evaluation Tupleflow. Topics Not Covered. Binned Probabilities - PowerPoint PPT Presentation
Citation preview
Efficient Processing of Complex Features for Information Retrieval
Dissertation by Trevor StrohmanPresented by
Laura C. VandivierFor ITCS6050, UNCC, Fall 2008
Overview
• Indexing• Ranking• Query Expansion• Query Evaluation• Tupleflow
Topics Not Covered
• Binned Probabilities• Score-Sorted Index Optimization• Document-Sorted Index
Optimization• Navigational Search with Complex
Features
Document Indexing
• Inverted ListA mapping from a single word to a set of
documents that contain the word
• Inverted IndexA set of inverted lists
Inverted Index
• Contain one inverted list for each term in the document collection
• Often omit frequently occurring words such as “a,” “and” and “the.”
Inverted Index Example
Sample Documents1. Cats, dogs, dogs.2. Dogs, cats, sheep.3. Whales, sheep, goats.4. Fish, whales, whales.
Inverted Indexcats
dogs
fish
goats
sheep
whales
1 1 4 3 2 3
2 2 3 4
QueryAnswer
cats 1,2
sheep + dogs
2
Expanding Inverted Indexes
• Include term frequencyMore terms implies “about”
cats
dogs
fish goats
sheep
whales
(1,1)
(1,2)
(4,1)
(3,1)
(2,1)
(3,1)
(2,1)
(2,1)
(3,1)
(4,2)
Expanding Inverted Indexes (cont.)
• Add word position informationFacilitates phrase searching
cats dogs fish goats sheep
whales
(1,1): 1
(1,2): 2,3
(4,1): 1
(3,1): 2
(2,1): 3
(3,1): 1
(2,1): 2
(2,1): 1 (3,1): 2
(4,2): 1
Inverted Index Statistics
• Compressed inverted indexes containing only word counts– 5% of the document collection in size– Built and queried faster
• Compressed inverted indexes containing word counts and positions– 20% of the document collection in size– Essential for high effectiveness, even in queries
not using phrases
Document Ranking
• Documents returned in order of relevance
• Perfect ranking impossible• Retrieval systems calculate
probability a document is relevant
Computing Relevance
• Assume “bag of words” with term independence
• Simple estimation
• Problems1. If a document does not contain all words of a multi-
word query it will not be retrieved.document containing 0 words = document containing some
words
2. All words are treated equally.Query = Maltese falcondocument(maltese:2, falcon:1) = document(maltese:1,falcon:2)for documents of similar length
• Smoothing can help
# occurrences
document length
Computing Relevance (cont.)
• Add additional features– Position/field in document, ex.
title– Proximity of query terms– Combinations
Computing Relevance (cont.)
Add query independent information• # links from other documents• URL depth
shorter generallonger specific
• User clicksMay match expectations but not relevance
• Dwell time• Document quality models
Unusual term distribution implies poor grammar so the document is not a good retrieval candidate
Query Expansion
StemmingGroups words that mean the same concept based on
natural language rules. ex: run, runs, running, ran
• Aggressive StemmerMay group words that are not related. ex. marine,
marinate
• Conservative StemmerMay fail to group words that are related. ex. run, ran
• Statistical StemmerUses word co-occurrence data to determine if they are
related.Would probably avoid the marine, marinate mistake.
Query Expansion (cont.)
SynonymsGroup by terms that mean the same concept
• ProblemMay be different depending on context
US: President = head of state = commander in chiefUK: prime minister = head of stateCorporation: president = chief executive (maybe)
• Solutions– Include synonyms in query but prefer term
matches– Use context from the whole query
“president of canada” “prime minister”
Query Expansion (cont.)
Relevance FeedbackUser selects relevant documents and they
are used to find similar documents.
Pseudo Relevance FeedbackSystem assumes the first few documents
retrieved are relevant and uses them to search for more.
No user involvement, so not as precise.
Evaluation
• Effectiveness
• Efficiency
Effectiveness
• Precision# of relevant results / # results
• SuccessWhether the first document was relevant
• Recall# relevant docs found / # relevant docs that exist
• Mean Average Precision (MAP)Average precision over all relevant documents
• Normalized Discounted Cumulative Gain (NDCG)Calculates using sum over result ranks
Calculating MAP
Assume a retrieval set of 10 documents with 1, 5, 7, 8 and 10 relevant.
Rank
Precision
1 1/1 = 1
5 2/5 = .2
7 3/7 = .43
8 4/8 = .5
10 5/10 = .5
If there were only 5 relevant documents, then(1 + .2 + .43 + .5 + .5) / 5 = .53
If we retrieved only 5 of 6 relevant documents, then(1 + .2 + .43 + .5 + .5) / 6 = .44
NDCG
• Uses 4 values for relevance, not just is/is not with 0 being not relevant and 4 being most relevant.
• Calculated asN (2r(i) − 1)/ log(1 + i)
Where i is the rank and r(i) is the relevance value at that rank.
Example: with the following results where is relevant and is not
i
1 10 20
MAP
NDCG
1.00 1.00 .51 .79 .33 .55
Efficiency
• Throughput# of queries processed per secondMust use identical systems.
• LatencyTime between when the user issues a
query and the system delivers a response.
< 150ms considered “instantaneous”• Generally, improving one implies
worsening the other
Measuring Efficiency
• DirectAttempt to create a real world system and
measure statistics.Straightforward but limited to
experimenter access.
• SimulationSystem operation is simulated in software.Repeatable but is only as good as its model.
Query Evaluation
• Document-at-a-timeEvaluate each term for a document
before moving to the next document.
• Term-at-a-timeEvaluate each document for a term
before moving to the next term.
Document-at-a-Time
• Produces complete document scores early so can quickly display partial results.
• Can incrementally fetch the inverted list data so uses less memory.
Document-at-a-Time Algorithm
procedure DocumentAtATimeRetrieval(Q)L ← Array()R ← PriorityQueue()for all terms wi in Q do
li ← InvertedList(wi)L.add( li )
end forfor all documents D in the collection do
for all inverted lists li in L dosD ← sD + f(Q,C,wi)(c(wi;D)) #Update the document
scoreend forsD ← sD · d(Q,C)(|D|) #Multiply by a document-dependent
factorR.add( sD,D )
end forreturn the top n results from R
end procedure
Term-at-a-Time
• Does not jump between inverted lists so saves branching.
• Inner loop iterates over documents so is executed for a long time, thus is easier to optimize.
• Efficient query processing strategies have been developed for term-at-a-time.
• Preferred for efficient system implementation.
Term-at-a-Time Algorithmprocedure TermAtATimeRetrieval(Q)
A ← HashTable()for all terms wi in Q do
li ← InvertedList(wi)for all documents D in li do
swi,D ← A[D] + f(Q,C,wi)(c(wi;D))end for
end forR ← PriorityQueue()for all accumulators A[D] in A do
sD ← A[D] · d(Q,C)(|D|) #Normalize the accumulator value
R.add( sD,D )end forreturn the top n results from R
end procedure
Optimization Types
• Unoptimized• Unsafe• Set Safe• Rank Safe• Score Safe
Unoptimized
• Compare the query to each document and calculate the score.
• Sort the documents. Documents with the same score may appear in any order.
• Return results in ranked order. “Top k documents” could be different.
Optimized• Unsafe
Documents returned have no guaranteed set of properties.
• Set SafeDocuments are guaranteed to be in the result set
but may not be in the same order as the unoptimized results.
• Rank SafeDocuments are guaranteed to be in the result set
and in the correct order, but document scores may not be thes same as the unoptimized results.
• Score SafeDocuments are guaranteed to be in the result set
and have the same scores as the unoptimized results.
Tupleflow
Distributed computing framework for indexing.• Flexibility
Settings made in parameter files, no ode changes required
• ScalabilityIndependent tasks spread across processors
• Disk abstractionStreaming data model
• Low abstraction penaltyCode handles custom hashing, sorting and
serialization
Traditional Indexing Approach
Create a word occurrence model by counting the unique terms in each document.
• Serial processingParse one document, move to the next
• Large memory requirements for unique word hash over large document setwords, misspellings, numbers, urls, etc.
• Different code required for each document typeDocuments, web pages, databases, etc.
Tupleflow Approach
Break processing into steps• Count terms (countsMaker)• Sort terms• Combine counts (countsReducer)
Tupleflow Example
The cat in the hat.
countsMaker sort
countsReducer
Word
Count
Word
Count Word Count
the 1 cat 1 cat 1
cat 1 hat 1 hat 1
in 1 in 1 in 1
the 1 the 1 the 2
hat 1 the 1
Tupleflow Execution Graph
• Single Processor
• Multiple Processors
filenames
read text
parse text
count words
filenames
read text
parse text
count words
combine counts
read text
parse text
count words
read text
parse text
count words
Summary
Document indexing and querying are time and resource intensive tasks. Optimizing and parallelizing wherever possible is essential to minimize resources and maximize efficiency. Tupleflow is one example of efficient indexing by parallelization.
Questions?