Upload
basantagarwal
View
221
Download
0
Embed Size (px)
Citation preview
8/3/2019 Ranking in IR & Www Paper1
1/24
February 15, 2005February 15, 2005 1 1
Ranking in IR and WWWRanking in IR and WWW
Modern Information Retrieval:Modern Information Retrieval: A Brief Overview A Brief Overview
--Amit Singhal Amit Singhal
Presented by Parin SangoiPresented by Parin Sangoi
8/3/2019 Ranking in IR & Www Paper1
2/24
February 15, 2005February 15, 2005 2 2
Ranking in IR and WWW (Modern Information Retrieval:ARanking in IR and WWW (Modern Information Retrieval:A
Brief OverviewBrief Overview--Amit Singhal) Amit Singhal)
OutlineOutlineIntroductionIntroductionHistoryHistoryModels and ImplementationModels and Implementation
Vector Space Model Vector Space ModelProbabilistic ModelsProbabilistic ModelsInference Network ModelInference Network ModelImplementationImplementation
EvaluationEvaluation
8/3/2019 Ranking in IR & Www Paper1
3/24
February 15, 2005February 15, 2005 3 3
Ranking in IR and WWW (Modern Information Retrieval:ARanking in IR and WWW (Modern Information Retrieval:A
Brief OverviewBrief Overview--Amit Singhal) Amit Singhal)
Outline (contd )Outline (contd )Key TechniquesKey Techniques
Term WeightingTerm WeightingQuery ModificationQuery Modification
Other Techniques and ApplicationsOther Techniques and Applications
ReferencesReferences
8/3/2019 Ranking in IR & Www Paper1
4/24
February 15, 2005February 15, 2005 4 4
Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information
Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)
IntroductionIntroductionWhat is Information Retrieval?What is Information Retrieval?
An information retrieval system does not inform(i.e. change the knowledge of) the user on thesubject of his inquiry. It merely informs on theexistence (or non-existence) and whereabouts of
documents relating to his request.-F.W. Lancaster
8/3/2019 Ranking in IR & Www Paper1
5/24
February 15, 2005February 15, 2005 5 5
Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information
Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)
Introduction(contd )Introduction(contd )
A typical IR system
8/3/2019 Ranking in IR & Www Paper1
6/24
February 15, 2005February 15, 2005 6 6
Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information
Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)
HistoryHistory Vannevar Bush(1945) Vannevar Bush(1945)
As We May Think(gave birth to the idea of automatic access) As We May Think(gave birth to the idea of automatic access)H.P.Luhn(1957)H.P.Luhn(1957)
Indexing units for documents and measuring word overlap asIndexing units for documents and measuring word overlap asa criterion for retrieval.a criterion for retrieval.
Gerald Salton and studentsGerald Salton and studentsSMART system(improve search quality)SMART system(improve search quality)Cyril CleverdonCyril Cleverdon
Cranfield evaluations which is still in use in IR systems.Cranfield evaluations which is still in use in IR systems.
8/3/2019 Ranking in IR & Www Paper1
7/24
February 15, 2005February 15, 2005 7 7
Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information
Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)
History(contd )History(contd )1970s and 1980s1970s and 1980s
Lot of developments based on the advances of the 60s.Lot of developments based on the advances of the 60s.Development of lots of models.Development of lots of models.Effective on small text collections.Effective on small text collections.
19921992
Text Retrieval Conference (TREC)Text Retrieval Conference (TREC) Aims at encouraging research in IR from large text collections. Aims at encouraging research in IR from large text collections.Old techniques modified and new techniques developed.Old techniques modified and new techniques developed.
8/3/2019 Ranking in IR & Www Paper1
8/24
February 15, 2005February 15, 2005 8 8
Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information
Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)
Models and ImplementationModels and ImplementationBoolean systemsBoolean systems
ANDs, ORs, and NOTs ANDs, ORs, and NOTsNo ranking and difficult for a user to form a goodNo ranking and difficult for a user to form a goodsearch request.search request.Many users still use Boolean systems as they thinkMany users still use Boolean systems as they thinkthey are more in control of the retrieval processthey are more in control of the retrieval process
8/3/2019 Ranking in IR & Www Paper1
9/24
February 15, 2005February 15, 2005 9 9
Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information
Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)
Models and Implementation(contd )Models and Implementation(contd ) Vector Space Model Vector Space Model
Text represented by a vector of terms.Text represented by a vector of terms.If words are chosen as terms, then every word can beIf words are chosen as terms, then every word can berepresented as a vector.represented as a vector.
A non A non--zero value is assigned to a text vector if the termzero value is assigned to a text vector if the termbelongs to the text.belongs to the text.
Text vectors are very sparse as there can be millions of termText vectors are very sparse as there can be millions of termin a vocabulary.in a vocabulary.Similarity between the query vector and the document vectorSimilarity between the query vector and the document vectoris used to assign a numeric score to a document for a query.is used to assign a numeric score to a document for a query.
8/3/2019 Ranking in IR & Www Paper1
10/24
February 15, 2005February 15, 2005 10 10
Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern InformationRetrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)
Vector Space Model(contd ) Vector Space Model(contd )
The angle between the two vectors is used as a measure of The angle between the two vectors is used as a measure of divergence between vectors, and cosine of an angle is used asdivergence between vectors, and cosine of an angle is used asthe numeric similarity( cosine is 1 for identical vectors and 0 forthe numeric similarity( cosine is 1 for identical vectors and 0 fororthogonal vectors).orthogonal vectors).
Alternatively the dot product between the two vectors can be Alternatively the dot product between the two vectors can beused to measure similarity. If all the vectors are of unit length,used to measure similarity. If all the vectors are of unit length,
then the cosine of the angle is the same as the dot products.then the cosine of the angle is the same as the dot products.If VIf Vqq is the document vector and Vis the document vector and V dd is the document vector,thenis the document vector,thenthe similarity of document D to Query q is:the similarity of document D to Query q is:
Sim(VSim(Vdd,V,Vqq)= W)= W titi(V(Vqq) . W) . W titi(V(Vdd) where W) where W titi(V(Vqq)is the ith component in)is the ith component inthe query vector Vthe query vector V qq and Wand W titi(V(Vdd) is the ith component in the) is the ith component in thedocument vector Vdocument vector V dd..
Vd
Vq
8/3/2019 Ranking in IR & Www Paper1
11/24
February 15, 2005February 15, 2005 11 11
Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information
Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)
Probabilistic ModelsProbabilistic ModelsOne of the main principles of an IR system is that it shouldOne of the main principles of an IR system is that it shouldbe ranked. The probabilistic model ranks by decreasingbe ranked. The probabilistic model ranks by decreasingprobability of their relevance to a query.probability of their relevance to a query.Let P(R/D) be the probability of relevance of document D. AsLet P(R/D) be the probability of relevance of document D. Asthe ranking criterion is monotonic under logthe ranking criterion is monotonic under log- -oddsoddstransformation, we can rank documents by log(P(R/D) /transformation, we can rank documents by log(P(R/D) /P(R/D)) where P(R/D) is the probability that document isP(R/D)) where P(R/D) is the probability that document isnonnon--relevant.relevant.
Applying Baye s theorem to this ratio we get Applying Baye s theorem to this ratio we get log ( (P(D/R).P(R)) / (P(D/R).P(R)) )log ( (P(D/R).P(R)) / (P(D/R).P(R)) )
8/3/2019 Ranking in IR & Www Paper1
12/24
February 15, 2005February 15, 2005 12 12
Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern InformationRetrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)
Probabilistic Model (contd )Probabilistic Model (contd ) Assuming P(R) is independent of the document Assuming P(R) is independent of the document under consideration and is thus constant, P(R) andunder consideration and is thus constant, P(R) andP(R) are just scaling factors and can be eliminated.P(R) are just scaling factors and can be eliminated.Thus the above formula can be simplified to:Thus the above formula can be simplified to:log ( P(D/R) / P(D/R)).log ( P(D/R) / P(D/R)).Independence AssumptionIndependence Assumption
If pIf p ii denotes P(t denotes P(t ii /R) and q /R) and q ii denotes P(t denotes P(t ii /R) the above /R) the abovelog formula reduces to:log formula reduces to:
8/3/2019 Ranking in IR & Www Paper1
13/24
February 15, 2005February 15, 2005 13 13
Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information
Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)
Probabilistic Model (contd )Probabilistic Model (contd )Sharp and Harper assume that pSharp and Harper assume that p ii is the same for everyis the same for everyquery and pquery and p ii /(1 /(1--pp ii) is a constant and hence can be) is a constant and hence can beignored.ignored.
Also all the documents in a collection are non Also all the documents in a collection are non- -relevant torelevant toa query (as the collections are very large) and estimatea query (as the collections are very large) and estimateqq ii by nby n ii /N where N is the collection size and n /N where N is the collection size and n ii is theis thenumber of documents containing the term i.number of documents containing the term i.Thus we get a scoring function:Thus we get a scoring function:
which is similar to the IDF function.which is similar to the IDF function.
8/3/2019 Ranking in IR & Www Paper1
14/24
February 15, 2005February 15, 2005 14 14
Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information
Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)
Inference Network ModelInference Network ModelIn the simplest implementation, a document instantiates aIn the simplest implementation, a document instantiates aterm with a certain strength, and the credit from multipleterm with a certain strength, and the credit from multipleterms is accumulated given a query to compute theterms is accumulated given a query to compute theequivalent of a numeric score for the document.equivalent of a numeric score for the document.If the strength is considered to be the weight of the term inIf the strength is considered to be the weight of the term inthe document, then the ranking is similar to the Vector Spacethe document, then the ranking is similar to the Vector SpaceModel or the Probabilistic Model.Model or the Probabilistic Model.
Any form can be used to define the strength of the Any form can be used to define the strength of theinstantiation, and any formula can be used.instantiation, and any formula can be used.
8/3/2019 Ranking in IR & Www Paper1
15/24
February 15, 2005February 15, 2005 15 15
Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information
Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)
ImplementationImplementationInverted list data structure.Inverted list data structure.
Fast access to a list of documents that contain a term alongFast access to a list of documents that contain a term alongwith some additional information (weight, relative position,with some additional information (weight, relative position,etc.).etc.).Inverted IndexInverted Index
Stop Words (ignored)Stop Words (ignored)the, in , of, a...the, in , of, a...
StemmingStemmingretrieval, retrieve, retrieved, retrieving, retrieverretrieval, retrieve, retrieved, retrieving, retrieverPoor stemming if it returns wrong documentsPoor stemming if it returns wrong documentsIs it good enough???Is it good enough???
Multi Word PhrasesMulti Word Phrases
8/3/2019 Ranking in IR & Www Paper1
16/24
February 15, 2005February 15, 2005 16 16
Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information
Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)EvaluationEvaluation
Measurable Quantities:Measurable Quantities:1.1. TheThe coveragecoverage of the collection, that is, the extent to which theof the collection, that is, the extent to which the
system includes relevant mattersystem includes relevant matter2.2. TheThe time lagtime lag, that is, the average interval between the time the, that is, the average interval between the time the
search request is made and the time an answer is given;search request is made and the time an answer is given;3.3. The form of The form of presentationpresentation of the output of the output 4.4. TheThe effort effort involved on the part of the user in obtaining answersinvolved on the part of the user in obtaining answers
to his search requeststo his search requests5.5. TheThe recallrecall of the system, that is, the proportion of relevant of the system, that is, the proportion of relevant
material actually retrieved in answer to a search request material actually retrieved in answer to a search request 6.6. TheThe precisionprecision of the system, that is, the proportion of retrievedof the system, that is, the proportion of retrieved
material that is actually relevant.material that is actually relevant.Out of these the last two measure the effectiveness of theOut of these the last two measure the effectiveness of theretrieval system.retrieval system.
8/3/2019 Ranking in IR & Www Paper1
17/24
February 15, 2005February 15, 2005 17 17
Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information
Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)
PrecisionPrecision and Recalland RecallContingency tableContingency table
8/3/2019 Ranking in IR & Www Paper1
18/24
February 15, 2005February 15, 2005 18 18
Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information
Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)
Precision and Recall (contd )Precision and Recall (contd )RecallRecall is the proportion of relevant documentsis the proportion of relevant documents
retrieved by the system.retrieved by the system.Pr ecisionPr ecision is the proportion of retrieved documentsis the proportion of retrieved documentsthat are relevant.that are relevant.F alloutF allout is the proportion of nonis the proportion of non- -relevant documentsrelevant documentsretrieved by the system.retrieved by the system.
A good IR system should have a high recall (retrieve as A good IR system should have a high recall (retrieve asmany relevant documents as possible) & have a highmany relevant documents as possible) & have a highprecision (retrieve very few nonprecision (retrieve very few non- -relevant documents).relevant documents).
8/3/2019 Ranking in IR & Www Paper1
19/24
February 15, 2005February 15, 2005 19 19
Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information
Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)Precision and Recall (contd )Precision and Recall (contd )
Unfortunately the two goals are quite contradictory.Unfortunately the two goals are quite contradictory. Average Precision Average Precision
8/3/2019 Ranking in IR & Www Paper1
20/24
February 15, 2005February 15, 2005 20 20
Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information
Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)Key TechniquesKey TechniquesTerm WeightingTerm Weighting
Both the Probabilistic Model and the Vector Space Model need aBoth the Probabilistic Model and the Vector Space Model need aweight function to determine the ranked relevance.weight function to determine the ranked relevance.
Three main factors affect the weight formulation:Three main factors affect the weight formulation:Term Frequency (tf)Term Frequency (tf)Words that repeat multiple times in a document are consideredWords that repeat multiple times in a document are considered
salient.salient.Document Frequency (idf)Document Frequency (idf)
Words that appear in many documents are considered commonWords that appear in many documents are considered commonand are not very indicative of document content. A weightingand are not very indicative of document content. A weightingmethod based on this, is called inverse document frequency (ormethod based on this, is called inverse document frequency (oridf idf) weighting.) weighting.
Document LengthDocument LengthWhen collections have documents of varying lengths, longerWhen collections have documents of varying lengths, longer
documents tend to score higher since they contain more wordsdocuments tend to score higher since they contain more wordsand word repetitions. This effect is usually compensated byand word repetitions. This effect is usually compensated bynormalizing for document lengths in the term weighting method.normalizing for document lengths in the term weighting method.
8/3/2019 Ranking in IR & Www Paper1
21/24
February 15, 2005February 15, 2005 21 21
Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information
Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)Term Weighting (contd )Term Weighting (contd )
After the first TREC researchers realized that raw tf is non After the first TREC researchers realized that raw tf is non- -optimaloptimaland a dampened frequency (e.g., a logarithmic tf function) is aand a dampened frequency (e.g., a logarithmic tf function) is abetter weighting metric.better weighting metric.
8/3/2019 Ranking in IR & Www Paper1
22/24
February 15, 2005February 15, 2005 22 22
Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information
Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)
Query ModificationQuery ModificationSynonymsSynonyms
Earlier systems relied on thesaurusEarlier systems relied on thesaurusNew ones build their own thesauri by analyzing word coNew ones build their own thesauri by analyzing word co- -occurrenceoccurrence
Relevance FeedbackRelevance FeedbackUsers are the best judgers of whether a query is relevant Users are the best judgers of whether a query is relevant or nonor non--relevant relevant
PseudoPseudo--FeedbackFeedbackRelevance feedback on the top few documents toRelevance feedback on the top few documents togenerate a new querygenerate a new query
8/3/2019 Ranking in IR & Www Paper1
23/24
February 15, 2005February 15, 2005 23 23
Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information
Retrieval: A Brief OverviewRetrieval: A Brief Overview- -Amit Singhal) Amit Singhal)Other Techniques and ApplicationsOther Techniques and Applications
TechniquesTechniquesCluster HypothesisCluster Hypothesis
Documents that are very similar to each other will have a similarDocuments that are very similar to each other will have a similarrelevance profile for a given query.relevance profile for a given query.Limited successLimited success
Aided in the development of browsing and searching interfaces Aided in the development of browsing and searching interfacesNatural Language ProcessingNatural Language Processing
Limited successLimited success Applications Applications
Information FilteringInformation Filtering
Topic Detection and Tracking (TDT)Topic Detection and Tracking (TDT)Speech RetrievalSpeech RetrievalCross Language RetrievalCross Language RetrievalQuestion AnsweringQuestion Answering
....
8/3/2019 Ranking in IR & Www Paper1
24/24
February 15, 2005February 15, 2005 24 24
Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information
Retrieval: A Brief OverviewRetrieval: A Brief Overview- -Amit Singhal) Amit Singhal)
ReferencesReferencesModern Information Retrieval: A Brief Modern Information Retrieval: A Brief OverviewOverview
--Amit Singhal Amit SinghalInformation RetrievalInformation Retrieval
--C.J. van RijsbergenC.J. van RijsbergenDr. Gautam Das s Lecture NotesDr. Gautam Das s Lecture Notes