Ranking in IR & Www Paper1

8/3/2019 Ranking in IR & Www Paper1

1/24

February 15, 2005February 15, 2005 1 1

Ranking in IR and WWWRanking in IR and WWW

Modern Information Retrieval:Modern Information Retrieval: A Brief Overview A Brief Overview

--Amit Singhal Amit Singhal

Presented by Parin SangoiPresented by Parin Sangoi


2/24


Ranking in IR and WWW (Modern Information Retrieval:ARanking in IR and WWW (Modern Information Retrieval:A

Brief OverviewBrief Overview--Amit Singhal) Amit Singhal)

OutlineOutlineIntroductionIntroductionHistoryHistoryModels and ImplementationModels and Implementation

Vector Space Model Vector Space ModelProbabilistic ModelsProbabilistic ModelsInference Network ModelInference Network ModelImplementationImplementation

EvaluationEvaluation


3/24


Ranking in IR and WWW (Modern Information Retrieval:ARanking in IR and WWW (Modern Information Retrieval:A

Brief OverviewBrief Overview--Amit Singhal) Amit Singhal)

Outline (contd )Outline (contd )Key TechniquesKey Techniques

Term WeightingTerm WeightingQuery ModificationQuery Modification

Other Techniques and ApplicationsOther Techniques and Applications

ReferencesReferences


4/24


Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information

Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)

IntroductionIntroductionWhat is Information Retrieval?What is Information Retrieval?

An information retrieval system does not inform(i.e. change the knowledge of) the user on thesubject of his inquiry. It merely informs on theexistence (or non-existence) and whereabouts of

documents relating to his request.-F.W. Lancaster


5/24




Introduction(contd )Introduction(contd )

A typical IR system


6/24




HistoryHistory Vannevar Bush(1945) Vannevar Bush(1945)

As We May Think(gave birth to the idea of automatic access) As We May Think(gave birth to the idea of automatic access)H.P.Luhn(1957)H.P.Luhn(1957)

Indexing units for documents and measuring word overlap asIndexing units for documents and measuring word overlap asa criterion for retrieval.a criterion for retrieval.

Gerald Salton and studentsGerald Salton and studentsSMART system(improve search quality)SMART system(improve search quality)Cyril CleverdonCyril Cleverdon

Cranfield evaluations which is still in use in IR systems.Cranfield evaluations which is still in use in IR systems.


7/24




History(contd )History(contd )1970s and 1980s1970s and 1980s

Lot of developments based on the advances of the 60s.Lot of developments based on the advances of the 60s.Development of lots of models.Development of lots of models.Effective on small text collections.Effective on small text collections.

19921992

Text Retrieval Conference (TREC)Text Retrieval Conference (TREC) Aims at encouraging research in IR from large text collections. Aims at encouraging research in IR from large text collections.Old techniques modified and new techniques developed.Old techniques modified and new techniques developed.


8/24




Models and ImplementationModels and ImplementationBoolean systemsBoolean systems

ANDs, ORs, and NOTs ANDs, ORs, and NOTsNo ranking and difficult for a user to form a goodNo ranking and difficult for a user to form a goodsearch request.search request.Many users still use Boolean systems as they thinkMany users still use Boolean systems as they thinkthey are more in control of the retrieval processthey are more in control of the retrieval process


9/24




Models and Implementation(contd )Models and Implementation(contd ) Vector Space Model Vector Space Model

Text represented by a vector of terms.Text represented by a vector of terms.If words are chosen as terms, then every word can beIf words are chosen as terms, then every word can berepresented as a vector.represented as a vector.

A non A non--zero value is assigned to a text vector if the termzero value is assigned to a text vector if the termbelongs to the text.belongs to the text.

Text vectors are very sparse as there can be millions of termText vectors are very sparse as there can be millions of termin a vocabulary.in a vocabulary.Similarity between the query vector and the document vectorSimilarity between the query vector and the document vectoris used to assign a numeric score to a document for a query.is used to assign a numeric score to a document for a query.


10/24


Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern InformationRetrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)

Vector Space Model(contd ) Vector Space Model(contd )

The angle between the two vectors is used as a measure of The angle between the two vectors is used as a measure of divergence between vectors, and cosine of an angle is used asdivergence between vectors, and cosine of an angle is used asthe numeric similarity( cosine is 1 for identical vectors and 0 forthe numeric similarity( cosine is 1 for identical vectors and 0 fororthogonal vectors).orthogonal vectors).

Alternatively the dot product between the two vectors can be Alternatively the dot product between the two vectors can beused to measure similarity. If all the vectors are of unit length,used to measure similarity. If all the vectors are of unit length,

then the cosine of the angle is the same as the dot products.then the cosine of the angle is the same as the dot products.If VIf Vqq is the document vector and Vis the document vector and V dd is the document vector,thenis the document vector,thenthe similarity of document D to Query q is:the similarity of document D to Query q is:

Sim(VSim(Vdd,V,Vqq)= W)= W titi(V(Vqq) . W) . W titi(V(Vdd) where W) where W titi(V(Vqq)is the ith component in)is the ith component inthe query vector Vthe query vector V qq and Wand W titi(V(Vdd) is the ith component in the) is the ith component in thedocument vector Vdocument vector V dd..

Vd

Vq


11/24




Probabilistic ModelsProbabilistic ModelsOne of the main principles of an IR system is that it shouldOne of the main principles of an IR system is that it shouldbe ranked. The probabilistic model ranks by decreasingbe ranked. The probabilistic model ranks by decreasingprobability of their relevance to a query.probability of their relevance to a query.Let P(R/D) be the probability of relevance of document D. AsLet P(R/D) be the probability of relevance of document D. Asthe ranking criterion is monotonic under logthe ranking criterion is monotonic under log- -oddsoddstransformation, we can rank documents by log(P(R/D) /transformation, we can rank documents by log(P(R/D) /P(R/D)) where P(R/D) is the probability that document isP(R/D)) where P(R/D) is the probability that document isnonnon--relevant.relevant.

Applying Baye s theorem to this ratio we get Applying Baye s theorem to this ratio we get log ( (P(D/R).P(R)) / (P(D/R).P(R)) )log ( (P(D/R).P(R)) / (P(D/R).P(R)) )


12/24


Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern InformationRetrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)

Probabilistic Model (contd )Probabilistic Model (contd ) Assuming P(R) is independent of the document Assuming P(R) is independent of the document under consideration and is thus constant, P(R) andunder consideration and is thus constant, P(R) andP(R) are just scaling factors and can be eliminated.P(R) are just scaling factors and can be eliminated.Thus the above formula can be simplified to:Thus the above formula can be simplified to:log ( P(D/R) / P(D/R)).log ( P(D/R) / P(D/R)).Independence AssumptionIndependence Assumption

If pIf p ii denotes P(t denotes P(t ii /R) and q /R) and q ii denotes P(t denotes P(t ii /R) the above /R) the abovelog formula reduces to:log formula reduces to:


13/24




Probabilistic Model (contd )Probabilistic Model (contd )Sharp and Harper assume that pSharp and Harper assume that p ii is the same for everyis the same for everyquery and pquery and p ii /(1 /(1--pp ii) is a constant and hence can be) is a constant and hence can beignored.ignored.

Also all the documents in a collection are non Also all the documents in a collection are non- -relevant torelevant toa query (as the collections are very large) and estimatea query (as the collections are very large) and estimateqq ii by nby n ii /N where N is the collection size and n /N where N is the collection size and n ii is theis thenumber of documents containing the term i.number of documents containing the term i.Thus we get a scoring function:Thus we get a scoring function:

which is similar to the IDF function.which is similar to the IDF function.


14/24




Inference Network ModelInference Network ModelIn the simplest implementation, a document instantiates aIn the simplest implementation, a document instantiates aterm with a certain strength, and the credit from multipleterm with a certain strength, and the credit from multipleterms is accumulated given a query to compute theterms is accumulated given a query to compute theequivalent of a numeric score for the document.equivalent of a numeric score for the document.If the strength is considered to be the weight of the term inIf the strength is considered to be the weight of the term inthe document, then the ranking is similar to the Vector Spacethe document, then the ranking is similar to the Vector SpaceModel or the Probabilistic Model.Model or the Probabilistic Model.

Any form can be used to define the strength of the Any form can be used to define the strength of theinstantiation, and any formula can be used.instantiation, and any formula can be used.


15/24




ImplementationImplementationInverted list data structure.Inverted list data structure.

Fast access to a list of documents that contain a term alongFast access to a list of documents that contain a term alongwith some additional information (weight, relative position,with some additional information (weight, relative position,etc.).etc.).Inverted IndexInverted Index

Stop Words (ignored)Stop Words (ignored)the, in , of, a...the, in , of, a...

StemmingStemmingretrieval, retrieve, retrieved, retrieving, retrieverretrieval, retrieve, retrieved, retrieving, retrieverPoor stemming if it returns wrong documentsPoor stemming if it returns wrong documentsIs it good enough???Is it good enough???

Multi Word PhrasesMulti Word Phrases


16/24



Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)EvaluationEvaluation

Measurable Quantities:Measurable Quantities:1.1. TheThe coveragecoverage of the collection, that is, the extent to which theof the collection, that is, the extent to which the

system includes relevant mattersystem includes relevant matter2.2. TheThe time lagtime lag, that is, the average interval between the time the, that is, the average interval between the time the

search request is made and the time an answer is given;search request is made and the time an answer is given;3.3. The form of The form of presentationpresentation of the output of the output 4.4. TheThe effort effort involved on the part of the user in obtaining answersinvolved on the part of the user in obtaining answers

to his search requeststo his search requests5.5. TheThe recallrecall of the system, that is, the proportion of relevant of the system, that is, the proportion of relevant

material actually retrieved in answer to a search request material actually retrieved in answer to a search request 6.6. TheThe precisionprecision of the system, that is, the proportion of retrievedof the system, that is, the proportion of retrieved

material that is actually relevant.material that is actually relevant.Out of these the last two measure the effectiveness of theOut of these the last two measure the effectiveness of theretrieval system.retrieval system.


17/24




PrecisionPrecision and Recalland RecallContingency tableContingency table


18/24




Precision and Recall (contd )Precision and Recall (contd )RecallRecall is the proportion of relevant documentsis the proportion of relevant documents

retrieved by the system.retrieved by the system.Pr ecisionPr ecision is the proportion of retrieved documentsis the proportion of retrieved documentsthat are relevant.that are relevant.F alloutF allout is the proportion of nonis the proportion of non- -relevant documentsrelevant documentsretrieved by the system.retrieved by the system.

A good IR system should have a high recall (retrieve as A good IR system should have a high recall (retrieve asmany relevant documents as possible) & have a highmany relevant documents as possible) & have a highprecision (retrieve very few nonprecision (retrieve very few non- -relevant documents).relevant documents).


19/24



Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)Precision and Recall (contd )Precision and Recall (contd )

Unfortunately the two goals are quite contradictory.Unfortunately the two goals are quite contradictory. Average Precision Average Precision


20/24



Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)Key TechniquesKey TechniquesTerm WeightingTerm Weighting

Both the Probabilistic Model and the Vector Space Model need aBoth the Probabilistic Model and the Vector Space Model need aweight function to determine the ranked relevance.weight function to determine the ranked relevance.

Three main factors affect the weight formulation:Three main factors affect the weight formulation:Term Frequency (tf)Term Frequency (tf)Words that repeat multiple times in a document are consideredWords that repeat multiple times in a document are considered

salient.salient.Document Frequency (idf)Document Frequency (idf)

Words that appear in many documents are considered commonWords that appear in many documents are considered commonand are not very indicative of document content. A weightingand are not very indicative of document content. A weightingmethod based on this, is called inverse document frequency (ormethod based on this, is called inverse document frequency (oridf idf) weighting.) weighting.

Document LengthDocument LengthWhen collections have documents of varying lengths, longerWhen collections have documents of varying lengths, longer

documents tend to score higher since they contain more wordsdocuments tend to score higher since they contain more wordsand word repetitions. This effect is usually compensated byand word repetitions. This effect is usually compensated bynormalizing for document lengths in the term weighting method.normalizing for document lengths in the term weighting method.


21/24



Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)Term Weighting (contd )Term Weighting (contd )

After the first TREC researchers realized that raw tf is non After the first TREC researchers realized that raw tf is non- -optimaloptimaland a dampened frequency (e.g., a logarithmic tf function) is aand a dampened frequency (e.g., a logarithmic tf function) is abetter weighting metric.better weighting metric.


22/24




Query ModificationQuery ModificationSynonymsSynonyms

Earlier systems relied on thesaurusEarlier systems relied on thesaurusNew ones build their own thesauri by analyzing word coNew ones build their own thesauri by analyzing word co- -occurrenceoccurrence

Relevance FeedbackRelevance FeedbackUsers are the best judgers of whether a query is relevant Users are the best judgers of whether a query is relevant or nonor non--relevant relevant

PseudoPseudo--FeedbackFeedbackRelevance feedback on the top few documents toRelevance feedback on the top few documents togenerate a new querygenerate a new query


23/24



Retrieval: A Brief OverviewRetrieval: A Brief Overview- -Amit Singhal) Amit Singhal)Other Techniques and ApplicationsOther Techniques and Applications

TechniquesTechniquesCluster HypothesisCluster Hypothesis

Documents that are very similar to each other will have a similarDocuments that are very similar to each other will have a similarrelevance profile for a given query.relevance profile for a given query.Limited successLimited success

Aided in the development of browsing and searching interfaces Aided in the development of browsing and searching interfacesNatural Language ProcessingNatural Language Processing

Limited successLimited success Applications Applications

Information FilteringInformation Filtering

Topic Detection and Tracking (TDT)Topic Detection and Tracking (TDT)Speech RetrievalSpeech RetrievalCross Language RetrievalCross Language RetrievalQuestion AnsweringQuestion Answering

....


24/24



Retrieval: A Brief OverviewRetrieval: A Brief Overview- -Amit Singhal) Amit Singhal)

ReferencesReferencesModern Information Retrieval: A Brief Modern Information Retrieval: A Brief OverviewOverview

--Amit Singhal Amit SinghalInformation RetrievalInformation Retrieval

--C.J. van RijsbergenC.J. van RijsbergenDr. Gautam Das s Lecture NotesDr. Gautam Das s Lecture Notes