Ranking in IR & Www Paper1

Embed Size (px)

Citation preview

  • 8/3/2019 Ranking in IR & Www Paper1

    1/24

    February 15, 2005February 15, 2005 1 1

    Ranking in IR and WWWRanking in IR and WWW

    Modern Information Retrieval:Modern Information Retrieval: A Brief Overview A Brief Overview

    --Amit Singhal Amit Singhal

    Presented by Parin SangoiPresented by Parin Sangoi

  • 8/3/2019 Ranking in IR & Www Paper1

    2/24

    February 15, 2005February 15, 2005 2 2

    Ranking in IR and WWW (Modern Information Retrieval:ARanking in IR and WWW (Modern Information Retrieval:A

    Brief OverviewBrief Overview--Amit Singhal) Amit Singhal)

    OutlineOutlineIntroductionIntroductionHistoryHistoryModels and ImplementationModels and Implementation

    Vector Space Model Vector Space ModelProbabilistic ModelsProbabilistic ModelsInference Network ModelInference Network ModelImplementationImplementation

    EvaluationEvaluation

  • 8/3/2019 Ranking in IR & Www Paper1

    3/24

    February 15, 2005February 15, 2005 3 3

    Ranking in IR and WWW (Modern Information Retrieval:ARanking in IR and WWW (Modern Information Retrieval:A

    Brief OverviewBrief Overview--Amit Singhal) Amit Singhal)

    Outline (contd )Outline (contd )Key TechniquesKey Techniques

    Term WeightingTerm WeightingQuery ModificationQuery Modification

    Other Techniques and ApplicationsOther Techniques and Applications

    ReferencesReferences

  • 8/3/2019 Ranking in IR & Www Paper1

    4/24

    February 15, 2005February 15, 2005 4 4

    Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information

    Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)

    IntroductionIntroductionWhat is Information Retrieval?What is Information Retrieval?

    An information retrieval system does not inform(i.e. change the knowledge of) the user on thesubject of his inquiry. It merely informs on theexistence (or non-existence) and whereabouts of

    documents relating to his request.-F.W. Lancaster

  • 8/3/2019 Ranking in IR & Www Paper1

    5/24

    February 15, 2005February 15, 2005 5 5

    Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information

    Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)

    Introduction(contd )Introduction(contd )

    A typical IR system

  • 8/3/2019 Ranking in IR & Www Paper1

    6/24

    February 15, 2005February 15, 2005 6 6

    Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information

    Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)

    HistoryHistory Vannevar Bush(1945) Vannevar Bush(1945)

    As We May Think(gave birth to the idea of automatic access) As We May Think(gave birth to the idea of automatic access)H.P.Luhn(1957)H.P.Luhn(1957)

    Indexing units for documents and measuring word overlap asIndexing units for documents and measuring word overlap asa criterion for retrieval.a criterion for retrieval.

    Gerald Salton and studentsGerald Salton and studentsSMART system(improve search quality)SMART system(improve search quality)Cyril CleverdonCyril Cleverdon

    Cranfield evaluations which is still in use in IR systems.Cranfield evaluations which is still in use in IR systems.

  • 8/3/2019 Ranking in IR & Www Paper1

    7/24

    February 15, 2005February 15, 2005 7 7

    Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information

    Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)

    History(contd )History(contd )1970s and 1980s1970s and 1980s

    Lot of developments based on the advances of the 60s.Lot of developments based on the advances of the 60s.Development of lots of models.Development of lots of models.Effective on small text collections.Effective on small text collections.

    19921992

    Text Retrieval Conference (TREC)Text Retrieval Conference (TREC) Aims at encouraging research in IR from large text collections. Aims at encouraging research in IR from large text collections.Old techniques modified and new techniques developed.Old techniques modified and new techniques developed.

  • 8/3/2019 Ranking in IR & Www Paper1

    8/24

    February 15, 2005February 15, 2005 8 8

    Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information

    Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)

    Models and ImplementationModels and ImplementationBoolean systemsBoolean systems

    ANDs, ORs, and NOTs ANDs, ORs, and NOTsNo ranking and difficult for a user to form a goodNo ranking and difficult for a user to form a goodsearch request.search request.Many users still use Boolean systems as they thinkMany users still use Boolean systems as they thinkthey are more in control of the retrieval processthey are more in control of the retrieval process

  • 8/3/2019 Ranking in IR & Www Paper1

    9/24

    February 15, 2005February 15, 2005 9 9

    Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information

    Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)

    Models and Implementation(contd )Models and Implementation(contd ) Vector Space Model Vector Space Model

    Text represented by a vector of terms.Text represented by a vector of terms.If words are chosen as terms, then every word can beIf words are chosen as terms, then every word can berepresented as a vector.represented as a vector.

    A non A non--zero value is assigned to a text vector if the termzero value is assigned to a text vector if the termbelongs to the text.belongs to the text.

    Text vectors are very sparse as there can be millions of termText vectors are very sparse as there can be millions of termin a vocabulary.in a vocabulary.Similarity between the query vector and the document vectorSimilarity between the query vector and the document vectoris used to assign a numeric score to a document for a query.is used to assign a numeric score to a document for a query.

  • 8/3/2019 Ranking in IR & Www Paper1

    10/24

    February 15, 2005February 15, 2005 10 10

    Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern InformationRetrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)

    Vector Space Model(contd ) Vector Space Model(contd )

    The angle between the two vectors is used as a measure of The angle between the two vectors is used as a measure of divergence between vectors, and cosine of an angle is used asdivergence between vectors, and cosine of an angle is used asthe numeric similarity( cosine is 1 for identical vectors and 0 forthe numeric similarity( cosine is 1 for identical vectors and 0 fororthogonal vectors).orthogonal vectors).

    Alternatively the dot product between the two vectors can be Alternatively the dot product between the two vectors can beused to measure similarity. If all the vectors are of unit length,used to measure similarity. If all the vectors are of unit length,

    then the cosine of the angle is the same as the dot products.then the cosine of the angle is the same as the dot products.If VIf Vqq is the document vector and Vis the document vector and V dd is the document vector,thenis the document vector,thenthe similarity of document D to Query q is:the similarity of document D to Query q is:

    Sim(VSim(Vdd,V,Vqq)= W)= W titi(V(Vqq) . W) . W titi(V(Vdd) where W) where W titi(V(Vqq)is the ith component in)is the ith component inthe query vector Vthe query vector V qq and Wand W titi(V(Vdd) is the ith component in the) is the ith component in thedocument vector Vdocument vector V dd..

    Vd

    Vq

  • 8/3/2019 Ranking in IR & Www Paper1

    11/24

    February 15, 2005February 15, 2005 11 11

    Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information

    Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)

    Probabilistic ModelsProbabilistic ModelsOne of the main principles of an IR system is that it shouldOne of the main principles of an IR system is that it shouldbe ranked. The probabilistic model ranks by decreasingbe ranked. The probabilistic model ranks by decreasingprobability of their relevance to a query.probability of their relevance to a query.Let P(R/D) be the probability of relevance of document D. AsLet P(R/D) be the probability of relevance of document D. Asthe ranking criterion is monotonic under logthe ranking criterion is monotonic under log- -oddsoddstransformation, we can rank documents by log(P(R/D) /transformation, we can rank documents by log(P(R/D) /P(R/D)) where P(R/D) is the probability that document isP(R/D)) where P(R/D) is the probability that document isnonnon--relevant.relevant.

    Applying Baye s theorem to this ratio we get Applying Baye s theorem to this ratio we get log ( (P(D/R).P(R)) / (P(D/R).P(R)) )log ( (P(D/R).P(R)) / (P(D/R).P(R)) )

  • 8/3/2019 Ranking in IR & Www Paper1

    12/24

    February 15, 2005February 15, 2005 12 12

    Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern InformationRetrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)

    Probabilistic Model (contd )Probabilistic Model (contd ) Assuming P(R) is independent of the document Assuming P(R) is independent of the document under consideration and is thus constant, P(R) andunder consideration and is thus constant, P(R) andP(R) are just scaling factors and can be eliminated.P(R) are just scaling factors and can be eliminated.Thus the above formula can be simplified to:Thus the above formula can be simplified to:log ( P(D/R) / P(D/R)).log ( P(D/R) / P(D/R)).Independence AssumptionIndependence Assumption

    If pIf p ii denotes P(t denotes P(t ii /R) and q /R) and q ii denotes P(t denotes P(t ii /R) the above /R) the abovelog formula reduces to:log formula reduces to:

  • 8/3/2019 Ranking in IR & Www Paper1

    13/24

    February 15, 2005February 15, 2005 13 13

    Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information

    Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)

    Probabilistic Model (contd )Probabilistic Model (contd )Sharp and Harper assume that pSharp and Harper assume that p ii is the same for everyis the same for everyquery and pquery and p ii /(1 /(1--pp ii) is a constant and hence can be) is a constant and hence can beignored.ignored.

    Also all the documents in a collection are non Also all the documents in a collection are non- -relevant torelevant toa query (as the collections are very large) and estimatea query (as the collections are very large) and estimateqq ii by nby n ii /N where N is the collection size and n /N where N is the collection size and n ii is theis thenumber of documents containing the term i.number of documents containing the term i.Thus we get a scoring function:Thus we get a scoring function:

    which is similar to the IDF function.which is similar to the IDF function.

  • 8/3/2019 Ranking in IR & Www Paper1

    14/24

    February 15, 2005February 15, 2005 14 14

    Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information

    Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)

    Inference Network ModelInference Network ModelIn the simplest implementation, a document instantiates aIn the simplest implementation, a document instantiates aterm with a certain strength, and the credit from multipleterm with a certain strength, and the credit from multipleterms is accumulated given a query to compute theterms is accumulated given a query to compute theequivalent of a numeric score for the document.equivalent of a numeric score for the document.If the strength is considered to be the weight of the term inIf the strength is considered to be the weight of the term inthe document, then the ranking is similar to the Vector Spacethe document, then the ranking is similar to the Vector SpaceModel or the Probabilistic Model.Model or the Probabilistic Model.

    Any form can be used to define the strength of the Any form can be used to define the strength of theinstantiation, and any formula can be used.instantiation, and any formula can be used.

  • 8/3/2019 Ranking in IR & Www Paper1

    15/24

    February 15, 2005February 15, 2005 15 15

    Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information

    Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)

    ImplementationImplementationInverted list data structure.Inverted list data structure.

    Fast access to a list of documents that contain a term alongFast access to a list of documents that contain a term alongwith some additional information (weight, relative position,with some additional information (weight, relative position,etc.).etc.).Inverted IndexInverted Index

    Stop Words (ignored)Stop Words (ignored)the, in , of, a...the, in , of, a...

    StemmingStemmingretrieval, retrieve, retrieved, retrieving, retrieverretrieval, retrieve, retrieved, retrieving, retrieverPoor stemming if it returns wrong documentsPoor stemming if it returns wrong documentsIs it good enough???Is it good enough???

    Multi Word PhrasesMulti Word Phrases

  • 8/3/2019 Ranking in IR & Www Paper1

    16/24

    February 15, 2005February 15, 2005 16 16

    Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information

    Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)EvaluationEvaluation

    Measurable Quantities:Measurable Quantities:1.1. TheThe coveragecoverage of the collection, that is, the extent to which theof the collection, that is, the extent to which the

    system includes relevant mattersystem includes relevant matter2.2. TheThe time lagtime lag, that is, the average interval between the time the, that is, the average interval between the time the

    search request is made and the time an answer is given;search request is made and the time an answer is given;3.3. The form of The form of presentationpresentation of the output of the output 4.4. TheThe effort effort involved on the part of the user in obtaining answersinvolved on the part of the user in obtaining answers

    to his search requeststo his search requests5.5. TheThe recallrecall of the system, that is, the proportion of relevant of the system, that is, the proportion of relevant

    material actually retrieved in answer to a search request material actually retrieved in answer to a search request 6.6. TheThe precisionprecision of the system, that is, the proportion of retrievedof the system, that is, the proportion of retrieved

    material that is actually relevant.material that is actually relevant.Out of these the last two measure the effectiveness of theOut of these the last two measure the effectiveness of theretrieval system.retrieval system.

  • 8/3/2019 Ranking in IR & Www Paper1

    17/24

    February 15, 2005February 15, 2005 17 17

    Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information

    Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)

    PrecisionPrecision and Recalland RecallContingency tableContingency table

  • 8/3/2019 Ranking in IR & Www Paper1

    18/24

    February 15, 2005February 15, 2005 18 18

    Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information

    Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)

    Precision and Recall (contd )Precision and Recall (contd )RecallRecall is the proportion of relevant documentsis the proportion of relevant documents

    retrieved by the system.retrieved by the system.Pr ecisionPr ecision is the proportion of retrieved documentsis the proportion of retrieved documentsthat are relevant.that are relevant.F alloutF allout is the proportion of nonis the proportion of non- -relevant documentsrelevant documentsretrieved by the system.retrieved by the system.

    A good IR system should have a high recall (retrieve as A good IR system should have a high recall (retrieve asmany relevant documents as possible) & have a highmany relevant documents as possible) & have a highprecision (retrieve very few nonprecision (retrieve very few non- -relevant documents).relevant documents).

  • 8/3/2019 Ranking in IR & Www Paper1

    19/24

    February 15, 2005February 15, 2005 19 19

    Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information

    Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)Precision and Recall (contd )Precision and Recall (contd )

    Unfortunately the two goals are quite contradictory.Unfortunately the two goals are quite contradictory. Average Precision Average Precision

  • 8/3/2019 Ranking in IR & Www Paper1

    20/24

    February 15, 2005February 15, 2005 20 20

    Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information

    Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)Key TechniquesKey TechniquesTerm WeightingTerm Weighting

    Both the Probabilistic Model and the Vector Space Model need aBoth the Probabilistic Model and the Vector Space Model need aweight function to determine the ranked relevance.weight function to determine the ranked relevance.

    Three main factors affect the weight formulation:Three main factors affect the weight formulation:Term Frequency (tf)Term Frequency (tf)Words that repeat multiple times in a document are consideredWords that repeat multiple times in a document are considered

    salient.salient.Document Frequency (idf)Document Frequency (idf)

    Words that appear in many documents are considered commonWords that appear in many documents are considered commonand are not very indicative of document content. A weightingand are not very indicative of document content. A weightingmethod based on this, is called inverse document frequency (ormethod based on this, is called inverse document frequency (oridf idf) weighting.) weighting.

    Document LengthDocument LengthWhen collections have documents of varying lengths, longerWhen collections have documents of varying lengths, longer

    documents tend to score higher since they contain more wordsdocuments tend to score higher since they contain more wordsand word repetitions. This effect is usually compensated byand word repetitions. This effect is usually compensated bynormalizing for document lengths in the term weighting method.normalizing for document lengths in the term weighting method.

  • 8/3/2019 Ranking in IR & Www Paper1

    21/24

    February 15, 2005February 15, 2005 21 21

    Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information

    Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)Term Weighting (contd )Term Weighting (contd )

    After the first TREC researchers realized that raw tf is non After the first TREC researchers realized that raw tf is non- -optimaloptimaland a dampened frequency (e.g., a logarithmic tf function) is aand a dampened frequency (e.g., a logarithmic tf function) is abetter weighting metric.better weighting metric.

  • 8/3/2019 Ranking in IR & Www Paper1

    22/24

    February 15, 2005February 15, 2005 22 22

    Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information

    Retrieval:A Brief OverviewRetrieval:A Brief Overview- -Amit Singhal) Amit Singhal)

    Query ModificationQuery ModificationSynonymsSynonyms

    Earlier systems relied on thesaurusEarlier systems relied on thesaurusNew ones build their own thesauri by analyzing word coNew ones build their own thesauri by analyzing word co- -occurrenceoccurrence

    Relevance FeedbackRelevance FeedbackUsers are the best judgers of whether a query is relevant Users are the best judgers of whether a query is relevant or nonor non--relevant relevant

    PseudoPseudo--FeedbackFeedbackRelevance feedback on the top few documents toRelevance feedback on the top few documents togenerate a new querygenerate a new query

  • 8/3/2019 Ranking in IR & Www Paper1

    23/24

    February 15, 2005February 15, 2005 23 23

    Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information

    Retrieval: A Brief OverviewRetrieval: A Brief Overview- -Amit Singhal) Amit Singhal)Other Techniques and ApplicationsOther Techniques and Applications

    TechniquesTechniquesCluster HypothesisCluster Hypothesis

    Documents that are very similar to each other will have a similarDocuments that are very similar to each other will have a similarrelevance profile for a given query.relevance profile for a given query.Limited successLimited success

    Aided in the development of browsing and searching interfaces Aided in the development of browsing and searching interfacesNatural Language ProcessingNatural Language Processing

    Limited successLimited success Applications Applications

    Information FilteringInformation Filtering

    Topic Detection and Tracking (TDT)Topic Detection and Tracking (TDT)Speech RetrievalSpeech RetrievalCross Language RetrievalCross Language RetrievalQuestion AnsweringQuestion Answering

    ....

  • 8/3/2019 Ranking in IR & Www Paper1

    24/24

    February 15, 2005February 15, 2005 24 24

    Ranking in IR and WWW (Modern InformationRanking in IR and WWW (Modern Information

    Retrieval: A Brief OverviewRetrieval: A Brief Overview- -Amit Singhal) Amit Singhal)

    ReferencesReferencesModern Information Retrieval: A Brief Modern Information Retrieval: A Brief OverviewOverview

    --Amit Singhal Amit SinghalInformation RetrievalInformation Retrieval

    --C.J. van RijsbergenC.J. van RijsbergenDr. Gautam Das s Lecture NotesDr. Gautam Das s Lecture Notes