Information Retrieval - IIT Bombaycs621-2011/IR_basics_lec28_oct_3_2011.pdf · Information Retrieval (IR) means searching for relevant documents ... frequency(tf) and inverse document

Information Retrieval

Yogesh Kakde

29th January, 2011

What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR

Outline

1 What is Information Retrieval

2 Basic Components in an Web-IR system

3 Theoretical Models Of IRBoolean ModelVector ModelProbabilistic Model


What is Information Retrieval

Information Retrieval (IR) means searching for relevant documentsand information within the contents of a specific data set such asthe World Wide Web.

ss


Basic Components in an Web-IR system

ss

Crawling: The system browses the document collection andfetches documents.

Indexing: The system builds an index of the documentsfetched during crawling.

Ranking: The system retrieves documents that are relevant tothe query from the index and displays to the user.

Relevance Feedback: The initial results returned from a givenquery may be used to refine the query itself.


Terminologies Used

Inverted Index: An index that maps back from terms to thedocuments where they occur.

D1 : The GDP increased 2 percent this quarter .

D2 : The Spring economic slowdown continued to spring downwards this quarter .

ss


Terminologies Used

A token is an instance of a sequence of characters in someparticular document that are grouped together as a usefulsemantic unit for processing.Example: The GDP increased 2 percent this quarter.Tokens: The,GDP,increased,2,percent,this,quarterA type is the class of all tokens containing the same charactersequence.Example: To be or not to be.Types: to,be,or,notA term is a type that is included in the IR system’s dictionary.Example: The GDP increased 2 percent this quarter.Terms: GDP, increased,2,percent,quarterStop words: some extremely common words which are of littlevalue in select documents matching a user need.Example: The GDP increased 2 percent this quarter.Stop words: The,this


Terminologies Used

Stemming is the process for reducing inflected (or sometimesderived) words to their stem, base or root form. Stemmerscan be catagorised as

Rule-based stemmersCo-occurrence based stemmers

Lemma: the canonical form, dictionary form, or citation formof a set of words.

Lemmatization is the algorithmic process of determining thelemma for a given word with the use of a vocabulary andmorphological analysis.


Logical View Of Document

ss


Measures Of Accuracy

ss


A Formal Characterization of IR Models

An information retrieval model is a quadruple

{D,Q,F ,R(qi , dj)}

where

D is a set composed of logical views for the documents in thecollection.

Q is a set composed of logical views for the user informationneeds.

F is a framework for modeling document representations,queries, and their relationships.

R(qi , dj) is a ranking function which associates a real numberwith a query qi belong to Q and a document representationdj belong to D. Such ranking defines an ordering among thedocuments with regard to the query qi .


Boolean Model

Boolean Model

The model is based on Boolean Logic and classical Set theoryDocuments and the query are conceived as sets of terms.Retrieval is based on whether or not the documents containthe query termsThe queries are specified as Boolean expressions which haveprecise semantics.

Advantages:

Formalism and SimplicityVery precise

Disadvantages

No notion of partial matchBoolean expressions have precise semantics(not simple fromusers point-of-view)Not RankedThe model does not use term weights and frequencyGiven a large set of documents, the Boolean model eitherretrieves too many documents or very few documents.


Vector Model

Vector Model

The use of binary weights is too limiting so vector model proposesnon-binary weights to index terms in queries and documents.

Document vector d̄j = {w1j ,w2j , ...,wtj}

Query vector q̄i = {w1i ,w2i , ...,wyi}

Where t: total number of index terms in the collection ofdocuments

Term Frequency(tf): The term-frequency is simply the countof the term i in document j . The term-frequency of a term isnormalized by the maximum term frequency of any term inthe document to bring the range of the term-frequency 0 to 1.

tfi ,j =freqi ,j

maxl(freql ,j)


Vector Model

Inverse Document Frequency: This factor is motivated from thefact that if a term appears in all documents in a set, then it losesits distinguishing power. The inverse document frequency (idf ) ofa term i is given by:

idfi = logN

ni

where

ni is the number of documents in which the term i occurs

N is the total number of documents

The term weights can be approximated by the product of termfrequency(tf) and inverse document frequency(idf). This is calledthe tf − idf measure.


Vector Model

Similarity measure between two vectors

Cosine Similarity The cosine similarity between two vectors d̄j (thedocument vector) and q̄i (query vector) is given by:

similarity(d̄j , q̄i ) = cosθ =d̄j .q̄i

|d̄j |, |q̄i |

ss


Vector Model

Advantage

The cosine similarity measure returns value in the range 0 to1. Hence, partial matching is possible.

Ranking of the retrieved results according to the cosinesimilarity score is possible.

Disadvantage

Index terms are considered to be mutually independent. Thus,this model does not capture the semantics of the query or thedocument.

It cannot denote the “clear logic view” like Boolean model


Probabilistic Model

Prbabilistic Model

The document is retrieved according to the probability of thedocument being relevant to the query. Mathematically, thescoring function is given by

P(R = 1|d , q)

The document is termed relevant if its probability of beingrelevant is greater than its probability of being non relevant

P(R = 1|d̄ , q̄) > P(R = 0|d̄ , q̄)


Probabilistic Model

Binary Independence Model

Assumptions

Each index term is Boolean in nature (1 if present, 0 ifabsent).

There is no association between index terms.

The relevance of each document is calculated independent ofthe relevance of other documents.

Documents and queries are both represented as binary termincidence vectors.

dj = x1j , x2j , ..., xMj

[xi is 1 if xi is present in document. Else, this is zero.]

qj = q1, q2, ..., qM


Probabilistic Model

Odds of relevance is used as ranking function as

it is monotonic with respect to probability of relevance

it reduces the computation

odds of relevance =P(dj relevant to q)

P(dj non − relevant to q)

Ranking Function rf ()

rf () =P(R|d̄j)P(R̄|d̄j)

where

P(R|d̄j) is the probability of document dj relevant to q.

P(R̄|d̄j) is the probability of document dj not relevant to q.


Probabilistic Model

By using Baye’s Rule we get

rf () =P(d̄j |R).P(R)

P(d̄j |R̄).P(R̄)

Since P(R) and P(R̄) are same for each document eliminating itfrom the above equation wont change the ranking of thedocuments.

rf () =P(d̄j |R)

P(d̄j |R̄)

=P(x1|R).P(x2|x1,R)...P(xM |x1...xM−1,R)

P(x1|R̄).P(x2|x1, R̄)...P(xM |x1...xM−1, R̄)

=M∏t=1

(P(xt |x1...xt−1,RP(xt |x1...xt−1, R̄

)

(1)


Probabilistic Model

Applying Naive-Bayes assumption which states that all index termxt are independent we get,

rf () =M∏t=1

P(xt |R)

P(xt |R̄)

Since each xt is either 0 or 1 under Binary Independence Model,we can separate the terms to give

rf () =M∏

t:xt=1

P(xt |R)

P(xt |R̄).

M∏t:xt=0

P(xt |R)

P(xt |R̄)

Now let, pt = P(xt |R) br the probability of a term xt appearing ina relevant document and ut = P(xt |R̄) be the probability that aterm xt appears in a non-relevant document.

rf () =M∏

t:xt=1

ptut.

M∏t:xt=0

1− pt1− ut


Probabilistic Model

Multiply the above equation by

M∏t:xt=1

1− pt1− ut

and its reciprocalM∏

t:xt=1

1− ut1− pt

Now the termM∏

t:xt=1∨0

1− pt1− ut

as this term is independent of index terms xt . Hence it can beignored and we get,

rf () =M∏

t:xt=1

pt .(1− ut)

ut .(1− pt)


Probabilistic Model

Taking the logarithm of equation we get

rf () =M∑

t:xt=1

logpt .(1− ut)

ut .(1− pt)

=M∑

t:xt=1

logpt

(1− pt)+ log

(1− ut)

ut

=M∑

t:xt=1

ct

(2)

ct is the log odds ratio for the terms in query.

ct = logodds of term present in a relevant document

odds of term being present in a non − relevant document


Probabilistic Model

Equation (2) gives the formal scoring function of probabilisticinformation retrieval model. Now ct needs to be approximated.

Advantages

Documents are ranked in decreasing order of their probabilityif being relevant

Disadvantages

The need to guess the initial seperation of documents intorelevant and non-relevant sets.

All wights are binary

Index terms are assumed to be independent


Probabilistic Model

Estimation of ct

When there are no retrieved documents we can assume

pt is constant for all index terms xi .The distribution of index terms among the non-relevantdocuments can be approximated by the distribution of indexterms among all the documents in the collection.

Thus,pt = 0.5

ut =niN

where, ni is the number of documents which contain the indexterm xi and N is the total number of documents in the collection.Putting these value we get

ct = log0.5

1− 0.5+

1− ntN

ntN

= logN − nt

nt

(3)


Probabilistic Model

This gives the initial ranking. Improved ranking can be obtained byfollowing procedure We assume

We can approximate pt by the distribution of the index termxi among the documents initially retrieved.

We can approximate ut by considering that the non-retrieveddocuments are not relevant.

We can write,

pt =Vi

V

ut =ni − Vi

N − V

(4)

where,

V is subset of the documents initially retrieved and ranked bythe probabilistic model.

Vi be the subset of V containing index term xi .


Probabilistic Model

For small values of Vi and V it creates problem, so we add anadjustment factor.

pt =Vi + ni

N

V + 1

ut =ni − Vi + ni

N

N − V + 1

(5)