Upload
lynguyet
View
221
Download
1
Embed Size (px)
Citation preview
Information Retrieval
Yogesh Kakde
29th January, 2011
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Outline
1 What is Information Retrieval
2 Basic Components in an Web-IR system
3 Theoretical Models Of IRBoolean ModelVector ModelProbabilistic Model
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
What is Information Retrieval
Information Retrieval (IR) means searching for relevant documentsand information within the contents of a specific data set such asthe World Wide Web.
ss
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Basic Components in an Web-IR system
ss
Crawling: The system browses the document collection andfetches documents.
Indexing: The system builds an index of the documentsfetched during crawling.
Ranking: The system retrieves documents that are relevant tothe query from the index and displays to the user.
Relevance Feedback: The initial results returned from a givenquery may be used to refine the query itself.
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Terminologies Used
Inverted Index: An index that maps back from terms to thedocuments where they occur.
D1 : The GDP increased 2 percent this quarter .
D2 : The Spring economic slowdown continued to spring downwards this quarter .
ss
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Terminologies Used
A token is an instance of a sequence of characters in someparticular document that are grouped together as a usefulsemantic unit for processing.Example: The GDP increased 2 percent this quarter.Tokens: The,GDP,increased,2,percent,this,quarterA type is the class of all tokens containing the same charactersequence.Example: To be or not to be.Types: to,be,or,notA term is a type that is included in the IR system’s dictionary.Example: The GDP increased 2 percent this quarter.Terms: GDP, increased,2,percent,quarterStop words: some extremely common words which are of littlevalue in select documents matching a user need.Example: The GDP increased 2 percent this quarter.Stop words: The,this
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Terminologies Used
Stemming is the process for reducing inflected (or sometimesderived) words to their stem, base or root form. Stemmerscan be catagorised as
Rule-based stemmersCo-occurrence based stemmers
Lemma: the canonical form, dictionary form, or citation formof a set of words.
Lemmatization is the algorithmic process of determining thelemma for a given word with the use of a vocabulary andmorphological analysis.
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Logical View Of Document
ss
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Measures Of Accuracy
ss
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
A Formal Characterization of IR Models
An information retrieval model is a quadruple
{D,Q,F ,R(qi , dj)}
where
D is a set composed of logical views for the documents in thecollection.
Q is a set composed of logical views for the user informationneeds.
F is a framework for modeling document representations,queries, and their relationships.
R(qi , dj) is a ranking function which associates a real numberwith a query qi belong to Q and a document representationdj belong to D. Such ranking defines an ordering among thedocuments with regard to the query qi .
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Boolean Model
Boolean Model
The model is based on Boolean Logic and classical Set theoryDocuments and the query are conceived as sets of terms.Retrieval is based on whether or not the documents containthe query termsThe queries are specified as Boolean expressions which haveprecise semantics.
Advantages:
Formalism and SimplicityVery precise
Disadvantages
No notion of partial matchBoolean expressions have precise semantics(not simple fromusers point-of-view)Not RankedThe model does not use term weights and frequencyGiven a large set of documents, the Boolean model eitherretrieves too many documents or very few documents.
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Vector Model
Vector Model
The use of binary weights is too limiting so vector model proposesnon-binary weights to index terms in queries and documents.
Document vector d̄j = {w1j ,w2j , ...,wtj}
Query vector q̄i = {w1i ,w2i , ...,wyi}
Where t: total number of index terms in the collection ofdocuments
Term Frequency(tf): The term-frequency is simply the countof the term i in document j . The term-frequency of a term isnormalized by the maximum term frequency of any term inthe document to bring the range of the term-frequency 0 to 1.
tfi ,j =freqi ,j
maxl(freql ,j)
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Vector Model
Inverse Document Frequency: This factor is motivated from thefact that if a term appears in all documents in a set, then it losesits distinguishing power. The inverse document frequency (idf ) ofa term i is given by:
idfi = logN
ni
where
ni is the number of documents in which the term i occurs
N is the total number of documents
The term weights can be approximated by the product of termfrequency(tf) and inverse document frequency(idf). This is calledthe tf − idf measure.
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Vector Model
Similarity measure between two vectors
Cosine Similarity The cosine similarity between two vectors d̄j (thedocument vector) and q̄i (query vector) is given by:
similarity(d̄j , q̄i ) = cosθ =d̄j .q̄i
|d̄j |, |q̄i |
ss
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Vector Model
Advantage
The cosine similarity measure returns value in the range 0 to1. Hence, partial matching is possible.
Ranking of the retrieved results according to the cosinesimilarity score is possible.
Disadvantage
Index terms are considered to be mutually independent. Thus,this model does not capture the semantics of the query or thedocument.
It cannot denote the “clear logic view” like Boolean model
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Probabilistic Model
Prbabilistic Model
The document is retrieved according to the probability of thedocument being relevant to the query. Mathematically, thescoring function is given by
P(R = 1|d , q)
The document is termed relevant if its probability of beingrelevant is greater than its probability of being non relevant
P(R = 1|d̄ , q̄) > P(R = 0|d̄ , q̄)
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Probabilistic Model
Binary Independence Model
Assumptions
Each index term is Boolean in nature (1 if present, 0 ifabsent).
There is no association between index terms.
The relevance of each document is calculated independent ofthe relevance of other documents.
Documents and queries are both represented as binary termincidence vectors.
dj = x1j , x2j , ..., xMj
[xi is 1 if xi is present in document. Else, this is zero.]
qj = q1, q2, ..., qM
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Probabilistic Model
Odds of relevance is used as ranking function as
it is monotonic with respect to probability of relevance
it reduces the computation
odds of relevance =P(dj relevant to q)
P(dj non − relevant to q)
Ranking Function rf ()
rf () =P(R|d̄j)P(R̄|d̄j)
where
P(R|d̄j) is the probability of document dj relevant to q.
P(R̄|d̄j) is the probability of document dj not relevant to q.
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Probabilistic Model
By using Baye’s Rule we get
rf () =P(d̄j |R).P(R)
P(d̄j |R̄).P(R̄)
Since P(R) and P(R̄) are same for each document eliminating itfrom the above equation wont change the ranking of thedocuments.
rf () =P(d̄j |R)
P(d̄j |R̄)
=P(x1|R).P(x2|x1,R)...P(xM |x1...xM−1,R)
P(x1|R̄).P(x2|x1, R̄)...P(xM |x1...xM−1, R̄)
=M∏t=1
(P(xt |x1...xt−1,RP(xt |x1...xt−1, R̄
)
(1)
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Probabilistic Model
Applying Naive-Bayes assumption which states that all index termxt are independent we get,
rf () =M∏t=1
P(xt |R)
P(xt |R̄)
Since each xt is either 0 or 1 under Binary Independence Model,we can separate the terms to give
rf () =M∏
t:xt=1
P(xt |R)
P(xt |R̄).
M∏t:xt=0
P(xt |R)
P(xt |R̄)
Now let, pt = P(xt |R) br the probability of a term xt appearing ina relevant document and ut = P(xt |R̄) be the probability that aterm xt appears in a non-relevant document.
rf () =M∏
t:xt=1
ptut.
M∏t:xt=0
1− pt1− ut
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Probabilistic Model
Multiply the above equation by
M∏t:xt=1
1− pt1− ut
and its reciprocalM∏
t:xt=1
1− ut1− pt
Now the termM∏
t:xt=1∨0
1− pt1− ut
as this term is independent of index terms xt . Hence it can beignored and we get,
rf () =M∏
t:xt=1
pt .(1− ut)
ut .(1− pt)
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Probabilistic Model
Taking the logarithm of equation we get
rf () =M∑
t:xt=1
logpt .(1− ut)
ut .(1− pt)
=M∑
t:xt=1
logpt
(1− pt)+ log
(1− ut)
ut
=M∑
t:xt=1
ct
(2)
ct is the log odds ratio for the terms in query.
ct = logodds of term present in a relevant document
odds of term being present in a non − relevant document
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Probabilistic Model
Equation (2) gives the formal scoring function of probabilisticinformation retrieval model. Now ct needs to be approximated.
Advantages
Documents are ranked in decreasing order of their probabilityif being relevant
Disadvantages
The need to guess the initial seperation of documents intorelevant and non-relevant sets.
All wights are binary
Index terms are assumed to be independent
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Probabilistic Model
Estimation of ct
When there are no retrieved documents we can assume
pt is constant for all index terms xi .The distribution of index terms among the non-relevantdocuments can be approximated by the distribution of indexterms among all the documents in the collection.
Thus,pt = 0.5
ut =niN
where, ni is the number of documents which contain the indexterm xi and N is the total number of documents in the collection.Putting these value we get
ct = log0.5
1− 0.5+
1− ntN
ntN
= logN − nt
nt
(3)
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Probabilistic Model
This gives the initial ranking. Improved ranking can be obtained byfollowing procedure We assume
We can approximate pt by the distribution of the index termxi among the documents initially retrieved.
We can approximate ut by considering that the non-retrieveddocuments are not relevant.
We can write,
pt =Vi
V
ut =ni − Vi
N − V
(4)
where,
V is subset of the documents initially retrieved and ranked bythe probabilistic model.
Vi be the subset of V containing index term xi .
What is Information Retrieval Basic Components in an Web-IR system Theoretical Models Of IR
Probabilistic Model
For small values of Vi and V it creates problem, so we add anadjustment factor.
pt =Vi + ni
N
V + 1
ut =ni − Vi + ni
N
N − V + 1
(5)