Page 1: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

Basic IR: Modeling Basic IR Task:

Match a subset of documents to the user’s query

Slightly more complex: and rank the resulting documents by

predicted relevance

The derivation of relevance leads to different IR models.

Page 2: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

Concepts: Term-Document Incidence

Imagine matrix of terms X documents with 1 when the term appears in the document and 0 otherwise.

Queries satisfied how? Problems?

search segment

select semantic

MIR 1 0 1 1

AI 1 1 0 1

Page 3: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

Concepts: Term Frequency To support document ranking, need

more than just term incidence. Term frequency records number of

times a given term appears in each document.

Intuition: More times a term appears in a document the more central it is to the topic of the document.

Page 4: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

Concept: Term Weight Weights represent the importance of

a given term for characterizing a document.

wij is a weight for term i in document j.

Page 5: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

Mapping Task and Document Type to Model

Index Terms

Full Text Full Text + Structure

Searching (Retrieval)

Classic Classic Structured

Surfing (Browsing)

Flat FlatHypertext

Structure GuidedHypertext

Page 6: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

IR Models

Non-Overlapping ListsProximal Nodes

Structured Models

Retrieval: Adhoc Filtering


U s e r

T a s k

Classic Models

boolean vector probabilistic

Set Theoretic

Fuzzy Extended Boolean


Inference Network Belief Network


Generalized Vector Lat. Semantic Index Neural Networks


Flat Structure Guided Hypertext from MIR text

Page 7: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

Classic Models: Basic Concepts

Ki is an index term dj is a document t is the total number of docs K = (k1, k2, …, kt) is the set of all index terms wij >= 0 is a weight associated with (ki,dj) wij = 0 indicates that term does not belong to

doc vec(dj) = (w1j, w2j, …, wtj) is a weighted vector

associated with the document dj gi(vec(dj)) = wij is a function which returns the

weight associated with pair (ki,dj)

Page 8: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

Classic: Boolean Model Based on set theory: map queries with

Boolean operations to set operations Select documents from term-

document incidence matrix Pros:Cons:

Page 9: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

Exact Matching Ignores… term frequency in document term scarcity in corpus size of document ranking

Page 10: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

Vector Model Vector of term weights based on term

frequency Compute similarity between query

and document where both are vectors vec(dj) = (w1j, w2j, ..., wtj)

vec(q) = (w1q, w2q, ..., wtq) Similarity is the cosine of the angle

between the vectors.

Page 11: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

Cosine Measure

Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1




from MIR notes





















Page 12: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

How to Set Wij Weights? TF-IDF

Within document: Term-Frequency tf measures term density within a

document Across document: Inverse Document

Frequency idf measures informativeness or rarity of

term across corpus.



i log

Page 13: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

TF * IDF Computation

)/log(,, ididi dfntfw

rmcontain te that documents ofnumber the

documents ofnumber total

document in termoffrequency ,






What happens as number of occurrences in a document increases?

What happens as term becomes more rare?

Page 14: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

TF * IDF TF may be normalized.

tf(i,d) = freq(i,d) / max(freq(l,d)) IDF is computed

normalized to size of corpus as log to make TF and IDF values

comparable IDF requires a static corpus.

Page 15: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

How to Set Wi,q Weights?

1. Create Vector directly from query2. Use modified tf-idf


qi df



qifreqW log*



Page 16: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted



d3d4 d5




k1 k2 k3 d1 2 0 1 d2 1 0 0 d3 0 1 3 d4 2 0 0 d5 1 2 4 d6 1 2 0 d7 0 5 0

q 1 2 3

from MIR notes

The Vector Model: Example

Page 17: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted



d3d4 d5




from MIR notes

The Vector Model: Example (cont.)

1. Compute Tf-IDF Vector for each documentFor first document:K1: ((2/2)*(log (7/5)) = .33K2: (0*(log (7/4))) = 0K3: ((1/2)*(log (7/3))) = .42

for rest:[.34 0 0], [0 .19 .85], [.34 0 0], [.08 .28 .85], [.17 .56 0], [0 .56 0]

Page 18: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

The Vector Model: Example (cont.)

2. Compute the Tf-IDF for the query [1 2 3]:K1: (.5 + ((.5 * 1)/3))*(log (7/5)))K2: (.5 + ((.5 * 2)/3))*(log (7/4)))K3: (.5 + ((.5 * 3)/3))*(log (7/3)))which is: [.22 .47 .85]



d3d4 d5




Page 19: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

The Vector Model: Example (cont.)

3. Compute the Sim for each document:D1:

D1*q = (.33 * .22) + (0 * .47) + (.42 * .85) = .43

|D1| = sqrt((.33^2) + (.42^2)) = .53|q| = sqrt((.22^2) + (.47^2) + (.85^2)) = 1.0sim = .43 / (.53 * 1.0) = .81

D2: .22 D3: .93 D4: .23 D5: .97 D6: .51 D7: .47



d3d4 d5




Page 20: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

Vector Model Implementation Issues Sparse TermXDocument matrix Store term count, term weight, or

weighted by idfi ? What if the corpus is not fixed (e.g.,

the Web)? What happens to IDF? How to efficiently compute Cosine

for large index?

Page 21: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

Heuristics for Computing Cosine for Large Index

Select from only non-zero cosines Focus on non-zero cosines for rare

(high idf) words Pre-compute document adjacency

for each term, pre-compute k nearest docs for a t term query, compute cosines from

query to union of t pre-computed lists, choose top k

Page 22: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

Pros: term-weighting improves quality cosine ranking formula sorts documents

according to degree of similarity to the query

Cons: assumes independence of index terms

The TFIDF Vector Model: Pros/Cons
