9/21/2000 Information Organization and Retrieval
Ranking and Relevance Feedback
Ray Larson & Marti Hearst
University of California, Berkeley
School of Information Management and Systems
SIMS 202: Information Organization and Retrieval
9/21/2000 Information Organization and Retrieval
Review
• Inverted files
• The Vector Space Model
• Term weighting
9/21/2000 Information Organization and Retrieval
tf x idf)/log(* kikik nNtfw
log
Tcontain that in documents ofnumber the
collection in the documents ofnumber total
in T termoffrequency document inverse
document in T termoffrequency
document in term
nNidf
Cn
CN
Cidf
Dtf
DkT
kk
kk
kk
ikik
ik
9/21/2000 Information Organization and Retrieval
Inverse Document Frequency
• IDF provides high values for rare words and low values for common words
41
10000log
698.220
10000log
301.05000
10000log
010000
10000log
For a collectionof 10000 documents
9/21/2000 Information Organization and Retrieval
Similarity Measures
|)||,min(|
||
||||
||
||||
||||
||2
||
21
21
DQ
DQ
DQ
DQ
DQDQ
DQ
DQ
DQ
Simple matching (coordination level match)
Dice’s Coefficient
Jaccard’s Coefficient
Cosine Coefficient
Overlap Coefficient
9/21/2000 Information Organization and Retrieval
Vector Space Visualization
9/21/2000 Information Organization and Retrieval
Text ClusteringClustering is
“The art of finding groups in data.” -- Kaufmann and Rousseeu
Term 1
Term 2
9/21/2000 Information Organization and Retrieval
tf x idf normalization• Normalize the term weights (so longer documents
are not unfairly given more weight)– normalize usually means force all values to fall within a certain
range, usually between 0 and 1, inclusive.
t
k kik
kikik
nNtf
nNtfw
1
22 )]/[log()(
)/log(
9/21/2000 Information Organization and Retrieval
Vector space similarity(use the weights to compare the documents)
terms.) thehting when weigdone tion was(Normaliza
product.inner normalizedor cosine, thecalled also is This
),(
:is documents twoof similarity theNow,
1
t
kjkikji wwDDsim
9/21/2000 Information Organization and Retrieval
Vector Space Similarity Measurecombine tf x idf into a similarity measure
)()(
),(
:comparison similarity in the normalize otherwise
),( :normalized weights termif
absent is terma if 0 ...,,
,...,,
1
2
1
2
1
1
,21
21
t
jd
t
jqj
t
jdqj
i
t
jdqji
qtqq
dddi
ij
ij
ij
itii
ww
ww
DQsim
wwDQsim
wwwwQ
wwwD
9/21/2000 Information Organization and Retrieval
Vector Space with Term Weights and Cosine Matching
1.0
0.8
0.6
0.4
0.2
0.80.60.40.20 1.0
D2
D1
Q
1
2
Term B
Term A
Di=(di1,wdi1;di2, wdi2;…;dit, wdit)Q =(qi1,wqi1;qi2, wqi2;…;qit, wqit)
t
j
t
j dq
t
j dq
i
ijj
ijj
ww
wwDQsim
1 1
22
1
)()(),(
Q = (0.4,0.8)D1=(0.8,0.3)D2=(0.2,0.7)
98.042.0
64.0
])7.0()2.0[(])8.0()4.0[(
)7.08.0()2.04.0()2,(
2222
DQsim
74.058.0
56.),( 1 DQsim
9/21/2000 Information Organization and Retrieval
Term Weights in SMART
• In SMART weights are decomposed into three factors:
norm
collectfreqw kkd
kd
9/21/2000 Information Organization and Retrieval
SMART Freq Components
1)ln(
)max(2
1
2
1)max(
}1,0{
kd
kd
kd
kd
kd
kd
freq
freqfreq
freq
freq
freq
Binary
maxnorm
augmented
log
9/21/2000 Information Organization and Retrieval
Collection Weighting in SMART
k
k
k
k
k
k
Doc
Doc
DocNDocDoc
NDoc
Doc
NDoc
collect
1
log
log
log
2
Inverse
squared
probabilistic
frequency
9/21/2000 Information Organization and Retrieval
Term Normalization in SMART
jvector
vectorj
vectorj
vectorj
w
w
w
w
norm
max
4
2
sum
cosine
fourth
max
9/21/2000 Information Organization and Retrieval
Probabilistic Models: Logistic Regression attributes
MX
n
nNIDF
IDFM
X
DLX
DAFM
X
QLX
QAFM
X
j
j
j
j
j
t
t
M
t
M
t
M
t
log
log1
log1
log1
6
15
4
13
2
11
Average Absolute Query Frequency
Query Length
Average Absolute Document Frequency
Document Length
Average Inverse Document Frequency
Inverse Document Frequency
Number of Terms in common between query and document -- logged
9/21/2000 Information Organization and Retrieval
Probabilistic Models: Logistic Regression
6
10),|(
iii XccDQRP
Probability of relevance is based onLogistic regression from a sample set of documentsto determine values of the coefficients.At retrieval the probability estimate is obtained by:
For the 6 X attribute measures shown previously
9/21/2000 Information Organization and Retrieval
Probabilistic Models
• Strong theoretical basis
• In principle should supply the best predictions of relevance given available information
• Can be implemented similarly to Vector
• Relevance information is required -- or is “guestimated”
• Important indicators of relevance may not be term -- though terms only are usually used
• Optimally requires on-going collection of relevance information
Advantages Disadvantages
9/21/2000 Information Organization and Retrieval
Vector and Probabilistic Models
• Support “natural language” queries• Treat documents and queries the same• Support relevance feedback searching• Support ranked retrieval• Differ primarily in theoretical basis and in how the
ranking is calculated– Vector assumes relevance
– Probabilistic relies on relevance judgments or estimates
9/21/2000 Information Organization and Retrieval
Current use of Probabilistic Models
• Virtually all the major systems in TREC now use the “Okapi BM25 formula” which incorporates the Robertson-Sparck Jones weights…
5.05.05.0
5.0
log)1(
rRnNrnrR
r
w
9/21/2000 Information Organization and Retrieval
Okapi BM25
• Where:• Q is a query containing terms T• K is k1((1-b) + b.dl/avdl)• k1, b and k3 are parameters , usually 1.2, 0.75 and 7-1000• tf is the frequency of the term in a specific document• qtf is the frequency of the term in a topic from which Q was
derived• dl and avdl are the document length and the average
document length measured in some convenient unit• w(1) is the Robertson-Sparck Jones weight.
QT qtfk
qtfk
tfK
tfkw
3
31)1( )1()1(
9/21/2000 Information Organization and Retrieval
Today
• Logistic regression and Cheshire
• Relevance Feedback
9/21/2000 Information Organization and Retrieval
Logistic Regression and Cheshire II
• The Cheshire II system uses Logistic Regression equations estimated from TREC full-text data.
• Demo (?)
9/21/2000 Information Organization and Retrieval
Querying in IR SystemInterest profiles
& QueriesDocuments
& data
Rules of the game =Rules for subject indexing +
Thesaurus (which consists of
Lead-InVocabulary
andIndexing
Language
StorageLine
Potentially Relevant
Documents
Comparison/Matching
Store1: Profiles/Search requests
Store2: Documentrepresentations
Indexing (Descriptive and
Subject)
Formulating query in terms of
descriptors
Storage of profiles
Storage of Documents
Information Storage and Retrieval System
9/21/2000 Information Organization and Retrieval
Relevance Feedback in an IR System
Interest profiles& Queries
Documents & data
Rules of the game =Rules for subject indexing +
Thesaurus (which consists of
Lead-InVocabulary
andIndexing
Language
StorageLine
Potentially Relevant
Documents
Comparison/Matching
Store1: Profiles/Search requests
Store2: Documentrepresentations
Indexing (Descriptive and
Subject)
Formulating query in terms of
descriptors
Storage of profiles
Storage of Documents
Information Storage and Retrieval System
Selected relevant docs
9/21/2000 Information Organization and Retrieval
Query Modification• Problem: how to reformulate the query?
– Thesaurus expansion:• Suggest terms similar to query terms
– Relevance feedback:• Suggest terms (and documents) similar to retrieved documents
that have been judged to be relevant
9/21/2000 Information Organization and Retrieval
Relevance Feedback• Main Idea:
– Modify existing query based on relevance judgements• Extract terms from relevant documents and add them to the
query
• and/or re-weight the terms already in the query
– Two main approaches:• Automatic (psuedo-relevance feedback)
• Users select relevant documents
– Users/system select terms from an automatically-generated list
9/21/2000 Information Organization and Retrieval
Relevance Feedback
• Usually do both:– expand query with new terms
– re-weight terms in query
• There are many variations– usually positive weights for terms from relevant docs
– sometimes negative weights for terms from non-relevant docs
– Remove terms ONLY in non-relevant documents
9/21/2000 Information Organization and Retrieval
Rocchio Method
0.25) to and 0.75 toset best to studies some(in
t termsnonrelevan andrelevant of importance the tune and
chosen documentsrelevant -non ofnumber the
chosen documentsrelevant ofnumber the
document relevant -non for the vector the
document relevant for the vector the
query initial for the vector the
2
1
0
1 21 101
21
n
n
iS
iR
Q
where
n
S
n
RQQ
i
i
n
i
in
i
i
9/21/2000 Information Organization and Retrieval
Rocchio Method
0.25) to and 0.75 toset best to studies some(in
t termsnonrelevan andrelevant of importance thetune and ,
chosen documentsrelevant -non ofnumber the
chosen documentsrelevant ofnumber the
document relevant -non for the vector the
document relevant for the vector the
query initial for the vector the
2
1
0
121101
21
n
n
iS
iR
Q
where
Sn
Rn
i
i
i
n
i
n
ii
9/21/2000 Information Organization and Retrieval
Rocchio/Vector Illustration
Retrieval
Information
0.5
1.0
0 0.5 1.0
D1
D2
Q0
Q’
Q”
Q0 = retrieval of information = (0.7,0.3)D1 = information science = (0.2,0.8)D2 = retrieval systems = (0.9,0.1)
Q’ = ½*Q0+ ½ * D1 = (0.45,0.55)Q” = ½*Q0+ ½ * D2 = (0.80,0.20)
9/21/2000 Information Organization and Retrieval
Example Rocchio Calculation
)04.1,033.0,488.0,022.0,527.0,01.0,002.0,000875.0,011.0(
12
25.0
75.0
1
)950,.00.0,450,.00.0,500,.00.0,00.0,00.0,00.0(
)00.0,020,.00.0,025,.005,.00.0,020,.010,.030(.
)120,.100,.100,.025,.050,.002,.020,.009,.020(.
)120,.00.0,00.0,050,.025,.025,.00.0,00.0,030(.
121
1
2
1
new
new
Q
SRRQQ
Q
S
R
R
Relevantdocs
Non-rel doc
Original Query
Constants
Rocchio Calculation
Resulting feedback query
9/21/2000 Information Organization and Retrieval
Rocchio Method
• Rocchio automatically– re-weights terms
– adds in new terms (from relevant docs)• have to be careful when using negative terms
• Rocchio is not a machine learning algorithm
• Most methods perform similarly– results heavily dependent on test collection
• Machine learning methods are proving to work better than standard IR approaches like Rocchio
9/21/2000 Information Organization and Retrieval
Probabilistic Relevance Feedback Robertson & Sparck Jones
Document Relevance
Documentindexing
Given a query term t
+ -
+ r n-r n
- R-r N-n-R+r N-n
R N-R N
Where N is the number of documents seen
9/21/2000 Information Organization and Retrieval
Robertson-Spark Jones Weights
• Retrospective formulation --
rRnNrnrR
r
wnewt log
9/21/2000 Information Organization and Retrieval
Robertson-Sparck Jones Weights
5.05.05.0
5.0
log)1(
rRnNrnrR
r
w
Predictive formulation
9/21/2000 Information Organization and Retrieval
Using Relevance Feedback
• Known to improve results– in TREC-like conditions (no user involved)
• What about with a user in the loop?– How might you measure this?
9/21/2000 Information Organization and Retrieval
Relevance Feedback Summary
• Iterative query modification can improve precision and recall for a standing query
• In at least one study, users were able to make good choices by seeing which terms were suggested for R.F. and selecting among them
Information Organization and Retrieval9/21/2000
Alternative Notions of Relevance Feedback
9/21/2000 Information Organization and Retrieval
Alternative Notions of Relevance Feedback
• Find people whose taste is “similar” to yours. Will you like what they like?
• Follow a users’ actions in the background. Can this be used to predict what the user will want to see next?
• Track what lots of people are doing. Does this implicitly indicate what they think is good and not good?
9/21/2000 Information Organization and Retrieval
Alternative Notions of Relevance Feedback
• Several different criteria to consider:– Implicit vs. Explicit judgements – Individual vs. Group judgements– Standing vs. Dynamic topics– Similarity of the items being judged vs. similarity
of the judges themselves
Information Organization and Retrieval
Collaborative Filtering (social filtering)
• If Pam liked the paper, I’ll like the paper
• If you liked Star Wars, you’ll like Independence Day
• Rating based on ratings of similar people– Ignores the text, so works on text, sound, pictures etc.
– But: Initial users can bias ratings of future users
Sally Bob Chris Lynn KarenStar Wars 7 7 3 4 7Jurassic Park 6 4 7 4 4Terminator II 3 4 7 6 3Independence Day 7 7 2 2 ?
Information Organization and Retrieval
• Users rate musical artists from like to dislike– 1 = detest 7 = can’t live without 4 = ambivalent
– There is a normal distribution around 4
– However, what matters are the extremes
• Nearest Neighbors Strategy: Find similar users and predicted (weighted) average of user ratings
• Pearson r algorithm: weight by degree of correlation between user U and user J– 1 means very similar, 0 means no correlation, -1 dissimilar
– Works better to compare against the ambivalent rating (4), rather than the individual’s average score
Ringo Collaborative Filtering (Shardanand & Maes 95)
22 )()(
))((
JJUU
JJUUrUJ
9/21/2000 Information Organization and Retrieval
Social Filtering• Ignores the content, only looks at who judges
things similarly• Works well on data relating to “taste”
– something that people are good at predicting about each other too
• Does it work for topic? – GroupLens results suggest otherwise (preliminary)– Perhaps for quality assessments– What about for assessing if a document is about a
topic?
Information Organization and Retrieval
Learning interface agents• Add agents in the UI, delegate tasks to them
• Use machine learning to improve performance– learn user behavior, preferences
• Useful when:– 1) past behavior is a useful predictor of the future
– 2) wide variety of behaviors amongst users
• Examples: – mail clerk: sort incoming messages in right mailboxes
– calendar manager: automatically schedule meeting times?
9/21/2000 Information Organization and Retrieval
Summary
• Relevance feedback is an effective means for user-directed query modification.
• Modification can be done with either direct or indirect user input
• Modification can be done based on an individual’s or a group’s past input.
9/21/2000 Information Organization and Retrieval