Upload
juan-romero
View
212
Download
0
Embed Size (px)
Citation preview
Latent Topic Feedback for Information RetrievalDavid Andrzejewski, David Buttler
Juan Gabriel Romero
Universidad Nacional de Colombia
May 31, 2013
Juan Gabriel Romero (Universidad Nacional de Colombia)Latent Topic Feedback for Information Retrieval May 31, 2013 1 / 14
The Problem
Corpus:
Document metadata limited
Specialized domain
Large corpus, small user base
The user can not formulate the ”right” query
Juan Gabriel Romero (Universidad Nacional de Colombia)Latent Topic Feedback for Information Retrieval May 31, 2013 2 / 14
Solution
Obtaining user feedback at the latent topic level
Learn latent (unobserved) topics
Construct representations of these topics
Present potentially relevant topics to the user
Augment the original query
Juan Gabriel Romero (Universidad Nacional de Colombia)Latent Topic Feedback for Information Retrieval May 31, 2013 3 / 14
Latent Dirichlet Allocation
Figure 1: Blei, D. Sep 2009. Topic Models
Juan Gabriel Romero (Universidad Nacional de Colombia)Latent Topic Feedback for Information Retrieval May 31, 2013 4 / 14
Latent Dirichlet Allocation
Figure 2: Blei, D. Sep 2009. Topic Models
Juan Gabriel Romero (Universidad Nacional de Colombia)Latent Topic Feedback for Information Retrieval May 31, 2013 5 / 14
Latent Dirichlet Allocation
P(w, z, φ, θ|α, β,d) ∝
(T∏t
p(φt |β)
) D∏j
p(θj |α)
( N∏i
φzi (wi )θ(zi )
)
To infer z, φ and θ, run Markov Chain Monte Carlo (Gibbs sampling) and,
φt(w) ∝ ntw + β
θj(t) ∝ njt + α
Juan Gabriel Romero (Universidad Nacional de Colombia)Latent Topic Feedback for Information Retrieval May 31, 2013 6 / 14
Topic representation
First, with k = 10,
Wt = k − argmaxw
φt(w)
label generation (Best topic word)Description Score
Word probability f1(w) = P(w |z = t)Topic posterior f2(w) = P(z = t|w)
PMI f3(w) =∑
w ′ ∈Wt\wPMI (w ,w ′)Conditional 1 f4(w) =
∑w ′ ∈Wt\wP(w |w ′)
Conditional 2 f5(w) =∑
w ′ ∈Wt\wP(w ′|w)
Juan Gabriel Romero (Universidad Nacional de Colombia)Latent Topic Feedback for Information Retrieval May 31, 2013 7 / 14
Topic representation
ngram identification (Turbo Topics)
I Most significant trigramI Two most significant bigramsI Four most significant unigrams
capitalization
Juan Gabriel Romero (Universidad Nacional de Colombia)Latent Topic Feedback for Information Retrieval May 31, 2013 8 / 14
Topic selection
Top 2 documents considered relevants
Enriched topics:
E =⋃
d∈Dq
k − argmaxt
θd(t)
Related topics:
R =⋃t∈E
k − argmaxt′ /∈E
Σ(t, t ′)
Filter topics:
PMI (t) =1
k(k − 1)
∑(w ,w ′)∈Wt
PMI (w ,w ′)
Juan Gabriel Romero (Universidad Nacional de Colombia)Latent Topic Feedback for Information Retrieval May 31, 2013 9 / 14
Query expansion
Add 10 most probable words in the topic Wt to the query
With γ ∈ [0, 1] as weight parameter.
For Nq the words in the original query, the weight is (1−γ)Nq
The weight for each word from the selected topic, then is γφ̃t(w),with φ̃t representing the re-normalized topic-word probability:
φ̃t(w) =φt(w)∑
w ′∈Wtφt(w ′)
Juan Gabriel Romero (Universidad Nacional de Colombia)Latent Topic Feedback for Information Retrieval May 31, 2013 10 / 14
Experiments
Questions:
Can query expansion with latent topic feedback improve the result ofactual queries?
Assuming there are latent topics, will the topic selection describedpresent them to the user?
If presented with a helpful topic will a user actually select it?(Outside the scope)
Juan Gabriel Romero (Universidad Nacional de Colombia)Latent Topic Feedback for Information Retrieval May 31, 2013 11 / 14
Experimental setup
Data set from TREC
MALLET
Preparation: Downcasing; removal of numbers, punctuacion marks;stop words; filter rarely occuring words
Vocabularies between 10,000 and 20,000
Gibbs inference run 1,000 times re-estimating α each 25 samples
500 topics
γ = 0.25
Juan Gabriel Romero (Universidad Nacional de Colombia)Latent Topic Feedback for Information Retrieval May 31, 2013 12 / 14
Results
Mean Average Precision, Normalized Discounted Cumulative Gain,NDCG15
Juan Gabriel Romero (Universidad Nacional de Colombia)Latent Topic Feedback for Information Retrieval May 31, 2013 13 / 14
Results
For 40% of queries exist a latent topic that can enhance results
For 40% of these queries the approach finds relevant topics
Changes in the technique give worst resultsI Without filtering: Increase in the number of topics without substantial
increase in helpful topics retrievedI Excluding related topics: Decrease in the number of topics and the
helpful topics presented
Juan Gabriel Romero (Universidad Nacional de Colombia)Latent Topic Feedback for Information Retrieval May 31, 2013 14 / 14