Jg Slides

Latent Topic Feedback for Information RetrievalDavid Andrzejewski, David Buttler

Juan Gabriel Romero

Universidad Nacional de Colombia

May 31, 2013

Juan Gabriel Romero (Universidad Nacional de Colombia)Latent Topic Feedback for Information Retrieval May 31, 2013 1 / 14

The Problem

Corpus:

Document metadata limited

Specialized domain

Large corpus, small user base

The user can not formulate the ”right” query


Solution

Obtaining user feedback at the latent topic level

Learn latent (unobserved) topics

Construct representations of these topics

Present potentially relevant topics to the user

Augment the original query


Latent Dirichlet Allocation

Figure 1: Blei, D. Sep 2009. Topic Models



Figure 2: Blei, D. Sep 2009. Topic Models



P(w, z, φ, θ|α, β,d) ∝

(T∏t

p(φt |β)

) D∏j

p(θj |α)

( N∏i

φzi (wi )θ(zi )

)

To infer z, φ and θ, run Markov Chain Monte Carlo (Gibbs sampling) and,

φt(w) ∝ ntw + β

θj(t) ∝ njt + α


Topic representation

First, with k = 10,

Wt = k − argmaxw

φt(w)

label generation (Best topic word)Description Score

Word probability f1(w) = P(w |z = t)Topic posterior f2(w) = P(z = t|w)

PMI f3(w) =∑

w ′ ∈Wt\wPMI (w ,w ′)Conditional 1 f4(w) =

∑w ′ ∈Wt\wP(w |w ′)

Conditional 2 f5(w) =∑

w ′ ∈Wt\wP(w ′|w)


Topic representation

ngram identification (Turbo Topics)

I Most significant trigramI Two most significant bigramsI Four most significant unigrams

capitalization


Topic selection

Top 2 documents considered relevants

Enriched topics:

E =⋃

d∈Dq

k − argmaxt

θd(t)

Related topics:

R =⋃t∈E

k − argmaxt′ /∈E

Σ(t, t ′)

Filter topics:

PMI (t) =1

k(k − 1)

∑(w ,w ′)∈Wt

PMI (w ,w ′)


Query expansion

Add 10 most probable words in the topic Wt to the query

With γ ∈ [0, 1] as weight parameter.

For Nq the words in the original query, the weight is (1−γ)Nq

The weight for each word from the selected topic, then is γφ̃t(w),with φ̃t representing the re-normalized topic-word probability:

φ̃t(w) =φt(w)∑

w ′∈Wtφt(w ′)


Experiments

Questions:

Can query expansion with latent topic feedback improve the result ofactual queries?

Assuming there are latent topics, will the topic selection describedpresent them to the user?

If presented with a helpful topic will a user actually select it?(Outside the scope)


Experimental setup

Data set from TREC

MALLET

Preparation: Downcasing; removal of numbers, punctuacion marks;stop words; filter rarely occuring words

Vocabularies between 10,000 and 20,000

Gibbs inference run 1,000 times re-estimating α each 25 samples

500 topics

γ = 0.25


Results

Mean Average Precision, Normalized Discounted Cumulative Gain,NDCG15


Results

For 40% of queries exist a latent topic that can enhance results

For 40% of these queries the approach finds relevant topics

Changes in the technique give worst resultsI Without filtering: Increase in the number of topics without substantial

increase in helpful topics retrievedI Excluding related topics: Decrease in the number of topics and the

helpful topics presented


Documents

Jg Slides