14
Latent Topic Feedback for Information Retrieval David Andrzejewski, David Buttler Juan Gabriel Romero Universidad Nacional de Colombia May 31, 2013 Juan Gabriel Romero (Universidad Nacional de Colombia) Latent Topic Feedback for Information Retrieval May 31, 2013 1 / 14

Jg Slides

Embed Size (px)

Citation preview

Latent Topic Feedback for Information RetrievalDavid Andrzejewski, David Buttler

Juan Gabriel Romero

Universidad Nacional de Colombia

May 31, 2013

Juan Gabriel Romero (Universidad Nacional de Colombia)Latent Topic Feedback for Information Retrieval May 31, 2013 1 / 14

The Problem

Corpus:

Document metadata limited

Specialized domain

Large corpus, small user base

The user can not formulate the ”right” query

Juan Gabriel Romero (Universidad Nacional de Colombia)Latent Topic Feedback for Information Retrieval May 31, 2013 2 / 14

Solution

Obtaining user feedback at the latent topic level

Learn latent (unobserved) topics

Construct representations of these topics

Present potentially relevant topics to the user

Augment the original query

Juan Gabriel Romero (Universidad Nacional de Colombia)Latent Topic Feedback for Information Retrieval May 31, 2013 3 / 14

Latent Dirichlet Allocation

Figure 1: Blei, D. Sep 2009. Topic Models

Juan Gabriel Romero (Universidad Nacional de Colombia)Latent Topic Feedback for Information Retrieval May 31, 2013 4 / 14

Latent Dirichlet Allocation

Figure 2: Blei, D. Sep 2009. Topic Models

Juan Gabriel Romero (Universidad Nacional de Colombia)Latent Topic Feedback for Information Retrieval May 31, 2013 5 / 14

Latent Dirichlet Allocation

P(w, z, φ, θ|α, β,d) ∝

(T∏t

p(φt |β)

) D∏j

p(θj |α)

( N∏i

φzi (wi )θ(zi )

)

To infer z, φ and θ, run Markov Chain Monte Carlo (Gibbs sampling) and,

φt(w) ∝ ntw + β

θj(t) ∝ njt + α

Juan Gabriel Romero (Universidad Nacional de Colombia)Latent Topic Feedback for Information Retrieval May 31, 2013 6 / 14

Topic representation

First, with k = 10,

Wt = k − argmaxw

φt(w)

label generation (Best topic word)Description Score

Word probability f1(w) = P(w |z = t)Topic posterior f2(w) = P(z = t|w)

PMI f3(w) =∑

w ′ ∈Wt\wPMI (w ,w ′)Conditional 1 f4(w) =

∑w ′ ∈Wt\wP(w |w ′)

Conditional 2 f5(w) =∑

w ′ ∈Wt\wP(w ′|w)

Juan Gabriel Romero (Universidad Nacional de Colombia)Latent Topic Feedback for Information Retrieval May 31, 2013 7 / 14

Topic representation

ngram identification (Turbo Topics)

I Most significant trigramI Two most significant bigramsI Four most significant unigrams

capitalization

Juan Gabriel Romero (Universidad Nacional de Colombia)Latent Topic Feedback for Information Retrieval May 31, 2013 8 / 14

Topic selection

Top 2 documents considered relevants

Enriched topics:

E =⋃

d∈Dq

k − argmaxt

θd(t)

Related topics:

R =⋃t∈E

k − argmaxt′ /∈E

Σ(t, t ′)

Filter topics:

PMI (t) =1

k(k − 1)

∑(w ,w ′)∈Wt

PMI (w ,w ′)

Juan Gabriel Romero (Universidad Nacional de Colombia)Latent Topic Feedback for Information Retrieval May 31, 2013 9 / 14

Query expansion

Add 10 most probable words in the topic Wt to the query

With γ ∈ [0, 1] as weight parameter.

For Nq the words in the original query, the weight is (1−γ)Nq

The weight for each word from the selected topic, then is γφ̃t(w),with φ̃t representing the re-normalized topic-word probability:

φ̃t(w) =φt(w)∑

w ′∈Wtφt(w ′)

Juan Gabriel Romero (Universidad Nacional de Colombia)Latent Topic Feedback for Information Retrieval May 31, 2013 10 / 14

Experiments

Questions:

Can query expansion with latent topic feedback improve the result ofactual queries?

Assuming there are latent topics, will the topic selection describedpresent them to the user?

If presented with a helpful topic will a user actually select it?(Outside the scope)

Juan Gabriel Romero (Universidad Nacional de Colombia)Latent Topic Feedback for Information Retrieval May 31, 2013 11 / 14

Experimental setup

Data set from TREC

MALLET

Preparation: Downcasing; removal of numbers, punctuacion marks;stop words; filter rarely occuring words

Vocabularies between 10,000 and 20,000

Gibbs inference run 1,000 times re-estimating α each 25 samples

500 topics

γ = 0.25

Juan Gabriel Romero (Universidad Nacional de Colombia)Latent Topic Feedback for Information Retrieval May 31, 2013 12 / 14

Results

Mean Average Precision, Normalized Discounted Cumulative Gain,NDCG15

Juan Gabriel Romero (Universidad Nacional de Colombia)Latent Topic Feedback for Information Retrieval May 31, 2013 13 / 14

Results

For 40% of queries exist a latent topic that can enhance results

For 40% of these queries the approach finds relevant topics

Changes in the technique give worst resultsI Without filtering: Increase in the number of topics without substantial

increase in helpful topics retrievedI Excluding related topics: Decrease in the number of topics and the

helpful topics presented

Juan Gabriel Romero (Universidad Nacional de Colombia)Latent Topic Feedback for Information Retrieval May 31, 2013 14 / 14