Upload
alpha
View
41
Download
0
Embed Size (px)
DESCRIPTION
Diversifying Search Results. Rakesh Agrawal, Sreenivas Gollapudi , Alan Halverson, Samuel Ieong Search Labs, Microsoft Research WSDM, February 10, 2009. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A. Ambiguity and Diversification. - PowerPoint PPT Presentation
Citation preview
Diversifying Search Results
Rakesh Agrawal, Sreenivas Gollapudi,Alan Halverson, Samuel Ieong
Search Labs, Microsoft Research
WSDM, February 10, 2009
Ambiguity and Diversification
• Many queries are ambiguous– “Barcelona” (City? Football team? Movie?)– “Michael Jordan”
Michael I. Jordan Michael J. Jordan
Ambiguity and Diversification
• Many queries are ambiguous– “Barcelona” (City? Football team? Movie?)– “Michael Jordan” (which one?)
How best to answer ambiguous queries?
• Use context, make suggestions, …
• Under the premise of returning a single (ordered) set of results, how best to diversify the search results so that a user will find something useful?
Intuition behind Our Approach
• Analyze click logs for classifying queries and docs
• Maximize the probability that the average user will find a relevant document in the retrieved results
• Use the analogy of marginal utility to determine whether to include more results from an already covered category
Outline
• Problem formulation
• Theoretical analysis
• Metrics to measure diversity
• Experiments
Assumptions
• A taxonomy (categorization of intents) C– For each query q, P(c | q) denote distribution of intents– c ∊ C P(c | q) = 1
• Quality assessment of documents at intent level– For each doc d, V(d | q, c) denote probability of the doc
satisfying the intent– Conditional independence
• Users are interested in finding at least one satisfying document
1¡ V(d1;d2jq;c) = (1¡ V(d1jq;c))(1¡ V(d2jq;c))
Problem Statement
DIVERSIFY(K)
• Given a query q, a set of documents D, distribution P(c | q), quality estimates V(d | c, q), and integer k
• Find a set of docs S D with |S| = k that maximizes
interpreted as the probability that the set S is relevant to the query over all possible intentions
c Sd
cqdVqcPqSP )),|(1(1)(|()|(
Find at least one relevant docMultiple intents
Discussion of Objective
• Makes explicit use of taxonomy– In contrast, similarity-based: [CG98], [CK06], [RKJ08]
• Captures both diversification and doc relevance– In contrast, coverage-based: [Z+05], [C+08], [V+08]
• Specific form of “loss minimization” [Z02], [ZL06]
• “Diminishing returns” for docs w/ the same intent
• Objective is order-independent– Assumes that all users read k results– May want to optimize k P(k) P(S | q)
Outline
• Problem formulation
• Theoretical analysis
• Metrics to measure diversity
• Experiments
Properties of the Objective
• DIVERSIFY(K) is NP-Hard– Reduction from Max-Cover
• No single ordering that will optimize for all k
• Can we make use of “diminishing returns”?
• Intent distribution: P(R | q) = 0.8, P(B | q) = 0.2.
0.4
A Greedy Algorithm
0.9
0.5
0.4
0.4
D V(d | q, c)
0.08
0.72
0.40
0.32
0.08
g(d | q, c)
U(R | q) = U(B | q) =0.8 0.2
×0.8×0.8×0.8×0.2×0.2
×0.08×0.08×0.2×0.2
0.08
0.08
0.04
0.03
0.08
0.12
×0.08×0.08
×0.12 0.050.4
0.9
0.4
0.07
S• Actually produces an
ordered set of results
• Results not proportional to intent distribution
• Results not according to (raw) quality
• Better results ⇒ less needed to be shown
Formal Claims
Lemma 1 P(S | q) is submodular.– Same intuition as diminishing returns– For sets of documents where S T, and a document d,
Theorem 1 Solution is an (1 – 1/e) approx from opt.– Consequence of Lemma 1 and [NWF78]
Theorem 2 Solution is optimal when each document can only satisfy one category.– Relative quality of docs does not change
P (S [ fdgjq) ¡ P (Sjq) ¸ P (T [ fdg) ¡ P (Tjq)
Outline
• Problem formulation
• Theoretical analysis
• Metrics to measure diversity
• Experiments
How to Measure Success?
• Many metrics for relevance– Normalized discounted cumulative gains at k (NDCG@k)– Mean average precision at k (MAP@k)– Mean reciprocal rank (MRR)
• Some metrics for diversity– Maximal marginal relevance (MMR) [CG98]– Nugget-based instantiation of NDCG [C+08]
• Want a metric that can take into account both relevance and diversity
[JK00]
Generalizing Relevance Metrics
• Take expectation over distribution of intents– Interpretation: how will the average user feel?
• Consider NDCG@k– Classic:
– NDCG-IA depends on intent distribution and intent-specific NDCG
DCG(S;k) =kX
j =1f (relevance(Sj ))=discount(j )
NDCG(S;k) = DCG(S;k)=maxR
DCG(R;k)
NDCG-IA(S;k) = Pc P (cjq)NDCG(S;kjc)
Outline
• Problem formulation
• Theoretical analysis
• Metrics to measure diversity
• Experiments
Setup
• 10,000 queries randomlysampled from logs– Queries classified acc.
to ODP (level 2) [F+08]– Keep only queries with
at least two intents (~900)
• Top 50 results from Live, Google, and Yahoo!
• Documents are rated on a 5-pt scale– >90% docs have ratings– Docs w/o ratings are assigned random grade according
to the distribution of rated documents
2 3 4 5 6 7 8 9 100
100
200
300
400
500
600
Category Count
Que
ry C
ount
Experiment Detail
• Documents are classified using a Rocchio classifier– Assumes that each doc belongs to only one category
• Quality scores of documents are estimated based on textual and link features of the webpage– Our approach is agnostic of how quality is determined– Can be interpreted as a re-ordering of search results
that takes into account ambiguities in queries
• Evaluation using generalized NDCG, MAP, and MRR– f(relevance(d)) = 2^rel(d); discount(j) = 1 + lg2 (j)
– Take P(c | q) as ground truth
NDCG-IA
NDCG-IA@3 NDCG-IA@5 [email protected]
0.05
0.10
0.15
0.20
0.25
0.30
Diverse Engine 1 Engine 2 Engine 3
NDC
G-IA
val
ue
MAP-IA and MRR-IA
MAP-IA@3 MAP-IA@5 [email protected]
0.10
0.20
0.30
0.40
0.50
0.60
0.70Diverse Engine 1 Engine 2 Engine 3
MAP
-IA v
alue
MRR-IA@3 MRR-IA@5 [email protected]
0.10
0.20
0.30
0.40
0.50
0.60
0.70Diverse Engine 1 Engine 2 Engine 3
MRR
-IA v
alue
Evaluation using Mechanical Turk
• Created two types of HITs on Mechanical Turk– Query classification: workers are asked to choose
among three interpretations– Document rating (under the given interpretation)
• Two additional evaluations– MT classification + current ratings– MT classification + MT document ratings
Evaluation using Mechanical Turk
MAP-IA@3 MAP-IA@5 [email protected]
0.10
0.20
0.30
0.40
0.50
0.60Diverse Engine 1 Engine 2 Engine 3
MAP
-IA v
alue
NDCG-IA@3 NDCG-IA@5 [email protected]
0.05
0.10
0.15
0.20
0.25Diverse Engine 1 Engine 2 Engine 3
NDC
G-IA
val
ue
MRR-IA@3 MRR-IA@5 [email protected]
0.10
0.20
0.30
0.40
0.50
0.60 Diverse Engine 1 Engine 2 Engine 3
Concluding Remarks
• Theoretical approach to diversification supported by empirical evaluation
• What to show is a function of both intent distribution and quality of documents– Less is needed when quality is high
• There are additional flexibilities in our approach– Not tied to any taxonomy– Can make use of context as well
Future Work
• When is it right to diversify?– Users have certain expectations about the workings of a
search engine
• What is the best way to diversify?– Evaluate approaches beyond diversifying the
retrieved results
• Metrics that capture both relevance and diversity– Some preliminary work suggests that there will be
certain trade-offs to make
Thanks
{rakesha, sreenig, alanhal, sieong}@microsoft.com