Diversifying Search Results

Diversifying Search Results

Rakesh Agrawal, Sreenivas Gollapudi,Alan Halverson, Samuel Ieong

Search Labs, Microsoft Research

WSDM, February 10, 2009

Ambiguity and Diversification

• Many queries are ambiguous– “Barcelona” (City? Football team? Movie?)– “Michael Jordan”

Michael I. Jordan Michael J. Jordan

Ambiguity and Diversification

• Many queries are ambiguous– “Barcelona” (City? Football team? Movie?)– “Michael Jordan” (which one?)

How best to answer ambiguous queries?

• Use context, make suggestions, …

• Under the premise of returning a single (ordered) set of results, how best to diversify the search results so that a user will find something useful?

Intuition behind Our Approach

• Analyze click logs for classifying queries and docs

• Maximize the probability that the average user will find a relevant document in the retrieved results

• Use the analogy of marginal utility to determine whether to include more results from an already covered category

Outline

• Problem formulation

• Theoretical analysis

• Metrics to measure diversity

• Experiments

Assumptions

• A taxonomy (categorization of intents) C– For each query q, P(c | q) denote distribution of intents– c ∊ C P(c | q) = 1

• Quality assessment of documents at intent level– For each doc d, V(d | q, c) denote probability of the doc

satisfying the intent– Conditional independence

• Users are interested in finding at least one satisfying document

1¡ V(d1;d2jq;c) = (1¡ V(d1jq;c))(1¡ V(d2jq;c))

Problem Statement

DIVERSIFY(K)

• Given a query q, a set of documents D, distribution P(c | q), quality estimates V(d | c, q), and integer k

• Find a set of docs S D with |S| = k that maximizes

interpreted as the probability that the set S is relevant to the query over all possible intentions

c Sd

cqdVqcPqSP )),|(1(1)(|()|(

Find at least one relevant docMultiple intents

Discussion of Objective

• Makes explicit use of taxonomy– In contrast, similarity-based: [CG98], [CK06], [RKJ08]

• Captures both diversification and doc relevance– In contrast, coverage-based: [Z+05], [C+08], [V+08]

• Specific form of “loss minimization” [Z02], [ZL06]

• “Diminishing returns” for docs w/ the same intent

• Objective is order-independent– Assumes that all users read k results– May want to optimize k P(k) P(S | q)

Outline




• Experiments

Properties of the Objective

• DIVERSIFY(K) is NP-Hard– Reduction from Max-Cover

• No single ordering that will optimize for all k

• Can we make use of “diminishing returns”?

• Intent distribution: P(R | q) = 0.8, P(B | q) = 0.2.

0.4

A Greedy Algorithm

0.9

0.5

0.4

0.4

D V(d | q, c)

0.08

0.72

0.40

0.32

0.08

g(d | q, c)

U(R | q) = U(B | q) =0.8 0.2

×0.8×0.8×0.8×0.2×0.2

×0.08×0.08×0.2×0.2

0.08

0.08

0.04

0.03

0.08

0.12

×0.08×0.08

×0.12 0.050.4

0.9

0.4

0.07

S• Actually produces an

ordered set of results

• Results not proportional to intent distribution

• Results not according to (raw) quality

• Better results ⇒ less needed to be shown

Formal Claims

Lemma 1 P(S | q) is submodular.– Same intuition as diminishing returns– For sets of documents where S T, and a document d,

Theorem 1 Solution is an (1 – 1/e) approx from opt.– Consequence of Lemma 1 and [NWF78]

Theorem 2 Solution is optimal when each document can only satisfy one category.– Relative quality of docs does not change

P (S [ fdgjq) ¡ P (Sjq) ¸ P (T [ fdg) ¡ P (Tjq)

Outline




• Experiments

How to Measure Success?

• Many metrics for relevance– Normalized discounted cumulative gains at k (NDCG@k)– Mean average precision at k (MAP@k)– Mean reciprocal rank (MRR)

• Some metrics for diversity– Maximal marginal relevance (MMR) [CG98]– Nugget-based instantiation of NDCG [C+08]

• Want a metric that can take into account both relevance and diversity

[JK00]

Generalizing Relevance Metrics

• Take expectation over distribution of intents– Interpretation: how will the average user feel?

• Consider NDCG@k– Classic:

– NDCG-IA depends on intent distribution and intent-specific NDCG

DCG(S;k) =kX

j =1f (relevance(Sj ))=discount(j )

NDCG(S;k) = DCG(S;k)=maxR

DCG(R;k)

NDCG-IA(S;k) = Pc P (cjq)NDCG(S;kjc)

Outline




• Experiments

Setup

• 10,000 queries randomlysampled from logs– Queries classified acc.

to ODP (level 2) [F+08]– Keep only queries with

at least two intents (~900)

• Top 50 results from Live, Google, and Yahoo!

• Documents are rated on a 5-pt scale– >90% docs have ratings– Docs w/o ratings are assigned random grade according

to the distribution of rated documents

2 3 4 5 6 7 8 9 100

100

200

300

400

500

600

Category Count

Que

ry C

ount

Experiment Detail

• Documents are classified using a Rocchio classifier– Assumes that each doc belongs to only one category

• Quality scores of documents are estimated based on textual and link features of the webpage– Our approach is agnostic of how quality is determined– Can be interpreted as a re-ordering of search results

that takes into account ambiguities in queries

• Evaluation using generalized NDCG, MAP, and MRR– f(relevance(d)) = 2^rel(d); discount(j) = 1 + lg2 (j)

– Take P(c | q) as ground truth

NDCG-IA

NDCG-IA@3 NDCG-IA@5 [email protected]

0.05

0.10

0.15

0.20

0.25

0.30

Diverse Engine 1 Engine 2 Engine 3

NDC

G-IA

val

ue

MAP-IA and MRR-IA

MAP-IA@3 MAP-IA@5 [email protected]

0.10

0.20

0.30

0.40

0.50

0.60

0.70Diverse Engine 1 Engine 2 Engine 3

MAP

-IA v

alue

MRR-IA@3 MRR-IA@5 [email protected]

0.10

0.20

0.30

0.40

0.50

0.60


MRR

-IA v

alue

Evaluation using Mechanical Turk

• Created two types of HITs on Mechanical Turk– Query classification: workers are asked to choose

among three interpretations– Document rating (under the given interpretation)

• Two additional evaluations– MT classification + current ratings– MT classification + MT document ratings

Evaluation using Mechanical Turk

MAP-IA@3 MAP-IA@5 [email protected]

0.10

0.20

0.30

0.40

0.50


MAP

-IA v

alue

NDCG-IA@3 NDCG-IA@5 [email protected]

0.05

0.10

0.15

0.20


NDC

G-IA

val

ue

MRR-IA@3 MRR-IA@5 [email protected]

0.10

0.20

0.30

0.40

0.50

0.60 Diverse Engine 1 Engine 2 Engine 3

Concluding Remarks

• Theoretical approach to diversification supported by empirical evaluation

• What to show is a function of both intent distribution and quality of documents– Less is needed when quality is high

• There are additional flexibilities in our approach– Not tied to any taxonomy– Can make use of context as well

Future Work

• When is it right to diversify?– Users have certain expectations about the workings of a

search engine

• What is the best way to diversify?– Evaluate approaches beyond diversifying the

retrieved results

• Metrics that capture both relevance and diversity– Some preliminary work suggests that there will be

certain trade-offs to make

Thanks

{rakesha, sreenig, alanhal, sieong}@microsoft.com