16
DivQ: Diversification for Keyword Search over Structured Databases Elena Demidova, Peter Fankhauser, Xuan Zhou and Wolfgang Nejfl L3S Research Center, Hannover, Germany Fraunhofer IPSE, Darmstadt Germany CSIRO ICT Centre, Australia SIGIR 2010 2010. 12. 17. Jaehui Park

DivQ: Diversification for Keyword Search over Structured Databases Elena Demidova, Peter Fankhauser, Xuan Zhou and Wolfgang Nejfl L3S Research Center,

Embed Size (px)

DESCRIPTION

Copyright  2010 by CEBT INTRODUCTION  Motivation Taking advantage of the structure of the databases – Query interpretation in terms of the underlying database – To deliver more diverse and orthogonal representations of query results ex) attribute  Contributions DivQ – A probabilistic query disambiguation model – A diversification scheme for generating top-k query interpretations Evaluation metrics for structured data – α-nDCG-W – WS-recall 3

Citation preview

Page 1: DivQ: Diversification for Keyword Search over Structured Databases Elena Demidova, Peter Fankhauser, Xuan Zhou and Wolfgang Nejfl L3S Research Center,

DivQ: Diversification for Keyword Search over Structured Databases

Elena Demidova, Peter Fankhauser, Xuan Zhou and Wolfgang NejflL3S Research Center, Hannover, Germany

Fraunhofer IPSE, Darmstadt GermanyCSIRO ICT Centre, Australia

SIGIR 2010

2010. 12. 17.Jaehui Park

Page 2: DivQ: Diversification for Keyword Search over Structured Databases Elena Demidova, Peter Fankhauser, Xuan Zhou and Wolfgang Nejfl L3S Research Center,

Copyright 2010 by CEBT

INTRODUCTION Keyword search over structured data

No single interpretation of a keyword query can satisfy all users Multiple interpretation may yield overlapping results.

Diversification Minimizing the risk of user's dissatisfaction by balancing relevance and

novelty of search results

An example Query: "London"

– location: the capital of UK– name: a book written by Jack London

The occurrences can be viewed as a keyword interpretation with differ-ent semantics offering complementary results.

2

Page 3: DivQ: Diversification for Keyword Search over Structured Databases Elena Demidova, Peter Fankhauser, Xuan Zhou and Wolfgang Nejfl L3S Research Center,

Copyright 2010 by CEBT

INTRODUCTION Motivation

Taking advantage of the structure of the databases – Query interpretation in terms of the underlying database– To deliver more diverse and orthogonal representations of query results

ex) attribute

Contributions DivQ

– A probabilistic query disambiguation model– A diversification scheme for generating top-k query interpretations

Evaluation metrics for structured data– α-nDCG-W– WS-recall

3

Page 4: DivQ: Diversification for Keyword Search over Structured Databases Elena Demidova, Peter Fankhauser, Xuan Zhou and Wolfgang Nejfl L3S Research Center,

Copyright 2010 by CEBT

The Diversification Scheme Query interpretations

a keyword query -> a set of structured queries

Ranking the query interpretations Providing a quick overview over the available classes of re-

sults Faceted search: navigate and choose

4

Q: CONSIDERATION CHRISTOPHER GUESTRele-vance

Top-3 interpretations rank-ing

Rele-vance

Top-3 interpretations diversifica-tion

0.9 A directorCHRISTOPHER GUEST of a movie CONSIDERATION

0.9 A directorCHRISTOPHER GUEST of a movie CONSIDERATION

0.5 A director CHRISTOPHER GUEST

0.4 An actor CHRISTOPHER GUEST

0.8 An actor CHRISTOPHER GUEST in a movie CON-SIDERATION

0.2 A plot containing CHRISTOPHER GUEST of a movie

increasing novelty

Page 5: DivQ: Diversification for Keyword Search over Structured Databases Elena Demidova, Peter Fankhauser, Xuan Zhou and Wolfgang Nejfl L3S Research Center,

Copyright 2010 by CEBT

The Diversification Scheme Bringing Keywords into Structure

Keyword Interpretations Ai:ki

– Mapping each keyword ki to an element Ai of an algebraic ex-pression

– (Predefined) query template T joining the keyword interpretations a structural patterns that is frequently used to query the databases

– An example Keyword query (K): CONSIDERATION CHRISTOPHER GUEST

director:CHRISTOPHER director:GUEST movie:CONSIDERATION

T: A director X of a movie Y

5

Page 6: DivQ: Diversification for Keyword Search over Structured Databases Elena Demidova, Peter Fankhauser, Xuan Zhou and Wolfgang Nejfl L3S Research Center,

Copyright 2010 by CEBT

The Diversification Scheme Estimating Query Relevance

Relevance of a query interpretation Q to informational needs K– P(Q|K) = P(I,T|K)

T: query template, I: a set of keyword interpretations– Assumptions

Each keyword has one particular interpretation. The probability of a keyword interpretation is independent from the part of the

query interpretation the keyword is not interpreted to.

– Attribute specific term frequency (ex. the avg number of co-occurrences) ex) rank higher: a first name and a last name of a person to attribute "name"

6

the probability that, given that Aj is a part of a query in-terpretation, keyword interpretation Aj are also a part of the query interpretation.

smoothing factor

Page 7: DivQ: Diversification for Keyword Search over Structured Databases Elena Demidova, Peter Fankhauser, Xuan Zhou and Wolfgang Nejfl L3S Research Center,

Copyright 2010 by CEBT

The Diversification Scheme Estimating Query Similarity

The Jaccard coefficient between the sets of keyword interpretations I contained by Q1 and Q2

Combining Relevance and Similarity 1. Select the most relevance interpretation as the first interpretation presented to

the user 2. Each of the following interpretations is selected based on both its relevance and

novelty

7

selected query interpretation set

Page 8: DivQ: Diversification for Keyword Search over Structured Databases Elena Demidova, Peter Fankhauser, Xuan Zhou and Wolfgang Nejfl L3S Research Center,

Copyright 2010 by CEBT

The Diversification Scheme The Diversification algo-

rithm materializing top-k rele-

vance query interpreta-tions

the worst case O(l*r)

– l: the number of query interpretations in L

– r: the number of query interpretations in the result list R

8

Page 9: DivQ: Diversification for Keyword Search over Structured Databases Elena Demidova, Peter Fankhauser, Xuan Zhou and Wolfgang Nejfl L3S Research Center,

Copyright 2010 by CEBT

EVALUATION METRICS α-nDCG-W

CGn (Cumulative Gain)– ex) 3+2+3+0+1+2 = 11

DCGi (Discounted Cumulative Gain)– ex) DCG1 = 3, DCG2 = 3 + 2/log22 = 5, DCG3 = 3 + (2/log22 + 3/

log23) = 6.887 nDCGi = DCGi / ideal DCGi α-nDCG

– Views a document as the set of information nuggets n Counting how many documents containing n were seen before and dis-

count the gain of this document accordingly– if α = 0, it is a standard nDCG– with increasing α, novelty is rewarded with more credit

9

D1 D2 D3 D4 D5 D63 2 3 0 1 2

Page 10: DivQ: Diversification for Keyword Search over Structured Databases Elena Demidova, Peter Fankhauser, Xuan Zhou and Wolfgang Nejfl L3S Research Center,

Copyright 2010 by CEBT

EVALUATION METRICS α-nDCG-W

In databases– an information nugget n corresponds to a primary key pki

The gain

The overlap– For each primary key pki in the result of Qk

Count how many query interpretations with pki were seen before, and ag-gregate the counts

10

overlap factor

Page 11: DivQ: Diversification for Keyword Search over Structured Databases Elena Demidova, Peter Fankhauser, Xuan Zhou and Wolfgang Nejfl L3S Research Center,

Copyright 2010 by CEBT

EVALUATION METRICS Weighted S-Recall

S-recall– Instance recall at rank k

when search results are related to several subtopics The number of unique subtopics covered by the first k re-

sults, divided by the total number of subtopics– a primary key corresponds to a subtopic in S-recall

11

Page 12: DivQ: Diversification for Keyword Search over Structured Databases Elena Demidova, Peter Fankhauser, Xuan Zhou and Wolfgang Nejfl L3S Research Center,

Copyright 2010 by CEBT

EXPERIMENTS IMDB

10,000,000 records Lyrics

400,000 records Query logs

MSN, AOL 200 most frequent queries (single query) 100 queries (complex queries)

12

Page 13: DivQ: Diversification for Keyword Search over Structured Databases Elena Demidova, Peter Fankhauser, Xuan Zhou and Wolfgang Nejfl L3S Research Center,

Copyright 2010 by CEBT

EXPERIMENTS User Study

16 participants were asked to indicate on a two-point Likert scale

to assess the relevance– top-25 interpretations

13

Page 14: DivQ: Diversification for Keyword Search over Structured Databases Elena Demidova, Peter Fankhauser, Xuan Zhou and Wolfgang Nejfl L3S Research Center,

Copyright 2010 by CEBT

EXPERIMENTS α-nDCG-W

α = 0, 0.5, and 0.99

14

Page 15: DivQ: Diversification for Keyword Search over Structured Databases Elena Demidova, Peter Fankhauser, Xuan Zhou and Wolfgang Nejfl L3S Research Center,

Copyright 2010 by CEBT

EXPERIMENTS WS-recall

Balancing Relevance and Novelty

15

Page 16: DivQ: Diversification for Keyword Search over Structured Databases Elena Demidova, Peter Fankhauser, Xuan Zhou and Wolfgang Nejfl L3S Research Center,

Copyright 2010 by CEBT

CONCLUSION We present an approach to search results diversification

over structured data. a probabilistic query disambiguation model query similarity measure a greedy algorithm

An adaptation of the established evaluation metrics are proposed.

– α-nDCG-W and WS-recall Evaluation results demonstrate the quality of the pro-

posed model and show that using our algorithms the novelty of keyword search results over structured data can be substantially improved.

16