1
How do Gain and Discount Functions Affect the Correlation between DCG and User Satisfaction? Julián Urbano Mónica Marrero ECIR 2015 Vienna, March 30 th Discount d(i ; k) Gain g(r) Zipfian: 1/ Linear: Linear: 1/ Exp(2): 2 1 Constant: 1 Exp(3): 3 1 Log(2): 1/ log 1 Exp(5): 5 1 Log(3): 1/ log 2 Bin(1): I 1 Log(5): 1/ log 4 Bin(2): I 2 Discount functions Rank i Discount d(i) 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 5 Zipfian Linear Constant Log(2) Log(3) Log(5) Gain functions Relevance r Gain g(r) 0 5 10 15 20 25 0 1 2 Linear Exp(2) Exp(3) Exp(5) Bin(1) Bin(2) Documents Information Need Real World Cranfield IR System Topic Relevance Judgments IR System Documents GAP DCG ERR Static Component Dynamic Component Test Collection Effectiveness Measures Time to complete task, Idle time, Success rate, Frustration, Satisfaction, Ease of use, Ease of learning… Precision, Average Precision, Reciprocal Rank, Q-measure, Discounted Cumulative Gain, Rank-Biased Precision, Time-Biased Gain… Live Observation What Gain and Discount for DCG are better to predict user satisfaction? First, let’s normalize DCG scores (this is not nDCG!) One system with DCG=φ. What does it mean? Intuition: φ·100% of users will be satisfied P(Sat|DCG= φ)= φ Two systems with ΔDCG=Δφ. What does it mean? Intuition: users will prefer the (supposedly) better one P(Pref|ΔDCG=Δφ)=1 P(Sat) and P(Pref) depend on the systems, not on how we evaluate them. Yet, there are many different ways to compute effectiveness Experiment Collect user preferences between two systems Map DCG onto P(Sat) Map ΔDCG onto P(Pref) Music recommendation task Ad-hoc, informational, enjoyable by assessors Preferences less confounded by interface effects All data from MIREX (TREC-like for Music IR) Datasets from 2007–2012 3-point relevance scale: {0, 1, 2} 4115 examples Uniformly covering the [0,1] range of ΔDCG 432 unique queries 5636 unique documents Crowdsourcedwith Crowdflower Trap examples for quality control 113 subjects 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Zipfian discount Difference in DCG Probability that users agree Gains Linear Exp(2) Exp(3) Exp(5) Bin(1) Bin(2) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Linear discount Difference in DCG Probability that users agree 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Log(2) discount Difference in DCG Probability that users agree 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Log(3) discount Difference in DCG Probability that users agree 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Log(5) discount Difference in DCG Probability that users agree 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Constant discount Difference in DCG Probability that users agree 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Zipfian discount DCG Probability of user satisfaction Gains Linear Exp(2) Exp(3) Exp(5) Bin(1) Bin(2) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Linear discount DCG Probability of user satisfaction 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Log(2) discount DCG Probability of user satisfaction 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Log(3) discount DCG Probability of user satisfaction 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Log(5) discount DCG Probability of user satisfaction 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Constant discount DCG Probability of user satisfaction Results (1 system): DCG predicting user satisfaction Results (2 systems): ΔDCG predicting user preference Results: bias of Gain and Discount functions Diagonal: how far is P(Sat|DCG) from the ideal diagonal? Intuitiveness of DCG scores Endpoint: how far is P(Sat|DCG) from the ideal 0% and 100%? User disagreement and goodness of the DCG user model Top: how far is P(Pref|ΔDCG) from the ideal 100%? Discriminative power Summary and Implications New method to map system effectiveness onto user satisfaction Sample application to DCG for a music recommendation task Gain functions that emphasize highly relevant documents underestimate user satisfaction. Linear gain is better than exponential All discount functions bias the prediction of user satisfaction This task might be too enjoyable to observe discount effect Size (of the DCG difference) does matter Non-parametric statistics (eg. Sign test, Wilcoxon test) and just looking at the ranking of systems (eg. Kendall τ) oversimplify the evaluation problem Zero-point null hypothesis testing (ie. H 0 : ΔDCG=0) is not reasonable Future work will investigate this method for Text IR Provide a common framework, based on P(Sat) and P(Pref), to evaluate with informational and navigational queries using appropriate measures Data and code available online @ ; ; 0.06 0.10 0.14 Diagonal bias Bias Bin(2) Bin(1) Exp(5) Exp(3) Exp(2) Linear 0.06 0.10 0.14 Zipfian Linear Log(2) Log(3) Log(5) Constant 0.06 0.10 0.14 Gain Discount 0.14 0.18 0.22 Endpoint bias Bias Bin(2) Bin(1) Exp(5) Exp(3) Exp(2) Linear 0.14 0.18 0.22 Zipfian Linear Log(2) Log(3) Log(5) Constant 0.14 0.18 0.22 Gain Discount 0.46 0.50 0.54 Top bias Bias Bin(2) Bin(1) Exp(5) Exp(3) Exp(2) Linear 0.46 0.50 0.54 Zipfian Linear Log(2) Log(3) Log(5) Constant 0.46 0.50 0.54 Gain Discount

How Do Gain and Discount Functions Affect the Correlation between DCG and User Satisfaction?

Embed Size (px)

Citation preview

Page 1: How Do Gain and Discount Functions Affect the Correlation between DCG and User Satisfaction?

How do Gain and Discount Functions Affect the

Correlation between DCG and User Satisfaction?Julián Urbano Mónica Marrero

ECIR 2015

Vienna, March 30th

Discount d(i ; k) Gain g(r)

Zipfian: 1/� Linear: �

Linear: �� � � � 1/� Exp(2): 2� � 1

Constant: 1 Exp(3): 3� � 1

Log(2): 1/ log� � � 1 Exp(5): 5� � 1

Log(3): 1/ log� � � 2 Bin(1): I�� � 1�

Log(5): 1/ log��� � 4 Bin(2): I�� � 2�

Discount functions

Rank i

Dis

count d(i)

0.0

0.2

0.4

0.6

0.8

1.0

1 2 3 4 5

ZipfianLinear

Constant

Log(2)

Log(3)

Log(5)

Gain functions

Relevance r

Gain

g(r

)

05

10

15

20

25

0 1 2

LinearExp(2)

Exp(3)

Exp(5)

Bin(1) Bin(2)

Documents

Information Need

Real World Cranfield

IR System

Topic

Relevance Judgments

IR System

Documents

GAP

DCG

ERR

StaticComponent

Dynamic Component

TestCollection

EffectivenessMeasures

Time to complete task, Idle time, Success rate, Frustration, Satisfaction,

Ease of use, Ease of learning…

Precision, Average Precision, Reciprocal Rank,Q-measure, Discounted Cumulative Gain,Rank-Biased Precision, Time-Biased Gain…

Live Observation

What Gain and Discount for DCG arebetter to predict user satisfaction?• First, let’s normalize DCG scores (this is not nDCG!)

• One system with DCG=φ. What does it mean?• Intuition: φ·100% of users will be satisfied• P(Sat|DCG= φ)= φ

• Two systems with ΔDCG=Δφ. What does it mean?• Intuition: users will prefer the (supposedly) better one• P(Pref|ΔDCG=Δφ)=1

P(Sat) and P(Pref) depend on the systems,not on how we evaluate them. Yet, there are

many different ways to compute effectiveness

Experiment• Collect user preferences between two systems• Map DCG onto P(Sat)• Map ΔDCG onto P(Pref)

• Music recommendation task• Ad-hoc, informational, enjoyable by assessors• Preferences less confounded by interface effects

• All data from MIREX (TREC-like for Music IR)• Datasets from 2007–2012• 3-point relevance scale: {0, 1, 2}

• 4115 examples• Uniformly covering the [0,1] range of ΔDCG• 432 unique queries• 5636 unique documents

• Crowdsourced with Crowdflower• Trap examples for quality control• 113 subjects

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Zipfian discount

Difference in DCG

Pro

bability

that users

agre

e

Gains

Linear

Exp(2)

Exp(3)

Exp(5)

Bin(1)

Bin(2)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Linear discount

Difference in DCG

Pro

bability

that users

agre

e

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Log(2) discount

Difference in DCG

Pro

bability

that users

agre

e

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Log(3) discount

Difference in DCG

Pro

bability

that users

agre

e

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Log(5) discount

Difference in DCG

Pro

bability

that users

agre

e

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Constant discount

Difference in DCG

Pro

bability

that users

agre

e

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Zipfian discount

DCG

Pro

bability

of user

satis

factio

n

Gains

Linear

Exp(2)

Exp(3)

Exp(5)

Bin(1)

Bin(2)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Linear discount

DCG

Pro

bability

of user

satis

factio

n

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Log(2) discount

DCG

Pro

bability

of user

satis

factio

n

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Log(3) discount

DCG

Pro

bability

of user

satis

factio

n

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Log(5) discount

DCG

Pro

bability

of user

satis

factio

n

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Constant discount

DCG

Pro

bability

of user

satis

factio

n

Results (1 system): DCG predicting user satisfaction Results (2 systems): ΔDCG predicting user preference

Results: bias of Gain and Discount functions

• Diagonal: how far is P(Sat|DCG) from the ideal diagonal?• Intuitiveness of DCG scores

• Endpoint: how far is P(Sat|DCG) from the ideal 0% and 100%?• User disagreement and goodness of the DCG user model

• Top: how far is P(Pref|ΔDCG) from the ideal 100%?• Discriminative power

Summary and Implications

• New method to map system effectiveness onto user satisfaction• Sample application to DCG for a music recommendation task

• Gain functions that emphasize highly relevant documents underestimate user satisfaction. Linear gain is better than exponential

• All discount functions bias the prediction of user satisfaction• This task might be too enjoyable to observe discount effect

• Size (of the DCG difference) does matter• Non-parametric statistics (eg. Sign test, Wilcoxon test) and just looking at

the ranking of systems (eg. Kendall τ) oversimplify the evaluation problem• Zero-point null hypothesis testing (ie. H

0 : ΔDCG=0) is not reasonable

• Future work will investigate this method for Text IR• Provide a common framework, based on P(Sat) and P(Pref), to evaluate

with informational and navigational queries using appropriate measures

Data and code

available online

���@� �∑ � �� ⋅ � �; ��

���

∑ � �� ⋅ � �; ��

���

0.06

0.10

0.14

Diagonal bias

Bias

Bin(2)

Bin(1)

Exp(5)

Exp(3)

Exp(2)

Linear

0.06

0.10

0.14

Zipfia

n

Linear

Log(2)

Log(3)

Log(5)

Constant

0.06

0.10

0.14

Gain Discount

0.14

0.18

0.22

Endpoint bias

Bias

Bin(2)

Bin(1)

Exp(5)

Exp(3)

Exp(2)

Linear

0.14

0.18

0.22

Zipfia

n

Linear

Log(2)

Log(3)

Log(5)

Constant

0.14

0.18

0.22

Gain Discount

0.46

0.50

0.54

Top bias

Bias

Bin(2)

Bin(1)

Exp(5)

Exp(3)

Exp(2)

Linear

0.46

0.50

0.54

Zipfia

n

Linear

Log(2)

Log(3)

Log(5)

Constant

0.46

0.50

0.54

Gain Discount