26
Click Model-Based Information Retrieval Metrics Aleksandr Chuklin ˚ 1,2 , Pavel Serdyukov 1 , Maarten de Rijke 2 1 Yandex, Moscow, Russia 2 ISLA, University of Amsterdam, The Netherlands SIGIR 2013 Dublin, Ireland ˚ Now at Google Switzerland 1 / 24

Click Model-Based Information Retrieval Metrics

Embed Size (px)

DESCRIPTION

Slides from SIGIR 2013 talk. The full paper can be found here: http://staff.science.uva.nl/~mdr/Publications/sigir2013-metrics.pdf ABSTRACT: In recent years many models have been proposed that are aimed at predicting clicks of web search users. In addition, some information retrieval evaluation metrics have been built on top of a user model. In this paper we bring these two directions together and propose a common approach to converting any click model into an evaluation metric. We then put the resulting model-based metrics as well as traditional metrics (like DCG or Precision) into a common evaluation framework and compare them along a number of dimensions. One of the dimensions we are particularly interested in is the agreement between offline and online experimental outcomes. It is widely believed, especially in an industrial setting, that online A/B-testing and interleaving experiments are generally better at capturing system quality than offline measurements. We show that offline metrics that are based on click models are more strongly correlated with online experimental outcomes than traditional offline metrics, especially in situations when we have incomplete relevance judgements.

Citation preview

Page 1: Click Model-Based Information Retrieval Metrics

Click Model-Based Information Retrieval Metrics

Aleksandr Chuklin˚ 1,2, Pavel Serdyukov1, Maarten de Rijke2

1Yandex, Moscow, Russia

2ISLA, University of Amsterdam, The Netherlands

SIGIR 2013Dublin, Ireland

˚Now at Google Switzerland1 / 24

Page 2: Click Model-Based Information Retrieval Metrics

§ IR Metrics Overview

§ Click Model-Based Metrics

§ Analysis of the New Metrics

2 / 24

Page 3: Click Model-Based Information Retrieval Metrics

§ IR Metrics Overview

§ Click Model-Based Metrics

§ Analysis of the New Metrics

2 / 24

Page 4: Click Model-Based Information Retrieval Metrics

§ IR Metrics Overview

§ Click Model-Based Metrics

§ Analysis of the New Metrics

2 / 24

Page 5: Click Model-Based Information Retrieval Metrics

Classification of IR evaluation techniques

Offline Metrics

Traditional Click Model-Based

Precision uSDBN, ERR (Chapelle et al., 2009)nDCG, DCG EBU (Yilmaz et al., 2010), rrDBN

MAP uDCM, rrDCMuUBM

Online Experiments

Absolute Metrics Interleaving

MaxRR, MinRR, MeanRR Team-Draft InterleavingUCTR, QCTR Balanced Interleaving

PLC

3 / 24

Page 6: Click Model-Based Information Retrieval Metrics

Offline metrics

§ Fixed set of queries Q

§ Documents are assessed by human judges using gradedrelevance R P t0, 1, . . . ,Rmaxu

SystemQuality “1

|Q|

ÿ

qPQ

Utilitypqq

§ Where Utility usually has the following form:

Utilitypqq “N

ÿ

i“1

decayi ¨ Rpdoci q

4 / 24

Page 7: Click Model-Based Information Retrieval Metrics

Click metrics: DBN

Example: DBN click model(Chapelle and Zhang, 2009)

EM algorithm in our implementation of the the examinationmodel.

2.1.3 Logistic modelAnother alternative is to use a slightly di↵erent model

related to logistic regression [8]:

P (C = 1|u, p) :=1

1 + exp(�↵̃u � �̃p). (3)

The click probability is not a product of probabilities anylonger, but it is still a function of the url and of the position.The main advantage is that it ensures that the resultingprobability is always between 0 and 1; also the optimizationis much easier since it is an unconstrained and jointly convexproblem.

2.2 Cascade ModelCascade model [8] di↵ers from the above position models

in that it considers the dependency among urls in a samesearch results page and model all clicks and skips simulta-neously in a session. It assumes that the user views searchresults from top to bottom and decides whether to click eachurl. Once a click is issued, documents below the clicked re-sult are not examined regardless of the position. With thecascade model, each document d, is either clicked with prob-ability rd (i.e. probability that the document is relevant)or skipped with probability (1-rd). The cascade model as-sumes that a user who clicks never comes back, and a userwho skips always continues. A click on the i-th documentindicates: 1. the user must have decided to skip the ranksabove; 2. the user deem the i-th document relevant. Theprobability of click on i-th document can thus be expressedas:

P (Ci = 1) = ri

i�1Y

j=1

(1� rj). (4)

3. DYNAMIC BAYESIAN NETWORKWe now introduce another model which considers the re-

sults set as a whole and takes into account the influence ofthe other urls while estimating the relevance of a given urlfrom click logs. The reason to consider the relevance of otherurls is the following: take for instance a relevant documentin position 3; if both documents in position 1 and 2 are veryrelevant, it is likely that this document will have very fewclicks; on the other hand, if the two top documents are irrel-evant, it will have a lot of clicks. A click model dependingonly on the position will not be able to make the distinctionbetween these two cases. We extend the idea of cascademodel and propose a Dynamic Bayesian Network (DBN)[11] to model simultaneously the relevance of all documents.

3.1 ModelThe Dynamic Bayesian Network that we propose is illus-

trated in figure 1. The sequence is over the documents inthe search result list. For simplicity, we keep only the top10 documents appearing in the first page of results, whichmeans that the sequence goes from 1 to 10. The variablesinside the box are defined at the session level, while thoseout of the box are defined at the query level. As before, weassume that the query is fixed.

For a given position i, in addition to the observed vari-able Ci indicating whether there was a click or not at this

EiEi�1 Ei+1

Ci

Ai Si

au su

Figure 1: The DBN used for clicks modeling. Ci isthe the only observed variable.

position, the following hidden binary variables are definedto model examination, perceived relevance, and actual rele-vance, respectively:

• Ei: did the user examine the url?• Ai: was the user attracted by the url?• Si: was the user satisfied by the landing page?

The following equations describe the model:

Ai = 1, Ei = 1, Ci = 1 (5a)

P (Ai = 1) = au (5b)

P (Si = 1|Ci = 1) = su (5c)

Ci = 0) Si = 0 (5d)

Si = 1) Ei+1 = 0 (5e)

P (Ei+1 = 1|Ei = 1, Si = 0) = � (5f)

Ei = 0) Ei+1 = 0 (5g)

As in the examination model, we assume that there is a clickif and only if the user looked at the url and was attracted byit (5a). The probability of being attracted depends only onthe url (5b). Similar to the cascade model, the user scansthe urls linearly from top to bottom until he decides to stop.After the user clicks and visits the url, there is a certainprobability that he will be satisfied by this url (5c). On theother hand, if he does not click, he will not be satisfied (5d).Once the user is satisfied by the url he has visited, he stopshis search (5e). If the user is not satisfied by the currentresult, there is a probability 1 � � that the user abandonshis search (5f) and a probability � that the user examinesthe next url. In other words, � measures the perseverance ofthe user4. If the user did not examine the position i, he willnot examine the subsequent positions (5g). In addition, au

and su have a beta prior. The choice of this prior is naturalbecause the beta distribution is conjugate to the binomialdistribution. It is clear that some of the assumptions arenot realistic and we discuss in section 8 how to extend them.However, as shown in the experimental section, this modelcan already explain accurately the observed clicks.

4it would be better to define the perseverance � at the userlevel, but we simply take the same value for all users.

WWW 2009 MADRID! Track: Data Mining / Session: Click Models

3

§ Ci — user clicked i-thdocument

§ Ei — user examined i-thdocument

§ Ai — user was attracted byi-th document

§ Si — user was satisfied byi-th document

Ck “ 1 ô Ak “ 1 and Ek “ 1

PpAk “ 1q “ aqpukq

PpSk “ 1|Ck “ 0q “ 0

PpSk “ 1|Ck “ 1q “ sqpukq

Ek`1 “ 1 ô Ek “ 1 and Sk “ 0

5 / 24

Page 8: Click Model-Based Information Retrieval Metrics

Converting click model into metric

§ aqpukq Ñ aqpRkq, sqpukq Ñ sqpRkq

§ Compute click probability Ci and satisfaction probability Si§ Use the following equations for utility-based and effort-based

(reciprocal rank) metrics (similar to (Carterette, 2011)):

uMetric “N

ÿ

k“1

PpCk “ 1q ¨ Rk (utility-based)

rrMetric “N

ÿ

k“1

PpSk “ 1q ¨1

k(effort-based)

Implementation:https://github.com/varepsilon/clickmodels

6 / 24

Page 9: Click Model-Based Information Retrieval Metrics

Click model-based metrics and their underlying models

Derived metric

Underlying click model Utility-based Effort-based

SDBN (Chapelle and Zhang, 2009) uSDBN ERRDBN (Chapelle and Zhang, 2009) EBU rrDBNDCM (Guo et al., 2009) uDCM rrDCMUBM (Dupret and Piwowarski, 2008) uUBM –

Previous work:

§ ERR, uSDBN (Chapelle et al., 2009)

§ EBU (Yilmaz et al., 2010)

7 / 24

Page 10: Click Model-Based Information Retrieval Metrics

Evaluating the metrics

§ Correlation with other metrics

§ Correlation with click metrics

§ Correlation with interleaving

Hypothesis

Model-based metrics should be better correlated with online usermetrics.

8 / 24

Page 11: Click Model-Based Information Retrieval Metrics

Aspect one: comparison to other metrics

Table: TREC 2011 runs, Kendall tau correlation. Values higher than 0.9are marked in boldface.

Precision2 DCG ERR uSDBN EBU rrDBN uDCM rrDCM uUBM

Precision 0.649 0.841 0.597 0.730 0.568 0.397 0.562 0.442 0.537Precision2 – 0.785 0.663 0.780 0.675 0.526 0.693 0.551 0.681DCG – – 0.740 0.857 0.711 0.530 0.704 0.592 0.685

ERR – – – 0.807 0.919 0.754 0.902 0.826 0.888uSDBN – – – – 0.792 0.585 0.794 0.638 0.754EBU – – – – – 0.788 0.970 0.822 0.930rrDBN – – – – – – 0.786 0.917 0.807uDCM – – – – – – – 0.813 0.947rrDCM – – – – – – – – 0.841

9 / 24

Page 12: Click Model-Based Information Retrieval Metrics

Model-based metrics

Hypothesis

Model-based metrics should be better correlated with online usermetrics.

10 / 24

Page 13: Click Model-Based Information Retrieval Metrics

Aspect two: absolute online metrics

Table: Pearson correlation between offline and absolute click metrics.Superscripts show statistically significant difference from ERR and EBU.

-RR

Max- Min- Mean- UCTR PLC

Precision ´0.117 ´0.163 ´0.155 0.042 ´0.027Precision2 0.026 0.093 0.075 0.092 0.094DCG 0.178 0.243 0.237 0.163 0.245

ERR 0.378 0.471 0.469 0.199 0.399EBU 0.374 0.467 0.464 0.198 0.397rrDBN 0.384IJIJ 0.475IJIJ 0.473IJIJ 0.194İİ 0.399´IJ

rrDCM 0.387IJIJ 0.478IJIJ 0.476IJIJ 0.194İİ 0.400´IJ

uSDBN 0.322İİ 0.412İİ 0.407İİ 0.206IJIJ 0.370İİ

uDCM 0.374İİ 0.466İİ 0.463İİ 0.198´´ 0.396İİ

uUBM 0.377´IJ 0.469İIJ 0.467İIJ 0.198´´ 0.398´IJ

11 / 24

Page 14: Click Model-Based Information Retrieval Metrics

Aspect three: interleaving

Large Scale Validation and Analysis of Interleaved Search Evaluation A:5

Input Interleaved RankingsRanking Balanced Team-Draft

Rank A B A first B first AAA BAA ABA ...1 a b a b aA bB aA

2 b e b a bB aA bB

3 c a e e cA cA eB

4 d f c c eB eB cA

5 g g d f dA dA dA

6 h h f d fB fB fB......

......

......

......

Fig. 1. Examples illustrating how Balanced and Team-Draft Interleaving combine input rankings A and Bover different randomizations. Superscript for the Team-Draft interleavings indicates team membership.

Interleaving methods address these problems by merging the two rankings A and Binto a single interleaved ranking I, which is presented to the user. The retrieval systemobserves clicks on the documents in I and attributes them to A, B, or both, dependingon the origin of the document. The goal is to make the interleaving process and click at-tribution as “fair” as possible with respect to biases in user behavior (e.g. position bias[Joachims et al. 2007]), so that clicks in the interleaved ranking I can be interpreted asunbiased feedback for a paired comparison between A and B. The precise definition of“fair” varies for different interleaving methods, but all have the goal of equalizing theinfluence of biases on clicks in I for A and B. This equalization of behavioral biases isconjectured to be more reliable than explicitly quantifying and correcting for bias afterdata collection. Furthermore, unlike absolute metrics (e.g. clickthrough rate, abandon-ment rate, see Section 6), interleaving methods do not assume that observable userbehavior changes with retrieval quality on some absolute scale. Rather, they assumeusers can identify the preferred alternative in a direct comparison.

The following two sections present the methods of Balanced Interleaving and Team-Draft Interleaving, which differ in the way duplicate documents are treated. The pre-sentation of the methods follows that in [Radlinski et al. 2008; 2010].

3.1. Balanced Interleaving MethodThe first interleaving method, called Balanced Interleaving, was proposed in [Joachims2002; 2003]. The name reflects the intuition that the results of the two rankings A andB should be interleaved into a single ranking I in a balanced way. This particularmethod ensures that any top k results in I always contain the top ka results from Aand the top kb results from B, where ka and kb differ by at most 1. Intuitively, a userreading the results in I from top to bottom will have always seen an approximatelyequal number of results from both A and B.

It can be shown that such an interleaved ranking always exists for any pair of rank-ings A and B, and that it is computed by Algorithm 1 [Joachims 2003]. The algorithmconstructs this ranking by maintaining two pointers, namely ka and kb, and then in-terleaving greedily. The pointers always point at the highest ranked result in the re-spective original ranking that is not yet in the combined ranking. To construct I, thelagging pointer among ka and kb is used to select the next result to add to I. Ties arebroken randomly.

Two examples of such combined rankings are presented in the column “Balanced” ofFigure 1. The left column assumes ranking A wins a tie-breaking coin toss, while theright column assumes that ranking B wins the toss.

Given an interleaving I of two rankings presented to the user, one can derive apreference statement from the user’s clicks. For this, we assume that the user exam-ines results from top to bottom (as supported by eye-tracking studies [Joachims et al.2007]). We denote the number of links in I that the user considers as l. For analysis,we assume that l is known and fixed a priori. This means that the user has l choices to

ACM Transactions on Information Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

12 / 24

Page 15: Click Model-Based Information Retrieval Metrics

Interleaving vs. offline metrics

§ 10 Team-Draft Interleaving Experiments ∆iAB.

§ For each experiment compute TdiSignal “ WinBWinA`WinB

´ 12

§ Judged query-document pairs matched against click log givingset of queries Q (|Q| „ 102 . . . 103); some documents may beunjudged (up to #unjudged docs per query)

§ For each metric compute:

MetricSignal “1

|Q 1|

ÿ

qPQ1

pMetricBpqq ´MetricApqqq ,

where Q 1 “ tq P Q | MetricBpqq ‰ MetricApqqu

§ Compare MetricSignal to TdiSignal using PearsonCorrelation (similar to (Radlinski and Craswell, 2010))

13 / 24

Page 16: Click Model-Based Information Retrieval Metrics

Interleaving vs. offline metrics

0 1 2 3 4 5 6 7 8 9 10

#unjudged

0.4

0.2

0.0

0.2

0.4

0.6

0.8

1.0co

rrela

tion

Simple Metrics

Precision

Precision2

DCG

uSDBN

ERR

EBU

rrDBN

uDCM

rrDCM

uUBM

Figure: Unjudged documents considered irrelevant

14 / 24

Page 17: Click Model-Based Information Retrieval Metrics

Making use of unjudged documents

0 1 2 3 4 5 6 7 8 9 10

#unjudged

0.4

0.2

0.0

0.2

0.4

0.6

0.8

1.0

corr

ela

tion

Condensed Metrics

Precision

Precision2

DCG

uSDBN

ERR

EBU

rrDBN

uDCM

rrDCM

uUBM

Figure: Method by Sakai, T. Alternatives to Bpref. SIGIR’2007 :unjudged documents skipped (result page is condensed)

15 / 24

Page 18: Click Model-Based Information Retrieval Metrics

Thresholds

§ Modify offline metric usage protocol. Introduce a threshold δ:

MetricSignal “1

|Qδ|

ÿ

qPQδ

pMetricBpqq ´MetricApqqq ,

where Qδ “ tq P Q | |MetricBpqq ´MetricApqq| ą δu

§ Choose a threshold to maximize correlation with interleaving

§ Use 5 experiments to tune thresholds and 5 thresholds to test.Repeat for each possible 5/5 split (total C 5

10 “ 252 splits)

16 / 24

Page 19: Click Model-Based Information Retrieval Metrics

Thresholds

0 1 2 3 4 5 6 7 8 9 10

#unjudged

0.4

0.2

0.0

0.2

0.4

0.6

0.8

1.0co

rrela

tion

Thresholded Metrics

Precision

Precision2

DCG

uSDBN

ERR

EBU

rrDBN

uDCM

rrDCM

uUBM

17 / 24

Page 20: Click Model-Based Information Retrieval Metrics

Thresholds+condensation

0 1 2 3 4 5 6 7 8 9 10

#unjudged

0.4

0.2

0.0

0.2

0.4

0.6

0.8

1.0co

rrela

tion

Thresholded Condensed Metrics

Precision

Precision2

DCG

uSDBN

ERR

EBU

rrDBN

uDCM

rrDCM

uUBM

18 / 24

Page 21: Click Model-Based Information Retrieval Metrics

All in one

0 1 2 3 4 5 6 7 8 9 10

#unjudged

0.4

0.2

0.0

0.2

0.4

0.6

0.8

1.0

corr

ela

tion

Simple Metrics

Precision

Precision2

DCG

uSDBN

ERR

EBU

rrDBN

uDCM

rrDCM

uUBM

0 1 2 3 4 5 6 7 8 9 10

#unjudged

0.4

0.2

0.0

0.2

0.4

0.6

0.8

1.0

corr

ela

tion

Condensed Metrics

Precision

Precision2

DCG

uSDBN

ERR

EBU

rrDBN

uDCM

rrDCM

uUBM

0 1 2 3 4 5 6 7 8 9 10

#unjudged

0.4

0.2

0.0

0.2

0.4

0.6

0.8

1.0

corr

ela

tion

Thresholded Metrics

Precision

Precision2

DCG

uSDBN

ERR

EBU

rrDBN

uDCM

rrDCM

uUBM

0 1 2 3 4 5 6 7 8 9 10

#unjudged

0.4

0.2

0.0

0.2

0.4

0.6

0.8

1.0

corr

ela

tion

Thresholded Condensed Metrics

Precision

Precision2

DCG

uSDBN

ERR

EBU

rrDBN

uDCM

rrDCM

uUBM

19 / 24

Page 22: Click Model-Based Information Retrieval Metrics

Summary

§ A recipe for turning a click model into a metric

§ Two families of metrics: utility-based and effort-based

§ Multi-aspect analysis of the metrics

20 / 24

Page 23: Click Model-Based Information Retrieval Metrics

Key results

§ Effort-based metrics are substantially different fromutility-based ones, even when based on the same user model

§ Model-based metrics show better agreement withinterleaving and better deal with unjudged documents

§ Using techniques such as condensation and threshold wecan improve agreement with interleaving

21 / 24

Page 24: Click Model-Based Information Retrieval Metrics

What’s next?

§ Judging snippets. Drop the assumption that snippetattractiveness is a function of document relevance as wasassumed by the click model-based metrics

§ Good abandonments. Modify any evaluation metric byadding additional gain from the snippets that contain ananswer to the user’s information need

22 / 24

Page 25: Click Model-Based Information Retrieval Metrics

23 / 24

Page 26: Click Model-Based Information Retrieval Metrics

Bibiography

B. Carterette. System effectiveness, user models, and user utility:a conceptual framework for investigation. In SIGIR, 2011.

O. Chapelle and Y. Zhang. A dynamic bayesian network clickmodel for web search ranking. In WWW. ACM, 2009.

O. Chapelle, D. Metzler, Y. Zhang, and P. Grinspan. Expectedreciprocal rank for graded relevance. In CIKM. ACM, 2009.

G. Dupret and B. Piwowarski. A user browsing model to predictsearch engine click data from past observations. In SIGIR. ACM,2008.

F. Guo, C. Liu, and Y. Wang. Efficient multiple-click models inweb search. In WSDM. ACM, 2009.

F. Radlinski and N. Craswell. Comparing the sensitivity ofinformation retrieval metrics. In SIGIR. ACM, 2010.

E. Yilmaz, M. Shokouhi, N. Craswell, and S. Robertson. Expectedbrowsing utility for web search evaluation. In CIKM. ACM, 2010.

24 / 24