Online Learning to Rank

Образец заголовка

Online Learning to Rank

by Edward W Huang (ewhuang3) and

Yerzhan Suleimenov (suleime1)

Prepared as an assignment for CS410: Text Information Systems in Spring 2016


Introduction

Образец заголовкаWhat is learning to rank?

• Many information retrieval problems are ranking problems

• Also known as machine-learned ranking

– Uses machine learning techniques to create ranking models

• Training data: queries and documents matched with relevance

judgements

– Model sorts objects by relevance, preference, or importance

– Finds optimal combination of features

Образец заголовкаApplications of learning to rank

• Ranking problems in information retrieval

– Document Retrieval

– Sentiment analysis

– Product rating

– Anti-spam measures

– Search engines

• Many more applications not just in information retrieval!

– Machine translation

– Computational biology

Образец заголовкаOnline vs. offline learning to rank

• Training set is produced by human assessors (offline)

– Time consuming and expensive to produce

– Not always in line with actual user preferences

• Data of users interacting with system (online)

– Users leave trace of interaction data: query reformulations, mouse movements, clicks, etc.

– Clicks especially valuable when interpreted as preferences

Образец заголовкаBig issue with online learning to

rank

• Exploration-exploitation dilemma

– Have to obtain feedback to improve system, while also utilizing past models to

optimize result quality

– Discuss solutions later


Creating Ranking Models

Образец заголовкаRanking model training framework

• Discriminative training attributes

– Input space

– Output space

– Hypothesis space

– Loss function

• Ranking model predicts ground truth label in training set in terms

of loss function

• Test phase: new query arrives, trained ranking model sorts

documents according to relevance to query

Образец заголовкаAlgorithms for learning to rank

problems

• Categorized into three groups by their framework (input

representation and loss function)

– Pointwise

– Pairwise

– Listwise

T.-Y. Liu. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval,

3(3): 225–331, 2009.

Образец заголовкаLimitations of pointwise approach

• Does not consider interdependency among documents

• Does not make use of the fact that some documents are

associated with the same query

• Most IR evaluation measures are query-level and position-based

Образец заголовкаPairwise and listwise

• Potential solutions to the previously mentioned exploration-exploitation dilemma

• Pairwise approach

– Input: pairs of documents with labels identifying which one is preferred

– Learns classifier to predict these labels

• Listwise approach

– Input: entire document list associated with a certain query

– Directly optimizes evaluation measures (i.e., Normalized Discounted Cumulative Gain)

Hofmann, Katja, Shimon Whiteson, and Maarten De Rijke. "Balancing Exploration and Exploitation in Listwise

and Pairwise Online Learning to Rank for Information Retrieval." Information Retrieval Inf Retrieval 16.1

(2012): 63-90.

Образец заголовкаAbsolute and relative feedback

approaches

• Use feedback to learn personalized rankings

• Absolute feedback: contextual bandits learning

• Relative feedback: gradient methods and inferred preferences

between complete result rankings

• Relative is usually better

– Robust to noisy feedback

– Deals with larger document spaces

Chen, Yiwei, and Katja Hofmann. "Online Learning to Rank: Absolute vs. Relative" Proceedings of the

24th International Conference on World Wide Web - WWW '15 Companion (2015).


State of the Art Learning

Образец заголовкаImproving learning performance

• Search engine clicks are useful, but might be biased

– Bias might come from attractive titles, snippets, or captions

• Method to detect and compensate for caption bias

– Enable reweighting of clicks based on likelihood

– Attractive, clicked links are considered less relevant

K. Hofmann, F. Behr, and F. Radlinski. On caption bias in interleaving experiments. In Proc. of CIKM,

2012.

Образец заголовкаHandling caption bias

• Allow weighting of clicks based on likelihood that each click is caption biased

• Model click probability as function of position, relevance, and caption bias

– Visual characteristics of individual documents

– Pairwise feature to focus on relationships with neighboring documents

• Learn model weights from past behavior of users

• Remove caption bias to obtain evaluation that reflects better relevance

Образец заголовкаImproving learning speed

• Search engine clicks can be interpreted using interleaved comparison methods (two main methods)

– Reliably infer preferences between pairs of rankers

• Dueling bandit gradient descent learns from these comparisons

– Requires pairwise comparisons involving users between all exploratory rankers

• Multileave gradient descent learns from comparisons of multiple rankers at once

– Uses a single user interaction

– Fast

Schuth, Anne, Harrie Oosterhuis, Shimon Whiteson, and Maarten De Rijke. "Multileave Gradient Descent for Fast Online Learning to Rank." Proceedings of the Ninth ACM International Conference on Web Search and Data Mining - WSDM '16(2016).


Evaluating Rankers

Образец заголовкаHow to evaluate rankers?

• After training a ranker, we need to find out how effective it is

• Offline evaluation methods

– Dependent on explicit expert judgements

– Not feasible in practice

• Online evaluation methods

– Leverage online data that can reflect ranker quality

– Click-based ranker evaluation (discussed next)

• State of the art software: Lerot

– Evaluates different algorithms

– Can simulate user clicking behaviour with user models

Schuth, Anne, Katja Hofmann, Shimon Whiteson, and Maarten De Rijke. "Lerot." Proceedings of the 2013 Workshop on Living Labs for Information Retrieval Evaluation - LivingLab '13 (2013).

Образец заголовкаClick-based ranker evaluation

• Online evaluation strategy based on clickthrough data

• Independent of expert judgments, unlike conventional evaluation

methods

– Measure reflects interest of an actual user rather than interest of an expert

providing relevance judgement

Образец заголовкаChallenges of using clickthrough

data

• Handling presentation bias

– Design user interface with three features

• Blind test: hidden random variables underlying the hypothesis test

• Click to preference: user’s click should reflect its actual judgment

• Low usability impact: interactive, user-friendly interface

• Identifying the superior of two rankers

– Unified user interface that sends user query to both rankers

– Mix two ranking results (discussed next)

– Show combined ranking to the user and record interesting/relevant clicks

T. Joachims, Evaluating Retrieval Performance Using Clickthrough Data, in J. Franke and G.

Nakhaeizadeh and I. Renz, "Text Mining", Physica/Springer Verlag, pp 79-96, 2003.

Образец заголовкаMixing two ranking results

• Also known as interleaving

• Key is to mix by balancing population from both rankers in top n

listings

• Algorithms vary in mixing strategy

– Balanced Interleaving

– Team-Draft Interleaving

Образец заголовкаLeveraging click responses from

mixed rankings

• Each click response represents user’s preference to ranker that provided the

clicked link

• Thus, proper leverage of clicks is critical

– Also known as test statistics

– Essential to reliable evaluation of rankers

• One basic approach is to assign equal weights to all clicks

– Suboptimal since not all clicks are equally significant

– Caption bias!

• More advanced test statistics, discussed next

Образец заголовкаTest statistics for evaluation

• Learn weights to maximize mean score difference between best and worst rankers

• Optimize statistical power of z-test by maximizing z-score and p-value

– Removes assumption of equal variance of weights

• Learns to invert Wilcoxon Signed-Rank Test

– Produces scoring function to optimize Wilcoxon test

• Max mean difference performs the worst

• Inverse z-test performs the best

Yisong Yue, Yue Gao, O. Chapelle, Ya Zhang, T. Joachims, Learning more powerful test statistics for click-

based retrieval evaluation, Proceedings of the Conference on Research and Development in Information

Retrieval (SIGIR), 2010.

Образец заголовкаHow good are interleaving

methods?

• Interleaving methods are compared against baseline:

conventional evaluation methods based on absolute metrics

• Conventional evaluation methods based on absolute metrics

– Absolute usage statistics are expected to monotonically

change with respect to ranker quality

• Interleaving methods

– More user clicks are expected for better ranker

Образец заголовкаRelative performance of

interleaving methods

• Experiment results on two rankers whose relative qualities are known by

construction

• Conventional evaluation methods based on absolute metrics

– Did not reliably identify high-quality rankers

– Absolute usage statistics did not monotonically change with respect to ranker quality

• Balanced Interleaving and Team-Draft Interleaving algorithms

– Reliably identified high-quality rankers

– Number of preferences for better ranker is significantly larger

F. Radlinski, M. Kurup, T. Joachims, How Does Clickthrough Data Reflect Retrieval Quality?,Proceedings of

the ACM Conference on Information and Knowledge Management (CIKM), 2008.

Образец заголовкаHow much reliable and why to

choose interleaved methods?

• Results of interleaving agrees with conventional evaluation

methods

• Achieves statistically reliable preference compared to absolute

metrics

• Economical: statistical evaluation power of 10 interleaved clicks is

approximately equal to 1 manual judged query

• Not sensitive to different click aggregation schemes

• Can complement or even replace standard evaluation methods

based on manual judgments or absolute metrics

O. Chapelle, T. Joachims, F. Radlinski, Yisong Yue, Large-Scale Validation and Analysis of Interleaved Search Evaluation, ACM

Transactions on Information Systems (TOIS), 30(1):6.1-6.41, 2012.

Образец заголовкаFuture directions

• Extend current linear learning approaches with online learning to rank algorithms

that can effectively learn more complex models

• Designing and re-experimenting with more complex models for click behavior to

better understand various click biases.

• Learning distinctive properties, such as click dwell time and use of back button, to

filter out raw clicks.

• Understanding range of domains in which interleaving methods are highly effective.

• Improvement of gradient descent based rankers by covering all search directions to

speed up learning processes.

Образец заголовкаReferences

1. Chen, Yiwei, and Katja Hofmann. "Online Learning to Rank: Absolute vs. Relative" Proceedings of the 24th

International Conference on World Wide Web - WWW '15 Companion (2015).

2. F. Radlinski, M. Kurup, T. Joachims, How Does Clickthrough Data Reflect Retrieval Quality?,Proceedings of

the ACM Conference on Information and Knowledge Management (CIKM), 2008.

3. Hofmann, Katja, Shimon Whiteson, and Maarten De Rijke. "Balancing Exploration and Exploitation in

Listwise and Pairwise Online Learning to Rank for Information Retrieval." Information Retrieval Inf

Retrieval 16.1 (2012): 63-90.

4. T. Joachims, Evaluating Retrieval Performance Using Clickthrough Data, in J. Franke and G. Nakhaeizadeh

and I. Renz, "Text Mining", Physica/Springer Verlag, pp 79-96, 2003.

5. K. Hofmann, F. Behr, and F. Radlinski. On caption bias in interleaving experiments. In Proc. of CIKM, 2012.

6. O. Chapelle, T. Joachims, F. Radlinski, Yisong Yue, Large-Scale Validation and Analysis of Interleaved Search

Evaluation, ACM Transactions on Information Systems (TOIS), 30(1):6.1-6.41, 2012.

7. Schuth, Anne, Harrie Oosterhuis, Shimon Whiteson, and Maarten De Rijke. "Multileave Gradient Descent for

Fast Online Learning to Rank." Proceedings of the Ninth ACM International Conference on Web Search and

Data Mining - WSDM '16 (2016).

8. Schuth, Anne, Katja Hofmann, Shimon Whiteson, and Maarten De Rijke. "Lerot." Proceedings of the 2013

Workshop on Living Labs for Information Retrieval Evaluation - LivingLab '13 (2013).

9. T.-Y. Liu. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3):

225–331, 2009.

10. Yisong Yue, Yue Gao, O. Chapelle, Ya Zhang, T. Joachims, Learning more powerful test statistics for click-

based retrieval evaluation, Proceedings of the Conference on Research and Development in Information

Retrieval (SIGIR), 2010.

Science

Online Learning to Rank