Learning to Rank - UvA · Learning to rank methods (Cont'd) III. Listwise: Learn to produce ranked lists that resemble the ideal ranking as closely as possible. Amenable to offline

Learning to Rank

Masrour Zoghi

2

What is learning to rank? A user comes to you with their query q. You look through your index and find 5 trillion

matching documents: bear in mind that the average query length is 3 words.

You don't want to put the page the user is looking for on the billionth page.

So, you need to learn how to rank your documents in the order of relevance to the user.

On the other hand, you can't read minds, so you take a statistical approach and try to learn from past experiences, hence the need for (machine) learning.

3

So, how do you go about doing this? Well, these are two ways:

Offline Learning (today): Get lots of pairs (q,d) with

relevance information (absolute or relative) and

train a regression or classification model.

Online Learning (later this week): Learn users'

preferences while interacting with them.

Important remark: Often offline evaluation goes together with offline learning but not necessarily.

4

Pros and Cons: Offline learning:

Better studied, so

more ML tools.

Not easily adaptable

as things change.

Often (but not always)

the method requires

explicit annotations.

Online learning:

Better for a dynamic

environment.

More sample efficient

since the algorithm plays

an active role.

Can't (easily) go back

and retrain.

5

Learning to rank methods: I. Pointwise: Try to learn a relevance scoring function: Nice, simple idea, with lots of applicable tools. Pretty hopeless because it's trying to do too much: all we need is rankings, not scores.

II. Pairwise: Learn a way of deciding pairwise preferences for documents given a query: A more sensible idea than the pointwise scheme.Simple classification problem.Can get lots of data from log files (more later).Doesn't easily fit the usual ML scheme.

6

II. Pairwise (Cont'd): Two types of training data:

1. Manually annotated data (q,d,d'), with d more relevant to q than d'.

Clean data.

Hard to get much of it.

1. Relevance information deduced from search logs or online user interaction: if in response to q, document d was listed after d', with d clicked and d' not, then d is more relevant to q than d'.

Can get a ton of data.

Noisy and one-sided data.

7

Learning to rank methods (Cont'd) III. Listwise: Learn to produce ranked lists that resemble

the ideal ranking as closely as possible.

Amenable to offline and online learning and evaluation.

Can use a simple regression model for it, as we'll see.

Seems to have the best performance so far.

Can't make use of logged click data so easily.

Best measure of similarity to the ideal ranking not clear.

Not always clear how to modify the ranker to get better results.

8

III. Listwise (Cont'd): Two schemes for learning:

1. Offline: Get a ton of annotated relevance judgements (q,d,s), where s is a score indicating the relevance of d to q. Use these to learn rankings that reduce NDCG or MAP or whatever. We'll see one example today.

2. Online: Interleave lists produced by different rankers and see which one the user likes better based on which documents in the list they click on. Anne will tell you all about this in two weeks.

9

Moral of the Story:Pointwise < Pairwise < Listwise

Except that the state of the art is a mixture of pairwise and listwise methods, i.e. LambdaMART (later this week)

10

AlgorithmsWe'll discuss 3 algorithms:

I. Ranking SVM: Pairwise algorithm based on SVM.T. Joachims, “Optimizing search engines using clickthrough data,” in KDD 2002.

II. RankBoost: Pairwise algorithm based on AdaBoost.Y. Freund, R. Iyer, R. E. Shapire and Y. Singer, “An efficient boosting algorithm for combining preferences,” Journal of Machine Learning Research, vol 4, 2003.

III. AdaRank: Listwise algorithm based on AdaBoost.J. Xu and H. Li, “AdaRank: A boosting algorithm for information retrieval,” in SIGIR 2007, pp. 391–398, 2007.

So, we will spend some time trying to understand Support Vector Machines (SVM) and AdaBoost.

11

Support Vector Machines Used for classification

(and regression). The goal is to separate

two classes of samples using a hyperplane.

So, we look for vector weights w and bias b such that the following equation separates the two classes:

12

SVM in more detail The main idea is to set

w and b such that y(t) = ±1 for “support vectors.”

Done by minimizing the norm square of w

subject to the constraints

Here t=±1 indicates the class of the samples. Decreasing the norm of w has the effect of pushing the blue lines away from each other.

13

SVM for grown ups Life is not always sunshine

and lollipops, so the picture is usually more complicated:

As a solution, we introduce slack variables and now the optimization function looks as follows:

subject to the constraints

14

SVM: further comments

Even though it might seem limited in scope, it can deal with nonlinear problems if you have good features.

It is a relatively straightforward optimization problem. It turns out that Euclidean norm is not the best

choice: the L1 norm is better, but harder to optimize. Quality of the features matters.

15

Ranking SVM First Idea: Use logged click data to deduce pairwise

preferences: wherever you see an unclicked document d' before a clicked one d that means that d is more relevant than d' for that query.

Second Idea: Train an SVM on the “difference” between d and d', i.e. the difference between the feature vectors.

You can get a lot of data this way: the message is try to use logs if you can. This is just one way: get creative!

The amount of info you extract from the logs is a small fraction of all logs: anything after the last click is wasted. Maybe we need to log more than just clicks?

16

AdaBoostA weighted committee voting scheme: You have a bunch of pathetic classifiers. But,

collectively they might do better. One way to put them together is by giving each one a

vote that is counted with a weight. The main idea is to foster diversity: if two classifiers do

the same job, you only need one of them. BUT, if you have multiple classifiers that use

complementary criteria to make their decisions, then you can get much better performance by letting them vote.

Picks weak classifiers one by one based on the failures of the previous ones.

17

AdaBoost: What it looks like

18


19


20


21


22


23

AdaBoost: the algorithm, part 1

24

AdaBoost: the algorithm, part 2

25

RankBoost and AdaRank RankBoost applies AdaBoost to pairs of documents AdaRank uses AdaBoost with MAP or NDCG.

26

AdaRank

Documents

Learning to Rank - UvA · Learning to rank methods (Cont'd) III. Listwise: Learn to produce ranked lists that resemble the ideal ranking as closely as possible. Amenable to offline