Learning to Rank: From Pairwise Approach to Listwise Approachsifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs...Learning to Rank: From Pairwise Approach to Listwise Approach ZheCao

Learning to Rank:From Pairwise Approach to Listwise Approach

Zhe Cao Tao Qin Tie-Yan Liu Ming-Feng Tsai Hang Li

Microsoft Research Asia, Beijing (2007)

Presented by Christian Kümmerle

December 2, 2014

Christian Kümmerle (University of Virginia, TU Munich) Learning to Rank: A Listwise Approach

Content

1 Framework: Learning to Rank

2 The Listwise Approach

3 Loss function based on probability model

4 ListNet algorithm

5 Experiments and Conclusion


Framework: Learning to Rank

What is Learning to Rank?

Classical IR ranking task: Given a query, rank documents to a list.

Query-dependent ranking functions:Vector space model, BM25, Language modelQuery-independent features of documents: e.g.

I PageRankI URL-depth, e.g.

http://sifaka.cs.uiuc.edu/∼wang296/Course/IR_Fall/lectures.htmlhas a depth of 4

−→ How can we combine all these "features" in order to get a betterranking function?





Query-dependent ranking functions:Vector space model, BM25, Language model

Query-independent features of documents: e.g.I PageRankI URL-depth, e.g.






















Idea: Learn the best way to combine the features from given training data,consisting of queries and corresponding labelled documents.Supervised learning:

In the authors’ paper:

Input space

: X = {x (1), x (2), . . .}, x (i) : List of feature representationsof documents for query qi ← Listwise approach

Output space

: Y = {y (1), y (2), . . .}, y (i) : List of judgements of therelevance degree of the documents for qi ← Listwise approach

Hypothesis space

← Neural network

Loss function

: Probability model on the space of permutations




Idea: Learn the best way to combine the features from given training data,consisting of queries and corresponding labelled documents.Supervised learning:

In the authors’ paper:

Input space

: X = {x (1), x (2), . . .}, x (i) : List of feature representationsof documents for query qi ← Listwise approach

Output space

: Y = {y (1), y (2), . . .}, y (i) : List of judgements of therelevance degree of the documents for qi ← Listwise approach

Hypothesis space

← Neural network

Loss function





Idea: Learn the best way to combine the features from given training data,consisting of queries and corresponding labelled documents.Supervised learning: In the authors’ paper:

Input space: X = {x (1), x (2), . . .}, x (i) : List of feature representationsof documents for query qi ← Listwise approachOutput space: Y = {y (1), y (2), . . .}, y (i) : List of judgements of therelevance degree of the documents for qi ← Listwise approachHypothesis space

← Neural network

Loss function






Input space: X = {x (1), x (2), . . .}, x (i) : List of feature representationsof documents for query qi ← Listwise approachOutput space: Y = {y (1), y (2), . . .}, y (i) : List of judgements of therelevance degree of the documents for qi ← Listwise approachHypothesis space ← Neural networkLoss function






Input space: X = {x (1), x (2), . . .}, x (i) : List of feature representationsof documents for query qi ← Listwise approachOutput space: Y = {y (1), y (2), . . .}, y (i) : List of judgements of therelevance degree of the documents for qi ← Listwise approachHypothesis space ← Neural networkLoss function: Probability model on the space of permutations


The Listwise Approach

Queries: Q = {q(1), q(2), . . . , q(m)} a set of m queries.List of documents: For query q(i), there are ni documents:d (i) =(d (i)

1 , d(i)2 , . . . , d

(i)ni ).

Feature representation in input space: x (i) =(x (i)1 , x

(i)2 , . . . , x

(i)ni )

with x (i)j = Ψ(q(i), d (i)

j ), e.g.

x(i)j =(BM25(q(i),d (i)

j ),LM(q(i),d (i)j ),TFIDF(q(i),d (i)

j ),PageRank(d (i)j ),URLdepth(d (i)

j ))∈R5

List of judgment scores in output space: y (i) =(y (i)1 , y

(i)2 , . . . , y

(i)ni )

with implicitly or explicitly given judgement scores y (i)j for all

documents corresponding to query q(i).

−→ Training data set T ={

(x (i), y (i))}m

i=1




1 , d(i)2 , . . . , d

(i)ni ).


(i)2 , . . . , x

(i)ni )


j ), e.g.

x(i)j =(BM25(q(i),d (i)



j ))∈R5


(i)2 , . . . , y

(i)ni )




(x (i), y (i))}m

i=1




1 , d(i)2 , . . . , d

(i)ni ).


(i)2 , . . . , x

(i)ni )


j ), e.g.

x(i)j =(BM25(q(i),d (i)



j ))∈R5


(i)2 , . . . , y

(i)ni )




(x (i), y (i))}m

i=1


What is a meaningful loss function?

We want: Find a function f : X → Y such that the f (x (i)) are "not verydifferent" from the y (i). −→ Loss function penalizes too big differences.

Idea: Just take NDCG! Perfectly ordered list can be derived from the givenjudgements y (i).

Problem: Discontinuity of NDCG with respect to the ranking scores, sinceNDCG is position based:

Example

Training query with NDCG = 1

f (x (i)) 1.2 0.7 3.110 3.109y (i) 2 1 4 3

Training query with NDCG = 0.86

f (x (i)) 1.2 0.7 3.110 3.111y (i) 2 1 4 3


What is a meaningful loss function?

We want: Find a function f : X → Y such that the f (x (i)) are "not verydifferent" from the y (i). −→ Loss function penalizes too big differences.

Idea: Just take NDCG! Perfectly ordered list can be derived from the givenjudgements y (i).

Problem: Discontinuity of NDCG with respect to the ranking scores, sinceNDCG is position based:

Example

Training query with NDCG = 1

f (x (i)) 1.2 0.7 3.110 3.109y (i) 2 1 4 3

Training query with NDCG = 0.86

f (x (i)) 1.2 0.7 3.110 3.111y (i) 2 1 4 3


Loss function based on probability model on permutations

Solution: Define probability distributions Py (i) and Pz(i) (for

z(i) := (f (x (i)1 ), . . . , f (x (i)

ni ))) on the set of permutations π on{1, . . . , ni}, take the KL divergence as loss function:

L(y (i), z(i)) := −∑π

Py (i)(π) log(Pz(i)(π)) ∝ KL(Py (i)(·) || Pz(i)(·))

How to define the probability distribution? E.g. for the set of permutationson {1, 2, 3}, the scores (y1, y2, y3) and the permutation π := (1, 3, 2):

Py (π) :=ey1

ey1 + ey2 + ey3· ey3

ey2 + ey3· e

y2

ey2




z(i) := (f (x (i)1 ), . . . , f (x (i)


L(y (i), z(i)) := −∑π



Py (π) :=ey1


ey2 + ey3· e

y2

ey2

DefinitionIf π is a permutation on {1, . . . , n}, its probability, given the list of scores yof length n, is:

Py (π) =n∏

j=1

exp(yπ−1(j))∑nl=j exp(yπ−1(l))




z(i) := (f (x (i)1 ), . . . , f (x (i)


L(y (i), z(i)) := −∑π



Py (π) :=ey1


ey2 + ey3· e

y2

ey2

DefinitionFor easier calculation we rather use in the algorithm with k fixed:

Py (π) =k∏

j=1

exp(yπ−1(j))∑nl=j exp(yπ−1(l))


The ListNet algorithm

Advantage: Loss function is differentiable with respect to the scorevectors! −→ We use functions fω from a Neural Network model as ahypothesis space.

−→ Learning task: minω

m∑i=1

L(y (i), fω(x (i)))



The ListNet algorithm

Advantage: Loss function is differentiable with respect to the scorevectors! −→ We use functions fω from a Neural Network model as ahypothesis space.

−→ Learning task: minω

m∑i=1

L(y (i), fω(x (i)))

Implementation,based ongradientdescent:

Learning to Rank: From Pairwise Approach to Listwise Approach

invariant or translation invariant with a carefully designedfunction φ(.). We omit the details here.

Given two lists of scores, we can define the metric betweenthe corresponding top k probability distributions as the list-wise loss function. For example, when we use Cross En-tropy as metric, the listwise loss function in Eq. (1) be-comes

L(y(i), z(i)) = −!

∀g∈Gk

Py(i) (g) log(Pz(i) (g)) (3)

5. Learning Method: ListNetWe propose a new learning method for optimizing the list-wise loss function based on top k probability, with NeuralNetwork as model and Gradient Descent as optimizationalgorithm. We refer to the method as ListNet.

Again, let us take document retrieval as example. We de-note the ranking function based on the Neural Networkmodel ω as fω. Given a feature vector x(i)j , fω(x

(i)j ) assigns

a score to it. We define φ in Definition 1 as an exponentialfunction, which is translation invariant as shown in Theo-rem 5. We then rewrite the top k probability in Theorem 8as

Ps(Gk( j1, j2, ..., jk)) =k"

t=1

exp(s jt )#n(i)l=t exp(s jl )

,

Given query q(i), the ranking function fω can generate ascore list z(i)( fω) =

$fω(x(i)1 ), fω(x

(i)2 ), · · · , fω(x(i)n(i) )

%. Then

the top k probability of documents (d(i)j1 , d(i)j2 , ..., d

(i)jk ) is cal-

culated as

Pz(i)( fω)(Gk( j1, j2, ..., jk)) =k"

t=1

exp( fω(x(i)jt ))#n(i)l=t exp( fω(x

(i)jl )),

With Cross Entropy as metric, the loss for query q(i) be-comes

L(y(i), z(i)( fω)) = −!

∀g∈Gk

Py(i) (g) log(Pz(i)( fω)(g)) (4)

The gradient of L(y(i), z(i)( fω)) with respect to parameter ωcan be calculated as follows

△ω = ∂L(y(i), z(i)( fω))∂ω

= −!

∀g∈Gk

∂Pz(i)( fω)(g)∂ω

Py(i) (g)Pz(i)( fω)(g)

(5)

Eq.(5) is then used in Gradient Descent. Algorithm 1 showsthe learning algorithm of ListNet.

Notice that ListNet is similar to RankNet. The only majordifference lies in that the former uses document lists as in-stances while the latter uses document pairs as instances;

Algorithm 1 Learning Algorithm of ListNetInput:training data {(x(1), y(1)), (x(2), y(2)), ..., (x(m), y(m))}Parameter: number of iterations T and learning rate ηInitialize parameter ωfor t = 1 to T dofor i = 1 to m doInput x(i) of query q(i) to Neural Network and com-pute score list z(i)( fω) with current ωCompute gradient △ω using Eq. (5)Update ω = ω − η × △ω

end forend forOutput Neural Network model ω

the former utilizes a listwise loss function while the latterutilizes a pairwise loss function. Interestingly, when thereare only two documents for each query, i.e., n(i) = 2, thenthe listwise loss function in ListNet becomes equivalent tothe pairwise loss function in RankNet.

In our experiments in this paper, we implemented ListNetwith k = 1. With some derivation (Cao et al., 2007), wecan see that for k = 1 we have

△ω =∂L(y(i), z(i)( fω))∂ω

= −n(i)!

j=1Py(i) (x(i)j )

∂ fω(x(i)j )∂ω

+1

#n(i)j=1 exp( fω(x

(i)j ))

n(i)!

j=1exp( fω(x(i)j ))

∂ fω(x(i)j )∂ω

(6)

For simplicity, we use a linear Neural Network model andomit the constant b in the model1:

fω(x(i)j ) = ⟨ω, x(i)j ⟩where ⟨·, ·⟩ denotes an inner product.

6. Experimental ResultsWe compared the ranking accuracies of ListNet with thoseof three baseline methods: RankNet, Ranking SVM, andRankBoost using three data sets.

6.1. Data Collections

We used three data sets in the experiments: TREC, a dataset obtained from web track of TREC 2003 (Craswell et al.,2003); OHSUMED, a benchmark data set for document re-trieval (Hersh et al., 1994); and CSearch, a data set from acommercial search engine.

TREC consists of web pages crawled from the .gov do-main in early 2002. There are in total 1,053,110 pages

1Note that Eq. (5) and Algorithm 1 can be applied to anycontinuous ranking function.

133


Experiments and Conclusion

Authors compared ranking accuracy of ListNet with other LtRalgorithms on three large scale data sets (TREC, OHSUMED andCsearch; Number of features: 20, 30 and 600)Procedure: Divide data into training subset and testing subset, usetraditional evaluation metrics (NDCG, MAP) on testing set.

Conclusions:ListNet outperforms algorithms based on pairwise approach (RankNet,RankingSVM, RankBoost)Drawback: High training complexity (O(nk) for list length n andparameter k)


Experiments and Conclusion

Authors compared ranking accuracy of ListNet with other LtRalgorithms on three large scale data sets (TREC, OHSUMED andCsearch; Number of features: 20, 30 and 600)Procedure: Divide data into training subset and testing subset, usetraditional evaluation metrics (NDCG, MAP) on testing set.

Conclusions:ListNet outperforms algorithms based on pairwise approach (RankNet,RankingSVM, RankBoost)Drawback: High training complexity (O(nk) for list length n andparameter k)


For Further Reading

Cao, Z., Qin, T., Liu, T.Y., Tsai, M.F., Li, H.Learning to Rank: From Pairwise Approach to Listwise Approach.In: Proceedings of the 24th International Conference on MachineLearning (ICML 2007), pp. 129–136 (2007)

Tie-Yan Liu.Learning to rank for information retrieval.Springer, 2011.

Thank you for your attention!


Documents

Learning to Rank: From Pairwise Approach to Listwise Approachsifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs...Learning to Rank: From Pairwise Approach to Listwise Approach ZheCao