Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Learning to Rank:From Pairwise Approach to Listwise Approach
Zhe Cao Tao Qin Tie-Yan Liu Ming-Feng Tsai Hang Li
Microsoft Research Asia, Beijing (2007)
Presented by Christian Kümmerle
December 2, 2014
Christian Kümmerle (University of Virginia, TU Munich) Learning to Rank: A Listwise Approach
Content
1 Framework: Learning to Rank
2 The Listwise Approach
3 Loss function based on probability model
4 ListNet algorithm
5 Experiments and Conclusion
Christian Kümmerle (University of Virginia, TU Munich) Learning to Rank: A Listwise Approach
Framework: Learning to Rank
What is Learning to Rank?
Classical IR ranking task: Given a query, rank documents to a list.
Query-dependent ranking functions:Vector space model, BM25, Language modelQuery-independent features of documents: e.g.
I PageRankI URL-depth, e.g.
http://sifaka.cs.uiuc.edu/∼wang296/Course/IR_Fall/lectures.htmlhas a depth of 4
−→ How can we combine all these "features" in order to get a betterranking function?
Christian Kümmerle (University of Virginia, TU Munich) Learning to Rank: A Listwise Approach
Framework: Learning to Rank
What is Learning to Rank?
Classical IR ranking task: Given a query, rank documents to a list.
Query-dependent ranking functions:Vector space model, BM25, Language model
Query-independent features of documents: e.g.I PageRankI URL-depth, e.g.
http://sifaka.cs.uiuc.edu/∼wang296/Course/IR_Fall/lectures.htmlhas a depth of 4
−→ How can we combine all these "features" in order to get a betterranking function?
Christian Kümmerle (University of Virginia, TU Munich) Learning to Rank: A Listwise Approach
Framework: Learning to Rank
What is Learning to Rank?
Classical IR ranking task: Given a query, rank documents to a list.
Query-dependent ranking functions:Vector space model, BM25, Language modelQuery-independent features of documents: e.g.
I PageRankI URL-depth, e.g.
http://sifaka.cs.uiuc.edu/∼wang296/Course/IR_Fall/lectures.htmlhas a depth of 4
−→ How can we combine all these "features" in order to get a betterranking function?
Christian Kümmerle (University of Virginia, TU Munich) Learning to Rank: A Listwise Approach
Framework: Learning to Rank
What is Learning to Rank?
Classical IR ranking task: Given a query, rank documents to a list.
Query-dependent ranking functions:Vector space model, BM25, Language modelQuery-independent features of documents: e.g.
I PageRankI URL-depth, e.g.
http://sifaka.cs.uiuc.edu/∼wang296/Course/IR_Fall/lectures.htmlhas a depth of 4
−→ How can we combine all these "features" in order to get a betterranking function?
Christian Kümmerle (University of Virginia, TU Munich) Learning to Rank: A Listwise Approach
Framework: Learning to Rank
What is Learning to Rank?
Idea: Learn the best way to combine the features from given training data,consisting of queries and corresponding labelled documents.Supervised learning:
In the authors’ paper:
Input space
: X = {x (1), x (2), . . .}, x (i) : List of feature representationsof documents for query qi ← Listwise approach
Output space
: Y = {y (1), y (2), . . .}, y (i) : List of judgements of therelevance degree of the documents for qi ← Listwise approach
Hypothesis space
← Neural network
Loss function
: Probability model on the space of permutations
Christian Kümmerle (University of Virginia, TU Munich) Learning to Rank: A Listwise Approach
Framework: Learning to Rank
What is Learning to Rank?
Idea: Learn the best way to combine the features from given training data,consisting of queries and corresponding labelled documents.Supervised learning:
In the authors’ paper:
Input space
: X = {x (1), x (2), . . .}, x (i) : List of feature representationsof documents for query qi ← Listwise approach
Output space
: Y = {y (1), y (2), . . .}, y (i) : List of judgements of therelevance degree of the documents for qi ← Listwise approach
Hypothesis space
← Neural network
Loss function
: Probability model on the space of permutations
Christian Kümmerle (University of Virginia, TU Munich) Learning to Rank: A Listwise Approach
Framework: Learning to Rank
What is Learning to Rank?
Idea: Learn the best way to combine the features from given training data,consisting of queries and corresponding labelled documents.Supervised learning: In the authors’ paper:
Input space: X = {x (1), x (2), . . .}, x (i) : List of feature representationsof documents for query qi ← Listwise approachOutput space: Y = {y (1), y (2), . . .}, y (i) : List of judgements of therelevance degree of the documents for qi ← Listwise approachHypothesis space
← Neural network
Loss function
: Probability model on the space of permutations
Christian Kümmerle (University of Virginia, TU Munich) Learning to Rank: A Listwise Approach
Framework: Learning to Rank
What is Learning to Rank?
Idea: Learn the best way to combine the features from given training data,consisting of queries and corresponding labelled documents.Supervised learning: In the authors’ paper:
Input space: X = {x (1), x (2), . . .}, x (i) : List of feature representationsof documents for query qi ← Listwise approachOutput space: Y = {y (1), y (2), . . .}, y (i) : List of judgements of therelevance degree of the documents for qi ← Listwise approachHypothesis space ← Neural networkLoss function
: Probability model on the space of permutations
Christian Kümmerle (University of Virginia, TU Munich) Learning to Rank: A Listwise Approach
Framework: Learning to Rank
What is Learning to Rank?
Idea: Learn the best way to combine the features from given training data,consisting of queries and corresponding labelled documents.Supervised learning: In the authors’ paper:
Input space: X = {x (1), x (2), . . .}, x (i) : List of feature representationsof documents for query qi ← Listwise approachOutput space: Y = {y (1), y (2), . . .}, y (i) : List of judgements of therelevance degree of the documents for qi ← Listwise approachHypothesis space ← Neural networkLoss function: Probability model on the space of permutations
Christian Kümmerle (University of Virginia, TU Munich) Learning to Rank: A Listwise Approach
The Listwise Approach
Queries: Q = {q(1), q(2), . . . , q(m)} a set of m queries.List of documents: For query q(i), there are ni documents:d (i) =(d (i)
1 , d(i)2 , . . . , d
(i)ni ).
Feature representation in input space: x (i) =(x (i)1 , x
(i)2 , . . . , x
(i)ni )
with x (i)j = Ψ(q(i), d (i)
j ), e.g.
x(i)j =(BM25(q(i),d (i)
j ),LM(q(i),d (i)j ),TFIDF(q(i),d (i)
j ),PageRank(d (i)j ),URLdepth(d (i)
j ))∈R5
List of judgment scores in output space: y (i) =(y (i)1 , y
(i)2 , . . . , y
(i)ni )
with implicitly or explicitly given judgement scores y (i)j for all
documents corresponding to query q(i).
−→ Training data set T ={
(x (i), y (i))}m
i=1
Christian Kümmerle (University of Virginia, TU Munich) Learning to Rank: A Listwise Approach
The Listwise Approach
Queries: Q = {q(1), q(2), . . . , q(m)} a set of m queries.List of documents: For query q(i), there are ni documents:d (i) =(d (i)
1 , d(i)2 , . . . , d
(i)ni ).
Feature representation in input space: x (i) =(x (i)1 , x
(i)2 , . . . , x
(i)ni )
with x (i)j = Ψ(q(i), d (i)
j ), e.g.
x(i)j =(BM25(q(i),d (i)
j ),LM(q(i),d (i)j ),TFIDF(q(i),d (i)
j ),PageRank(d (i)j ),URLdepth(d (i)
j ))∈R5
List of judgment scores in output space: y (i) =(y (i)1 , y
(i)2 , . . . , y
(i)ni )
with implicitly or explicitly given judgement scores y (i)j for all
documents corresponding to query q(i).
−→ Training data set T ={
(x (i), y (i))}m
i=1
Christian Kümmerle (University of Virginia, TU Munich) Learning to Rank: A Listwise Approach
The Listwise Approach
Queries: Q = {q(1), q(2), . . . , q(m)} a set of m queries.List of documents: For query q(i), there are ni documents:d (i) =(d (i)
1 , d(i)2 , . . . , d
(i)ni ).
Feature representation in input space: x (i) =(x (i)1 , x
(i)2 , . . . , x
(i)ni )
with x (i)j = Ψ(q(i), d (i)
j ), e.g.
x(i)j =(BM25(q(i),d (i)
j ),LM(q(i),d (i)j ),TFIDF(q(i),d (i)
j ),PageRank(d (i)j ),URLdepth(d (i)
j ))∈R5
List of judgment scores in output space: y (i) =(y (i)1 , y
(i)2 , . . . , y
(i)ni )
with implicitly or explicitly given judgement scores y (i)j for all
documents corresponding to query q(i).
−→ Training data set T ={
(x (i), y (i))}m
i=1
Christian Kümmerle (University of Virginia, TU Munich) Learning to Rank: A Listwise Approach
What is a meaningful loss function?
We want: Find a function f : X → Y such that the f (x (i)) are "not verydifferent" from the y (i). −→ Loss function penalizes too big differences.
Idea: Just take NDCG! Perfectly ordered list can be derived from the givenjudgements y (i).
Problem: Discontinuity of NDCG with respect to the ranking scores, sinceNDCG is position based:
Example
Training query with NDCG = 1
f (x (i)) 1.2 0.7 3.110 3.109y (i) 2 1 4 3
Training query with NDCG = 0.86
f (x (i)) 1.2 0.7 3.110 3.111y (i) 2 1 4 3
Christian Kümmerle (University of Virginia, TU Munich) Learning to Rank: A Listwise Approach
What is a meaningful loss function?
We want: Find a function f : X → Y such that the f (x (i)) are "not verydifferent" from the y (i). −→ Loss function penalizes too big differences.
Idea: Just take NDCG! Perfectly ordered list can be derived from the givenjudgements y (i).
Problem: Discontinuity of NDCG with respect to the ranking scores, sinceNDCG is position based:
Example
Training query with NDCG = 1
f (x (i)) 1.2 0.7 3.110 3.109y (i) 2 1 4 3
Training query with NDCG = 0.86
f (x (i)) 1.2 0.7 3.110 3.111y (i) 2 1 4 3
Christian Kümmerle (University of Virginia, TU Munich) Learning to Rank: A Listwise Approach
Loss function based on probability model on permutations
Solution: Define probability distributions Py (i) and Pz(i) (for
z(i) := (f (x (i)1 ), . . . , f (x (i)
ni ))) on the set of permutations π on{1, . . . , ni}, take the KL divergence as loss function:
L(y (i), z(i)) := −∑π
Py (i)(π) log(Pz(i)(π)) ∝ KL(Py (i)(·) || Pz(i)(·))
How to define the probability distribution? E.g. for the set of permutationson {1, 2, 3}, the scores (y1, y2, y3) and the permutation π := (1, 3, 2):
Py (π) :=ey1
ey1 + ey2 + ey3· ey3
ey2 + ey3· e
y2
ey2
Christian Kümmerle (University of Virginia, TU Munich) Learning to Rank: A Listwise Approach
Loss function based on probability model on permutations
Solution: Define probability distributions Py (i) and Pz(i) (for
z(i) := (f (x (i)1 ), . . . , f (x (i)
ni ))) on the set of permutations π on{1, . . . , ni}, take the KL divergence as loss function:
L(y (i), z(i)) := −∑π
Py (i)(π) log(Pz(i)(π)) ∝ KL(Py (i)(·) || Pz(i)(·))
How to define the probability distribution? E.g. for the set of permutationson {1, 2, 3}, the scores (y1, y2, y3) and the permutation π := (1, 3, 2):
Py (π) :=ey1
ey1 + ey2 + ey3· ey3
ey2 + ey3· e
y2
ey2
DefinitionIf π is a permutation on {1, . . . , n}, its probability, given the list of scores yof length n, is:
Py (π) =n∏
j=1
exp(yπ−1(j))∑nl=j exp(yπ−1(l))
Christian Kümmerle (University of Virginia, TU Munich) Learning to Rank: A Listwise Approach
Loss function based on probability model on permutations
Solution: Define probability distributions Py (i) and Pz(i) (for
z(i) := (f (x (i)1 ), . . . , f (x (i)
ni ))) on the set of permutations π on{1, . . . , ni}, take the KL divergence as loss function:
L(y (i), z(i)) := −∑π
Py (i)(π) log(Pz(i)(π)) ∝ KL(Py (i)(·) || Pz(i)(·))
How to define the probability distribution? E.g. for the set of permutationson {1, 2, 3}, the scores (y1, y2, y3) and the permutation π := (1, 3, 2):
Py (π) :=ey1
ey1 + ey2 + ey3· ey3
ey2 + ey3· e
y2
ey2
DefinitionFor easier calculation we rather use in the algorithm with k fixed:
Py (π) =k∏
j=1
exp(yπ−1(j))∑nl=j exp(yπ−1(l))
Christian Kümmerle (University of Virginia, TU Munich) Learning to Rank: A Listwise Approach
The ListNet algorithm
Advantage: Loss function is differentiable with respect to the scorevectors! −→ We use functions fω from a Neural Network model as ahypothesis space.
−→ Learning task: minω
m∑i=1
L(y (i), fω(x (i)))
Christian Kümmerle (University of Virginia, TU Munich) Learning to Rank: A Listwise Approach
Christian Kümmerle (University of Virginia, TU Munich) Learning to Rank: A Listwise Approach
The ListNet algorithm
Advantage: Loss function is differentiable with respect to the scorevectors! −→ We use functions fω from a Neural Network model as ahypothesis space.
−→ Learning task: minω
m∑i=1
L(y (i), fω(x (i)))
Implementation,based ongradientdescent:
Learning to Rank: From Pairwise Approach to Listwise Approach
invariant or translation invariant with a carefully designedfunction φ(.). We omit the details here.
Given two lists of scores, we can define the metric betweenthe corresponding top k probability distributions as the list-wise loss function. For example, when we use Cross En-tropy as metric, the listwise loss function in Eq. (1) be-comes
L(y(i), z(i)) = −!
∀g∈Gk
Py(i) (g) log(Pz(i) (g)) (3)
5. Learning Method: ListNetWe propose a new learning method for optimizing the list-wise loss function based on top k probability, with NeuralNetwork as model and Gradient Descent as optimizationalgorithm. We refer to the method as ListNet.
Again, let us take document retrieval as example. We de-note the ranking function based on the Neural Networkmodel ω as fω. Given a feature vector x(i)j , fω(x
(i)j ) assigns
a score to it. We define φ in Definition 1 as an exponentialfunction, which is translation invariant as shown in Theo-rem 5. We then rewrite the top k probability in Theorem 8as
Ps(Gk( j1, j2, ..., jk)) =k"
t=1
exp(s jt )#n(i)l=t exp(s jl )
,
Given query q(i), the ranking function fω can generate ascore list z(i)( fω) =
$fω(x(i)1 ), fω(x
(i)2 ), · · · , fω(x(i)n(i) )
%. Then
the top k probability of documents (d(i)j1 , d(i)j2 , ..., d
(i)jk ) is cal-
culated as
Pz(i)( fω)(Gk( j1, j2, ..., jk)) =k"
t=1
exp( fω(x(i)jt ))#n(i)l=t exp( fω(x
(i)jl )),
With Cross Entropy as metric, the loss for query q(i) be-comes
L(y(i), z(i)( fω)) = −!
∀g∈Gk
Py(i) (g) log(Pz(i)( fω)(g)) (4)
The gradient of L(y(i), z(i)( fω)) with respect to parameter ωcan be calculated as follows
△ω = ∂L(y(i), z(i)( fω))∂ω
= −!
∀g∈Gk
∂Pz(i)( fω)(g)∂ω
Py(i) (g)Pz(i)( fω)(g)
(5)
Eq.(5) is then used in Gradient Descent. Algorithm 1 showsthe learning algorithm of ListNet.
Notice that ListNet is similar to RankNet. The only majordifference lies in that the former uses document lists as in-stances while the latter uses document pairs as instances;
Algorithm 1 Learning Algorithm of ListNetInput:training data {(x(1), y(1)), (x(2), y(2)), ..., (x(m), y(m))}Parameter: number of iterations T and learning rate ηInitialize parameter ωfor t = 1 to T dofor i = 1 to m doInput x(i) of query q(i) to Neural Network and com-pute score list z(i)( fω) with current ωCompute gradient △ω using Eq. (5)Update ω = ω − η × △ω
end forend forOutput Neural Network model ω
the former utilizes a listwise loss function while the latterutilizes a pairwise loss function. Interestingly, when thereare only two documents for each query, i.e., n(i) = 2, thenthe listwise loss function in ListNet becomes equivalent tothe pairwise loss function in RankNet.
In our experiments in this paper, we implemented ListNetwith k = 1. With some derivation (Cao et al., 2007), wecan see that for k = 1 we have
△ω =∂L(y(i), z(i)( fω))∂ω
= −n(i)!
j=1Py(i) (x(i)j )
∂ fω(x(i)j )∂ω
+1
#n(i)j=1 exp( fω(x
(i)j ))
n(i)!
j=1exp( fω(x(i)j ))
∂ fω(x(i)j )∂ω
(6)
For simplicity, we use a linear Neural Network model andomit the constant b in the model1:
fω(x(i)j ) = ⟨ω, x(i)j ⟩where ⟨·, ·⟩ denotes an inner product.
6. Experimental ResultsWe compared the ranking accuracies of ListNet with thoseof three baseline methods: RankNet, Ranking SVM, andRankBoost using three data sets.
6.1. Data Collections
We used three data sets in the experiments: TREC, a dataset obtained from web track of TREC 2003 (Craswell et al.,2003); OHSUMED, a benchmark data set for document re-trieval (Hersh et al., 1994); and CSearch, a data set from acommercial search engine.
TREC consists of web pages crawled from the .gov do-main in early 2002. There are in total 1,053,110 pages
1Note that Eq. (5) and Algorithm 1 can be applied to anycontinuous ranking function.
133
Christian Kümmerle (University of Virginia, TU Munich) Learning to Rank: A Listwise Approach
Experiments and Conclusion
Authors compared ranking accuracy of ListNet with other LtRalgorithms on three large scale data sets (TREC, OHSUMED andCsearch; Number of features: 20, 30 and 600)Procedure: Divide data into training subset and testing subset, usetraditional evaluation metrics (NDCG, MAP) on testing set.
Conclusions:ListNet outperforms algorithms based on pairwise approach (RankNet,RankingSVM, RankBoost)Drawback: High training complexity (O(nk) for list length n andparameter k)
Christian Kümmerle (University of Virginia, TU Munich) Learning to Rank: A Listwise Approach
Experiments and Conclusion
Authors compared ranking accuracy of ListNet with other LtRalgorithms on three large scale data sets (TREC, OHSUMED andCsearch; Number of features: 20, 30 and 600)Procedure: Divide data into training subset and testing subset, usetraditional evaluation metrics (NDCG, MAP) on testing set.
Conclusions:ListNet outperforms algorithms based on pairwise approach (RankNet,RankingSVM, RankBoost)Drawback: High training complexity (O(nk) for list length n andparameter k)
Christian Kümmerle (University of Virginia, TU Munich) Learning to Rank: A Listwise Approach
For Further Reading
Cao, Z., Qin, T., Liu, T.Y., Tsai, M.F., Li, H.Learning to Rank: From Pairwise Approach to Listwise Approach.In: Proceedings of the 24th International Conference on MachineLearning (ICML 2007), pp. 129–136 (2007)
Tie-Yan Liu.Learning to rank for information retrieval.Springer, 2011.
Thank you for your attention!
Christian Kümmerle (University of Virginia, TU Munich) Learning to Rank: A Listwise Approach