Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Statistical learning theory for ranking problems
Ambuj Tewari
University of Michigan
September 3, 2015 / MLSIG Seminar @ IISc
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Outline
1 Learning to rank as supervised MLFeaturesLabel and Output SpacesPerformance MeasuresRanking functions
2 Theory for learning to rankRegularized empirical loss minimizationGeneralization error boundsConsistency
3 Ongoing workOnline learning to rankTop-K feedback
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Outline
1 Learning to rank as supervised MLFeaturesLabel and Output SpacesPerformance MeasuresRanking functions
2 Theory for learning to rankRegularized empirical loss minimizationGeneralization error boundsConsistency
3 Ongoing workOnline learning to rankTop-K feedback
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Ranking problems are everywhere
Web search: Rank webpages in order of their relevance toa queryNews stories: Rank stories according to user interests at aparticular time (Google News, Facebook News Feed)Recommender systems: Rank movies (Netflix, IMDB),music (Pandora), or consumer goods (Amazon) accordingto a user’s preferencesSocial networks: Rank people in order of their likelihood ofbecoming friends with a given userBioinformatics: Rank genes in order of their biological rolein a given disease
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Ranking problems are everywhere
“What information consumes is rather obvious: it consumes theattention of its recipients. Hence a wealth of informationcreates a poverty of attention, and a need to allocate thatattention efficiently among the overabundance of informationsources that might consume it."
–Herbert Simon (1916–2001)Nobel Laureate (Economics), ACM Turing Award
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Learning to rank objects
Creating a ranking manually requires a lot of effortCan we have computers learn to rank objects?Will still need some human supervisionPose learning to rank as supervised ML problem
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Ranking as a canonical learning problem
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Ranking as a canonical learning problem
NIPS Workshops: 2002, 2005, 2009, 2014SIGIR Workshops: 2007, 2008, 2009Journal of Information Retrieval Special Issue in 2010Yahoo! Learning to Rank Challenge: 2010http://jmlr.csail.mit.edu/proceedings/papers/v14/chapelle11a/chapelle11a.pdf
LEarning TO Rank (LETOR) and Microsoft Learning toRank (MSLR) datasets: 2007-2010http://research.microsoft.com/en-us/um/beijing/projects/letor/
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Learning to rank as supervised ML
Supervised Machine Learning: construct a mapping
f : I → O
based on input, output pairs (training examples)
(X1,Y1), . . . , (XN ,YN)︸ ︷︷ ︸training set
where Xn ∈ I, Yn ∈ OWant f (X ) to be a good predictor of Y for unseen XIn ranking problems, O is all possible rankings of a givenset of objects (webpages, movies, etc)
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
A learning-to-rank training example
Query: “Learning to rank”Candidate webpages and relevance scores:
en.wikipedia.org/wiki/Learning_to_rank2 or “good”research.microsoft.com/.../hangli/l2r.pdf2 or “good”code.google.com/p/jforests/1 or “fair”www.vmathlive.com0 or “bad”
Think: example X =( query , (URL1, URL2, . . ., URL4) ),and label R = (2,2,1,0)
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Typical ML cycle
1 Feature construction: Find a way to map (query ,webpage)into Rp
Each example (query, m webpages) gets mapped toXn ∈ I = Rm×p
2 Obtain labels: Get Rn ∈ {0,1,2, . . . ,Rmax}m from humanjudgements
3 Train ranker: Get f : I → O = Sm where Sm is the set ofpermutations of degree m
Training usually done by solving some optimizationproblems on the training set
4 Evaluate performance: Using some performance measure,evaluate the ranker on a “test set”
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Joint query, document features
Number of query terms that occur in document
Document length
Sum/min/max/mean/variance of term frequencies
Sum/min/max/mean/variance of tf-idf
Cosine similarity between query and document
Any model of P(R = 1|Q,D) gives a feature (e.g., BM25)
No. of slashes in/length of URL, Pagerank, site level Pagerank
Query-URL click count, user-URL click count
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
A slightly more general supervised ML setting
For a future query and webpage-list, want to rank thewebpagesDon’t necessarily want to predict the relevance scoresLet f (ranking function) take values in the space of rankingsSo now, training examples (X ,Y ) ∈ I × L where L = labelspaceFunction f to be learned maps I to O (note L 6= O)E.g. (1,2,2,0,0) ∈ L and(d1 → 3,d2 → 1,d3 → 2,d4 → 4,d5 → 5) ∈ O
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Notation for rankings
We’ll think of m (no. of documents) as fixed but in reality itvaries over examplesWe’ll denote a ranking using a permutation σ ∈ Sm (set ofall m! permutations)σ(i) is the rank/position of document iσ−1(j) is the document at rank/position j
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Notation for rankings
Suppose we have 3 documents d1,d2,d3 and we wantthem to be ranked thus:
d2d3d1
So the ranks are d1 : 3,d2 : 1,d3 : 2This means σ(1) = 3, σ(2) = 1, σ(3) = 2Also, σ−1(1) = 2, σ−1(2) = 3, σ−1(3) = 1.
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Notation for relevance scores/labels
Relevance labels can be binary or multi-gradedIn the binary case, documents are either relevant orirrelevant (1 or 0)In the multi-graded case, relevance labels are from{0,1,2, . . . ,Rmax}Typically, Rmax ≤ 4 (at most 5 levels of relevance)A relevance score/label vector R as an m-vectorR(i) gives relevance of document i
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Performance measures in ranking
Performance measures could be gains (to be maximized)or losses (to be minimized)They take a relevance vector R and ranking σ asarguments and produce a non-negative gain/lossSome performance measures are only defined for binaryrelevanceOthers can handle more general multi-graded relevance
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Performance measure: RR
Reciprocal rank (RR) is defined for binary relevance onlyRR is the reciprocal rank of the first relevant documentaccording to the ranking
RR(σ,Y ) =1
min{σ(i) : Y (i) = 1}.
Example: R = (0,1,0,1), σ = (1,3,4,2) then RR = 12
original rankedd1 : 0 d1 : 0d2 : 1 d4 : 1 ← RR = 1
2d3 : 0 d2 : 1d4 : 1 d3 : 0
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Performance measure: Precision@K
Precision@K is for binary relevance onlyPrecision@K is the fraction of relevant documents in thetop-k
Prec@K (σ,R) =1K
K∑j=1
R(σ−1(j))
original ranked Prec@Kd1 : 0 d1 : 0 0/1d2 : 1 d4 : 1 1/2d3 : 0 d2 : 1 2/3d4 : 1 d3 : 0 2/4
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Performance measure: AP
Average Precision (AP) is also for binary relevanceIt is the average of Precision@K over positions of relevantdocuments only
AP(σ,R) =1∑m
i=1 R(i)·
∑j:R(σ−1(j))=1
Prec@j
original ranked Prec@Kd1 : 0 d1 : 0 0/1d2 : 1 d4 : 1 1/2← AP = 1/2+2/3
2d3 : 0 d2 : 1 2/3←d4 : 1 d3 : 0 2/4
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Performance measure: DCG
Discounted Cumulative Gain (DCG) is for multi-gradedrelevanceContributions of relevant documents further down theranking are discounted more
DCG(σ,R) =m∑
i=1
F (R(i))G(σ(i))
f ,g are increasing functionsStandard choice: F (r) = 2r − 1 and G(j) = log2(1 + j)
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Performance measure: NDCG
Normalized DCG (NDCG), like DCG, is for multi-gradedrelevanceAs the name suggests, NDCG normalizes DCG to keepthe gain bounded by 1
NDCG(σ,R) =DCG(σ,R)
Z (R)
The normalization constant
maxσ
DCG(σ,R)
can be computed quickly by sorting the relevance vector
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Ranking functions
A ranking function maps m query-document p-dimensionalfeatures into a permutation
X =
φ(q,d1)>
...φ(q,dm)
>
︸ ︷︷ ︸
m×d matrix
−→d5 ⇔ σ(5) = 1...
d12 ⇔ σ(12) = m
So mathematically it is a map from Rm×p to Sm
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Ranking functions from scoring functions
However, it is hard to learn functions that map intocombinatorial spacesIn binary classification, we often learn a real-valuedfunction and threshold it to output a class label(positive/negative)In ranking, we do the same but replace thresholding bysortingFor example, let f : Rp → R be a scoring functionφ(q,d1)
>
φ(q,d2)>
φ(q,d3)>
f=⇒
−2.34.21.9
sort=⇒
d2 ⇔ σ(2) = 1d3 ⇔ σ(3) = 2d1 ⇔ σ(1) = 3
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Recap
Training data consists query-document features and relevance scores
X =
φ(q, d1)>
...φ(q, dm)
>
︸ ︷︷ ︸
m×d matrix
, R =
R(1)...
R(m)
Will use training data to learn a scoring function f that will be used torank documents for a new querys1
...sm
=
f (φ(qnew, dnew1 )
...f (φ(qnew, dnew
m )
Scores can be sorted to get a ranking whose performance willevaluated using a measure such as ERR, AP, NDCG
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Outline
1 Learning to rank as supervised MLFeaturesLabel and Output SpacesPerformance MeasuresRanking functions
2 Theory for learning to rankRegularized empirical loss minimizationGeneralization error boundsConsistency
3 Ongoing workOnline learning to rankTop-K feedback
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Many nice, innovative methods have been developed. E.g.
AdaRank BoltzRank Cranking FRank Infinite pushLambdaMART LambdaRank ListMLE ListNET McRank
MPRank p-norm push OWPC PermuRank PRankRankBoost Ranking SVM RankNet SmoothRank SoftRank
SVM-Combo SVM-MAP SVM-MRR SVM-NDCG SVRank
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Support Vector Machine
SVMs solve the following problem (‖w‖22 := w>w):
minw
‖w‖22︸ ︷︷ ︸L2 regularization
+ C︸︷︷︸tuning param.
n∑i=1
ξi
subject to
ξi ≥ 0
Yiw>Xi︸ ︷︷ ︸margin
≥ 1− ξi︸︷︷︸slack variable
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Ranking SVM
Consider a query qn and documents di,n and dj,n such that
Rn(i) > Rn(j)
We would like to have
w>φ(qn,dn,i) > w>φ(qn,dn,j)
As in SVM, we can introduce a slack variable ξn,i,j with theconstraints
ξn,i,j ≥ 0
w>φ(qn,dn,i) ≥ w>φ(qn,dn,j) + 1− ξn,i,j
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Ranking SVM
Ranking SVM solves the following problem
minw‖w‖22 + C
N∑n=1
∑Rn(i)>Rn(j)
ξn,i,j
subject to the constraints
ξn,i,j ≥ 0
w>φ(qn,dn,i) ≥ w>φ(qn,dn,j) + 1− ξn,i,j
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Ranking SVM as regularized loss minimization
Note that the minimum possible value of ξ subject to ξ ≥ 0and ξ ≥ t is
max{t ,0} =: [t ]+
This observation lets us write the ranking SVMoptimization as
minw‖w‖2
2︸ ︷︷ ︸regularizer
+ CN∑
n=1
∑Rn(i)>Rn(j)
[1 + w>φ(qn, dn,j)− w>φ(qn, dn,i)]+︸ ︷︷ ︸loss on example n
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Regularized loss minimization
Note that
Xnw =
φ(qn,dn,1)>w
...φ(qn,dn,m)
>w
Ranking SVM problem (where λ = 1/C):
minw
λ‖w‖22 +N∑
n=1
`(Xnw ,Rn)
Loss in ranking SVM is
`(s,R) =∑
R(i)>R(j)
[1 + sj − si ]+
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Why not directly optimize performance measures?
Let RL be some actual loss you want to minimize (e.g.,1-NDCG)Then why not solve the following?
minw
λ‖w‖22 +N∑
n=1
RL(argsort(Xnw),Rn)
argsort(t) is a ranking σ that puts elements of t in(decreasing) sorted order:
tσ−1(1) ≥ tσ−1(2) ≥ . . . ≥ tσ−1(m)
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Why not directly optimize performance measures?
The function
w 7→ RL(argsort(Xw),R)
is not a nice one (from optimization perspective)!In the neighborhood of any point, it is either flat ordiscontinuousIt can only take one of at most m! different valuesMinimizing a sum of such functions is computationallyintractable
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Convex surrogates
The ranking SVM problem is convex hence easy tooptimizeBut no direct relation to a performance measureDirectly optimizing performance measure→ non-convexproblemsGiven a hard-to-optimize performance measure, can weindirectly minimize it via some convex surrogate?
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Theory: still a long way to go
An amazing amount of applied work existsYet, a lot of theoretical work remains to be done
“However, sometimes benchmark experiments are not asreliable as expected due to the small scales of the training andtest data. In this situation, a theory is needed to guarantee theperformance of an algorithm on infinite unseen data.”
–O. Chapelle, Y. Chang, and T.-Y. Liu“Future Directions in Learning to Rank” (2011)
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Statistical Ranking Theory: Challenges
Combinatorial nature of output spaceSize of prediction space is m! (m = no. of things beingranked)
No canonical performance measureMany choices: ERR, AP, (N)DCGEven with a fixed choice, how to come up with “good"convex surrogates
Dealing with diverse types of feedback modelsrelevance scores, pairwise feedback, click-through rates,preference DAGs
Large-scale challengesguarantees for parallel, distributed, online/streamingapproaches
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Our focus
Human feedback in form of relevance scoresRegularized (convex) loss minimization based rankingmethods
minw
λ‖w‖22 +N∑
n=1
`(Xnw ,Rn)
A few commonly used performance measures: NDCG, APGeneralization error bounds and consistency
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Standard Setup
Standard assumption in theory: examples
(X1,R1), . . . , (XN ,RN)
are drawn iid from an underlying (unknown) distributionConsider the constrained form of loss minimization
w = argmin‖w‖2≤B
1N
N∑n=1
`(Xnw ,Rn)
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Risk
Risk of w : performance of w on “infinite unseen data"
R(w) = EX ,R[`(Xw ,R)]
Note: cannot be computed without knowledge ofunderlying distributionEmpirical risk of w : performance of w on training set
R(w) =1N
N∑n=1
`(Xnw ,Rn)
Note: can be computed using just the training data
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Generalization error bound
We say that w is an empirical risk minimizer (ERM)
w = argmin‖w‖2≤B
R(w)
We’re typically interested in bounds on the generalizationerror
R(w)
For instance, we expect the generalization error to belarger than
R(w)
but by how much?
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Generalization error bound
We might therefore want a generalization error bound ofthe form
R(w) ≤ R(w) + stuff
1.5 1.0 0.5 0.0 0.5 1.0 1.50.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
w w
R(w)−R(w)
R(w)
Risk R(w)
Emp risk R(w)
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Rademacher complexity
Let ε1, . . . , εN be iid Rademacher (symmetric Bernoulli) randomvariables
Then, with prob. at least 1− δ,
R(w) ≤ R(w) + Eε1,...,εN
[max‖w‖2≤B
∣∣∣∣∣ 1N
N∑n=1
εn`(Xnw ,Rn)
∣∣∣∣∣]+
√log(1/δ)
N
The quantity
Eε1,...,εN
[max‖w‖2≤B
∣∣∣∣∣ 1N
N∑n=1
εn`(Xnw ,Rn)
∣∣∣∣∣]
is called Rademacher complexity
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
How does Rademacher complexity scale?Best previous bound (Chapelle and Wu, 2010) on Rademacher complexitydepended on:
Loss function `More “complicated” loss =⇒ bigger complexitySize B of w ’sLarger B =⇒ bigger complexitySize of features, say ‖φ(q, di)‖2 ≤ RLarger R =⇒ bigger complexitySample size NLarger N =⇒ smaller complexityNo. m of documents per queryLarger m =⇒ larger complexity
Roughly speaking, scaled as:
Lip‖·‖2(`) · B · R ·
√m ·√
log(1/δ)N
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Lipschitz constant
Fix arbitrary relevance vector RHow much does `(s,R) change as we change s?Lipschitz constant Lip‖·‖(`) is smallest L such that
∀R, |`(s,R)− `(s′,R)| ≤ L · ‖s − s′‖
Previous proof technique needs to use Lip‖·‖2(`)
We get better bound using Lip‖·‖∞(`) ≤√
m · Lip‖·‖2(`)
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Improved bound on Rademacher complexityOur (Chaudhuri and T., 2015) improved bound on Rademacher complexitydepends on:
Loss function `Lipschitz constant is measured w.r.t. `∞ norm
Size B of w ’sLarger B =⇒ bigger complexity
Size of features, say ‖φ(q, di)‖2 ≤ RLarger R =⇒ bigger complexity
Sample size NLarger N =⇒ smaller complexity
Does not depend on no. m of documents per query
Roughly speaking, scales as:
Lip‖·‖∞(`) · B · R ·√
log(1/δ)N
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Another type of generalization bound
So far we have seen bounds of the form
R(w) ≤ R(w) + stuff
However, R(w) ≤ R(w?) and R(w?) concentrates aroundR(w?)
Therefore, it is also easy to derive bounds of the form
R(w) ≤ R(w?) + stuff
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Generalization bounds: recap
Risk R(w) is expected loss of w on infinite, unseen data
Empirical risk R(w) is average loss on training data
A w that minimizes loss on training data is called an empirical riskminimizer (ERM)
Saw two types of generalization error bounds:
R(w) ≤ R(w) + stuff
R(w) ≤ R(w?)︸ ︷︷ ︸minimum possible risk
+ stuff
Rademacher complexity need not scale with length of document lists
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Richer function classes for more data
As we see more data, it is natural to enlarge the functionclass over which ERM is computed
fn = argminf∈Fn
1N
N∑n=1
`(f (Xn),Rn) = argminf∈Fn
R(f )
where
f (Xn) =
f (φ(qn,dn,1))...
f (φ(qn,dn,m))
E.g., as n grows, Fn = larger decision trees, deeper neuralnetworks, SVMs with less regularization
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Estimation error vs approximation error
Let f ?(Fn) be the best function in Fn and f ? be the bestoverall function:
f ?(Fn) = argminf∈Fn
R(f ) , f ? = argminf
R(f )
We can decompose the excess risk
R(fn)− R(f ?) = R(fn)− R(f ?(Fn))︸ ︷︷ ︸estimation error
+ R(f ?(Fn))− R(f ?)︸ ︷︷ ︸approximation error
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Consistency
If Fn grows too quickly, estimation error will not go to zeroIf Fn grows too slowly, approximation error will not go tozeroBy judiciously growing Fn, we can guarantee that botherrors go to zeroThis results in (strong) universal consistency: for all datadistributions
R(fn)→ R(f ?) a.s. as n→∞
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
What did we achieve?
We said that we will use a nice (e.g., convex) loss ` as asurrogate and train scoring function fn on finite data using `
fn = argminf∈Fn
1N
N∑n=1
`(f (Xn),Rn)
Enlarging Fn neither too quickly nor too slowly with n willgive us
R(fn)n→∞−→ R(f ?)
Great, but how good is fn according to rankingperformance measures (NDCG, ERR, AP)?
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Risk under ranking loss
Recall that R(f ) is the risk of f under the surrogate `:
R(f ) = EX ,R[`(f (X ),R)]
Similarly, there is L(f ), the risk of f under a target rankingloss RL:
L(f ) = EX ,R[RL(argsort(f (X )),R)]
argsort required because first argument of RL has to be aranking, not a score vector
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Calibration
Fix a target loss RL (e.g., 1-NDCG) and a surrogate ` (e.g.,RankingSVM loss)Suppose we know
R(fn)n→∞−→ min
fR(f )
Does this automatically guarantee the following?
L(fn)n→∞−→ min
fL(f )
If yes, then we say that surrogate ` is calibrated w.r.t.target loss RL
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Calibration in binary classification
Consider the simpler setting of binary classification whereXi ∈ Rp and Yi ∈ {+1,−1}We learn real valued functions f : Rp → R and take thesign to output a class labelGiven a surrogate ` (hinge loss, logistic loss), we candefine
R(f ) = EX ,Y [`(f (X ),Y )]
Target loss is often just the 0-1 loss:
L(f ) = EX ,Y [1(Yf (X)≤0)]
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Calibration in binary classification
Suppose `(Y , f (X )) = φ(Yf (X )) (a margin loss)For a margin-based convex surrogate `, when does
R(fn)n→∞−→ min
fR(f )
implyL(fn)
n→∞−→ minf
L(f )
Surprisingly simple answer (Bartlett, Jordan, McAuliffe,2006): φ′(0) < 0
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Examples of calibrated surrogates
0.5 0.0 0.5 1.0 1.5 2.0
Margin Yf(X)
0.0
0.5
1.0
1.5
2.0
2.5
0-1 loss
Hinge
Logistic
Squared
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Calibration w.r.t. DCG
Recall that DCG is defined as:
DCG(σ,R) =m∑
i=1
F (R(i))G(σ(i))
Let η be a distribution over relevance vectors R’sThe minimizer σ(η) of
σ 7→ ER∼η[DCG(σ,R)]
is argsort(ER∼η[f (R)])
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Calibration w.r.t. (N)DCG
Consider the score vector s(η) that minimizes
s 7→ ER∼η[`(s,R)]
Necessary and sufficient condition for DCG calibration:For every η, s(η) has the same sorted order as ER∼η[f (R)]
We (Ravikumar, T., Yang, 2011) derived a sufficient (andalmost necessary) condition for calibration w.r.t. NDCG bytaking normalization into account:
`(s,R) = Bregman(normalized(R), order-preserving(s))
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Calibration w.r.t. DCG: Example
Consider the regression based surrogate
`(s,R) =m∑
i=1
(si − f (Ri))2
Easy to see that s(η) is exactly ER∼η[f (R)]
Thus, this regression based surrogate is DCG calibrated
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Calibration w.r.t. other measures
The (N)DCG case might lead one to believe that convexcalibrated surrogates always existActually, the situation is more complex!It is known that there do not exist any convex calibratedsurrogates for ERR and APCaveat: Non-existence result assumes that we’re usingargsort to move from scores to rankings
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Calibration w.r.t. AP
Suppose we don’t restrict ourselves to learning Rm-valuedscores (followed by argsort)Then, we (Ramaswamy, Agarwal, T., 2013) constructed aconvex surrogate in O(m2) dimensions that is calibratedw.r.t. APCaveat: the mapping from scores to permutations involvessolving an NP-hard problemOpen: Is there a convex surrogate in poly(m) dimensionssuch that going from scores to permutation takes poly(m)time?
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Regularized empirical loss minimizationGeneralization error boundsConsistency
Consistency: recap
Consistency is an asymptotic (sample size N →∞) property of learningalgorithms
A consistent algorithm’s performance reaches that of the “best" rankeras it sees more and more data
A learning algorithm that minimizes a surrogate ` has to balanceestimation error, approximation error trade-off
If trade-off is managed well, it will have consistency in the sense of `
If this also implies consistency w.r.t. a target loss RL, we say ` iscalibrated w.r.t. RL
Calibration in ranking is trickier than in classification: easy in somecases (e.g. NDCG), not so easy in others (e.g., ERR, AP)
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Online learning to rankTop-K feedback
Outline
1 Learning to rank as supervised MLFeaturesLabel and Output SpacesPerformance MeasuresRanking functions
2 Theory for learning to rankRegularized empirical loss minimizationGeneralization error boundsConsistency
3 Ongoing workOnline learning to rankTop-K feedback
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Online learning to rankTop-K feedback
The online protocol
Instead of training on a fixed batch dataset, ranking could occurin a more online/incremental setting
For t = 1, . . . ,NRanker gets feature vectors Xt ∈ Rm×p
Ranker outputs a scoring function ft : R→ RRelevance vector Rt is revealedRanker suffers loss `(ft(Xt),Rt)
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Online learning to rankTop-K feedback
Regret
In online settings, we often look at regret:
N∑t=1
`(ft(Xt),Rt)︸ ︷︷ ︸algorithm’s loss
− minf∈F
N∑t=1
`(f (Xt),Rt)︸ ︷︷ ︸best loss in hindsight
If ` is convex and we’re using linear scoring functionsf (x) = w>x then we can use tools from online convexoptimization (OCO)Great OCO introduction herehttp://ocobook.cs.princeton.edu/
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Online learning to rankTop-K feedback
Perceptron-like algorithm for LeToR
Standard OGD (online gradient descent) algo usingranking SVM surrogate:
`(s,R) =∑
R(i)>R(j)
[1 + sj − si ]+
gives a O(√
N) regret boundγ-separability: ∃w?, ‖w?‖2 = 1 s.t., whenever di is morerelevant than dj for query q
(w?)>φ(q,di) ≥ (w?)>φ(q,dj) + γ
Under γ-separability, we (Chaudhuri, T., 2015) show
cumulative NDCG/AP loss ≤ poly(m) · R2
γ2
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Online learning to rankTop-K feedback
Top-K feedback protocol
For t = 1, . . . ,NRanker gets feature vectors Xt ∈ Rm×p
Ranker outputs a scoring function ft : R→ ROnly top-K relevance scores revealed:
Rt(σ−1t (1)), . . . ,Rt(σ
−1t (K ))
where σt = argsort(ft(Xt))
Ranker suffers loss `(ft(Xt),Rt)
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Online learning to rankTop-K feedback
Algorithm for Top-K LeToR
Can’t run OGD algo using ranking SVM surrogate:
`(s,R) =∑
R(i)>R(j)
[1 + sj − si ]+
since neither loss nor gradient is observedHowever, by using the right amount of exploration, we canget unbiased estimate of gradient with controlled varianceWe (Chaudhuri, T., 2015) show a O(N2/3) regret bound inthe challenging top-K feedback setting
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Summary
Ranking problems arise in a variety of contextsLearning to rank can be, and has been, viewed as asupervised ML problemWe gave new, improved generalization error bounds forranking methodsWe investigated consistency of convex surrogate basedranking algorithms for NDCG and APWe considered a novel feedback model (top-K feedback)and gave online regret bounds
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Collaborators
Eunho Harish Pradeep Shivani Sougata
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Papers
Generalization error bounds: Chaudhuri, Tewari. Generalization errorbounds for learning to rank: Does the length of document lists matter?,ICML 2015
NDCG calibration: Ravikumar, Tewari, Yang. On NDCG consistency oflistwise ranking methods, AISTATS 2011
Partial progress on AP calibration: Ramaswamy, Agarwal, Tewari.Convex calibrated surrogates for low-rank loss matrices withapplications to subset ranking losses, NIPS 2013
Top-K feedback (fixed set of objects): Chaudhuri, Tewari. Onlineranking with top-1 feedback, AISTATS 2015Honorable Mention, Best Student Paper Award
Ambuj Tewari Learning to Rank
Learning to rank as supervised MLTheory for learning to rank
Ongoing workSummary
Thanks
Thank you for your attention!
Ambuj Tewari Learning to Rank