73
Learning to rank as supervised ML Theory for learning to rank Ongoing work Summary Statistical learning theory for ranking problems Ambuj Tewari University of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to Rank

University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Statistical learning theory for ranking problems

Ambuj Tewari

University of Michigan

September 3, 2015 / MLSIG Seminar @ IISc

Ambuj Tewari Learning to Rank

Page 2: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Outline

1 Learning to rank as supervised MLFeaturesLabel and Output SpacesPerformance MeasuresRanking functions

2 Theory for learning to rankRegularized empirical loss minimizationGeneralization error boundsConsistency

3 Ongoing workOnline learning to rankTop-K feedback

Ambuj Tewari Learning to Rank

Page 3: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Outline

1 Learning to rank as supervised MLFeaturesLabel and Output SpacesPerformance MeasuresRanking functions

2 Theory for learning to rankRegularized empirical loss minimizationGeneralization error boundsConsistency

3 Ongoing workOnline learning to rankTop-K feedback

Ambuj Tewari Learning to Rank

Page 4: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Ranking problems are everywhere

Web search: Rank webpages in order of their relevance toa queryNews stories: Rank stories according to user interests at aparticular time (Google News, Facebook News Feed)Recommender systems: Rank movies (Netflix, IMDB),music (Pandora), or consumer goods (Amazon) accordingto a user’s preferencesSocial networks: Rank people in order of their likelihood ofbecoming friends with a given userBioinformatics: Rank genes in order of their biological rolein a given disease

Ambuj Tewari Learning to Rank

Page 5: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Ranking problems are everywhere

“What information consumes is rather obvious: it consumes theattention of its recipients. Hence a wealth of informationcreates a poverty of attention, and a need to allocate thatattention efficiently among the overabundance of informationsources that might consume it."

–Herbert Simon (1916–2001)Nobel Laureate (Economics), ACM Turing Award

Ambuj Tewari Learning to Rank

Page 6: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Learning to rank objects

Creating a ranking manually requires a lot of effortCan we have computers learn to rank objects?Will still need some human supervisionPose learning to rank as supervised ML problem

Ambuj Tewari Learning to Rank

Page 7: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Ranking as a canonical learning problem

Ambuj Tewari Learning to Rank

Page 8: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Ranking as a canonical learning problem

NIPS Workshops: 2002, 2005, 2009, 2014SIGIR Workshops: 2007, 2008, 2009Journal of Information Retrieval Special Issue in 2010Yahoo! Learning to Rank Challenge: 2010http://jmlr.csail.mit.edu/proceedings/papers/v14/chapelle11a/chapelle11a.pdf

LEarning TO Rank (LETOR) and Microsoft Learning toRank (MSLR) datasets: 2007-2010http://research.microsoft.com/en-us/um/beijing/projects/letor/

Ambuj Tewari Learning to Rank

Page 9: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Learning to rank as supervised ML

Supervised Machine Learning: construct a mapping

f : I → O

based on input, output pairs (training examples)

(X1,Y1), . . . , (XN ,YN)︸ ︷︷ ︸training set

where Xn ∈ I, Yn ∈ OWant f (X ) to be a good predictor of Y for unseen XIn ranking problems, O is all possible rankings of a givenset of objects (webpages, movies, etc)

Ambuj Tewari Learning to Rank

Page 10: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

A learning-to-rank training example

Query: “Learning to rank”Candidate webpages and relevance scores:

en.wikipedia.org/wiki/Learning_to_rank2 or “good”research.microsoft.com/.../hangli/l2r.pdf2 or “good”code.google.com/p/jforests/1 or “fair”www.vmathlive.com0 or “bad”

Think: example X =( query , (URL1, URL2, . . ., URL4) ),and label R = (2,2,1,0)

Ambuj Tewari Learning to Rank

Page 11: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Typical ML cycle

1 Feature construction: Find a way to map (query ,webpage)into Rp

Each example (query, m webpages) gets mapped toXn ∈ I = Rm×p

2 Obtain labels: Get Rn ∈ {0,1,2, . . . ,Rmax}m from humanjudgements

3 Train ranker: Get f : I → O = Sm where Sm is the set ofpermutations of degree m

Training usually done by solving some optimizationproblems on the training set

4 Evaluate performance: Using some performance measure,evaluate the ranker on a “test set”

Ambuj Tewari Learning to Rank

Page 12: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Joint query, document features

Number of query terms that occur in document

Document length

Sum/min/max/mean/variance of term frequencies

Sum/min/max/mean/variance of tf-idf

Cosine similarity between query and document

Any model of P(R = 1|Q,D) gives a feature (e.g., BM25)

No. of slashes in/length of URL, Pagerank, site level Pagerank

Query-URL click count, user-URL click count

Ambuj Tewari Learning to Rank

Page 13: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

A slightly more general supervised ML setting

For a future query and webpage-list, want to rank thewebpagesDon’t necessarily want to predict the relevance scoresLet f (ranking function) take values in the space of rankingsSo now, training examples (X ,Y ) ∈ I × L where L = labelspaceFunction f to be learned maps I to O (note L 6= O)E.g. (1,2,2,0,0) ∈ L and(d1 → 3,d2 → 1,d3 → 2,d4 → 4,d5 → 5) ∈ O

Ambuj Tewari Learning to Rank

Page 14: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Notation for rankings

We’ll think of m (no. of documents) as fixed but in reality itvaries over examplesWe’ll denote a ranking using a permutation σ ∈ Sm (set ofall m! permutations)σ(i) is the rank/position of document iσ−1(j) is the document at rank/position j

Ambuj Tewari Learning to Rank

Page 15: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Notation for rankings

Suppose we have 3 documents d1,d2,d3 and we wantthem to be ranked thus:

d2d3d1

So the ranks are d1 : 3,d2 : 1,d3 : 2This means σ(1) = 3, σ(2) = 1, σ(3) = 2Also, σ−1(1) = 2, σ−1(2) = 3, σ−1(3) = 1.

Ambuj Tewari Learning to Rank

Page 16: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Notation for relevance scores/labels

Relevance labels can be binary or multi-gradedIn the binary case, documents are either relevant orirrelevant (1 or 0)In the multi-graded case, relevance labels are from{0,1,2, . . . ,Rmax}Typically, Rmax ≤ 4 (at most 5 levels of relevance)A relevance score/label vector R as an m-vectorR(i) gives relevance of document i

Ambuj Tewari Learning to Rank

Page 17: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Performance measures in ranking

Performance measures could be gains (to be maximized)or losses (to be minimized)They take a relevance vector R and ranking σ asarguments and produce a non-negative gain/lossSome performance measures are only defined for binaryrelevanceOthers can handle more general multi-graded relevance

Ambuj Tewari Learning to Rank

Page 18: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Performance measure: RR

Reciprocal rank (RR) is defined for binary relevance onlyRR is the reciprocal rank of the first relevant documentaccording to the ranking

RR(σ,Y ) =1

min{σ(i) : Y (i) = 1}.

Example: R = (0,1,0,1), σ = (1,3,4,2) then RR = 12

original rankedd1 : 0 d1 : 0d2 : 1 d4 : 1 ← RR = 1

2d3 : 0 d2 : 1d4 : 1 d3 : 0

Ambuj Tewari Learning to Rank

Page 19: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Performance measure: Precision@K

Precision@K is for binary relevance onlyPrecision@K is the fraction of relevant documents in thetop-k

Prec@K (σ,R) =1K

K∑j=1

R(σ−1(j))

original ranked Prec@Kd1 : 0 d1 : 0 0/1d2 : 1 d4 : 1 1/2d3 : 0 d2 : 1 2/3d4 : 1 d3 : 0 2/4

Ambuj Tewari Learning to Rank

Page 20: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Performance measure: AP

Average Precision (AP) is also for binary relevanceIt is the average of Precision@K over positions of relevantdocuments only

AP(σ,R) =1∑m

i=1 R(i)·

∑j:R(σ−1(j))=1

Prec@j

original ranked Prec@Kd1 : 0 d1 : 0 0/1d2 : 1 d4 : 1 1/2← AP = 1/2+2/3

2d3 : 0 d2 : 1 2/3←d4 : 1 d3 : 0 2/4

Ambuj Tewari Learning to Rank

Page 21: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Performance measure: DCG

Discounted Cumulative Gain (DCG) is for multi-gradedrelevanceContributions of relevant documents further down theranking are discounted more

DCG(σ,R) =m∑

i=1

F (R(i))G(σ(i))

f ,g are increasing functionsStandard choice: F (r) = 2r − 1 and G(j) = log2(1 + j)

Ambuj Tewari Learning to Rank

Page 22: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Performance measure: NDCG

Normalized DCG (NDCG), like DCG, is for multi-gradedrelevanceAs the name suggests, NDCG normalizes DCG to keepthe gain bounded by 1

NDCG(σ,R) =DCG(σ,R)

Z (R)

The normalization constant

maxσ

DCG(σ,R)

can be computed quickly by sorting the relevance vector

Ambuj Tewari Learning to Rank

Page 23: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Ranking functions

A ranking function maps m query-document p-dimensionalfeatures into a permutation

X =

φ(q,d1)>

...φ(q,dm)

>

︸ ︷︷ ︸

m×d matrix

−→d5 ⇔ σ(5) = 1...

d12 ⇔ σ(12) = m

So mathematically it is a map from Rm×p to Sm

Ambuj Tewari Learning to Rank

Page 24: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Ranking functions from scoring functions

However, it is hard to learn functions that map intocombinatorial spacesIn binary classification, we often learn a real-valuedfunction and threshold it to output a class label(positive/negative)In ranking, we do the same but replace thresholding bysortingFor example, let f : Rp → R be a scoring functionφ(q,d1)

>

φ(q,d2)>

φ(q,d3)>

f=⇒

−2.34.21.9

sort=⇒

d2 ⇔ σ(2) = 1d3 ⇔ σ(3) = 2d1 ⇔ σ(1) = 3

Ambuj Tewari Learning to Rank

Page 25: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Recap

Training data consists query-document features and relevance scores

X =

φ(q, d1)>

...φ(q, dm)

>

︸ ︷︷ ︸

m×d matrix

, R =

R(1)...

R(m)

Will use training data to learn a scoring function f that will be used torank documents for a new querys1

...sm

=

f (φ(qnew, dnew1 )

...f (φ(qnew, dnew

m )

Scores can be sorted to get a ranking whose performance willevaluated using a measure such as ERR, AP, NDCG

Ambuj Tewari Learning to Rank

Page 26: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Outline

1 Learning to rank as supervised MLFeaturesLabel and Output SpacesPerformance MeasuresRanking functions

2 Theory for learning to rankRegularized empirical loss minimizationGeneralization error boundsConsistency

3 Ongoing workOnline learning to rankTop-K feedback

Ambuj Tewari Learning to Rank

Page 27: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Many nice, innovative methods have been developed. E.g.

AdaRank BoltzRank Cranking FRank Infinite pushLambdaMART LambdaRank ListMLE ListNET McRank

MPRank p-norm push OWPC PermuRank PRankRankBoost Ranking SVM RankNet SmoothRank SoftRank

SVM-Combo SVM-MAP SVM-MRR SVM-NDCG SVRank

Ambuj Tewari Learning to Rank

Page 28: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Support Vector Machine

SVMs solve the following problem (‖w‖22 := w>w):

minw

‖w‖22︸ ︷︷ ︸L2 regularization

+ C︸︷︷︸tuning param.

n∑i=1

ξi

subject to

ξi ≥ 0

Yiw>Xi︸ ︷︷ ︸margin

≥ 1− ξi︸︷︷︸slack variable

Ambuj Tewari Learning to Rank

Page 29: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Ranking SVM

Consider a query qn and documents di,n and dj,n such that

Rn(i) > Rn(j)

We would like to have

w>φ(qn,dn,i) > w>φ(qn,dn,j)

As in SVM, we can introduce a slack variable ξn,i,j with theconstraints

ξn,i,j ≥ 0

w>φ(qn,dn,i) ≥ w>φ(qn,dn,j) + 1− ξn,i,j

Ambuj Tewari Learning to Rank

Page 30: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Ranking SVM

Ranking SVM solves the following problem

minw‖w‖22 + C

N∑n=1

∑Rn(i)>Rn(j)

ξn,i,j

subject to the constraints

ξn,i,j ≥ 0

w>φ(qn,dn,i) ≥ w>φ(qn,dn,j) + 1− ξn,i,j

Ambuj Tewari Learning to Rank

Page 31: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Ranking SVM as regularized loss minimization

Note that the minimum possible value of ξ subject to ξ ≥ 0and ξ ≥ t is

max{t ,0} =: [t ]+

This observation lets us write the ranking SVMoptimization as

minw‖w‖2

2︸ ︷︷ ︸regularizer

+ CN∑

n=1

∑Rn(i)>Rn(j)

[1 + w>φ(qn, dn,j)− w>φ(qn, dn,i)]+︸ ︷︷ ︸loss on example n

Ambuj Tewari Learning to Rank

Page 32: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Regularized loss minimization

Note that

Xnw =

φ(qn,dn,1)>w

...φ(qn,dn,m)

>w

Ranking SVM problem (where λ = 1/C):

minw

λ‖w‖22 +N∑

n=1

`(Xnw ,Rn)

Loss in ranking SVM is

`(s,R) =∑

R(i)>R(j)

[1 + sj − si ]+

Ambuj Tewari Learning to Rank

Page 33: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Why not directly optimize performance measures?

Let RL be some actual loss you want to minimize (e.g.,1-NDCG)Then why not solve the following?

minw

λ‖w‖22 +N∑

n=1

RL(argsort(Xnw),Rn)

argsort(t) is a ranking σ that puts elements of t in(decreasing) sorted order:

tσ−1(1) ≥ tσ−1(2) ≥ . . . ≥ tσ−1(m)

Ambuj Tewari Learning to Rank

Page 34: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Why not directly optimize performance measures?

The function

w 7→ RL(argsort(Xw),R)

is not a nice one (from optimization perspective)!In the neighborhood of any point, it is either flat ordiscontinuousIt can only take one of at most m! different valuesMinimizing a sum of such functions is computationallyintractable

Ambuj Tewari Learning to Rank

Page 35: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Convex surrogates

The ranking SVM problem is convex hence easy tooptimizeBut no direct relation to a performance measureDirectly optimizing performance measure→ non-convexproblemsGiven a hard-to-optimize performance measure, can weindirectly minimize it via some convex surrogate?

Ambuj Tewari Learning to Rank

Page 36: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Theory: still a long way to go

An amazing amount of applied work existsYet, a lot of theoretical work remains to be done

“However, sometimes benchmark experiments are not asreliable as expected due to the small scales of the training andtest data. In this situation, a theory is needed to guarantee theperformance of an algorithm on infinite unseen data.”

–O. Chapelle, Y. Chang, and T.-Y. Liu“Future Directions in Learning to Rank” (2011)

Ambuj Tewari Learning to Rank

Page 37: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Statistical Ranking Theory: Challenges

Combinatorial nature of output spaceSize of prediction space is m! (m = no. of things beingranked)

No canonical performance measureMany choices: ERR, AP, (N)DCGEven with a fixed choice, how to come up with “good"convex surrogates

Dealing with diverse types of feedback modelsrelevance scores, pairwise feedback, click-through rates,preference DAGs

Large-scale challengesguarantees for parallel, distributed, online/streamingapproaches

Ambuj Tewari Learning to Rank

Page 38: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Our focus

Human feedback in form of relevance scoresRegularized (convex) loss minimization based rankingmethods

minw

λ‖w‖22 +N∑

n=1

`(Xnw ,Rn)

A few commonly used performance measures: NDCG, APGeneralization error bounds and consistency

Ambuj Tewari Learning to Rank

Page 39: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Standard Setup

Standard assumption in theory: examples

(X1,R1), . . . , (XN ,RN)

are drawn iid from an underlying (unknown) distributionConsider the constrained form of loss minimization

w = argmin‖w‖2≤B

1N

N∑n=1

`(Xnw ,Rn)

Ambuj Tewari Learning to Rank

Page 40: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Risk

Risk of w : performance of w on “infinite unseen data"

R(w) = EX ,R[`(Xw ,R)]

Note: cannot be computed without knowledge ofunderlying distributionEmpirical risk of w : performance of w on training set

R(w) =1N

N∑n=1

`(Xnw ,Rn)

Note: can be computed using just the training data

Ambuj Tewari Learning to Rank

Page 41: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Generalization error bound

We say that w is an empirical risk minimizer (ERM)

w = argmin‖w‖2≤B

R(w)

We’re typically interested in bounds on the generalizationerror

R(w)

For instance, we expect the generalization error to belarger than

R(w)

but by how much?

Ambuj Tewari Learning to Rank

Page 42: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Generalization error bound

We might therefore want a generalization error bound ofthe form

R(w) ≤ R(w) + stuff

1.5 1.0 0.5 0.0 0.5 1.0 1.50.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

w w

R(w)−R(w)

R(w)

Risk R(w)

Emp risk R(w)

Ambuj Tewari Learning to Rank

Page 43: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Rademacher complexity

Let ε1, . . . , εN be iid Rademacher (symmetric Bernoulli) randomvariables

Then, with prob. at least 1− δ,

R(w) ≤ R(w) + Eε1,...,εN

[max‖w‖2≤B

∣∣∣∣∣ 1N

N∑n=1

εn`(Xnw ,Rn)

∣∣∣∣∣]+

√log(1/δ)

N

The quantity

Eε1,...,εN

[max‖w‖2≤B

∣∣∣∣∣ 1N

N∑n=1

εn`(Xnw ,Rn)

∣∣∣∣∣]

is called Rademacher complexity

Ambuj Tewari Learning to Rank

Page 44: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

How does Rademacher complexity scale?Best previous bound (Chapelle and Wu, 2010) on Rademacher complexitydepended on:

Loss function `More “complicated” loss =⇒ bigger complexitySize B of w ’sLarger B =⇒ bigger complexitySize of features, say ‖φ(q, di)‖2 ≤ RLarger R =⇒ bigger complexitySample size NLarger N =⇒ smaller complexityNo. m of documents per queryLarger m =⇒ larger complexity

Roughly speaking, scaled as:

Lip‖·‖2(`) · B · R ·

√m ·√

log(1/δ)N

Ambuj Tewari Learning to Rank

Page 45: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Lipschitz constant

Fix arbitrary relevance vector RHow much does `(s,R) change as we change s?Lipschitz constant Lip‖·‖(`) is smallest L such that

∀R, |`(s,R)− `(s′,R)| ≤ L · ‖s − s′‖

Previous proof technique needs to use Lip‖·‖2(`)

We get better bound using Lip‖·‖∞(`) ≤√

m · Lip‖·‖2(`)

Ambuj Tewari Learning to Rank

Page 46: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Improved bound on Rademacher complexityOur (Chaudhuri and T., 2015) improved bound on Rademacher complexitydepends on:

Loss function `Lipschitz constant is measured w.r.t. `∞ norm

Size B of w ’sLarger B =⇒ bigger complexity

Size of features, say ‖φ(q, di)‖2 ≤ RLarger R =⇒ bigger complexity

Sample size NLarger N =⇒ smaller complexity

Does not depend on no. m of documents per query

Roughly speaking, scales as:

Lip‖·‖∞(`) · B · R ·√

log(1/δ)N

Ambuj Tewari Learning to Rank

Page 47: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Another type of generalization bound

So far we have seen bounds of the form

R(w) ≤ R(w) + stuff

However, R(w) ≤ R(w?) and R(w?) concentrates aroundR(w?)

Therefore, it is also easy to derive bounds of the form

R(w) ≤ R(w?) + stuff

Ambuj Tewari Learning to Rank

Page 48: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Generalization bounds: recap

Risk R(w) is expected loss of w on infinite, unseen data

Empirical risk R(w) is average loss on training data

A w that minimizes loss on training data is called an empirical riskminimizer (ERM)

Saw two types of generalization error bounds:

R(w) ≤ R(w) + stuff

R(w) ≤ R(w?)︸ ︷︷ ︸minimum possible risk

+ stuff

Rademacher complexity need not scale with length of document lists

Ambuj Tewari Learning to Rank

Page 49: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Richer function classes for more data

As we see more data, it is natural to enlarge the functionclass over which ERM is computed

fn = argminf∈Fn

1N

N∑n=1

`(f (Xn),Rn) = argminf∈Fn

R(f )

where

f (Xn) =

f (φ(qn,dn,1))...

f (φ(qn,dn,m))

E.g., as n grows, Fn = larger decision trees, deeper neuralnetworks, SVMs with less regularization

Ambuj Tewari Learning to Rank

Page 50: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Estimation error vs approximation error

Let f ?(Fn) be the best function in Fn and f ? be the bestoverall function:

f ?(Fn) = argminf∈Fn

R(f ) , f ? = argminf

R(f )

We can decompose the excess risk

R(fn)− R(f ?) = R(fn)− R(f ?(Fn))︸ ︷︷ ︸estimation error

+ R(f ?(Fn))− R(f ?)︸ ︷︷ ︸approximation error

Ambuj Tewari Learning to Rank

Page 51: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Consistency

If Fn grows too quickly, estimation error will not go to zeroIf Fn grows too slowly, approximation error will not go tozeroBy judiciously growing Fn, we can guarantee that botherrors go to zeroThis results in (strong) universal consistency: for all datadistributions

R(fn)→ R(f ?) a.s. as n→∞

Ambuj Tewari Learning to Rank

Page 52: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

What did we achieve?

We said that we will use a nice (e.g., convex) loss ` as asurrogate and train scoring function fn on finite data using `

fn = argminf∈Fn

1N

N∑n=1

`(f (Xn),Rn)

Enlarging Fn neither too quickly nor too slowly with n willgive us

R(fn)n→∞−→ R(f ?)

Great, but how good is fn according to rankingperformance measures (NDCG, ERR, AP)?

Ambuj Tewari Learning to Rank

Page 53: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Risk under ranking loss

Recall that R(f ) is the risk of f under the surrogate `:

R(f ) = EX ,R[`(f (X ),R)]

Similarly, there is L(f ), the risk of f under a target rankingloss RL:

L(f ) = EX ,R[RL(argsort(f (X )),R)]

argsort required because first argument of RL has to be aranking, not a score vector

Ambuj Tewari Learning to Rank

Page 54: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Calibration

Fix a target loss RL (e.g., 1-NDCG) and a surrogate ` (e.g.,RankingSVM loss)Suppose we know

R(fn)n→∞−→ min

fR(f )

Does this automatically guarantee the following?

L(fn)n→∞−→ min

fL(f )

If yes, then we say that surrogate ` is calibrated w.r.t.target loss RL

Ambuj Tewari Learning to Rank

Page 55: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Calibration in binary classification

Consider the simpler setting of binary classification whereXi ∈ Rp and Yi ∈ {+1,−1}We learn real valued functions f : Rp → R and take thesign to output a class labelGiven a surrogate ` (hinge loss, logistic loss), we candefine

R(f ) = EX ,Y [`(f (X ),Y )]

Target loss is often just the 0-1 loss:

L(f ) = EX ,Y [1(Yf (X)≤0)]

Ambuj Tewari Learning to Rank

Page 56: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Calibration in binary classification

Suppose `(Y , f (X )) = φ(Yf (X )) (a margin loss)For a margin-based convex surrogate `, when does

R(fn)n→∞−→ min

fR(f )

implyL(fn)

n→∞−→ minf

L(f )

Surprisingly simple answer (Bartlett, Jordan, McAuliffe,2006): φ′(0) < 0

Ambuj Tewari Learning to Rank

Page 57: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Examples of calibrated surrogates

0.5 0.0 0.5 1.0 1.5 2.0

Margin Yf(X)

0.0

0.5

1.0

1.5

2.0

2.5

0-1 loss

Hinge

Logistic

Squared

Ambuj Tewari Learning to Rank

Page 58: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Calibration w.r.t. DCG

Recall that DCG is defined as:

DCG(σ,R) =m∑

i=1

F (R(i))G(σ(i))

Let η be a distribution over relevance vectors R’sThe minimizer σ(η) of

σ 7→ ER∼η[DCG(σ,R)]

is argsort(ER∼η[f (R)])

Ambuj Tewari Learning to Rank

Page 59: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Calibration w.r.t. (N)DCG

Consider the score vector s(η) that minimizes

s 7→ ER∼η[`(s,R)]

Necessary and sufficient condition for DCG calibration:For every η, s(η) has the same sorted order as ER∼η[f (R)]

We (Ravikumar, T., Yang, 2011) derived a sufficient (andalmost necessary) condition for calibration w.r.t. NDCG bytaking normalization into account:

`(s,R) = Bregman(normalized(R), order-preserving(s))

Ambuj Tewari Learning to Rank

Page 60: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Calibration w.r.t. DCG: Example

Consider the regression based surrogate

`(s,R) =m∑

i=1

(si − f (Ri))2

Easy to see that s(η) is exactly ER∼η[f (R)]

Thus, this regression based surrogate is DCG calibrated

Ambuj Tewari Learning to Rank

Page 61: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Calibration w.r.t. other measures

The (N)DCG case might lead one to believe that convexcalibrated surrogates always existActually, the situation is more complex!It is known that there do not exist any convex calibratedsurrogates for ERR and APCaveat: Non-existence result assumes that we’re usingargsort to move from scores to rankings

Ambuj Tewari Learning to Rank

Page 62: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Calibration w.r.t. AP

Suppose we don’t restrict ourselves to learning Rm-valuedscores (followed by argsort)Then, we (Ramaswamy, Agarwal, T., 2013) constructed aconvex surrogate in O(m2) dimensions that is calibratedw.r.t. APCaveat: the mapping from scores to permutations involvessolving an NP-hard problemOpen: Is there a convex surrogate in poly(m) dimensionssuch that going from scores to permutation takes poly(m)time?

Ambuj Tewari Learning to Rank

Page 63: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Consistency: recap

Consistency is an asymptotic (sample size N →∞) property of learningalgorithms

A consistent algorithm’s performance reaches that of the “best" rankeras it sees more and more data

A learning algorithm that minimizes a surrogate ` has to balanceestimation error, approximation error trade-off

If trade-off is managed well, it will have consistency in the sense of `

If this also implies consistency w.r.t. a target loss RL, we say ` iscalibrated w.r.t. RL

Calibration in ranking is trickier than in classification: easy in somecases (e.g. NDCG), not so easy in others (e.g., ERR, AP)

Ambuj Tewari Learning to Rank

Page 64: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Online learning to rankTop-K feedback

Outline

1 Learning to rank as supervised MLFeaturesLabel and Output SpacesPerformance MeasuresRanking functions

2 Theory for learning to rankRegularized empirical loss minimizationGeneralization error boundsConsistency

3 Ongoing workOnline learning to rankTop-K feedback

Ambuj Tewari Learning to Rank

Page 65: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Online learning to rankTop-K feedback

The online protocol

Instead of training on a fixed batch dataset, ranking could occurin a more online/incremental setting

For t = 1, . . . ,NRanker gets feature vectors Xt ∈ Rm×p

Ranker outputs a scoring function ft : R→ RRelevance vector Rt is revealedRanker suffers loss `(ft(Xt),Rt)

Ambuj Tewari Learning to Rank

Page 66: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Online learning to rankTop-K feedback

Regret

In online settings, we often look at regret:

N∑t=1

`(ft(Xt),Rt)︸ ︷︷ ︸algorithm’s loss

− minf∈F

N∑t=1

`(f (Xt),Rt)︸ ︷︷ ︸best loss in hindsight

If ` is convex and we’re using linear scoring functionsf (x) = w>x then we can use tools from online convexoptimization (OCO)Great OCO introduction herehttp://ocobook.cs.princeton.edu/

Ambuj Tewari Learning to Rank

Page 67: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Online learning to rankTop-K feedback

Perceptron-like algorithm for LeToR

Standard OGD (online gradient descent) algo usingranking SVM surrogate:

`(s,R) =∑

R(i)>R(j)

[1 + sj − si ]+

gives a O(√

N) regret boundγ-separability: ∃w?, ‖w?‖2 = 1 s.t., whenever di is morerelevant than dj for query q

(w?)>φ(q,di) ≥ (w?)>φ(q,dj) + γ

Under γ-separability, we (Chaudhuri, T., 2015) show

cumulative NDCG/AP loss ≤ poly(m) · R2

γ2

Ambuj Tewari Learning to Rank

Page 68: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Online learning to rankTop-K feedback

Top-K feedback protocol

For t = 1, . . . ,NRanker gets feature vectors Xt ∈ Rm×p

Ranker outputs a scoring function ft : R→ ROnly top-K relevance scores revealed:

Rt(σ−1t (1)), . . . ,Rt(σ

−1t (K ))

where σt = argsort(ft(Xt))

Ranker suffers loss `(ft(Xt),Rt)

Ambuj Tewari Learning to Rank

Page 69: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Online learning to rankTop-K feedback

Algorithm for Top-K LeToR

Can’t run OGD algo using ranking SVM surrogate:

`(s,R) =∑

R(i)>R(j)

[1 + sj − si ]+

since neither loss nor gradient is observedHowever, by using the right amount of exploration, we canget unbiased estimate of gradient with controlled varianceWe (Chaudhuri, T., 2015) show a O(N2/3) regret bound inthe challenging top-K feedback setting

Ambuj Tewari Learning to Rank

Page 70: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Summary

Ranking problems arise in a variety of contextsLearning to rank can be, and has been, viewed as asupervised ML problemWe gave new, improved generalization error bounds forranking methodsWe investigated consistency of convex surrogate basedranking algorithms for NDCG and APWe considered a novel feedback model (top-K feedback)and gave online regret bounds

Ambuj Tewari Learning to Rank

Page 71: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Collaborators

Eunho Harish Pradeep Shivani Sougata

Ambuj Tewari Learning to Rank

Page 72: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Papers

Generalization error bounds: Chaudhuri, Tewari. Generalization errorbounds for learning to rank: Does the length of document lists matter?,ICML 2015

NDCG calibration: Ravikumar, Tewari, Yang. On NDCG consistency oflistwise ranking methods, AISTATS 2011

Partial progress on AP calibration: Ramaswamy, Agarwal, Tewari.Convex calibrated surrogates for low-rank loss matrices withapplications to subset ranking losses, NIPS 2013

Top-K feedback (fixed set of objects): Chaudhuri, Tewari. Onlineranking with top-1 feedback, AISTATS 2015Honorable Mention, Best Student Paper Award

Ambuj Tewari Learning to Rank

Page 73: University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Thanks

Thank you for your attention!

Ambuj Tewari Learning to Rank