University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to

Learning to rank as supervised MLTheory for learning to rank

Ongoing workSummary

Statistical learning theory for ranking problems

Ambuj Tewari

University of Michigan

September 3, 2015 / MLSIG Seminar @ IISc

Ambuj Tewari Learning to Rank


Ongoing workSummary

Outline

1 Learning to rank as supervised MLFeaturesLabel and Output SpacesPerformance MeasuresRanking functions

2 Theory for learning to rankRegularized empirical loss minimizationGeneralization error boundsConsistency

3 Ongoing workOnline learning to rankTop-K feedback



Ongoing workSummary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Outline






Ongoing workSummary


Ranking problems are everywhere

Web search: Rank webpages in order of their relevance toa queryNews stories: Rank stories according to user interests at aparticular time (Google News, Facebook News Feed)Recommender systems: Rank movies (Netflix, IMDB),music (Pandora), or consumer goods (Amazon) accordingto a user’s preferencesSocial networks: Rank people in order of their likelihood ofbecoming friends with a given userBioinformatics: Rank genes in order of their biological rolein a given disease



Ongoing workSummary


Ranking problems are everywhere

“What information consumes is rather obvious: it consumes theattention of its recipients. Hence a wealth of informationcreates a poverty of attention, and a need to allocate thatattention efficiently among the overabundance of informationsources that might consume it."

–Herbert Simon (1916–2001)Nobel Laureate (Economics), ACM Turing Award



Ongoing workSummary


Learning to rank objects

Creating a ranking manually requires a lot of effortCan we have computers learn to rank objects?Will still need some human supervisionPose learning to rank as supervised ML problem



Ongoing workSummary


Ranking as a canonical learning problem



Ongoing workSummary


Ranking as a canonical learning problem

NIPS Workshops: 2002, 2005, 2009, 2014SIGIR Workshops: 2007, 2008, 2009Journal of Information Retrieval Special Issue in 2010Yahoo! Learning to Rank Challenge: 2010http://jmlr.csail.mit.edu/proceedings/papers/v14/chapelle11a/chapelle11a.pdf

LEarning TO Rank (LETOR) and Microsoft Learning toRank (MSLR) datasets: 2007-2010http://research.microsoft.com/en-us/um/beijing/projects/letor/


http://jmlr.csail.mit.edu/proceedings/papers/v14/chapelle11a/chapelle11a.pdf

http://jmlr.csail.mit.edu/proceedings/papers/v14/chapelle11a/chapelle11a.pdf

http://research.microsoft.com/en-us/um/beijing/projects/letor/

http://research.microsoft.com/en-us/um/beijing/projects/letor/


Ongoing workSummary


Learning to rank as supervised ML

Supervised Machine Learning: construct a mapping

f : I → O

based on input, output pairs (training examples)

(X1,Y1), . . . , (XN ,YN)︸︷︷︸training set

where Xn ∈ I, Yn ∈ OWant f (X ) to be a good predictor of Y for unseen XIn ranking problems, O is all possible rankings of a givenset of objects (webpages, movies, etc)



Ongoing workSummary


A learning-to-rank training example

Query: “Learning to rank”Candidate webpages and relevance scores:

en.wikipedia.org/wiki/Learning_to_rank2 or “good”research.microsoft.com/.../hangli/l2r.pdf2 or “good”code.google.com/p/jforests/1 or “fair”www.vmathlive.com0 or “bad”

Think: example X =( query , (URL1, URL2, . . ., URL4) ),and label R = (2,2,1,0)



Ongoing workSummary


Typical ML cycle

1 Feature construction: Find a way to map (query ,webpage)into Rp

Each example (query, m webpages) gets mapped toXn ∈ I = Rm×p

2 Obtain labels: Get Rn ∈ {0,1,2, . . . ,Rmax}m from humanjudgements

3 Train ranker: Get f : I → O = Sm where Sm is the set ofpermutations of degree m

Training usually done by solving some optimizationproblems on the training set

4 Evaluate performance: Using some performance measure,evaluate the ranker on a “test set”



Ongoing workSummary


Joint query, document features

Number of query terms that occur in document

Document length

Sum/min/max/mean/variance of term frequencies

Sum/min/max/mean/variance of tf-idf

Cosine similarity between query and document

Any model of P(R = 1|Q,D) gives a feature (e.g., BM25)

No. of slashes in/length of URL, Pagerank, site level Pagerank

Query-URL click count, user-URL click count



Ongoing workSummary


A slightly more general supervised ML setting

For a future query and webpage-list, want to rank thewebpagesDon’t necessarily want to predict the relevance scoresLet f (ranking function) take values in the space of rankingsSo now, training examples (X ,Y ) ∈ I × L where L = labelspaceFunction f to be learned maps I to O (note L 6= O)E.g. (1,2,2,0,0) ∈ L and(d1 → 3,d2 → 1,d3 → 2,d4 → 4,d5 → 5) ∈ O



Ongoing workSummary


Notation for rankings

We’ll think of m (no. of documents) as fixed but in reality itvaries over examplesWe’ll denote a ranking using a permutation σ ∈ Sm (set ofall m! permutations)σ(i) is the rank/position of document iσ−1(j) is the document at rank/position j



Ongoing workSummary


Notation for rankings

Suppose we have 3 documents d1,d2,d3 and we wantthem to be ranked thus:

d2d3d1

So the ranks are d1 : 3,d2 : 1,d3 : 2This means σ(1) = 3, σ(2) = 1, σ(3) = 2Also, σ−1(1) = 2, σ−1(2) = 3, σ−1(3) = 1.



Ongoing workSummary


Notation for relevance scores/labels

Relevance labels can be binary or multi-gradedIn the binary case, documents are either relevant orirrelevant (1 or 0)In the multi-graded case, relevance labels are from{0,1,2, . . . ,Rmax}Typically, Rmax ≤ 4 (at most 5 levels of relevance)A relevance score/label vector R as an m-vectorR(i) gives relevance of document i



Ongoing workSummary


Performance measures in ranking

Performance measures could be gains (to be maximized)or losses (to be minimized)They take a relevance vector R and ranking σ asarguments and produce a non-negative gain/lossSome performance measures are only defined for binaryrelevanceOthers can handle more general multi-graded relevance



Ongoing workSummary


Performance measure: RR

Reciprocal rank (RR) is defined for binary relevance onlyRR is the reciprocal rank of the first relevant documentaccording to the ranking

RR(σ,Y ) =1

min{σ(i) : Y (i) = 1}.

Example: R = (0,1,0,1), σ = (1,3,4,2) then RR = 12

original rankedd1 : 0 d1 : 0d2 : 1 d4 : 1 ← RR = 1

2d3 : 0 d2 : 1d4 : 1 d3 : 0



Ongoing workSummary


Performance measure: Precision@K

Precision@K is for binary relevance onlyPrecision@K is the fraction of relevant documents in thetop-k

Prec@K (σ,R) =1K

K∑j=1

R(σ−1(j))

original ranked Prec@Kd1 : 0 d1 : 0 0/1d2 : 1 d4 : 1 1/2d3 : 0 d2 : 1 2/3d4 : 1 d3 : 0 2/4



Ongoing workSummary


Performance measure: AP

Average Precision (AP) is also for binary relevanceIt is the average of Precision@K over positions of relevantdocuments only

AP(σ,R) =1∑m

i=1 R(i)·

∑j:R(σ−1(j))=1

Prec@j

original ranked Prec@Kd1 : 0 d1 : 0 0/1d2 : 1 d4 : 1 1/2← AP = 1/2+2/3

2d3 : 0 d2 : 1 2/3←d4 : 1 d3 : 0 2/4



Ongoing workSummary


Performance measure: DCG

Discounted Cumulative Gain (DCG) is for multi-gradedrelevanceContributions of relevant documents further down theranking are discounted more

DCG(σ,R) =m∑

i=1

F (R(i))G(σ(i))

f ,g are increasing functionsStandard choice: F (r) = 2r − 1 and G(j) = log2(1 + j)



Ongoing workSummary


Performance measure: NDCG

Normalized DCG (NDCG), like DCG, is for multi-gradedrelevanceAs the name suggests, NDCG normalizes DCG to keepthe gain bounded by 1

NDCG(σ,R) =DCG(σ,R)

Z (R)

The normalization constant

maxσ

DCG(σ,R)

can be computed quickly by sorting the relevance vector



Ongoing workSummary


Ranking functions

A ranking function maps m query-document p-dimensionalfeatures into a permutation

X =

φ(q,d1)>

...φ(q,dm)

>

︸︷︷︸

m×d matrix

−→d5 ⇔ σ(5) = 1...

d12 ⇔ σ(12) = m

So mathematically it is a map from Rm×p to Sm



Ongoing workSummary


Ranking functions from scoring functions

However, it is hard to learn functions that map intocombinatorial spacesIn binary classification, we often learn a real-valuedfunction and threshold it to output a class label(positive/negative)In ranking, we do the same but replace thresholding bysortingFor example, let f : Rp → R be a scoring functionφ(q,d1)

>

φ(q,d2)>

φ(q,d3)>

f=⇒

−2.34.21.9

sort=⇒

d2 ⇔ σ(2) = 1d3 ⇔ σ(3) = 2d1 ⇔ σ(1) = 3



Ongoing workSummary


Recap

Training data consists query-document features and relevance scores

X =

φ(q, d1)>

...φ(q, dm)

>

︸︷︷︸

m×d matrix

, R =

R(1)...

R(m)

Will use training data to learn a scoring function f that will be used torank documents for a new querys1

...sm

=

f (φ(qnew, dnew1 )

...f (φ(qnew, dnew

m )

Scores can be sorted to get a ranking whose performance willevaluated using a measure such as ERR, AP, NDCG



Ongoing workSummary

Regularized empirical loss minimizationGeneralization error boundsConsistency

Outline






Ongoing workSummary


Many nice, innovative methods have been developed. E.g.

AdaRank BoltzRank Cranking FRank Infinite pushLambdaMART LambdaRank ListMLE ListNET McRank

MPRank p-norm push OWPC PermuRank PRankRankBoost Ranking SVM RankNet SmoothRank SoftRank

SVM-Combo SVM-MAP SVM-MRR SVM-NDCG SVRank



Ongoing workSummary


Support Vector Machine

SVMs solve the following problem (‖w‖22 := w>w):

minw

‖w‖22︸︷︷︸L2 regularization

+ C︸︷︷︸tuning param.

n∑i=1

ξi

subject to

ξi ≥ 0

Yiw>Xi︸︷︷︸margin

≥ 1− ξi︸︷︷︸slack variable



Ongoing workSummary


Ranking SVM

Consider a query qn and documents di,n and dj,n such that

Rn(i) > Rn(j)

We would like to have

w>φ(qn,dn,i) > w>φ(qn,dn,j)

As in SVM, we can introduce a slack variable ξn,i,j with theconstraints

ξn,i,j ≥ 0

w>φ(qn,dn,i) ≥ w>φ(qn,dn,j) + 1− ξn,i,j



Ongoing workSummary


Ranking SVM

Ranking SVM solves the following problem

minw‖w‖22 + C

N∑n=1

∑Rn(i)>Rn(j)

ξn,i,j

subject to the constraints

ξn,i,j ≥ 0

w>φ(qn,dn,i) ≥ w>φ(qn,dn,j) + 1− ξn,i,j



Ongoing workSummary


Ranking SVM as regularized loss minimization

Note that the minimum possible value of ξ subject to ξ ≥ 0and ξ ≥ t is

max{t ,0} =: [t ]+

This observation lets us write the ranking SVMoptimization as

minw‖w‖2

2︸︷︷︸regularizer

+ CN∑

n=1

∑Rn(i)>Rn(j)

[1 + w>φ(qn, dn,j)− w>φ(qn, dn,i)]+︸︷︷︸loss on example n



Ongoing workSummary


Regularized loss minimization

Note that

Xnw =

φ(qn,dn,1)>w

...φ(qn,dn,m)

>w

Ranking SVM problem (where λ = 1/C):

minw

λ‖w‖22 +N∑

n=1

`(Xnw ,Rn)

Loss in ranking SVM is

`(s,R) =∑

R(i)>R(j)

[1 + sj − si ]+



Ongoing workSummary


Why not directly optimize performance measures?

Let RL be some actual loss you want to minimize (e.g.,1-NDCG)Then why not solve the following?

minw

λ‖w‖22 +N∑

n=1

RL(argsort(Xnw),Rn)

argsort(t) is a ranking σ that puts elements of t in(decreasing) sorted order:

tσ−1(1) ≥ tσ−1(2) ≥ . . . ≥ tσ−1(m)



Ongoing workSummary


Why not directly optimize performance measures?

The function

w 7→ RL(argsort(Xw),R)

is not a nice one (from optimization perspective)!In the neighborhood of any point, it is either flat ordiscontinuousIt can only take one of at most m! different valuesMinimizing a sum of such functions is computationallyintractable



Ongoing workSummary


Convex surrogates

The ranking SVM problem is convex hence easy tooptimizeBut no direct relation to a performance measureDirectly optimizing performance measure→ non-convexproblemsGiven a hard-to-optimize performance measure, can weindirectly minimize it via some convex surrogate?



Ongoing workSummary


Theory: still a long way to go

An amazing amount of applied work existsYet, a lot of theoretical work remains to be done

“However, sometimes benchmark experiments are not asreliable as expected due to the small scales of the training andtest data. In this situation, a theory is needed to guarantee theperformance of an algorithm on infinite unseen data.”

–O. Chapelle, Y. Chang, and T.-Y. Liu“Future Directions in Learning to Rank” (2011)



Ongoing workSummary


Statistical Ranking Theory: Challenges

Combinatorial nature of output spaceSize of prediction space is m! (m = no. of things beingranked)

No canonical performance measureMany choices: ERR, AP, (N)DCGEven with a fixed choice, how to come up with “good"convex surrogates

Dealing with diverse types of feedback modelsrelevance scores, pairwise feedback, click-through rates,preference DAGs

Large-scale challengesguarantees for parallel, distributed, online/streamingapproaches



Ongoing workSummary


Our focus

Human feedback in form of relevance scoresRegularized (convex) loss minimization based rankingmethods

minw

λ‖w‖22 +N∑

n=1

`(Xnw ,Rn)

A few commonly used performance measures: NDCG, APGeneralization error bounds and consistency



Ongoing workSummary


Standard Setup

Standard assumption in theory: examples

(X1,R1), . . . , (XN ,RN)

are drawn iid from an underlying (unknown) distributionConsider the constrained form of loss minimization

w = argmin‖w‖2≤B

1N

N∑n=1

`(Xnw ,Rn)



Ongoing workSummary


Risk

Risk of w : performance of w on “infinite unseen data"

R(w) = EX ,R[`(Xw ,R)]

Note: cannot be computed without knowledge ofunderlying distributionEmpirical risk of w : performance of w on training set

R(w) =1N

N∑n=1

`(Xnw ,Rn)

Note: can be computed using just the training data



Ongoing workSummary


Generalization error bound

We say that w is an empirical risk minimizer (ERM)

w = argmin‖w‖2≤B

R(w)

We’re typically interested in bounds on the generalizationerror

R(w)

For instance, we expect the generalization error to belarger than

R(w)

but by how much?



Ongoing workSummary


Generalization error bound

We might therefore want a generalization error bound ofthe form

R(w) ≤ R(w) + stuff

1.5 1.0 0.5 0.0 0.5 1.0 1.50.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

w w

R(w)−R(w)

R(w)

Risk R(w)

Emp risk R(w)



Ongoing workSummary


Rademacher complexity

Let ε1, . . . , εN be iid Rademacher (symmetric Bernoulli) randomvariables

Then, with prob. at least 1− δ,

R(w) ≤ R(w) + Eε1,...,εN

[max‖w‖2≤B

∣∣∣∣∣ 1N

N∑n=1

εn`(Xnw ,Rn)

∣∣∣∣∣]+

√log(1/δ)

N

The quantity

Eε1,...,εN

[max‖w‖2≤B

∣∣∣∣∣ 1N

N∑n=1

εn`(Xnw ,Rn)

∣∣∣∣∣]

is called Rademacher complexity



Ongoing workSummary


How does Rademacher complexity scale?Best previous bound (Chapelle and Wu, 2010) on Rademacher complexitydepended on:

Loss function `More “complicated” loss =⇒ bigger complexitySize B of w ’sLarger B =⇒ bigger complexitySize of features, say ‖φ(q, di)‖2 ≤ RLarger R =⇒ bigger complexitySample size NLarger N =⇒ smaller complexityNo. m of documents per queryLarger m =⇒ larger complexity

Roughly speaking, scaled as:

Lip‖·‖2(`) · B · R ·

√m ·√

log(1/δ)N



Ongoing workSummary


Lipschitz constant

Fix arbitrary relevance vector RHow much does `(s,R) change as we change s?Lipschitz constant Lip‖·‖(`) is smallest L such that

∀R, |`(s,R)− `(s′,R)| ≤ L · ‖s − s′‖

Previous proof technique needs to use Lip‖·‖2(`)

We get better bound using Lip‖·‖∞(`) ≤√

m · Lip‖·‖2(`)



Ongoing workSummary


Improved bound on Rademacher complexityOur (Chaudhuri and T., 2015) improved bound on Rademacher complexitydepends on:

Loss function `Lipschitz constant is measured w.r.t. `∞ norm

Size B of w ’sLarger B =⇒ bigger complexity

Size of features, say ‖φ(q, di)‖2 ≤ RLarger R =⇒ bigger complexity

Sample size NLarger N =⇒ smaller complexity

Does not depend on no. m of documents per query

Roughly speaking, scales as:

Lip‖·‖∞(`) · B · R ·√

log(1/δ)N



Ongoing workSummary


Another type of generalization bound

So far we have seen bounds of the form


However, R(w) ≤ R(w?) and R(w?) concentrates aroundR(w?)

Therefore, it is also easy to derive bounds of the form

R(w) ≤ R(w?) + stuff



Ongoing workSummary


Generalization bounds: recap

Risk R(w) is expected loss of w on infinite, unseen data

Empirical risk R(w) is average loss on training data

A w that minimizes loss on training data is called an empirical riskminimizer (ERM)

Saw two types of generalization error bounds:


R(w) ≤ R(w?)︸︷︷︸minimum possible risk

+ stuff

Rademacher complexity need not scale with length of document lists



Ongoing workSummary


Richer function classes for more data

As we see more data, it is natural to enlarge the functionclass over which ERM is computed

fn = argminf∈Fn

1N

N∑n=1

`(f (Xn),Rn) = argminf∈Fn

R(f )

where

f (Xn) =

f (φ(qn,dn,1))...

f (φ(qn,dn,m))

E.g., as n grows, Fn = larger decision trees, deeper neuralnetworks, SVMs with less regularization



Ongoing workSummary


Estimation error vs approximation error

Let f ?(Fn) be the best function in Fn and f ? be the bestoverall function:

f ?(Fn) = argminf∈Fn

R(f ) , f ? = argminf

R(f )

We can decompose the excess risk

R(fn)− R(f ?) = R(fn)− R(f ?(Fn))︸︷︷︸estimation error

+ R(f ?(Fn))− R(f ?)︸︷︷︸approximation error



Ongoing workSummary


Consistency

If Fn grows too quickly, estimation error will not go to zeroIf Fn grows too slowly, approximation error will not go tozeroBy judiciously growing Fn, we can guarantee that botherrors go to zeroThis results in (strong) universal consistency: for all datadistributions

R(fn)→ R(f ?) a.s. as n→∞



Ongoing workSummary


What did we achieve?

We said that we will use a nice (e.g., convex) loss ` as asurrogate and train scoring function fn on finite data using `

fn = argminf∈Fn

1N

N∑n=1

`(f (Xn),Rn)

Enlarging Fn neither too quickly nor too slowly with n willgive us

R(fn)n→∞−→ R(f ?)

Great, but how good is fn according to rankingperformance measures (NDCG, ERR, AP)?



Ongoing workSummary


Risk under ranking loss

Recall that R(f ) is the risk of f under the surrogate `:

R(f ) = EX ,R[`(f (X ),R)]

Similarly, there is L(f ), the risk of f under a target rankingloss RL:

L(f ) = EX ,R[RL(argsort(f (X )),R)]

argsort required because first argument of RL has to be aranking, not a score vector



Ongoing workSummary


Calibration

Fix a target loss RL (e.g., 1-NDCG) and a surrogate ` (e.g.,RankingSVM loss)Suppose we know

R(fn)n→∞−→ min

fR(f )

Does this automatically guarantee the following?

L(fn)n→∞−→ min

fL(f )

If yes, then we say that surrogate ` is calibrated w.r.t.target loss RL



Ongoing workSummary


Calibration in binary classification

Consider the simpler setting of binary classification whereXi ∈ Rp and Yi ∈ {+1,−1}We learn real valued functions f : Rp → R and take thesign to output a class labelGiven a surrogate ` (hinge loss, logistic loss), we candefine

R(f ) = EX ,Y [`(f (X ),Y )]

Target loss is often just the 0-1 loss:

L(f ) = EX ,Y [1(Yf (X)≤0)]



Ongoing workSummary


Calibration in binary classification

Suppose `(Y , f (X )) = φ(Yf (X )) (a margin loss)For a margin-based convex surrogate `, when does

R(fn)n→∞−→ min

fR(f )

implyL(fn)

n→∞−→ minf

L(f )

Surprisingly simple answer (Bartlett, Jordan, McAuliffe,2006): φ′(0) < 0



Ongoing workSummary


Examples of calibrated surrogates

0.5 0.0 0.5 1.0 1.5 2.0

Margin Yf(X)

0.0

0.5

1.0

1.5

2.0

2.5

0-1 loss

Hinge

Logistic

Squared



Ongoing workSummary


Calibration w.r.t. DCG

Recall that DCG is defined as:

DCG(σ,R) =m∑

i=1

F (R(i))G(σ(i))

Let η be a distribution over relevance vectors R’sThe minimizer σ(η) of

σ 7→ ER∼η[DCG(σ,R)]

is argsort(ER∼η[f (R)])



Ongoing workSummary


Calibration w.r.t. (N)DCG

Consider the score vector s(η) that minimizes

s 7→ ER∼η[`(s,R)]

Necessary and sufficient condition for DCG calibration:For every η, s(η) has the same sorted order as ER∼η[f (R)]

We (Ravikumar, T., Yang, 2011) derived a sufficient (andalmost necessary) condition for calibration w.r.t. NDCG bytaking normalization into account:

`(s,R) = Bregman(normalized(R), order-preserving(s))



Ongoing workSummary


Calibration w.r.t. DCG: Example

Consider the regression based surrogate

`(s,R) =m∑

i=1

(si − f (Ri))2

Easy to see that s(η) is exactly ER∼η[f (R)]

Thus, this regression based surrogate is DCG calibrated



Ongoing workSummary


Calibration w.r.t. other measures

The (N)DCG case might lead one to believe that convexcalibrated surrogates always existActually, the situation is more complex!It is known that there do not exist any convex calibratedsurrogates for ERR and APCaveat: Non-existence result assumes that we’re usingargsort to move from scores to rankings



Ongoing workSummary


Calibration w.r.t. AP

Suppose we don’t restrict ourselves to learning Rm-valuedscores (followed by argsort)Then, we (Ramaswamy, Agarwal, T., 2013) constructed aconvex surrogate in O(m2) dimensions that is calibratedw.r.t. APCaveat: the mapping from scores to permutations involvessolving an NP-hard problemOpen: Is there a convex surrogate in poly(m) dimensionssuch that going from scores to permutation takes poly(m)time?



Ongoing workSummary


Consistency: recap

Consistency is an asymptotic (sample size N →∞) property of learningalgorithms

A consistent algorithm’s performance reaches that of the “best" rankeras it sees more and more data

A learning algorithm that minimizes a surrogate ` has to balanceestimation error, approximation error trade-off

If trade-off is managed well, it will have consistency in the sense of `

If this also implies consistency w.r.t. a target loss RL, we say ` iscalibrated w.r.t. RL

Calibration in ranking is trickier than in classification: easy in somecases (e.g. NDCG), not so easy in others (e.g., ERR, AP)



Ongoing workSummary

Online learning to rankTop-K feedback

Outline






Ongoing workSummary


The online protocol

Instead of training on a fixed batch dataset, ranking could occurin a more online/incremental setting

For t = 1, . . . ,NRanker gets feature vectors Xt ∈ Rm×p

Ranker outputs a scoring function ft : R→ RRelevance vector Rt is revealedRanker suffers loss `(ft(Xt),Rt)



Ongoing workSummary


Regret

In online settings, we often look at regret:

N∑t=1

`(ft(Xt),Rt)︸︷︷︸algorithm’s loss

− minf∈F

N∑t=1

`(f (Xt),Rt)︸︷︷︸best loss in hindsight

If ` is convex and we’re using linear scoring functionsf (x) = w>x then we can use tools from online convexoptimization (OCO)Great OCO introduction herehttp://ocobook.cs.princeton.edu/


http://ocobook.cs.princeton.edu/


Ongoing workSummary


Perceptron-like algorithm for LeToR

Standard OGD (online gradient descent) algo usingranking SVM surrogate:

`(s,R) =∑

R(i)>R(j)

[1 + sj − si ]+

gives a O(√

N) regret boundγ-separability: ∃w?, ‖w?‖2 = 1 s.t., whenever di is morerelevant than dj for query q

(w?)>φ(q,di) ≥ (w?)>φ(q,dj) + γ

Under γ-separability, we (Chaudhuri, T., 2015) show

cumulative NDCG/AP loss ≤ poly(m) · R2

γ2



Ongoing workSummary


Top-K feedback protocol

For t = 1, . . . ,NRanker gets feature vectors Xt ∈ Rm×p

Ranker outputs a scoring function ft : R→ ROnly top-K relevance scores revealed:

Rt(σ−1t (1)), . . . ,Rt(σ

−1t (K ))

where σt = argsort(ft(Xt))

Ranker suffers loss `(ft(Xt),Rt)



Ongoing workSummary


Algorithm for Top-K LeToR

Can’t run OGD algo using ranking SVM surrogate:

`(s,R) =∑

R(i)>R(j)

[1 + sj − si ]+

since neither loss nor gradient is observedHowever, by using the right amount of exploration, we canget unbiased estimate of gradient with controlled varianceWe (Chaudhuri, T., 2015) show a O(N2/3) regret bound inthe challenging top-K feedback setting



Ongoing workSummary

Summary

Ranking problems arise in a variety of contextsLearning to rank can be, and has been, viewed as asupervised ML problemWe gave new, improved generalization error bounds forranking methodsWe investigated consistency of convex surrogate basedranking algorithms for NDCG and APWe considered a novel feedback model (top-K feedback)and gave online regret bounds



Ongoing workSummary

Collaborators

Eunho Harish Pradeep Shivani Sougata



Ongoing workSummary

Papers

Generalization error bounds: Chaudhuri, Tewari. Generalization errorbounds for learning to rank: Does the length of document lists matter?,ICML 2015

NDCG calibration: Ravikumar, Tewari, Yang. On NDCG consistency oflistwise ranking methods, AISTATS 2011

Partial progress on AP calibration: Ramaswamy, Agarwal, Tewari.Convex calibrated surrogates for low-rank loss matrices withapplications to subset ranking losses, NIPS 2013

Top-K feedback (fixed set of objects): Chaudhuri, Tewari. Onlineranking with top-1 feedback, AISTATS 2015Honorable Mention, Best Student Paper Award



Ongoing workSummary

Thanks

Thank you for your attention!


Documents

University of Michigan - Statisticsdept.stat.lsa.umich.edu/~tewaria/talks/IISc-MLSIG-2015.pdfUniversity of Michigan September 3, 2015 / MLSIG Seminar @ IISc Ambuj Tewari Learning to