Beat the Mean Bandit

Beat the Mean Bandit

ICML 2011

Yisong Yue Carnegie Mellon University

Joint work with Thorsten Joachims (Cornell University)

Optimizing Information Retrieval Systems

• Increasingly reliant on user feedback– E.g., clicks on search results

• Online learning is a popular modeling tool– Especially partial-information (bandit) settings

• Our focus: learning from relative preferences– Motivated by recent work on interleaved retrieval

evaluation (example following)

Team Draft Interleaving(Comparison Oracle for Search)

Ranking A1.Napa Valley – The authority for lodging...

www.napavalley.com2.Napa Valley Wineries - Plan your wine...

www.napavalley.com/wineries3.Napa Valley College

www.napavalley.edu/homex.asp4. Been There | Tips | Napa Valley

www.ivebeenthere.co.uk/tips/166815. Napa Valley Wineries and Wine

www.napavintners.com6. Napa Country, California – Wikipedia

en.wikipedia.org/wiki/Napa_Valley

Ranking B1. Napa Country, California – Wikipedia

en.wikipedia.org/wiki/Napa_Valley2. Napa Valley – The authority for lodging...

www.napavalley.com3. Napa: The Story of an American Eden...

books.google.co.uk/books?isbn=...4. Napa Valley Hotels – Bed and Breakfast...

www.napalinks.com5. NapaValley.org

www.napavalley.org6. The Napa Valley Marathon

www.napavalleymarathon.org

Presented Ranking1.Napa Valley – The authority for lodging...

www.napavalley.com2. Napa Country, California – Wikipedia

en.wikipedia.org/wiki/Napa_Valley3. Napa: The Story of an American Eden...

books.google.co.uk/books?isbn=...4.Napa Valley Wineries – Plan your wine...

www.napavalley.com/wineries5. Napa Valley Hotels – Bed and Breakfast...

www.napalinks.com 6.Napa Balley College

www.napavalley.edu/homex.asp7 NapaValley.org

www.napavalley.org

AB

[Radlinski et al. 2008]

Ranking A1.Napa Valley – The authority for lodging...

www.napavalley.com2.Napa Valley Wineries - Plan your wine...

www.napavalley.com/wineries3.Napa Valley College

www.napavalley.edu/homex.asp4. Been There | Tips | Napa Valley

www.ivebeenthere.co.uk/tips/166815. Napa Valley Wineries and Wine

www.napavintners.com6. Napa Country, California – Wikipedia

en.wikipedia.org/wiki/Napa_Valley

Ranking B1. Napa Country, California – Wikipedia

en.wikipedia.org/wiki/Napa_Valley2. Napa Valley – The authority for lodging...

www.napavalley.com3. Napa: The Story of an American Eden...

books.google.co.uk/books?isbn=...4. Napa Valley Hotels – Bed and Breakfast...

www.napalinks.com5. NapaValley.org

www.napavalley.org6. The Napa Valley Marathon

www.napavalleymarathon.org

Presented Ranking1.Napa Valley – The authority for lodging...

www.napavalley.com2. Napa Country, California – Wikipedia

en.wikipedia.org/wiki/Napa_Valley3. Napa: The Story of an American Eden...

books.google.co.uk/books?isbn=...4.Napa Valley Wineries – Plan your wine...

www.napavalley.com/wineries5. Napa Valley Hotels – Bed and Breakfast...

www.napalinks.com 6.Napa Balley College

www.napavalley.edu/homex.asp7 NapaValley.org

www.napavalley.org

B wins!

Click

[Radlinski et al. 2008]

Click

Team Draft Interleaving(Comparison Oracle for Search)

…A B C Total wins Total losses

A wins vs… 0 1 0 1 0B wins vs… 0 0 0 0 1C wins vs… 0 0 0 0 0

Interleave A vs B

…

Interleave A vs C

A B C Total wins Total lossesA wins vs… 0 1 0 1 1B wins vs… 0 0 0 0 1C wins vs… 1 0 0 1 0

…

Interleave B vs C


…

Interleave A vs B


Outline

• Learning Formulation– Dueling Bandits Problem [Yue et al. 2009]

• Modeling transitivity violation– E.g., (A >> B) AND (B >> C) IMPLIES (A >> C) ??– Not done in previous work

Outline

• Learning Formulation– Dueling Bandits Problem [Yue et al. 2009]

• Modeling transitivity violation– E.g., (A >> B) AND (B >> C) IMPLIES (A >> C) ??– Not done in previous work

• Algorithm: Beat-the-Mean

• Empirical Validation

Dueling Bandits Problem

• Given K bandits b1, …, bK

• Each iteration: compare (duel) two bandits– E.g., interleaving two retrieval functions

[Yue et al. 2009]

Dueling Bandits Problem

• Given K bandits b1, …, bK

• Each iteration: compare (duel) two bandits– E.g., interleaving two retrieval functions

• Cost function (regret):

• (bt, bt’) are the two bandits chosen• b* is the overall best one• (% users who prefer best bandit over chosen ones)

T

tttT bbPbbPR

1

1)'*()*(

[Yue et al. 2009]

Example Pairwise PreferencesA B C D E F

A 0 0.05 0.05 0.04 0.11 0.11B -0.05 0 0.05 0.06 0.08 0.10C -0.05 -0.05 0 0.04 0.01 0.06D -0.04 -0.04 -0.04 0 0.04 0.00E -0.11 -0.08 -0.01 -0.04 0 0.01F -0.11 -0.10 -0.06 -0.00 -0.01 0

•Values are Pr(row > col) – 0.5•Derived from interleaving experiments on http://arXiv.org

http://arXiv.org/


A 0 0.05 0.05 0.04 0.11 0.11B -0.05 0 0.05 0.06 0.08 0.10C -0.05 -0.05 0 0.04 0.01 0.06D -0.04 -0.04 -0.04 0 0.04 0.00E -0.11 -0.08 -0.01 -0.04 0 0.01F -0.11 -0.10 -0.06 -0.00 -0.01 0


Compare E & F:•P(A > E) = 0.61•P(A > F) = 0.61•Incurred Regret = 0.22

T

tttT bbPbbPR

1

1)'*()*(

http://arXiv.org/


A 0 0.05 0.05 0.04 0.11 0.11B -0.05 0 0.05 0.06 0.08 0.10C -0.05 -0.05 0 0.04 0.01 0.06D -0.04 -0.04 -0.04 0 0.04 0.00E -0.11 -0.08 -0.01 -0.04 0 0.01F -0.11 -0.10 -0.06 -0.00 -0.01 0


Compare B & C:•P(A > B) = 0.55•P(A > C) = 0.55•Incurred Regret = 0.10

T

tttT bbPbbPR

1

1)'*()*(

http://arXiv.org/


A 0 0.05 0.05 0.04 0.11 0.11B -0.05 0 0.05 0.06 0.08 0.10C -0.05 -0.05 0 0.04 0.01 0.06D -0.04 -0.04 -0.04 0 0.04 0.00E -0.11 -0.08 -0.01 -0.04 0 0.01F -0.11 -0.10 -0.06 -0.00 -0.01 0


Compare A & A:•P(A > A) = 0.50•P(A > A) = 0.50•Incurred Regret = 0.00

T

tttT bbPbbPR

1

1)'*()*(

Interleaving shows ranking produced by A.

http://arXiv.org/


A 0 0.05 0.05 0.04 0.11 0.11B -0.05 0 0.05 0.06 0.08 0.10C -0.05 -0.05 0 0.04 0.01 0.06D -0.04 -0.04 -0.04 0 0.04 0.00E -0.11 -0.08 -0.01 -0.04 0 0.01F -0.11 -0.10 -0.06 -0.00 -0.01 0

Violation in internal consistency!For strong stochastic transitivity: •A > D should be at least 0.06


http://arXiv.org/


A 0 0.05 0.05 0.04 0.11 0.11B -0.05 0 0.05 0.06 0.08 0.10C -0.05 -0.05 0 0.04 0.01 0.06D -0.04 -0.04 -0.04 0 0.04 0.00E -0.11 -0.08 -0.01 -0.04 0 0.01F -0.11 -0.10 -0.06 -0.00 -0.01 0

Violation in internal consistency!For strong stochastic transitivity: •C > E should be at least 0.04


http://arXiv.org/


A 0 0.05 0.05 0.04 0.11 0.11B -0.05 0 0.05 0.06 0.08 0.10C -0.05 -0.05 0 0.04 0.01 0.06D -0.04 -0.04 -0.04 0 0.04 0.00E -0.11 -0.08 -0.01 -0.04 0 0.01F -0.11 -0.10 -0.06 -0.00 -0.01 0

Violation in internal consistency!For strong stochastic transitivity: •D > F should be at least 0.04


http://arXiv.org/

Modeling Assumptions

• P(bi > bj) = ½ + εij

• Let b1 be the best overall bandit

• Relaxed Stochastic Transitivity– For three bandits b1 > bj > bk :– γ ≥ 1 (γ = 1 for strong transitivity **)– Relaxed internal consistency property

• Stochastic Triangle Inequality– For three bandits b1 > bj > bk :– Diminishing returns property

jkjk 11

(** γ = 1 required in previous work, and required to apply for all bandit triplets)

Example Pairwise Preferences

A B C D E FA 0 0.05 0.05 0.04 0.11 0.11B -0.05 0 0.05 0.06 0.08 0.10C -0.05 -0.05 0 0.04 0.01 0.06D -0.04 -0.04 -0.04 0 0.04 0.00E -0.11 -0.08 -0.01 -0.04 0 0.01F -0.11 -0.10 -0.06 -0.00 -0.01 0

γ = 1.5

jkjk , max 11


http://arXiv.org/

Beat-the-MeanA B C D E F Mean Lower

BoundUpperBound

A winsTotal

00

00

00

00

00

00

--0

0.00 1.00

B wins Total

00

00

0 0

00

00

00

--0

0.00 1.00

C wins Total

00

00

00

00

00

00

--0

0.00 1.00

D winsTotal

00

00

00

00

00

00

--0

0.00 1.00

E wins Total

00

00

00

00

00

00

--0

0.00 1.00

F wins Total

00

00

00

00

00

00

--0

0.00 1.00


BoundUpperBound

A winsTotal

00

00

00

00

00

00

--0

0.00 1.00

B wins Total

00

00

0 0

00

00

00

--0

0.00 1.00

C wins Total

00

00

00

00

00

00

--0

0.00 1.00

D winsTotal

00

00

00

00

00

00

--0

0.00 1.00

E wins Total

00

00

00

00

00

00

--0

0.00 1.00

F wins Total

00

00

00

00

00

00

--0

0.00 1.00

Comparison Results


BoundUpperBound

A winsTotal

00

00

00

00

00

00

--0

0.00 1.00

B wins Total

00

00

0 0

00

00

00

--0

0.00 1.00

C wins Total

00

00

00

00

00

00

--0

0.00 1.00

D winsTotal

00

00

00

00

00

00

--0

0.00 1.00

E wins Total

00

00

00

00

00

00

--0

0.00 1.00

F wins Total

00

00

00

00

00

00

--0

0.00 1.00

Mean Score &Confidence Interval


BoundUpperBound

A winsTotal

00

00

00

00

00

00

--0

0.00 1.00

B wins Total

00

00

0 0

00

00

00

--0

0.00 1.00

C wins Total

00

00

00

00

00

00

--0

0.00 1.00

D winsTotal

00

00

00

00

00

00

--0

0.00 1.00

E wins Total

00

00

00

00

00

00

--0

0.00 1.00

F wins Total

00

00

00

00

00

00

--0

0.00 1.00

A’s performance vs rest


BoundUpperBound

A winsTotal

00

00

00

00

00

00

--0

0.00 1.00

B wins Total

00

00

0 0

00

00

00

--0

0.00 1.00

C wins Total

00

00

00

00

00

00

--0

0.00 1.00

D winsTotal

00

00

00

00

00

00

--0

0.00 1.00

E wins Total

00

00

00

00

00

00

--0

0.00 1.00

F wins Total

00

00

00

00

00

00

--0

0.00 1.00

A’s mean performance


BoundUpperBound

A winsTotal

00

11

00

00

00

00

1.001

0.00 1.00

B wins Total

00

00

0 0

00

00

00

--0

0.00 1.00

C wins Total

00

00

00

00

00

00

--0

0.00 1.00

D winsTotal

00

00

00

00

00

00

--0

0.00 1.00

E wins Total

00

00

00

00

00

00

--0

0.00 1.00

F wins Total

00

00

00

00

00

00

--0

0.00 1.00


BoundUpperBound

A winsTotal

00

11

00

00

00

00

1.001

0.00 1.00

B wins Total

00

00

0 0

00

01

00

0.001

0.00 1.00

C wins Total

00

00

00

00

00

00

--0

0.00 1.00

D winsTotal

00

00

00

00

00

00

--0

0.00 1.00

E wins Total

00

00

00

00

00

00

--0

0.00 1.00

F wins Total

00

00

00

00

00

00

--0

0.00 1.00


BoundUpperBound

A winsTotal

00

11

00

00

00

00

1.001

0.00 1.00

B wins Total

00

00

0 0

00

01

00

0.001

0.00 1.00

C wins Total

00

00

00

00

00

11

1.001

0.00 1.00

D winsTotal

00

00

00

00

00

00

--0

0.00 1.00

E wins Total

00

00

00

00

00

00

--0

0.00 1.00

F wins Total

00

00

00

00

00

00

--0

0.00 1.00


BoundUpperBound

A winsTotal

00

11

00

00

00

00

1.001

0.00 1.00

B wins Total

00

00

0 0

00

01

00

0.001

0.00 1.00

C wins Total

00

00

00

00

00

11

1.001

0.00 1.00

D winsTotal

00

00

01

00

00

00

0.001

0.00 1.00

E wins Total

00

00

00

00

00

00

--0

0.00 1.00

F wins Total

00

00

00

00

00

00

--0

0.00 1.00


BoundUpperBound

A winsTotal

00

11

00

00

00

00

1.001

0.00 1.00

B wins Total

00

00

0 0

00

01

00

0.001

0.00 1.00

C wins Total

00

00

00

00

00

11

1.001

0.00 1.00

D winsTotal

00

00

01

00

00

00

0.001

0.00 1.00

E wins Total

01

00

00

00

00

00

0.001

0.00 1.00

F wins Total

00

00

00

00

00

00

--0

0.00 1.00


BoundUpperBound

A winsTotal

00

11

00

00

00

00

1.001

0.00 1.00

B wins Total

00

00

0 0

00

01

00

0.001

0.00 1.00

C wins Total

00

00

00

00

00

11

1.001

0.00 1.00

D winsTotal

00

00

01

00

00

00

0.001

0.00 1.00

E wins Total

01

00

00

00

00

00

0.001

0.00 1.00

F wins Total

00

00

01

00

00

00

0.001

0.00 1.00


BoundUpperBound

A winsTotal

1325

1624

1122

1628

2030

1321

0.59150

0.49 0.69

B wins Total

1430

1530

1319

1520

1726

2025

0.63150

0.53 0.73

C wins Total

1228

1022

1323

1528

2024

1325

0.55150

0.45 0.65

D winsTotal

920

1528

1021

1123

1528

1530

0.50150

0.40 0.60

E wins Total

824

1125

622

1429

1431

1019

0.42150

0.32 0.52

F wins Total

1129

425

1018

1225

1430

1323

0.43150

0.33 0.53


BoundUpperBound

A winsTotal

1325

1624

1122

1628

2030

1321

0.59150

0.49 0.69

B wins Total

1430

1530

1319

1520

1726

2025

0.63150

0.53 0.73

C wins Total

1228

1022

1323

1528

2024

1325

0.55150

0.45 0.65

D winsTotal

920

1528

1021

1123

1528

1530

0.50150

0.40 0.60

E wins Total

824

1125

622

1429

1431

1019

0.42150

0.32 0.52

F wins Total

1129

425

1018

1225

1430

1323

0.43150

0.33 0.53

B dominates E!(B’s lower bound greater than E’s upper bound)


BoundUpperBound

A winsTotal

1325

1624

1122

1628

2030

1321

0.58120

0.49 0.67

B wins Total

1430

1530

1319

1520

1526

2025

0.62124

0.51 0.73

C wins Total

1228

1022

1323

1528

2024

1325

0.50126

0.39 0.61

D winsTotal

920

1528

1021

1123

1528

1530

0.49122

0.38 0.60

E wins Total

824

1125

622

1429

1431

1019

0.42150

0.32 0.52

F wins Total

1129

425

1018

1225

1430

1323

0.42120

0.31 0.53


BoundUpperBound

A winsTotal

1325

1725

1122

1628

2030

1321

0.58121

0.49 0.67

B wins Total

1430

1530

1319

1520

1526

2025

0.62124

0.51 0.73

C wins Total

1228

1022

1323

1528

2024

1325

0.50126

0.39 0.61

D winsTotal

920

1528

1021

1123

1528

1530

0.49122

0.38 0.60

E wins Total

824

1125

622

1429

1431

1019

0.42150

0.32 0.52

F wins Total

1129

425

1018

1225

1430

1323

0.42120

0.31 0.53


BoundUpperBound

A winsTotal

1530

1929

1428

1833

2330

1525

0.56145

0.46 0.66

B wins Total

1533

1734

1524

2027

1526

2327

0.62145

0.52 0.72

C wins Total

1331

1128

1429

1530

2024

1627

0.48145

0.38 0.68

D winsTotal

1126

1731

1226

1429

1528

1733

0.49145

0.39 0.59

E wins Total

824

1125

622

1429

1431

1019

0.42150

0.32 0.52

F wins Total

1232

730

1326

1328

1430

1529

0.41145

0.31 0.51


BoundUpperBound

A winsTotal

1530

1929

1428

1833

2330

1525

0.56145

0.46 0.66

B wins Total

1533

1734

1524

2027

1526

2327

0.62145

0.52 0.72

C wins Total

1331

1128

1429

1530

2024

1627

0.48145

0.38 0.68

D winsTotal

1126

1731

1226

1429

1528

1733

0.49145

0.39 0.59

E wins Total

824

1125

622

1429

1431

1019

0.42150

0.32 0.52

F wins Total

1232

730

1326

1328

1430

1529

0.41145

0.31 0.51

B dominates F!(B’s lower bound greater than F’s upper bound)


BoundUpperBound

A winsTotal

1530

1929

1428

1833

2330

1525

0.55120

0.43 0.67

B wins Total

1533

1734

1524

2027

1526

2327

0.56118

0.44 0.68

C wins Total

1331

1128

1429

1530

2024

1627

0.45118

0.33 0.57

D winsTotal

1126

1731

1226

1429

1528

1733

0.48112

0.36 0.60

E wins Total

824

1125

622

1429

1431

1019

0.42150

0.32 0.52

F wins Total

1232

730

1326

1328

1430

1529

0.41145

0.31 0.51


BoundUpperBound

A winsTotal

4180

4475

3870

4275

2330

1525

0.55300

0.48 0.62

B wins Total

3169

3878

4778

5175

1526

2327

0.56300

0.49 0.63

C wins Total

3377

3177

3570

3976

2024

1627

0.46300

0.49 0.53

D winsTotal

3076

2777

3574

3573

1528

1733

0.42300

0.35 0.49

E wins Total

824

1125

622

1429

1431

1019

0.42150

0.32 0.52

F wins Total

1232

730

1326

1328

1430

1529

0.41145

0.31 0.51

B dominates D!(B’s lower bound greater than D’s upper bound)


BoundUpperBound

A winsTotal

4180

4475

3870

4275

2330

1525

0.55225

0.46 0.64

B wins Total

3169

3878

4778

5175

1526

2327

0.52225

0.43 0.61

C wins Total

3377

3177

3570

3976

2024

1627

0.33225

0.24 0.42

D winsTotal

3076

2777

3574

3573

1528

1733

0.42300

0.35 0.49

E wins Total

824

1125

622

1429

1431

1019

0.42150

0.32 0.52

F wins Total

1232

730

1326

1328

1430

1529

0.41145

0.31 0.51

A dominates C!(A’s lower bound greater than C’s upper bound)


BoundUpperBound

A winsTotal

4180

4475

3870

4275

2330

1525

0.5180

0.38 0.64

B wins Total

3169

3878

4778

5175

1526

2327

0.52147

0.45 0.49

C wins Total

3377

3177

3570

3976

2024

1627

0.33225

0.24 0.42

D winsTotal

3076

2777

3574

3573

1528

1733

0.42300

0.35 0.49

E wins Total

824

1125

622

1429

1431

1019

0.42150

0.32 0.52

F wins Total

1232

730

1326

1328

1430

1529

0.41145

0.31 0.51

Eventually… A is last bandit remaining. A is declared best bandit!

Regret Guarantee• Playing against mean bandit calibrates preference scores

– Estimates of (active) bandits directly comparable – One estimate per active bandit = linear number of estimates



• We can bound comparisons needed to remove worst bandit– Varies smoothly with transitivity parameter γ– High probability bound

• We can bound the regret incurred by each comparison– Varies smoothly with transitivity parameter γ





• Thus, we can bound the total regret with high probability:– γ is typically close to 1

TKORT log

7

We also have a similar PAC guarantee.





• Thus, we can bound the total regret with high probability:– γ is typically close to 1

TKORT log

7

We also have a similar PAC guarantee.

Not possible with previous approaches!

•Simulation experiment where γ = 1.3•Light = Beat-the-Mean•Dark = Interleaved Filter [Yue et al. 2009]

•Beat-the-Mean maintains linear regret guarantee•Interleaved Filter suffers quadratic regret in the worst case

•Simulation experiment where γ = 1 (original DB setting)•Light = Beat-the-Mean•Dark = Interleaved Filter [Yue et al. 2009]

•Beat-the-Mean has high probability bound•Beat-the-Mean exhibits significantly lower variance

Conclusions

• Online learning approach using pairwise feedback– Well-suited for optimizing information retrieval systems

from user feedback– Models violations in preference transitivity

• Algorithm: Beat-the-Mean– Regret linear in #bandits and logarithmic in #iterations– Degrades smoothly with transitivity violation– Stronger guarantees than previous work– Empirically supported

Documents

Beat the Mean Bandit