Linear Upper Confidence Bound Algorithm for Contextual ... · Linear Upper Conﬁdence Bound...

Linear Upper Confidence Bound Algorithm forContextual Bandit Problem with Piled Rewards

Kuan-Hao Huang and Hsuan-Tien Lin

Department of Computer Science & Information EngineeringNational Taiwan University

April 20, 2016 (PAKDD)

K.-H. Huang and H.-T. Lin (NTU) LinUCB Algorithm with Piled Rewards PAKDD, 2016 0 / 13

Contextual Bandit Problem (example)

Ad (type A) Ad (type B) Ad (type C) Ad (type D)

Online Advertisement System

User User User User

Feedback Feadback Feadback

Contextual Bandit Problem (traditional)

NotationI user: context x ∈ Rd

I ad: action a ∈ {1, 2, ..,K}I feedback: reward r ∈ [0, 1]

Contextual bandit problem (traditional setting)

for round t = 1, 2, ..., T

I algorithm A receives a context xtI algorithm A selects an action at based on the context xtI algorithm A receives the reward rt,at

algorithm A tries to maximize the cumulative rewardsT∑t=1

Challenge for contextual bandit problemI partial feedback: exploitation vs. exploration

Contextual Bandit Problem (piled-reward example)

Ad (type A) Ad (type B) Ad (type C) Ad (type D)

Online Advertisement System

Ad exchange

User User User User

Feedback Feedback Feedback

Pile of Feedback

Contextual Bandit Problem (piled-reward)

NotationI user: context x ∈ Rd

I ad: action a ∈ {1, 2, ..,K}I feedback: reward r ∈ [0, 1]

Contextual bandit problem (piled-reward setting)

for round t = 1, 2, ..., T

I for i = 1, 2, ..., nI algorithm A receives a context xtiI algorithm A selects an action ati based on the context xt

I algorithm A receives n rewards rt1,at1 , rt2,at2 , ..., rtn,atn

algorithm A tries to maximize the cumulative rewardsT∑t=1

n∑i=1

rti,ati

Linear Upper Confidence Bound (LinUCB)

LinUCB [Li et al., 2010]I state-of-the-art algorithm for the traditional setting (n = 1)I for each round t and context xt, LinUCB gives every actions a a scoreI selected action at = argmaxa(scoret,a(xt))

scoret,a(xt) = estimated reward + uncertainty= estimated reward + confidence bound

= w>t,ax+ α√x>t (I+X>t−1,aXt−1,a)−1xt

I estimated reward for exploitation, is obtained by the regression frompairs (xτ , rτ,a) of action a

I uncertainty for exploration, estimates how confident for the estimatedreward

I update the scoring function whenever receiving the reward

Applying LinUCB to Piled-reward setting

LinUCB under the piled-reward setting

for round t = 1, 2, ..., T

I for i = 1, 2, ..., nI LinUCB receives context xtiI LinUCB selects an action ati with the same scoring function

I LinUCB receives n rewards rt1,at1 , rt2,at2 , ..., rtn,atnI LinUCB updates the scoring function with n rewards

Problem for LinUCB under the piled-reward settingI no update for scoring function within the roundI LinUCB selects action with high uncertainty but low estimated reward

risk for some contextsI these contexts come again and again→ low rewardI need strategic exploration within the round

Strategic Exploration

Our solutionI use previous contexts xt1 ,xt2 , ...,xti−1

in this round to help for selectingaction for xti

I give each previous context xtτ a pseudo reward ptτ ,atτI use the pseudo reward to pretend the true rewardI we design two pseudo rewards:

I estimated reward: estimated rewardI underestimated reward: estimated reward - confidence bound

Score after the update with pseudo reward

pseudo reward estimated reward uncertaintyestimated reward no change become lower

underestimated reward become lower become lowerI achieve strategic explorationI underestimated reward is more aggressive than estimated reward

Linear Upper Confidence Bound with Pseudo Reward

A novel algorithmI Linear Upper Confidence Bound with Pseudo Reward (LinUCBPR)

I LinUCBPR-ER: estimated reward as the pseudo rewardI LinUCBPR-UR: underestimated reward as the pseudo reward

LinUCBPR under the piled-reward setting

for round t = 1, 2, ..., T

I for i = 1, 2, ..., nI LinUCBPR receives context xtiI LinUCBPR selects an action ati with the scoring functionI LinUCBPR updates the scoring function with the pseudo rewards pti,ati

I LinUCBPR receives n true rewards rt1,at1 , rt2,at2 , ..., rtn,atnI LinUCBPR discards the change caused by the pseudo rewardsI LinUCBPR updates the scoring function with n true rewards

Theoretical Analysis

Regret for algorithm A

Regret(A) =T∑t=1

n∑i=1

rti,a∗ti−

T∑t=1

n∑i=1

rti,ati

Theorem

For some α = O(√ln(nTK/δ)), with probability 1− δ, the regret bounds of

LinUCB and LinUCBPR-ER under the piled-reward setting are both

O(√dn2TK ln3(nTK/δ))

I when the number of contexts (nT ) is constant, the regret bound ∝√n

I LinUCB and LinUCBPR-ER enjoy the same regret bound

Artificial Datasets

Artificial dataI u1,u1, ...,uK ∈ Rd for K actionsI rt,a = u>a xt + εt, where ε ∈ [−0.05, 0.05]

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

d=10, K=100, n=500

normalized rounds

Idealε−GreedyLinUCBLinUCBPR−ERLinUCBPR−URQMP−D

(a) Average cumulative reward

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1500

5000d=10, K=100, n=500

normalized roundsre

Idealε−GreedyLinUCBLinUCBPR−ERLinUCBPR−URQMP−D

(b) Regret

I LinUCBPR outperform others, especially in the early roundsI LinUCBPR-ER is better than LinUCBPR-UR

Simple Supervised-to-contextual-bandit Datasets

I take supervised-to-contextual-bandit transform [Dudı́k et al., 2011] on 8multiclass datasets

normalized rounds0 0.2 0.4 0.6 0.8 1

Ideal0-GreedyLinUCBLinUCBPR-ERLinUCBPR-URQMP-D

(a) acoustic

(b) covtype

(c) letter

(d) pendigits

(e) poker

(f) satimage

(g) shuttle

(h) usps

I LinUCBPR-ER reaches the best again

Real-world Datasets

News recommendation dataset by Yahoo!I appearing in ICML 2012 workshop competitionI the only public dataset for contextual bandit problemI dynamic action set

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

normalized rounds

ε−GreedyLinUCBLinUCBPR−ERLinUCBPR−URQMP−D

(a) R6A

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

normalized rounds

(b) R6B

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.03

normalized rounds

(c) R6B-clean

I LinUCBPR-ER is stable and promising

Conclusion

I formalize the piled-reward setting for contextual bandit problemI demonstrate how LinUCB can be applied to the piled-reward setting, and

prove its regret boundI propose LinUCBPR, and prove the regret bound of LinUCBPR-ERI validate the promising performance of LinUCBPR-ER

Thank you! Any question?

Linear Upper Confidence Bound Algorithm for Contextual ... · Linear Upper Conﬁdence Bound...

Documents

UPPER BOUND ANALYSIS OF BEARING AND OVERTURNING A …

Research Article Upper Bound Solution for the Face Stability of Shield Tunnel …downloads.hindawi.com/journals/mpe/2014/727964.pdf · 2019-07-31 · Research Article Upper Bound

Lower and Upper Bound Theory

ON AN UPPER BOUND FOR THE ARITHMETIC SELF-INTERSECTION NUMBER OF … · 2009-07-22 · provides an upper bound for the arithmetic self-intersection number of the dualizing sheaf on

An upper bound for ex-post Sharpe ratio with application ...rchan/paper/alfred_kk_sharpe_ratio.pdf · An upper bound for ex-post Sharpe ratio with application in performance measurement

Lower and Upper Bound Theory - Department of Computer …user.ceng.metu.edu.tr/~e125043/algonotes/algo3.pdf · · 2008-03-25Lower and Upper Bound Theory, A.Yazici, Spring 2007 CEng

Quantum Adversary (Upper) Bound · 2018. 10. 25. · Quantum Adversary (Upper) Bound Shelby Kimmel Center for Theoretical Physics, Massachusetts Institute of Technology skimmel@mit.edu

Preliminary Upper-Bound Consequence Analysis for a Waste ...Preliminary Upper-Bound Consequence Analysis for a Waste Repository at Yucca Mountain, Nevada F. L. Thompson, F. H. Dove,

Creating an Upper Bound of Chess Positions/Moves

A Tight BER Upper Bound for Bit-Interleaved Coded

An Upper Bound for the Reachability Problem of Safe ...ceur-ws.org/Vol-1269/paper101.pdf · An Upper Bound for the Reachability Problem of Safe, Elementary Hornets ... In this paper

2D Upper Bound Analysis of ECAE through 2θ-Dies for a ... · 3. Upper Bound Analysis of ECAE Through 2θ-Segal Die The upper bound theorem equation according to the works8,9 has

Package ‘mixedsde’ - R · chain matrix of Markov chain samples propSd old proposal standard deviation iteration number of current iteration lower lower bound upper upper bound

Local Model Checking of Weighted CTL with Upper-Bound ...spinroot.com/spin/Workshops/ws13/spin2013_submission_48.pdf · Local Model Checking of Weighted CTL with Upper-Bound Constraints

OPTIMAL UPPER BOUND FOR THE INFINITY NORM OF …

Optimizations & Bounds for Sparse ... - bebop.cs.berkeley.edu · Performance Models Upper Bound on Performance Evaluate quality of optimized code against bound Model Characteristics

Upper bound ENCIT2012-0352

Should we believe the Precision Electroweak Upper Bound on

Explicit upper bound for the function of sum of divisors

Relative Upper Confidence Bound for the K-Armed Dueling ... · Relative Upper Conﬁdence Bound more, our bounds are the ﬁrst explicitly non-asymptotic results for the K-armed dueling