Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem....

Reinforcement LearningThe one with the bandits

REINFORCEMENT LEARNING

I Evaluative feedbackI judge actions takenI does not instruct which action is the best

I Contrast to purely instructive feedback (supervised learning)I k-armed bandit, single state RL, non-associative

EXPLORATION - EXPLOITATION DILEMMA

I Online decision making involves a fundamental choice:Exploration Make a random action to gather informationExploitation Perform the best action according to current knowledge

I How to choose the best behaviour?I Gather enough information to learn the best overall decisionsI May need to sacrifice short- for long-term rewards. Myopic vs far-sighted.

The k-arm Bandit Problem

THE K-ARMED BANDIT PROBLEM

a1 a2 a3 . . . akExpected value 5 6 4 7

a1 a2 a3 . . . akExpected value 5 6 4 7trial_1 4

a1 a2 a3 . . . akExpected value 5 6 4 7trial_1 4trial_2 7

a1 a2 a3 . . . akExpected value 5 6 4 7trial_1 4trial_2 7trial_3 4

a1 a2 a3 . . . akExpected value 5 6 4 7trial_1 4trial_2 7trial_3 4. . . . . . . . . . . . . . . . . .trial_n 5trial_n+ 1 ? ? ? ? ?

I Choose repeatedly from k actions; a play

I After each play, receive a reward chosen from a stationary distribution anddepends on the chosen action

I Objective is to maximize the expected total reward over a time periodI Analogy to a slot machine, but here with many arms

I Choose repeatedly from k actions; a playI After each play, receive a reward chosen from a stationary distribution and

depends on the chosen action

I Objective is to maximize the expected total reward over a time periodI Analogy to a slot machine, but here with many arms

depends on the chosen actionI Objective is to maximize the expected total reward over a time period

I Analogy to a slot machine, but here with many arms

depends on the chosen actionI Objective is to maximize the expected total reward over a time periodI Analogy to a slot machine, but here with many arms

I k possible actions at each time step t

I action chosen at time step t is denoted as AtI corresponding reward is RtI each action has an expected reward; its valueq∗(a) = E [Rt|At = a]

I if values known, then problem is trivial. Select highest value.I we can estimate; Qt(a)

I we want Qt(a) to be as close as possible to q∗(a)

I k possible actions at each time step tI action chosen at time step t is denoted as At

I corresponding reward is RtI each action has an expected reward; its valueq∗(a) = E [Rt|At = a]

I k possible actions at each time step tI action chosen at time step t is denoted as AtI corresponding reward is Rt

I each action has an expected reward; its valueq∗(a) = E [Rt|At = a]

I k possible actions at each time step tI action chosen at time step t is denoted as AtI corresponding reward is RtI each action has an expected reward; its valueq∗(a) = E [Rt|At = a]

I if values known, then problem is trivial. Select highest value.

I we can estimate; Qt(a)

I Suppose you calculate estimates thenQt(a) ≈ q∗(a) as t grows

I The greedy action is a∗t = argmaxa

I at = a∗t → exploitationI at 6= a∗t → exploration

I Constant exploration is a bad idea.I Constant exploitation is a bad idea.I Stop exploration is a bad idea.I Reduce exploration, maybe good?I Conflict between exploring and exploiting

I at = a∗t → exploitation

I at 6= a∗t → exploration

I Constant exploration is a bad idea.

I Constant exploitation is a bad idea.I Stop exploration is a bad idea.I Reduce exploration, maybe good?I Conflict between exploring and exploiting

I Constant exploration is a bad idea.I Constant exploitation is a bad idea.

I Stop exploration is a bad idea.I Reduce exploration, maybe good?I Conflict between exploring and exploiting

I Constant exploration is a bad idea.I Constant exploitation is a bad idea.I Stop exploration is a bad idea.

I Reduce exploration, maybe good?I Conflict between exploring and exploiting

I Constant exploration is a bad idea.I Constant exploitation is a bad idea.I Stop exploration is a bad idea.I Reduce exploration, maybe good?

I Conflict between exploring and exploiting

TO EXPLORE OR TO EXPLOIT? THAT IS THE QUESTION

I Many sophisticated methods for different formulations.I Strong assumptions on stationarity, prior knowledge.I Hard to verify or altogether violated in the full RL problem.I We only worry about balancing somewhat.

RANDOM EXPLORATION

I Simplest form of action selection

I Good for explorationI Bad for everything else

at = random action

RANDOM EXPLORATION

I Simplest form of action selectionI Good for exploration

I Bad for everything elseat = random action

RANDOM EXPLORATION

I Simplest form of action selectionI Good for explorationI Bad for everything else

at = random action

RANDOM EXPLORATION

I Simplest form of action selectionI Good for explorationI Bad for everything else

at = random action

ACTION-VALUE METHODS

Methods that adapt action-values and nothing else and use these estimates toselect actions.

Qt(a)=̇

∑t−1i=1 Ri · 1Ai=a∑t−1i=1 1Ai=a

I Law of large numbers Qt(a) will converge to q∗(a) as denominator goes toinfinity

I Select action with highest value At=̇ argmaxa

I Greedy action selection

Qt(a)=̇

∑t−1i=1 Ri · 1Ai=a∑t−1i=1 1Ai=a

Qt(a)=̇

∑t−1i=1 Ri · 1Ai=a∑t−1i=1 1Ai=a

ε-GREEDY

I Simple way to balance exploration and exploitation

I ε to choose how to act

{a∗t with probability 1− εrandom action with probability ε

I In the limit actions will be sampled infinite amount of timesI Will converge

ε-GREEDY

I Simple way to balance exploration and exploitationI ε to choose how to act

ε-GREEDY

I In the limit actions will be sampled infinite amount of times

I Will converge

ε-GREEDY

THE 10-ARMED TESTBED

I k = 10, ten possible actionsI q∗(a) selected from a normal zero mean unit variance distributionN (0, 1)

I Rt is drawn fromN (q∗(a), 1)

I 1000 time steps = 1 runI Repeat 2000 times and average the resultsI Action-value estimates using sample-average

I Or decay ε over timeI Only explore linearI Only exploit linearI Decay sublinear

[David Silver course]

I Is ε-greedy always better?

I If variance larger?

I ...then it is even better

I If variance is zero?

I ...then greedy finds the true best action fast

I If nonstationary?

I ...nongreedy actions might become better than greedy; need to explore

I RL problems are commonly nonstationary and require a trade-off.

I Is ε-greedy always better?I If variance larger?

I ...then it is even betterI If variance is zero?

I If nonstationary?

I ...then it is even better

I If variance is zero?

I If nonstationary?

I ...then greedy finds the true best action fastI If nonstationary?

I If nonstationary?

I ...nongreedy actions might become better than greedy; need to exploreI RL problems are commonly nonstationary and require a trade-off.

INCREMENTAL IMPLEMENTATION

I Action value methods estimates sample averages. Computationally inefficient.

I Looking at a single action:

Qn=̇R1 +R2 + . . .+Rn−1

n− 1

I As a running sum:

Qn+1 = Qn +1

n[Rn −Qn]

NewEstimate← OldEstimate+ StepSize [Target−OldEstimate]

I Action value methods estimates sample averages. Computationally inefficient.I Looking at a single action:

Qn=̇R1 +R2 + . . .+Rn−1

n− 1

I As a running sum:

Qn+1 = Qn +1

n[Rn −Qn]

Qn=̇R1 +R2 + . . .+Rn−1

n− 1

I As a running sum:

Qn+1 = Qn +1

n[Rn −Qn]

Qn=̇R1 +R2 + . . .+Rn−1

n− 1

I As a running sum:

Qn+1 = Qn +1

n[Rn −Qn]

I [Target−OldEstimate] represents the error.I We want to move towards the target though it may be noisy.I StepSize changes here based on n. Generally denotes by α.

NONSTATIONARY PROBLEMS

I Rewards can change over time

I Weight recent rewards over long-past. Use a constant step size α ∈ [0, 1):

Qn+1=̇Qn + α [Rn −Qn]

I Results in weighted average:

Qn+1 = (1− α)nQ1 +n∑i=1

α(1− α)n−iRi

I Also called exponentially recency-weighted average.

I Rewards can change over timeI Weight recent rewards over long-past. Use a constant step size α ∈ [0, 1):

Qn+1 = (1− α)nQ1 +n∑i=1

α(1− α)n−iRi

Qn+1 = (1− α)nQ1 +n∑i=1

α(1− α)n−iRi

Qn+1 = (1− α)nQ1 +n∑i=1

α(1− α)n−iRi

ON CONVERGENCE

I Stochastic approximation says that to converge with probability 1:∞∑n=1

an(a) =∞ and∞∑n=1

a2n(a) <∞

I Large enough to avoid initial conditions, small enough to assure convergence.I Satisfied in sample-average but not constant αI Actually desirable in nonstationary problems.I Good for theoretical work, not so much in practice.

OPTIMISTIC INITIAL VALUES

I Initial estimates bias behaviour.!I 10-arm bandit with Q1(a) = +5, ∀a

UPPER CONFIDENCE BOUND ACTION SELECTION

I Select non-greedy actions according to their potential.

At=̇ argmaxa

[Qt(a) + c

√ln t

UPPER CONFIDENCE BOUND ACTION SELECTION

I ln means increase gets small over time; still unbounded so all actions will betried.

I Actions with low estimates or frequently chosen will be selected withdecreasing frequency.

I Difficult to extend beyond bandits in general RL.I Nonstationarity is a problem.I Large state spaces a problem.

SOFTMAX ACTION SELECTION

I Grade action probabilities by estimating values.I Commonly use a Gibbs or Boltzmann distribution. Choose action a at time t

with probability:eQt(a)/τ∑nb=1 e

Qt(b)/τ

I τ controls exploration and is called temperatureI Easier to set ε over τ with confidence.

GRADIENT BANDIT

I Learn a numerical preference on actions rather than action valuesHt(a). Thelarger the preference the more the action is chosen.

I Only relative preference of actions over other actions are important; adding1000 to all makes no difference to probabilities(usually Gibbs or Softmax):

Pr {At = a} =̇eHt(a)∑kb=1 e

Ht(b)=̇πt(a)

I At start all preferences are the same

GRADIENT BANDIT

I Idea from stochastic gradient ascend. Update:

Ht+1(At)=̇Ht(At) + α(Rt − R̄t)(1− πt(At)), andHt+1(a)=̇Ht(a)− α(Rt − R̄t)πt(a), ∀a 6= At

I R̄t serves as a baseline to compare.I If reward below baseline, probability decreased.I Rest of the actions move the opposite way.

ASSOCIATIVE SEARCH - CONTEXTUAL BANDITS

I So far nonassociative tasks; no different situations, single state.

I find the single best action when stationaryI or try to find best action as it changes over time when nonstationary

I In RL generally we have different situations; find best action for each differentsituation.

I If slot machine changes its action values fast, will the methods we talked aboutwork?

I If it signals that it changes e.g. display color, can we associate that to anaction?

I yes with the right policy

I Like RL (search for policy) but like k-armed (only affect intermediate rewards).I If actions affect next state and reward then RL.

I So far nonassociative tasks; no different situations, single state.I find the single best action when stationary

I or try to find best action as it changes over time when nonstationaryI In RL generally we have different situations; find best action for each different

situation.I If slot machine changes its action values fast, will the methods we talked about

work?I If it signals that it changes e.g. display color, can we associate that to an

action?

I So far nonassociative tasks; no different situations, single state.I find the single best action when stationaryI or try to find best action as it changes over time when nonstationary

I yes with the right policyI Like RL (search for policy) but like k-armed (only affect intermediate rewards).I If actions affect next state and reward then RL.

I If it signals that it changes e.g. display color, can we associate that to anaction?I yes with the right policy

I Like RL (search for policy) but like k-armed (only affect intermediate rewards).

I If actions affect next state and reward then RL.

CONCLUSION

I Despite their simplicity these are fairly state of the art.I But which is best?

CONCLUSION

I Not really satisfactory to the general exploration exploitation dilemma.I Gittins index action value.

I In some cases tractable leads to optimal solution if given complete knowledge.I Doesn’t generalize to the full RL problem.

I Bayesian setting can compute the information state given a horizon.I Bayesian approach. Known prior, update after each step.I Select action based on posterior. Posterior or Thompson sampling. Performs

similar to distribution free methods.I Grows pretty large and is not feasible. Approximate solutions? Topic for

research.

SNEAK PEAK TO MDPS - FORMAL DEFINITION RL

A Markov decision process consists of:I A set of states SI A set of actions AI A transition probability matrix P

Pass′ = Pr[St+1 = s′|St = s,At = a

]I A reward functionR

Ras = E [Rt+1|St = s,At = a]

I γ ∈ [0, 1] a discount factorAssumes the Markov property i.e. next state, reward are independent of historygiven the current state.

SNEAK PEAK TO MDPS - FORMAL DEFINITION RL

I Goal is to maximaze the expected long term future reward.I Usually discounted sum:

∞∑t=0

γtrt+1

I Not the same as maximizing immediate rewards due to γI Learn a policy π(a|s). Contrast to what we saw before, this requires actions to

be chosen for specific states.

SNEAK PEAK TO MDPS - EXAMPLERECYCLING ROBOT

Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem....

Documents

Multi-armed Bandit with Additional Observationsalinlab.kaist.ac.kr/resource/Multi_armed_Bandit... · Multi-armed Bandit with Additional Observations 13:3 the expert problem (i.e.,

Bayesian Contextual Multi-armed Bandits Contextual Multi-armed Bandits ... The Epoch-Greedy Algorithm for Contextual Multi-armed ... topic model w/ a Bayesian multi-armed bandit analysis

THE MULTI-ARMED BANDIT PROBLEMhockpeng/bandit5a.pdfTHE MULTI-ARMED BANDIT PROBLEM: AN EFFICIENT NON-PARAMETRIC SOLUTION Hock Peng Chan stachp@nus.edu.sg Department of Statistics and

Multi-Armed Bandit and Applications

The Multi-Armed Bandit Problem€¦ · Sumeet Katariya Electrical and Computer Engineering December 7, 2013 Sumeet Katariya Multi-armed Bandit. Motivation Model Algorithms Outline

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit …aditya/E1245_Online_Prediction_Learning_F201… · Multi-armed bandit problems are the most basic examples of

A Simple Distribution- Free Approach to the Max k-Armed Bandit Problem Matthew Streeter and Stephen Smith Carnegie Mellon University

Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

The Irrevocable Multi-Armed Bandit Problemweb.mit.edu/~vivekf/www/papers/MAB_final_old.pdf · The Irrevocable Multi-Armed Bandit Problem Vivek F. Farias ∗ Ritesh Madan † 22 June

The Multi-Armed Bandit, with Constraints - Stony Brookfeinberg/public/Bandits-DFR.pdfThe Multi-Armed Bandit, with Constraints Eric V. Denardo,1 Eugene A. Feinberg2 and Uriel G. Rothblum3

Lecture 9: Exploration and Exploitation 9: Exploration and Exploitation Lecture 9: Exploration and Exploitation David Silver. ... The Multi-Armed Bandit A multi-armed bandit is a tuple

Foraging and Multi-armed Bandits Optimal Foraging …vaibhav/talks/2013a.pdfForaging and Multi-armed Bandits ... the multi-armed bandit problem with switching cost. IEEE Transactions

Asymptotically Optimal Multi-Armed Bandit …mnk/papers/mab-c-arx-2015.pdfAsymptotically Optimal Multi-Armed Bandit Policies under a Cost Constraint Apostolos N. Burnetas aburnetas@math.uoa.gr

Multi-Armed Bandit Problem and Its Applications in ... POLYTECHNIQUE Abstract Master of Complex System Multi-Armed Bandit Problem and Its Applications in Intelligent Tutoring Systems

Multi-Armed Bandit: Learning in Dynamic Systems with ...ewh.ieee.org/r10/xian/com/zhaoqing.pdfc QingZhao,UCDavis. TalkatXidianUniv.,September,2011. 1 Multi-Armed Bandit: Learning in

A two-armed bandit theory of market pricing. - Yale University

CCN Interest Forwarding Strategy as Multi-Armed Bandit

Chapter 6 MULTI-ARMED BANDIT PROBLEMS - …web.eecs.umich.edu/~teneket/pubs/MAB-Survey.pdfChapter 6 MULTI-ARMED BANDIT PROBLEMS Aditya Mahajan University of Michigan, Ann Arbor, MI,

Scaling Multi-Armed Bandit Algorithms · formation systems → Data streams; Data analytics. KEYWORDS Bandit Algorithms; Thompson Sampling; Adaptive Windowing; Data Stream Monitoring;

Bandit models: a tutorial - GDR Recherche Opérationnellegdrro.lip6.fr/sites/default/files/JourneeCOSdec2015-Kaufman.pdfMulti-Armed Bandit model: general setting K arms: for a 2f1;:::;Kg;