Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

1

Reinforcement LearningThe one with the bandits

2

REINFORCEMENT LEARNING

I Evaluative feedbackI judge actions takenI does not instruct which action is the best

I Contrast to purely instructive feedback (supervised learning)I k-armed bandit, single state RL, non-associative

3

EXPLORATION - EXPLOITATION DILEMMA

I Online decision making involves a fundamental choice:Exploration Make a random action to gather informationExploitation Perform the best action according to current knowledge

I How to choose the best behaviour?I Gather enough information to learn the best overall decisionsI May need to sacrifice short- for long-term rewards. Myopic vs far-sighted.

4

The k-arm Bandit Problem

5

THE K-ARMED BANDIT PROBLEM

a1 a2 a3 . . . akExpected value 5 6 4 7

6


a1 a2 a3 . . . akExpected value 5 6 4 7trial_1 4

7


a1 a2 a3 . . . akExpected value 5 6 4 7trial_1 4trial_2 7

8


a1 a2 a3 . . . akExpected value 5 6 4 7trial_1 4trial_2 7trial_3 4

9


a1 a2 a3 . . . akExpected value 5 6 4 7trial_1 4trial_2 7trial_3 4. . . . . . . . . . . . . . . . . .trial_n 5trial_n+ 1 ? ? ? ? ?

10


I Choose repeatedly from k actions; a play

I After each play, receive a reward chosen from a stationary distribution anddepends on the chosen action

I Objective is to maximize the expected total reward over a time periodI Analogy to a slot machine, but here with many arms

10


I Choose repeatedly from k actions; a playI After each play, receive a reward chosen from a stationary distribution and

depends on the chosen action

I Objective is to maximize the expected total reward over a time periodI Analogy to a slot machine, but here with many arms

10



depends on the chosen actionI Objective is to maximize the expected total reward over a time period

I Analogy to a slot machine, but here with many arms

10



depends on the chosen actionI Objective is to maximize the expected total reward over a time periodI Analogy to a slot machine, but here with many arms

11


I k possible actions at each time step t

I action chosen at time step t is denoted as AtI corresponding reward is RtI each action has an expected reward; its valueq∗(a) = E [Rt|At = a]

I if values known, then problem is trivial. Select highest value.I we can estimate; Qt(a)

I we want Qt(a) to be as close as possible to q∗(a)

11


I k possible actions at each time step tI action chosen at time step t is denoted as At

I corresponding reward is RtI each action has an expected reward; its valueq∗(a) = E [Rt|At = a]



11


I k possible actions at each time step tI action chosen at time step t is denoted as AtI corresponding reward is Rt

I each action has an expected reward; its valueq∗(a) = E [Rt|At = a]



11


I k possible actions at each time step tI action chosen at time step t is denoted as AtI corresponding reward is RtI each action has an expected reward; its valueq∗(a) = E [Rt|At = a]



11



I if values known, then problem is trivial. Select highest value.

I we can estimate; Qt(a)


11





11





12


I Suppose you calculate estimates thenQt(a) ≈ q∗(a) as t grows

I The greedy action is a∗t = argmaxa

Qt(a)

I at = a∗t → exploitationI at 6= a∗t → exploration

I Constant exploration is a bad idea.I Constant exploitation is a bad idea.I Stop exploration is a bad idea.I Reduce exploration, maybe good?I Conflict between exploring and exploiting

12




Qt(a)



12




Qt(a)

I at = a∗t → exploitation

I at 6= a∗t → exploration


12




Qt(a)



12




Qt(a)


I Constant exploration is a bad idea.

I Constant exploitation is a bad idea.I Stop exploration is a bad idea.I Reduce exploration, maybe good?I Conflict between exploring and exploiting

12




Qt(a)


I Constant exploration is a bad idea.I Constant exploitation is a bad idea.

I Stop exploration is a bad idea.I Reduce exploration, maybe good?I Conflict between exploring and exploiting

12




Qt(a)


I Constant exploration is a bad idea.I Constant exploitation is a bad idea.I Stop exploration is a bad idea.

I Reduce exploration, maybe good?I Conflict between exploring and exploiting

12




Qt(a)


I Constant exploration is a bad idea.I Constant exploitation is a bad idea.I Stop exploration is a bad idea.I Reduce exploration, maybe good?

I Conflict between exploring and exploiting

12




Qt(a)



13

TO EXPLORE OR TO EXPLOIT? THAT IS THE QUESTION

I Many sophisticated methods for different formulations.I Strong assumptions on stationarity, prior knowledge.I Hard to verify or altogether violated in the full RL problem.I We only worry about balancing somewhat.

14

RANDOM EXPLORATION

I Simplest form of action selection

I Good for explorationI Bad for everything else

at = random action

14

RANDOM EXPLORATION

I Simplest form of action selectionI Good for exploration

I Bad for everything elseat = random action

14

RANDOM EXPLORATION

I Simplest form of action selectionI Good for explorationI Bad for everything else

at = random action

14

RANDOM EXPLORATION

I Simplest form of action selectionI Good for explorationI Bad for everything else

at = random action

15

ACTION-VALUE METHODS

Methods that adapt action-values and nothing else and use these estimates toselect actions.

Qt(a)=̇

∑t−1i=1 Ri · 1Ai=a∑t−1i=1 1Ai=a

I Law of large numbers Qt(a) will converge to q∗(a) as denominator goes toinfinity

I Select action with highest value At=̇ argmaxa

Qt(a)

I Greedy action selection

15



Qt(a)=̇

∑t−1i=1 Ri · 1Ai=a∑t−1i=1 1Ai=a



Qt(a)


15



Qt(a)=̇

∑t−1i=1 Ri · 1Ai=a∑t−1i=1 1Ai=a



Qt(a)


16

ε-GREEDY

I Simple way to balance exploration and exploitation

I ε to choose how to act

at =

{a∗t with probability 1− εrandom action with probability ε

I In the limit actions will be sampled infinite amount of timesI Will converge

16

ε-GREEDY

I Simple way to balance exploration and exploitationI ε to choose how to act

at =



16

ε-GREEDY


at =


I In the limit actions will be sampled infinite amount of times

I Will converge

16

ε-GREEDY


at =



17

THE 10-ARMED TESTBED

18


I k = 10, ten possible actionsI q∗(a) selected from a normal zero mean unit variance distributionN (0, 1)

I Rt is drawn fromN (q∗(a), 1)

I 1000 time steps = 1 runI Repeat 2000 times and average the resultsI Action-value estimates using sample-average

19


20


I Or decay ε over timeI Only explore linearI Only exploit linearI Decay sublinear

[David Silver course]

21


I Is ε-greedy always better?

I If variance larger?

I ...then it is even better

I If variance is zero?

I ...then greedy finds the true best action fast

I If nonstationary?

I ...nongreedy actions might become better than greedy; need to explore

I RL problems are commonly nonstationary and require a trade-off.

21


I Is ε-greedy always better?I If variance larger?

I ...then it is even betterI If variance is zero?


I If nonstationary?



21



I ...then it is even better

I If variance is zero?


I If nonstationary?



21




I ...then greedy finds the true best action fastI If nonstationary?



21





I If nonstationary?



21





I ...nongreedy actions might become better than greedy; need to exploreI RL problems are commonly nonstationary and require a trade-off.

21







21





I ...nongreedy actions might become better than greedy; need to exploreI RL problems are commonly nonstationary and require a trade-off.

22

INCREMENTAL IMPLEMENTATION

I Action value methods estimates sample averages. Computationally inefficient.

I Looking at a single action:

Qn=̇R1 +R2 + . . .+Rn−1

n− 1

I As a running sum:

Qn+1 = Qn +1

n[Rn −Qn]

NewEstimate← OldEstimate+ StepSize [Target−OldEstimate]

22


I Action value methods estimates sample averages. Computationally inefficient.I Looking at a single action:

Qn=̇R1 +R2 + . . .+Rn−1

n− 1

I As a running sum:

Qn+1 = Qn +1

n[Rn −Qn]


22



Qn=̇R1 +R2 + . . .+Rn−1

n− 1

I As a running sum:

Qn+1 = Qn +1

n[Rn −Qn]


22



Qn=̇R1 +R2 + . . .+Rn−1

n− 1

I As a running sum:

Qn+1 = Qn +1

n[Rn −Qn]


23


I [Target−OldEstimate] represents the error.I We want to move towards the target though it may be noisy.I StepSize changes here based on n. Generally denotes by α.

23


I [Target−OldEstimate] represents the error.I We want to move towards the target though it may be noisy.I StepSize changes here based on n. Generally denotes by α.

24

NONSTATIONARY PROBLEMS

I Rewards can change over time

I Weight recent rewards over long-past. Use a constant step size α ∈ [0, 1):

Qn+1=̇Qn + α [Rn −Qn]

I Results in weighted average:

Qn+1 = (1− α)nQ1 +n∑i=1

α(1− α)n−iRi

I Also called exponentially recency-weighted average.

24


I Rewards can change over timeI Weight recent rewards over long-past. Use a constant step size α ∈ [0, 1):



Qn+1 = (1− α)nQ1 +n∑i=1

α(1− α)n−iRi


24





Qn+1 = (1− α)nQ1 +n∑i=1

α(1− α)n−iRi


24





Qn+1 = (1− α)nQ1 +n∑i=1

α(1− α)n−iRi


25

ON CONVERGENCE

I Stochastic approximation says that to converge with probability 1:∞∑n=1

an(a) =∞ and∞∑n=1

a2n(a) <∞

I Large enough to avoid initial conditions, small enough to assure convergence.I Satisfied in sample-average but not constant αI Actually desirable in nonstationary problems.I Good for theoretical work, not so much in practice.

26

OPTIMISTIC INITIAL VALUES

I Initial estimates bias behaviour.!I 10-arm bandit with Q1(a) = +5, ∀a

27

UPPER CONFIDENCE BOUND ACTION SELECTION

I Select non-greedy actions according to their potential.

At=̇ argmaxa

[Qt(a) + c

√ln t

Nt(a)

]

28

UPPER CONFIDENCE BOUND ACTION SELECTION

I ln means increase gets small over time; still unbounded so all actions will betried.

I Actions with low estimates or frequently chosen will be selected withdecreasing frequency.

I Difficult to extend beyond bandits in general RL.I Nonstationarity is a problem.I Large state spaces a problem.

29

SOFTMAX ACTION SELECTION

I Grade action probabilities by estimating values.I Commonly use a Gibbs or Boltzmann distribution. Choose action a at time t

with probability:eQt(a)/τ∑nb=1 e

Qt(b)/τ

I τ controls exploration and is called temperatureI Easier to set ε over τ with confidence.

30

GRADIENT BANDIT

I Learn a numerical preference on actions rather than action valuesHt(a). Thelarger the preference the more the action is chosen.

I Only relative preference of actions over other actions are important; adding1000 to all makes no difference to probabilities(usually Gibbs or Softmax):

Pr {At = a} =̇eHt(a)∑kb=1 e

Ht(b)=̇πt(a)

I At start all preferences are the same

31

GRADIENT BANDIT

I Idea from stochastic gradient ascend. Update:

Ht+1(At)=̇Ht(At) + α(Rt − R̄t)(1− πt(At)), andHt+1(a)=̇Ht(a)− α(Rt − R̄t)πt(a), ∀a 6= At

I R̄t serves as a baseline to compare.I If reward below baseline, probability decreased.I Rest of the actions move the opposite way.

32

ASSOCIATIVE SEARCH - CONTEXTUAL BANDITS

I So far nonassociative tasks; no different situations, single state.

I find the single best action when stationaryI or try to find best action as it changes over time when nonstationary

I In RL generally we have different situations; find best action for each differentsituation.

I If slot machine changes its action values fast, will the methods we talked aboutwork?

I If it signals that it changes e.g. display color, can we associate that to anaction?

I yes with the right policy

I Like RL (search for policy) but like k-armed (only affect intermediate rewards).I If actions affect next state and reward then RL.

32


I So far nonassociative tasks; no different situations, single state.I find the single best action when stationary

I or try to find best action as it changes over time when nonstationaryI In RL generally we have different situations; find best action for each different

situation.I If slot machine changes its action values fast, will the methods we talked about

work?I If it signals that it changes e.g. display color, can we associate that to an

action?



32


I So far nonassociative tasks; no different situations, single state.I find the single best action when stationaryI or try to find best action as it changes over time when nonstationary






32








32








32






I yes with the right policyI Like RL (search for policy) but like k-armed (only affect intermediate rewards).I If actions affect next state and reward then RL.

32





I If it signals that it changes e.g. display color, can we associate that to anaction?I yes with the right policy


32






I Like RL (search for policy) but like k-armed (only affect intermediate rewards).

I If actions affect next state and reward then RL.

32







33

CONCLUSION

I Despite their simplicity these are fairly state of the art.I But which is best?

34

CONCLUSION

I Not really satisfactory to the general exploration exploitation dilemma.I Gittins index action value.

I In some cases tractable leads to optimal solution if given complete knowledge.I Doesn’t generalize to the full RL problem.

I Bayesian setting can compute the information state given a horizon.I Bayesian approach. Known prior, update after each step.I Select action based on posterior. Posterior or Thompson sampling. Performs

similar to distribution free methods.I Grows pretty large and is not feasible. Approximate solutions? Topic for

research.

35

SNEAK PEAK TO MDPS - FORMAL DEFINITION RL

A Markov decision process consists of:I A set of states SI A set of actions AI A transition probability matrix P

Pass′ = Pr[St+1 = s′|St = s,At = a

]I A reward functionR

Ras = E [Rt+1|St = s,At = a]

I γ ∈ [0, 1] a discount factorAssumes the Markov property i.e. next state, reward are independent of historygiven the current state.

36

SNEAK PEAK TO MDPS - FORMAL DEFINITION RL

I Goal is to maximaze the expected long term future reward.I Usually discounted sum:

∞∑t=0

γtrt+1

I Not the same as maximizing immediate rewards due to γI Learn a policy π(a|s). Contrast to what we saw before, this requires actions to

be chosen for specific states.

37

SNEAK PEAK TO MDPS - EXAMPLERECYCLING ROBOT

Documents

Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED