13
Part 9: Bandit Problems Introduction to Reinforcement Learning

Part 9: Bandit Problems

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Part 9: Bandit Problems

Part 9: Bandit Problems

Introduction to Reinforcement Learning

Page 2: Part 9: Bandit Problems

N-armed Bandits

❐ The simplest problems with evaluative feedback (rewards)❐ Evaluating actions vs. instructing by giving correct actions❐ Supervised learning is instructive; optimization is evaluative

❐ Associative vs. Nonassociative:! Associative: inputs mapped to actions;

learn the best action for each input! Nonassociative: “learn” (find) one best action

❐ n-armed bandit (at least how we treat it) is:! Nonassociative! Evaluative feedback

Sutton & Barto, Chapter 2

Page 3: Part 9: Bandit Problems

The n-Armed Bandit Problem

❐ Choose repeatedly from one of n actions; each choice is called a play

❐ After each play , you get a reward , whereAt Rt

These are unknown action valuesDistribution of depends only on Rt At

❐ Objective is to maximize the reward in the long term, e.g., over 1000 plays

To solve the n-armed bandit problem, you must explore a variety of actions and exploit the best of them

E Rt{ } = q*(At )

Page 4: Part 9: Bandit Problems

The Exploration/Exploitation Dilemma

❐ Suppose you form estimates

❐ The greedy action at t is

❐ You can’t exploit all the time; you can’t explore all the time

❐ You can never stop exploring; but maybe you should always reduce exploring. Maybe not.

Qt (a) ≈ q*(a) Action-value estimates

At* = argmax

aQt (a)

At = At* ⇒ exploitation

At ≠ At* ⇒ exploration

At*

Page 5: Part 9: Bandit Problems

Action-Value Methods

❐ Methods that adapt action-value estimates and nothing else, e.g.: suppose by the t-th play, action had been chosen times, producing rewards then

Ka R1, R2 ,K , RKa,

“sample average”

limKa→∞

Qt (a) = q*(a)

a

Qt (a) =R1 + R2 +L RKa

Ka

...

...

Page 6: Part 9: Bandit Problems

ε-Greedy Action Selection

❐ Greedy action selection:

❐ ε-Greedy:

At = At* = argmax

aQt (a)

At* with probability 1− ε

a random action with probability ε{At =

. . . the simplest way to balance exploration and exploitation

Page 7: Part 9: Bandit Problems

10-Armed Testbed

❐ n = 10 possible actions❐ Each is chosen randomly from a normal distrib.: ❐ each is also normal: ❐ 1000 plays❐ repeat the whole thing 2000 times and average the results

Rtq*(a)

N(q*(At ),1)N(0,1)

Page 8: Part 9: Bandit Problems

ε-Greedy Methods on the 10-Armed Testbed

Page 9: Part 9: Bandit Problems

Upper Confidence Bound (UCB) action selection

❐ A better way of reducing exploration over time! Compute bounds on the likely range of q*(a), ∀a! Select the action with the largest upper bound

– Adapt to both more plays to action a– and more plays on the other actions

At = argmax

a

"Qt(a) + c

slog t

Kt(a)

#

G�t =

1X

n=1

G(n)t (1� �t+n)

t+n�1Y

i=t+1

�i

=

T�1X

k=t+1

G(k�t)t (1� �k)

k�1Y

i=t+1

�i + Gt

T�1Y

i=t+1

�i.

r(St) = (1� �(St)) z(St) (1)

Rt = r(St)

�t = �(St)

�t = �(St)

�t = �(St)

⇢t = ⇢(St)

◆t = ◆(St)

✓t = vt + ut

vt+1 = vt + ↵ (1� �t+1�t+1)⇥�Rt+1 � ✓>

t �t

�et + ut

+ ↵�t+1 (1� �t+1)�✓>t �t+1

�et

ut+1 = �t+1�t+1(1� ⇢t+1)⇥ut + ↵

�Rt+1 + ✓>

t �t+1 � ✓>t �t

�et

et = �t�t⇢tet�1 + ◆t�t

✓t = vt + ut

vt+1 = vt + ↵ (1� �t+1�t+1)⇥�Rt+1 � ¯Rt � ✓>

t �t

�et + ut

+ ↵�t+1 (1� �t+1)�✓>t �t+1

�et

1

Page 10: Part 9: Bandit Problems

Incremental Implementation

Qk+1 =R1 + R2 +L Rk

k

Recall the sample average estimation method:

Can we do this incrementally (without storing all the rewards)?

Could keep a running sum and count (and divide), or, equivalently:

Qk+1 =Qk +1kRk −Qk[ ]

The average of the first k rewards is(dropping the dependence on ):

This is a common form for update rules:

NewEstimate = OldEstimate + StepSize[Target – OldEstimate]

a...

Page 11: Part 9: Bandit Problems

Tracking a Nonstationary Problem

Choosing to be a sample average is appropriate in astationary problem, i.e., when none of the change over time,

But not in a nonstationary problem.

Qk

q*(a)

Better in the nonstationary case is:

Qk+1 =Qk +α Rk −Qk[ ]for constant α, 0 <α ≤1

= (1−α )kQ0 + α(1−αi=1

k

∑ )k−i Ri

exponential, recency-weighted average

Page 12: Part 9: Bandit Problems

Optimistic Initial Values

❐ All methods so far depend on , i.e., they are biased❐ Suppose we initialize the action values optimistically,

Q0 (a)

e.g., on the 10-armed testbed, useand α = 0.1

Q0 (a) = 5 for all a

Page 13: Part 9: Bandit Problems

Conclusions

❐ These are all very simple methods! but they are complicated enough—we will build on

them! we should understand them completely

❐ Ideas for improvements:! estimating uncertainties . . . interval estimation! UCB (Upper Confidence Bound)! approximating Bayes optimal solutions ! Gittens indices

❐ The full RL problem offers some ideas for solution . . .