84
Reinforcement Learning The one with the bandits

Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

1

Reinforcement LearningThe one with the bandits

Page 2: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

2

REINFORCEMENT LEARNING

I Evaluative feedbackI judge actions takenI does not instruct which action is the best

I Contrast to purely instructive feedback (supervised learning)I k-armed bandit, single state RL, non-associative

Page 3: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

3

EXPLORATION - EXPLOITATION DILEMMA

I Online decision making involves a fundamental choice:Exploration Make a random action to gather informationExploitation Perform the best action according to current knowledge

I How to choose the best behaviour?I Gather enough information to learn the best overall decisionsI May need to sacrifice short- for long-term rewards. Myopic vs far-sighted.

Page 4: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

4

The k-arm Bandit Problem

Page 5: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

5

THE K-ARMED BANDIT PROBLEM

a1 a2 a3 . . . akExpected value 5 6 4 7

Page 6: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

6

THE K-ARMED BANDIT PROBLEM

a1 a2 a3 . . . akExpected value 5 6 4 7trial_1 4

Page 7: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

7

THE K-ARMED BANDIT PROBLEM

a1 a2 a3 . . . akExpected value 5 6 4 7trial_1 4trial_2 7

Page 8: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

8

THE K-ARMED BANDIT PROBLEM

a1 a2 a3 . . . akExpected value 5 6 4 7trial_1 4trial_2 7trial_3 4

Page 9: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

9

THE K-ARMED BANDIT PROBLEM

a1 a2 a3 . . . akExpected value 5 6 4 7trial_1 4trial_2 7trial_3 4. . . . . . . . . . . . . . . . . .trial_n 5trial_n+ 1 ? ? ? ? ?

Page 10: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

10

THE K-ARMED BANDIT PROBLEM

I Choose repeatedly from k actions; a play

I After each play, receive a reward chosen from a stationary distribution anddepends on the chosen action

I Objective is to maximize the expected total reward over a time periodI Analogy to a slot machine, but here with many arms

Page 11: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

10

THE K-ARMED BANDIT PROBLEM

I Choose repeatedly from k actions; a playI After each play, receive a reward chosen from a stationary distribution and

depends on the chosen action

I Objective is to maximize the expected total reward over a time periodI Analogy to a slot machine, but here with many arms

Page 12: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

10

THE K-ARMED BANDIT PROBLEM

I Choose repeatedly from k actions; a playI After each play, receive a reward chosen from a stationary distribution and

depends on the chosen actionI Objective is to maximize the expected total reward over a time period

I Analogy to a slot machine, but here with many arms

Page 13: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

10

THE K-ARMED BANDIT PROBLEM

I Choose repeatedly from k actions; a playI After each play, receive a reward chosen from a stationary distribution and

depends on the chosen actionI Objective is to maximize the expected total reward over a time periodI Analogy to a slot machine, but here with many arms

Page 14: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

11

THE K-ARMED BANDIT PROBLEM

I k possible actions at each time step t

I action chosen at time step t is denoted as AtI corresponding reward is RtI each action has an expected reward; its valueq∗(a) = E [Rt|At = a]

I if values known, then problem is trivial. Select highest value.I we can estimate; Qt(a)

I we want Qt(a) to be as close as possible to q∗(a)

Page 15: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

11

THE K-ARMED BANDIT PROBLEM

I k possible actions at each time step tI action chosen at time step t is denoted as At

I corresponding reward is RtI each action has an expected reward; its valueq∗(a) = E [Rt|At = a]

I if values known, then problem is trivial. Select highest value.I we can estimate; Qt(a)

I we want Qt(a) to be as close as possible to q∗(a)

Page 16: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

11

THE K-ARMED BANDIT PROBLEM

I k possible actions at each time step tI action chosen at time step t is denoted as AtI corresponding reward is Rt

I each action has an expected reward; its valueq∗(a) = E [Rt|At = a]

I if values known, then problem is trivial. Select highest value.I we can estimate; Qt(a)

I we want Qt(a) to be as close as possible to q∗(a)

Page 17: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

11

THE K-ARMED BANDIT PROBLEM

I k possible actions at each time step tI action chosen at time step t is denoted as AtI corresponding reward is RtI each action has an expected reward; its valueq∗(a) = E [Rt|At = a]

I if values known, then problem is trivial. Select highest value.I we can estimate; Qt(a)

I we want Qt(a) to be as close as possible to q∗(a)

Page 18: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

11

THE K-ARMED BANDIT PROBLEM

I k possible actions at each time step tI action chosen at time step t is denoted as AtI corresponding reward is RtI each action has an expected reward; its valueq∗(a) = E [Rt|At = a]

I if values known, then problem is trivial. Select highest value.

I we can estimate; Qt(a)

I we want Qt(a) to be as close as possible to q∗(a)

Page 19: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

11

THE K-ARMED BANDIT PROBLEM

I k possible actions at each time step tI action chosen at time step t is denoted as AtI corresponding reward is RtI each action has an expected reward; its valueq∗(a) = E [Rt|At = a]

I if values known, then problem is trivial. Select highest value.I we can estimate; Qt(a)

I we want Qt(a) to be as close as possible to q∗(a)

Page 20: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

11

THE K-ARMED BANDIT PROBLEM

I k possible actions at each time step tI action chosen at time step t is denoted as AtI corresponding reward is RtI each action has an expected reward; its valueq∗(a) = E [Rt|At = a]

I if values known, then problem is trivial. Select highest value.I we can estimate; Qt(a)

I we want Qt(a) to be as close as possible to q∗(a)

Page 21: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

12

THE K-ARMED BANDIT PROBLEM

I Suppose you calculate estimates thenQt(a) ≈ q∗(a) as t grows

I The greedy action is a∗t = argmaxa

Qt(a)

I at = a∗t → exploitationI at 6= a∗t → exploration

I Constant exploration is a bad idea.I Constant exploitation is a bad idea.I Stop exploration is a bad idea.I Reduce exploration, maybe good?I Conflict between exploring and exploiting

Page 22: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

12

THE K-ARMED BANDIT PROBLEM

I Suppose you calculate estimates thenQt(a) ≈ q∗(a) as t grows

I The greedy action is a∗t = argmaxa

Qt(a)

I at = a∗t → exploitationI at 6= a∗t → exploration

I Constant exploration is a bad idea.I Constant exploitation is a bad idea.I Stop exploration is a bad idea.I Reduce exploration, maybe good?I Conflict between exploring and exploiting

Page 23: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

12

THE K-ARMED BANDIT PROBLEM

I Suppose you calculate estimates thenQt(a) ≈ q∗(a) as t grows

I The greedy action is a∗t = argmaxa

Qt(a)

I at = a∗t → exploitation

I at 6= a∗t → exploration

I Constant exploration is a bad idea.I Constant exploitation is a bad idea.I Stop exploration is a bad idea.I Reduce exploration, maybe good?I Conflict between exploring and exploiting

Page 24: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

12

THE K-ARMED BANDIT PROBLEM

I Suppose you calculate estimates thenQt(a) ≈ q∗(a) as t grows

I The greedy action is a∗t = argmaxa

Qt(a)

I at = a∗t → exploitationI at 6= a∗t → exploration

I Constant exploration is a bad idea.I Constant exploitation is a bad idea.I Stop exploration is a bad idea.I Reduce exploration, maybe good?I Conflict between exploring and exploiting

Page 25: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

12

THE K-ARMED BANDIT PROBLEM

I Suppose you calculate estimates thenQt(a) ≈ q∗(a) as t grows

I The greedy action is a∗t = argmaxa

Qt(a)

I at = a∗t → exploitationI at 6= a∗t → exploration

I Constant exploration is a bad idea.

I Constant exploitation is a bad idea.I Stop exploration is a bad idea.I Reduce exploration, maybe good?I Conflict between exploring and exploiting

Page 26: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

12

THE K-ARMED BANDIT PROBLEM

I Suppose you calculate estimates thenQt(a) ≈ q∗(a) as t grows

I The greedy action is a∗t = argmaxa

Qt(a)

I at = a∗t → exploitationI at 6= a∗t → exploration

I Constant exploration is a bad idea.I Constant exploitation is a bad idea.

I Stop exploration is a bad idea.I Reduce exploration, maybe good?I Conflict between exploring and exploiting

Page 27: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

12

THE K-ARMED BANDIT PROBLEM

I Suppose you calculate estimates thenQt(a) ≈ q∗(a) as t grows

I The greedy action is a∗t = argmaxa

Qt(a)

I at = a∗t → exploitationI at 6= a∗t → exploration

I Constant exploration is a bad idea.I Constant exploitation is a bad idea.I Stop exploration is a bad idea.

I Reduce exploration, maybe good?I Conflict between exploring and exploiting

Page 28: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

12

THE K-ARMED BANDIT PROBLEM

I Suppose you calculate estimates thenQt(a) ≈ q∗(a) as t grows

I The greedy action is a∗t = argmaxa

Qt(a)

I at = a∗t → exploitationI at 6= a∗t → exploration

I Constant exploration is a bad idea.I Constant exploitation is a bad idea.I Stop exploration is a bad idea.I Reduce exploration, maybe good?

I Conflict between exploring and exploiting

Page 29: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

12

THE K-ARMED BANDIT PROBLEM

I Suppose you calculate estimates thenQt(a) ≈ q∗(a) as t grows

I The greedy action is a∗t = argmaxa

Qt(a)

I at = a∗t → exploitationI at 6= a∗t → exploration

I Constant exploration is a bad idea.I Constant exploitation is a bad idea.I Stop exploration is a bad idea.I Reduce exploration, maybe good?I Conflict between exploring and exploiting

Page 30: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

13

TO EXPLORE OR TO EXPLOIT? THAT IS THE QUESTION

I Many sophisticated methods for different formulations.I Strong assumptions on stationarity, prior knowledge.I Hard to verify or altogether violated in the full RL problem.I We only worry about balancing somewhat.

Page 31: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

14

RANDOM EXPLORATION

I Simplest form of action selection

I Good for explorationI Bad for everything else

at = random action

Page 32: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

14

RANDOM EXPLORATION

I Simplest form of action selectionI Good for exploration

I Bad for everything elseat = random action

Page 33: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

14

RANDOM EXPLORATION

I Simplest form of action selectionI Good for explorationI Bad for everything else

at = random action

Page 34: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

14

RANDOM EXPLORATION

I Simplest form of action selectionI Good for explorationI Bad for everything else

at = random action

Page 35: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

15

ACTION-VALUE METHODS

Methods that adapt action-values and nothing else and use these estimates toselect actions.

Qt(a)=̇

∑t−1i=1 Ri · 1Ai=a∑t−1i=1 1Ai=a

I Law of large numbers Qt(a) will converge to q∗(a) as denominator goes toinfinity

I Select action with highest value At=̇ argmaxa

Qt(a)

I Greedy action selection

Page 36: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

15

ACTION-VALUE METHODS

Methods that adapt action-values and nothing else and use these estimates toselect actions.

Qt(a)=̇

∑t−1i=1 Ri · 1Ai=a∑t−1i=1 1Ai=a

I Law of large numbers Qt(a) will converge to q∗(a) as denominator goes toinfinity

I Select action with highest value At=̇ argmaxa

Qt(a)

I Greedy action selection

Page 37: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

15

ACTION-VALUE METHODS

Methods that adapt action-values and nothing else and use these estimates toselect actions.

Qt(a)=̇

∑t−1i=1 Ri · 1Ai=a∑t−1i=1 1Ai=a

I Law of large numbers Qt(a) will converge to q∗(a) as denominator goes toinfinity

I Select action with highest value At=̇ argmaxa

Qt(a)

I Greedy action selection

Page 38: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

16

ε-GREEDY

I Simple way to balance exploration and exploitation

I ε to choose how to act

at =

{a∗t with probability 1− εrandom action with probability ε

I In the limit actions will be sampled infinite amount of timesI Will converge

Page 39: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

16

ε-GREEDY

I Simple way to balance exploration and exploitationI ε to choose how to act

at =

{a∗t with probability 1− εrandom action with probability ε

I In the limit actions will be sampled infinite amount of timesI Will converge

Page 40: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

16

ε-GREEDY

I Simple way to balance exploration and exploitationI ε to choose how to act

at =

{a∗t with probability 1− εrandom action with probability ε

I In the limit actions will be sampled infinite amount of times

I Will converge

Page 41: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

16

ε-GREEDY

I Simple way to balance exploration and exploitationI ε to choose how to act

at =

{a∗t with probability 1− εrandom action with probability ε

I In the limit actions will be sampled infinite amount of timesI Will converge

Page 42: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

17

THE 10-ARMED TESTBED

Page 43: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

18

THE 10-ARMED TESTBED

I k = 10, ten possible actionsI q∗(a) selected from a normal zero mean unit variance distributionN (0, 1)

I Rt is drawn fromN (q∗(a), 1)

I 1000 time steps = 1 runI Repeat 2000 times and average the resultsI Action-value estimates using sample-average

Page 44: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

19

THE 10-ARMED TESTBED

Page 45: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

20

THE 10-ARMED TESTBED

I Or decay ε over timeI Only explore linearI Only exploit linearI Decay sublinear

[David Silver course]

Page 46: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

21

THE 10-ARMED TESTBED

I Is ε-greedy always better?

I If variance larger?

I ...then it is even better

I If variance is zero?

I ...then greedy finds the true best action fast

I If nonstationary?

I ...nongreedy actions might become better than greedy; need to explore

I RL problems are commonly nonstationary and require a trade-off.

Page 47: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

21

THE 10-ARMED TESTBED

I Is ε-greedy always better?I If variance larger?

I ...then it is even betterI If variance is zero?

I ...then greedy finds the true best action fast

I If nonstationary?

I ...nongreedy actions might become better than greedy; need to explore

I RL problems are commonly nonstationary and require a trade-off.

Page 48: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

21

THE 10-ARMED TESTBED

I Is ε-greedy always better?I If variance larger?

I ...then it is even better

I If variance is zero?

I ...then greedy finds the true best action fast

I If nonstationary?

I ...nongreedy actions might become better than greedy; need to explore

I RL problems are commonly nonstationary and require a trade-off.

Page 49: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

21

THE 10-ARMED TESTBED

I Is ε-greedy always better?I If variance larger?

I ...then it is even betterI If variance is zero?

I ...then greedy finds the true best action fastI If nonstationary?

I ...nongreedy actions might become better than greedy; need to explore

I RL problems are commonly nonstationary and require a trade-off.

Page 50: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

21

THE 10-ARMED TESTBED

I Is ε-greedy always better?I If variance larger?

I ...then it is even betterI If variance is zero?

I ...then greedy finds the true best action fast

I If nonstationary?

I ...nongreedy actions might become better than greedy; need to explore

I RL problems are commonly nonstationary and require a trade-off.

Page 51: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

21

THE 10-ARMED TESTBED

I Is ε-greedy always better?I If variance larger?

I ...then it is even betterI If variance is zero?

I ...then greedy finds the true best action fastI If nonstationary?

I ...nongreedy actions might become better than greedy; need to exploreI RL problems are commonly nonstationary and require a trade-off.

Page 52: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

21

THE 10-ARMED TESTBED

I Is ε-greedy always better?I If variance larger?

I ...then it is even betterI If variance is zero?

I ...then greedy finds the true best action fastI If nonstationary?

I ...nongreedy actions might become better than greedy; need to explore

I RL problems are commonly nonstationary and require a trade-off.

Page 53: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

21

THE 10-ARMED TESTBED

I Is ε-greedy always better?I If variance larger?

I ...then it is even betterI If variance is zero?

I ...then greedy finds the true best action fastI If nonstationary?

I ...nongreedy actions might become better than greedy; need to exploreI RL problems are commonly nonstationary and require a trade-off.

Page 54: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

22

INCREMENTAL IMPLEMENTATION

I Action value methods estimates sample averages. Computationally inefficient.

I Looking at a single action:

Qn=̇R1 +R2 + . . .+Rn−1

n− 1

I As a running sum:

Qn+1 = Qn +1

n[Rn −Qn]

NewEstimate← OldEstimate+ StepSize [Target−OldEstimate]

Page 55: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

22

INCREMENTAL IMPLEMENTATION

I Action value methods estimates sample averages. Computationally inefficient.I Looking at a single action:

Qn=̇R1 +R2 + . . .+Rn−1

n− 1

I As a running sum:

Qn+1 = Qn +1

n[Rn −Qn]

NewEstimate← OldEstimate+ StepSize [Target−OldEstimate]

Page 56: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

22

INCREMENTAL IMPLEMENTATION

I Action value methods estimates sample averages. Computationally inefficient.I Looking at a single action:

Qn=̇R1 +R2 + . . .+Rn−1

n− 1

I As a running sum:

Qn+1 = Qn +1

n[Rn −Qn]

NewEstimate← OldEstimate+ StepSize [Target−OldEstimate]

Page 57: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

22

INCREMENTAL IMPLEMENTATION

I Action value methods estimates sample averages. Computationally inefficient.I Looking at a single action:

Qn=̇R1 +R2 + . . .+Rn−1

n− 1

I As a running sum:

Qn+1 = Qn +1

n[Rn −Qn]

NewEstimate← OldEstimate+ StepSize [Target−OldEstimate]

Page 58: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

23

INCREMENTAL IMPLEMENTATION

I [Target−OldEstimate] represents the error.I We want to move towards the target though it may be noisy.I StepSize changes here based on n. Generally denotes by α.

Page 59: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

23

INCREMENTAL IMPLEMENTATION

I [Target−OldEstimate] represents the error.I We want to move towards the target though it may be noisy.I StepSize changes here based on n. Generally denotes by α.

Page 60: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

24

NONSTATIONARY PROBLEMS

I Rewards can change over time

I Weight recent rewards over long-past. Use a constant step size α ∈ [0, 1):

Qn+1=̇Qn + α [Rn −Qn]

I Results in weighted average:

Qn+1 = (1− α)nQ1 +n∑i=1

α(1− α)n−iRi

I Also called exponentially recency-weighted average.

Page 61: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

24

NONSTATIONARY PROBLEMS

I Rewards can change over timeI Weight recent rewards over long-past. Use a constant step size α ∈ [0, 1):

Qn+1=̇Qn + α [Rn −Qn]

I Results in weighted average:

Qn+1 = (1− α)nQ1 +n∑i=1

α(1− α)n−iRi

I Also called exponentially recency-weighted average.

Page 62: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

24

NONSTATIONARY PROBLEMS

I Rewards can change over timeI Weight recent rewards over long-past. Use a constant step size α ∈ [0, 1):

Qn+1=̇Qn + α [Rn −Qn]

I Results in weighted average:

Qn+1 = (1− α)nQ1 +n∑i=1

α(1− α)n−iRi

I Also called exponentially recency-weighted average.

Page 63: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

24

NONSTATIONARY PROBLEMS

I Rewards can change over timeI Weight recent rewards over long-past. Use a constant step size α ∈ [0, 1):

Qn+1=̇Qn + α [Rn −Qn]

I Results in weighted average:

Qn+1 = (1− α)nQ1 +n∑i=1

α(1− α)n−iRi

I Also called exponentially recency-weighted average.

Page 64: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

25

ON CONVERGENCE

I Stochastic approximation says that to converge with probability 1:∞∑n=1

an(a) =∞ and∞∑n=1

a2n(a) <∞

I Large enough to avoid initial conditions, small enough to assure convergence.I Satisfied in sample-average but not constant αI Actually desirable in nonstationary problems.I Good for theoretical work, not so much in practice.

Page 65: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

26

OPTIMISTIC INITIAL VALUES

I Initial estimates bias behaviour.!I 10-arm bandit with Q1(a) = +5, ∀a

Page 66: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

27

UPPER CONFIDENCE BOUND ACTION SELECTION

I Select non-greedy actions according to their potential.

At=̇ argmaxa

[Qt(a) + c

√ln t

Nt(a)

]

Page 67: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

28

UPPER CONFIDENCE BOUND ACTION SELECTION

I ln means increase gets small over time; still unbounded so all actions will betried.

I Actions with low estimates or frequently chosen will be selected withdecreasing frequency.

I Difficult to extend beyond bandits in general RL.I Nonstationarity is a problem.I Large state spaces a problem.

Page 68: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

29

SOFTMAX ACTION SELECTION

I Grade action probabilities by estimating values.I Commonly use a Gibbs or Boltzmann distribution. Choose action a at time t

with probability:eQt(a)/τ∑nb=1 e

Qt(b)/τ

I τ controls exploration and is called temperatureI Easier to set ε over τ with confidence.

Page 69: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

30

GRADIENT BANDIT

I Learn a numerical preference on actions rather than action valuesHt(a). Thelarger the preference the more the action is chosen.

I Only relative preference of actions over other actions are important; adding1000 to all makes no difference to probabilities(usually Gibbs or Softmax):

Pr {At = a} =̇eHt(a)∑kb=1 e

Ht(b)=̇πt(a)

I At start all preferences are the same

Page 70: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

31

GRADIENT BANDIT

I Idea from stochastic gradient ascend. Update:

Ht+1(At)=̇Ht(At) + α(Rt − R̄t)(1− πt(At)), andHt+1(a)=̇Ht(a)− α(Rt − R̄t)πt(a), ∀a 6= At

I R̄t serves as a baseline to compare.I If reward below baseline, probability decreased.I Rest of the actions move the opposite way.

Page 71: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

32

ASSOCIATIVE SEARCH - CONTEXTUAL BANDITS

I So far nonassociative tasks; no different situations, single state.

I find the single best action when stationaryI or try to find best action as it changes over time when nonstationary

I In RL generally we have different situations; find best action for each differentsituation.

I If slot machine changes its action values fast, will the methods we talked aboutwork?

I If it signals that it changes e.g. display color, can we associate that to anaction?

I yes with the right policy

I Like RL (search for policy) but like k-armed (only affect intermediate rewards).I If actions affect next state and reward then RL.

Page 72: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

32

ASSOCIATIVE SEARCH - CONTEXTUAL BANDITS

I So far nonassociative tasks; no different situations, single state.I find the single best action when stationary

I or try to find best action as it changes over time when nonstationaryI In RL generally we have different situations; find best action for each different

situation.I If slot machine changes its action values fast, will the methods we talked about

work?I If it signals that it changes e.g. display color, can we associate that to an

action?

I yes with the right policy

I Like RL (search for policy) but like k-armed (only affect intermediate rewards).I If actions affect next state and reward then RL.

Page 73: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

32

ASSOCIATIVE SEARCH - CONTEXTUAL BANDITS

I So far nonassociative tasks; no different situations, single state.I find the single best action when stationaryI or try to find best action as it changes over time when nonstationary

I In RL generally we have different situations; find best action for each differentsituation.

I If slot machine changes its action values fast, will the methods we talked aboutwork?

I If it signals that it changes e.g. display color, can we associate that to anaction?

I yes with the right policy

I Like RL (search for policy) but like k-armed (only affect intermediate rewards).I If actions affect next state and reward then RL.

Page 74: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

32

ASSOCIATIVE SEARCH - CONTEXTUAL BANDITS

I So far nonassociative tasks; no different situations, single state.I find the single best action when stationaryI or try to find best action as it changes over time when nonstationary

I In RL generally we have different situations; find best action for each differentsituation.

I If slot machine changes its action values fast, will the methods we talked aboutwork?

I If it signals that it changes e.g. display color, can we associate that to anaction?

I yes with the right policy

I Like RL (search for policy) but like k-armed (only affect intermediate rewards).I If actions affect next state and reward then RL.

Page 75: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

32

ASSOCIATIVE SEARCH - CONTEXTUAL BANDITS

I So far nonassociative tasks; no different situations, single state.I find the single best action when stationaryI or try to find best action as it changes over time when nonstationary

I In RL generally we have different situations; find best action for each differentsituation.

I If slot machine changes its action values fast, will the methods we talked aboutwork?

I If it signals that it changes e.g. display color, can we associate that to anaction?

I yes with the right policy

I Like RL (search for policy) but like k-armed (only affect intermediate rewards).I If actions affect next state and reward then RL.

Page 76: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

32

ASSOCIATIVE SEARCH - CONTEXTUAL BANDITS

I So far nonassociative tasks; no different situations, single state.I find the single best action when stationaryI or try to find best action as it changes over time when nonstationary

I In RL generally we have different situations; find best action for each differentsituation.

I If slot machine changes its action values fast, will the methods we talked aboutwork?

I If it signals that it changes e.g. display color, can we associate that to anaction?

I yes with the right policyI Like RL (search for policy) but like k-armed (only affect intermediate rewards).I If actions affect next state and reward then RL.

Page 77: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

32

ASSOCIATIVE SEARCH - CONTEXTUAL BANDITS

I So far nonassociative tasks; no different situations, single state.I find the single best action when stationaryI or try to find best action as it changes over time when nonstationary

I In RL generally we have different situations; find best action for each differentsituation.

I If slot machine changes its action values fast, will the methods we talked aboutwork?

I If it signals that it changes e.g. display color, can we associate that to anaction?I yes with the right policy

I Like RL (search for policy) but like k-armed (only affect intermediate rewards).I If actions affect next state and reward then RL.

Page 78: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

32

ASSOCIATIVE SEARCH - CONTEXTUAL BANDITS

I So far nonassociative tasks; no different situations, single state.I find the single best action when stationaryI or try to find best action as it changes over time when nonstationary

I In RL generally we have different situations; find best action for each differentsituation.

I If slot machine changes its action values fast, will the methods we talked aboutwork?

I If it signals that it changes e.g. display color, can we associate that to anaction?I yes with the right policy

I Like RL (search for policy) but like k-armed (only affect intermediate rewards).

I If actions affect next state and reward then RL.

Page 79: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

32

ASSOCIATIVE SEARCH - CONTEXTUAL BANDITS

I So far nonassociative tasks; no different situations, single state.I find the single best action when stationaryI or try to find best action as it changes over time when nonstationary

I In RL generally we have different situations; find best action for each differentsituation.

I If slot machine changes its action values fast, will the methods we talked aboutwork?

I If it signals that it changes e.g. display color, can we associate that to anaction?I yes with the right policy

I Like RL (search for policy) but like k-armed (only affect intermediate rewards).I If actions affect next state and reward then RL.

Page 80: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

33

CONCLUSION

I Despite their simplicity these are fairly state of the art.I But which is best?

Page 81: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

34

CONCLUSION

I Not really satisfactory to the general exploration exploitation dilemma.I Gittins index action value.

I In some cases tractable leads to optimal solution if given complete knowledge.I Doesn’t generalize to the full RL problem.

I Bayesian setting can compute the information state given a horizon.I Bayesian approach. Known prior, update after each step.I Select action based on posterior. Posterior or Thompson sampling. Performs

similar to distribution free methods.I Grows pretty large and is not feasible. Approximate solutions? Topic for

research.

Page 82: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

35

SNEAK PEAK TO MDPS - FORMAL DEFINITION RL

A Markov decision process consists of:I A set of states SI A set of actions AI A transition probability matrix P

Pass′ = Pr[St+1 = s′|St = s,At = a

]I A reward functionR

Ras = E [Rt+1|St = s,At = a]

I γ ∈ [0, 1] a discount factorAssumes the Markov property i.e. next state, reward are independent of historygiven the current state.

Page 83: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

36

SNEAK PEAK TO MDPS - FORMAL DEFINITION RL

I Goal is to maximaze the expected long term future reward.I Usually discounted sum:

∞∑t=0

γtrt+1

I Not the same as maximizing immediate rewards due to γI Learn a policy π(a|s). Contrast to what we saw before, this requires actions to

be chosen for specific states.

Page 84: Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

37

SNEAK PEAK TO MDPS - EXAMPLERECYCLING ROBOT