1 Dr. Itamar Arel College of Engineering Electrical Engineering & Computer Science Department The University of Tennessee Fall 2009 August 24, 2009 ECE-517:

11

Dr. Itamar ArelDr. Itamar Arel

College of EngineeringCollege of EngineeringElectrical Engineering & Computer Science DepartmentElectrical Engineering & Computer Science Department

The University of TennesseeThe University of TennesseeFall 2009Fall 2009

August 24, 2009August 24, 2009

ECE-517: Reinforcement Learning in ECE-517: Reinforcement Learning in Artificial IntelligenceArtificial Intelligence

Lecture 2: Evaluative Feedback Lecture 2: Evaluative Feedback (Exploration vs. Exploitation)(Exploration vs. Exploitation)

ECE-517 - Reinforcement Learning in AI

Changes to the scheduleChanges to the schedule

Please note changes to the schedule (on course Please note changes to the schedule (on course webpage)webpage)

Key: mid-term moved to Oct. 5 (one week prior)Key: mid-term moved to Oct. 5 (one week prior)

22

ECE-517 - Reinforcement Learning in AI 33

OutlineOutline

RecapRecap

What is evaluative feedbackWhat is evaluative feedback

N-arm Bandit Problem (test case)N-arm Bandit Problem (test case) Action-Value MethodsAction-Value Methods Softmax Action SelectionSoftmax Action Selection Incremental ImplementationIncremental Implementation Tracking nonstationary rewardsTracking nonstationary rewards Optimistic Initial ValuesOptimistic Initial Values Reinforcement ComparisonReinforcement Comparison Pursuit methodsPursuit methods Associative searchAssociative search


RecapRecap

RL revolves around learning from RL revolves around learning from experienceexperience by by interacting with the environmentinteracting with the environmentUnsupervised learning disciplineUnsupervised learning disciplineTrial-and-errorTrial-and-error based basedDelayed rewardDelayed reward – main concept (value – main concept (value functions, etc.)functions, etc.)PolicyPolicy maps from situations to actions maps from situations to actionsExploitation vs. Exploration is key challengeExploitation vs. Exploration is key challengeWe looked at the Tic-Tac-Toe example where:We looked at the Tic-Tac-Toe example where:

V(s) V(s) V(s) + V(s) + [V(s’) –V(s)][V(s’) –V(s)]


What is Evaluative Feedback?What is Evaluative Feedback?

RL uses training information that RL uses training information that evaluatesevaluates the actions the actions taken rather than taken rather than instructsinstructs by giving correct actions by giving correct actions

Necessitates Necessitates trail-by-errortrail-by-error search for good behavior search for good behavior Creates need for active explorationCreates need for active exploration

Pure Pure evaluative feedbackevaluative feedback indicates how good the indicates how good the action taken is, but not whether it is the best or the action taken is, but not whether it is the best or the worst action possibleworst action possible

Pure Pure instructive feedbackinstructive feedback, on the other hand, indicates , on the other hand, indicates the correct action to take, independent of the action the correct action to take, independent of the action actually taken actually taken

Corresponds to supervised learningCorresponds to supervised learning e.g. artificial neural networkse.g. artificial neural networks


n-Armed Bandit Problemn-Armed Bandit Problem

Lets look at a simple version of the Lets look at a simple version of the n-armed bandit n-armed bandit problemproblem

First step in understanding the full RL problemFirst step in understanding the full RL problem

Here is the problem description:Here is the problem description: An agent is repeatedly faced with making oneAn agent is repeatedly faced with making one

out of out of nn actions actions After each step a After each step a reward valuereward value is provided, is provided,

drawn from a stationary probability drawn from a stationary probability distribution that depends on the actiondistribution that depends on the actionselectedselected◊◊

The agent’s objective is to maximize theThe agent’s objective is to maximize theexpected total reward over timeexpected total reward over time

Each action selection is called a Each action selection is called a play play or or iterationiteration

Extension of the classic slot machine (“one-armed Extension of the classic slot machine (“one-armed bandit”)bandit”)


n-Armed Bandit Problem (cont.)n-Armed Bandit Problem (cont.)

Each action has a Each action has a value value – an expected or mean reward – an expected or mean reward given that the action is selectedgiven that the action is selected

If the agent knew the value of each function – the problem If the agent knew the value of each function – the problem would be trivialwould be trivial

The agent maintains The agent maintains estimatesestimates of the values, and of the values, and chooses the highestchooses the highest

Greedy algorithmGreedy algorithm Directly associated with (policy) Directly associated with (policy) exploitationexploitation

If agent chooses non-greedily – we say it If agent chooses non-greedily – we say it exploresexplores Under uncertainly the agent must exploreUnder uncertainly the agent must explore A balance must be found between exploration & A balance must be found between exploration &

exploitationexploitation

Initial condition: Initial condition: all levers assume to yield reward = 0all levers assume to yield reward = 0We’ll see several simple balancing methods and show We’ll see several simple balancing methods and show that they work much better than methods that always that they work much better than methods that always exploitexploit


Action-Value MethodsAction-Value Methods

We’ll look at simple methods for estimating the values We’ll look at simple methods for estimating the values of actionsof actions

Let Let QQ**((aa)) denote the true (actual) value of denote the true (actual) value of aa, and , and QQtt ((aa))

its estimate at time its estimate at time tt The true value equals the mean reward for that actionThe true value equals the mean reward for that action

Let’s assume that by iteration (play) Let’s assume that by iteration (play) tt, action , action aa has been has been taken taken kkaa times – hence we may use the times – hence we may use the sample-averagesample-average

……

The The greedy policy greedy policy selects the highest sample-average, selects the highest sample-average, i.e.i.e.

a

kk

ii

at k

rrrr

kaQ a

a

...1)( 21

1

)(max)( )(maxarg ** aQaQaQa ta

tta


Action-Value Methods (cont.)Action-Value Methods (cont.)

A simple alternative is to behave greedily most of the A simple alternative is to behave greedily most of the time, but every once in a while, say with small probability time, but every once in a while, say with small probability , instead select an action at random, instead select an action at random

This is called an This is called an –greedy method –greedy method

We simulate the 10-arm bandit problem, where …We simulate the 10-arm bandit problem, where … rraa ~ ~ NN((QQ**((aa)),1) ,1) (noisy readings of rewards)(noisy readings of rewards) QQ**((aa)) ~ ~ NN((00,1) ,1) (actual, true mean reward for action (actual, true mean reward for action aa))

We further assume that there are 2000 machines (tasks), We further assume that there are 2000 machines (tasks), each with 10 leverseach with 10 leversThe rewards distributions are drawn independently for The rewards distributions are drawn independently for each machine each machine

Each iteration, choose a lever on each machine and Each iteration, choose a lever on each machine and calculate the average reward from all 2000 machinescalculate the average reward from all 2000 machines

…



◊◊


Side note: the optimal average rewardSide note: the optimal average reward

2 4 6 8 10 12 14 16 18 200

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Number of levers

Exp

ect

ed

va

lue

of

the

ma

xim

al r

ew

ard

)1,0(~

,...,max 21

Nz

zzzE

i

n

◊◊



The advantage of The advantage of –greedy –greedy methods depends methods depends on the taskon the task If the rewards have high variance If the rewards have high variance –greedy –greedy

would have stronger advantagewould have stronger advantage If the rewards had zero variance, greedy If the rewards had zero variance, greedy

algorithm would have sufficedalgorithm would have sufficed

If the problem was non-stationary (true If the problem was non-stationary (true rewards values changed slowly over time)rewards values changed slowly over time) – –greedygreedy would have been a must would have been a must Q: Perhaps some better methods exist ?Q: Perhaps some better methods exist ?


Softmax Action SelectionSoftmax Action Selection

So far we assumed that while exploring (using So far we assumed that while exploring (using – –greedygreedy) we chose equally among the alternatives) we chose equally among the alternatives

This means we could have chosen really bad, as This means we could have chosen really bad, as opposed (for example) to choosing the next-best opposed (for example) to choosing the next-best actionaction

The obvious solution is to The obvious solution is to rank the alternatives rank the alternatives …… Generate a probability density/mass function to estimate Generate a probability density/mass function to estimate

the rewards from each actionthe rewards from each action All actions are ranked/weightedAll actions are ranked/weighted Typically use Boltzmann distribution, i.e. choose action Typically use Boltzmann distribution, i.e. choose action aa

on iteration on iteration tt with probabilitywith probability

n

b

bQ

aQ

at

t

e

etaaction

1

/)(

/)(

)(}Pr{


Softmax Action Selection (cont.)Softmax Action Selection (cont.)

1 2 3 40

2

4

6

8

10

Action index

Ave

rage

val

ue

1 2 3 40

0.2

0.4

0.6

0.8

1

Action index

Ave

rage v

alu

e

1 2 3 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Action index

Ave

rage

val

ue

1 2 3 40

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Action index

Ave

rage

val

ue


Incremental ImplementationIncremental Implementation

Sample-average methods require linearly-increasing Sample-average methods require linearly-increasing memory (storage of reward history)memory (storage of reward history)

We need a more memory-efficient method …We need a more memory-efficient method …

kkk

kkk

kkkk

k

iik

k

iik

Qrk

Q

QkQrk

QQkQrk

rrk

rk

Q

1

1

1

11

1

11

1

1

)1(1

11

1

1

1

1

1

◊◊


Recurring theme in RLRecurring theme in RL

The previous result is consistent with a recurring The previous result is consistent with a recurring theme in RL which istheme in RL which is

New_Estimate New_Estimate

Old_Estimate + StepSize[Target – Old_Estimate]Old_Estimate + StepSize[Target – Old_Estimate]

The The StepSizeStepSize may be fixed or adaptive (in accordance may be fixed or adaptive (in accordance with the specific application)with the specific application)


Tracking a Nonstationary ProblemTracking a Nonstationary Problem

So far we have considered stationary problemsSo far we have considered stationary problemsIn reality, many problems are effectively nonstationaryIn reality, many problems are effectively nonstationaryA popular approach is to A popular approach is to weigh recent rewardsweigh recent rewards more more heavily than older onesheavily than older onesOne such technique is called One such technique is called fixed step sizefixed step size

This is a weighted average that exponentially This is a weighted average that exponentially decreases decreases

k

ii

ikk

kk

kkk

kkkkk

kkkk

rQ

Qr

rrr

QrrQr

QrQQ

10

011

22

1

22

11

11

)1()1(

)1()1(

...)1()1(

)1()1()1(

◊◊


Optimistic Initial ValuesOptimistic Initial Values

All methods discussed so far depended, to some All methods discussed so far depended, to some extent, on the initial action-value estimates, extent, on the initial action-value estimates, QQ00((aa))

For sample-average methods – this bias disappears For sample-average methods – this bias disappears when all actions have been selected at least oncewhen all actions have been selected at least once

For fixed step-size methods, the bias disappears with For fixed step-size methods, the bias disappears with time (geometrically decreasing)time (geometrically decreasing)

In the 10-arm bandit example with In the 10-arm bandit example with = 0.1= 0.1 … … If we were to set allIf we were to set all

initial reward guessesinitial reward guessesto +5 (instead of zero)to +5 (instead of zero)

Exploration is Exploration is guaranteed, since guaranteed, since true values true values are ~are ~NN(0,1)(0,1)

◊◊


Reinforcement ComparisonReinforcement Comparison

An intuitive element in RL is that …An intuitive element in RL is that …higher rewards higher rewards made more likely to occur made more likely to occurlower rewards lower rewards made less likely to occur made less likely to occur

How is the learner to know what constitutes a high or How is the learner to know what constitutes a high or low reward?low reward?

To make a judgment, one must compare the reward to a To make a judgment, one must compare the reward to a

reference reward - reference reward - rrtt

Natural choice – average of previously received rewardsNatural choice – average of previously received rewards These methods are called These methods are called reinforcement comparison reinforcement comparison

methodsmethods

The agent maintains an The agent maintains an action preferenceaction preference value, value, pptt((aa)), ,

for each action for each action aaThe preference might beThe preference might beused to select an action accordingused to select an action accordingto a softmax relationshipto a softmax relationship

n

b

bp

ap

tt

t

e

ea

1

)(

)(

)(◊◊


Reinforcement Comparison (cont.)Reinforcement Comparison (cont.)

The reinforcement comparison idea is used in The reinforcement comparison idea is used in updating the action preferencesupdating the action preferences

High reward increases the probability of an action to High reward increases the probability of an action to be selected, and visa versabe selected, and visa versa

Following the action preference update, the agent Following the action preference update, the agent updates the reference rewardupdates the reference reward

allows us to allows us to differentiate betweendifferentiate between

rates for rates for rrtt and and pptt

tttttt rrapap )()(1

tttt rrrr 1


Pursuit MethodsPursuit Methods

Another class of effective learning methods are Another class of effective learning methods are pursuit pursuit methodsmethods

They maintain both They maintain both action-value estimatesaction-value estimates andand action action preferencespreferences

The preferences continually “pursue” the greedy The preferences continually “pursue” the greedy actionsactions

Letting denote the greedy action, the update rules Letting denote the greedy action, the update rules are:are:

The action value estimates, The action value estimates, QQtt+1+1((aa),), are updated using are updated using

one the ways described (e.g. sample averages of one the ways described (e.g. sample averages of observed rewards)observed rewards)

*

11111

*1

*1

*1

*11

for )(0)()(

for )(1)()(

ttttttt

ttttttt

aaaaa

aaaaa

*1ta


Pursuit Methods (cont.)Pursuit Methods (cont.)


Associative SearchAssociative Search

So far we’ve considered So far we’ve considered nonassociative nonassociative tasks in which tasks in which there was no association of there was no association of actionsactions with with statesstates

Find the single best action when task is stationary, orFind the single best action when task is stationary, or Track the the best action as it changes over timeTrack the the best action as it changes over time

However, in RL the goal is to learn a However, in RL the goal is to learn a policypolicy (i.e. state to (i.e. state to action mappings)action mappings)

A natural extension of the A natural extension of the nn-arm bandit:-arm bandit: Assume you have K machines, but only one is played at a timeAssume you have K machines, but only one is played at a time The agent maps the state (i.e. machine played) to the actionThe agent maps the state (i.e. machine played) to the action

This would be called an This would be called an associative search associative search tasktask It is like the full RL problem in that it involves a policyIt is like the full RL problem in that it involves a policy However, it lacks the long-term reward prospect of full RLHowever, it lacks the long-term reward prospect of full RL


SummarySummary

We’ve looked at various action-selection We’ve looked at various action-selection schemesschemes

Balancing exploration vs. exploitationBalancing exploration vs. exploitation – –greedygreedy Softmax techniquesSoftmax techniques

There is no “single-best” solution for all There is no “single-best” solution for all problemsproblems

We’ll see more of this issue later …We’ll see more of this issue later …

Documents

1 Dr. Itamar Arel College of Engineering Electrical Engineering & Computer Science Department The University of Tennessee Fall 2009 August 24, 2009 ECE-517: