Reinforcement Learning: A Tutorial Rich Sutton AT&T Labs -- Research sutton

Reinforcement Learning:A Tutorial

Rich SuttonAT&T Labs -- Research

www.research.att.com/~sutton

Outline

• Core RL Theory– Value functions– Markov decision processes– Temporal-Difference Learning– Policy Iteration

• The Space of RL Methods– Function Approximation– Traces– Kinds of Backups

• Lessons for Artificial Intelligence

• Concluding Remarks

RL is Learning from Interaction

Environment

actionstate

rewardAgent

• complete agent• temporally situated• continual learning & planning• object is to affect environment• environment stochastic & uncertain

Ball-Ballancing Demo

• RL applied to a real physical system

• Illustrates online learning

• An easy problem...but learns as you watch!

• Courtesy of GTE Labs:Hamid Benbrahim

Judy Franklin

John Doleac

Oliver Selfridge

• Same learning algorithm as early pole-balancer

Markov Decision Processes (MDPs)

In RL, the environment is a modeled as an MDP, defined by

S – set of states of the environment

A(s) – set of actions possible in state s SP(s,s',a) – probability of transition from s to s' given aR(s,s',a) – expected reward on transition s to s' given a – discount rate for delayed reward

discrete time, t = 0, 1, 2, . . .

t

. . . st art +1 st +1

t +1art +2 st +2

t +2art +3 st +3

. . .t +3a

Find a policy sS aA(s) (could be stochastic)

that maximizes the value (expected future reward) of each s :

and each s,a pair:

The Objective is to Maximize Long-term Total Discounted Reward

V (s) = E {r + r + r + s =s, }

rewards

These are called value functions - cf. evaluation functions

t +1 t +2 t +3 t. . .2

Q (s,a) = E {r + r + r + s =s, a =a, }t +1 t +2 t +3 t t. . .2

Optimal Value Functions and Policies

There exist optimal value functions:

And corresponding optimal policies:

V * (s) =max

V (s) Q* (s, a) =max

Q (s,a)

* (s) = arg maxa

Q* (s, a)

* is the greedy policy with respect to Q*

1-Step Tabular Q-Learning

(Watkins, 1989)

On each state transition:

Q(st ,at ) ← Q(st,at ) + α rt+1 +maxa

Q(st+1,a) −Q(st ,at)[ ]

TD Error

limt→ ∞

Q(s,a) → Q* (s,a)

limt→ ∞

t → *

a tableentry

st at

rt +1 st +1

Update:

Assumes finite MDP

Optimal behavior found without a model of the environment!

Example of Value Functions

4Grid Example

T

T

T

T

T

T

T

T

T

T

T

T 0.00.0 0.00.0 0.00.0 0.00.0 0.00.0 0.00.0 0.00.0

-1.0 -1.0 -1.0-1.0 -1.0 -1.0 -1.0-1.0 -1.0 -1.0 -1.0-1.0 -1.0 -1.0

-1.7 -2.0 -2.0-1.7 -2.0 -2.0 -2.0-2.0 -2.0 -2.0 -1.7-2.0 -2.0 -1.7

-2.4 -2.9 -3.0-2.4 -2.9 -3.0 -2.9-2.9 -3.0 -2.9 -2.4-3.0 -2.9 -2.4

-6.1 -8.4 -9.0-6.1 -7.7 -8.4 -8.4-8.4 -8.4 -7.7 -6.1-9.0 -8.4 -6.1

-14. -20. -22.-14. -18. -20. -20.-20. -20. -18. -14.-22. -20. -14.

Vk for therandom policy

Greedy Policy. . w r tVk

=0k

=1k

=2k

=10k

=k ∞

=3k

optimalpolicy

randompolicy

Vk+1(s)=E{ r+Vk(s)' }

Policy Iteration

Summary of RL Theory:Generalized Policy Iteration

Policy

Value

Function

V, Q

policyevaluation

policyimprovement

* V* , Q*

value learning

greedification

Outline





Sarsa Backup

st ,at

st +1

rt +1

Backup!

The value of this pair

Is updated from the value of the next pair

TD Error

Q(st ,at ) ← Q(st,at ) + α rt+1 +Q(st+1,at+1) −Q(st,at)[ ]

at+1

st at

rt +1 st +1On transition:at+1

Update:

Rummery & NiranjanHollandSutton

Dimensions of RL algorithms

1. Action Selection and Exploration

2. Function Approximation

3. Eligibility Trace

4. On-line Experience vs Simulated Experience

5. Kind of Backups

.

.

.

1. Action Selection (Exploration)

Ok, you've learned a value function, How do you pick Actions?

Q ≈Q*

Greedy Action Selection:Always select the action that looks best:

-Greedy Action Selection:Be greedy most of the timeOccasionally take a random action

Surprisingly, this is the state of the art.

Exploitationvs

Exploration!

(s) = arg maxa

Q(s,a)

2. Function Approximation

FunctionApproximator

s

a

Could be:• table• Backprop Neural Network• Radial-Basis-Function Network• Tile Coding (CMAC)• Nearest Neighbor, Memory Based • Decision Tree

gradient-descentmethods

targets or errors

Q(s,a)

Adjusting Network Weights

estimated value

w ← w+ α rt+1 + Q(st+1,at+1 )−Q(st ,at)[ ] ∇w f(st ,at,w)

Q(s,a) = f (s,a,w)

e.g., gradient-descent Sarsa:

target value

weight vector

standardbackprop gradient

Sparse Coarse Coding

fixed expansiveRe-representation

Linearlast layer

Coarse: Large receptive fieldsSparse: Few features present at one time

features

.

.

.

.

.

.

.

.

.

.

.

The Mountain Car Problem

Minimum-Time-to-Goal Problem

Moore, 1990 Goal

Gravity wins

SITUATIONS: car's position and velocity

ACTIONS: three thrusts: forward, reverse, none

REWARDS: always –1 until car reaches the goal

No Discounting

RL Applied to the Mountain Car

1. TD Method = Sarsa Backup

2. Action Selection = Greedy

with initial optimistic values

3. Function approximator = CMAC

4. Eligibility traces = Replacing

(Sutton, 1995)Video Demo

Q0 (s, a) =0

Tile Coding applied to Mountain Cara.k.a. CMACs (Albus, 1980)

Tiling #1

Car Position Tiling #2 Car Position

Shape/Size of tiles ⇒ Generalization

#Tilings Resolution of final approximation⇒

5 5x grid

(10)

The Acrobot Problem

e.g., Dejong & Spong, 1994

Sutton, 1995

Minimum–Time–to–Goal:

4 state variables: 2 joint angles 2 angular velocities

CMAC of 48 layers

RL same as Mountain Car

θ 1

θ 2

: Goal Raise tip above line

Torque

applied

here

tip

=-1 Reward per time step

(TD error) • (eligibility of at time t )

3. Eligibility Traces (*)

• The in TD(), Q(), Sarsa() . . .

• A more primitive way of dealing with delayed reward

• The link to Monte Carlo methods

• Passing credit back farther than one action

• The general form is:

change in at time t

Q(s,a) ~ s, a

global specific to s, a

e.g., was taken at t ? or shortly before t ?

s, a

at time t

visits to

Eligibility Traces

TIME

s, a

accumulatingtrace

replacetrace

Complex TD Backups (*)

trial primitive TD backups

1–

(1–)

(1–)2

3

= 1

e.g. TD ( )

Incrementally computes a weighted mixture of these backups as states are visited

= 0 Simple TD

= 1 Simple Monte Carlo

Kinds of Backups

Dynamic programming

Temporal-differencelearning

Monte Carlo

Exhaustivesearch

bootstrapping,

fullbackups

samplebackups

shallowbackups

deepbackups

Outline





World-Class Applications of RL

• TD-Gammon and Jellyfish Tesauro, DahlWorld's best backgammon player

• Elevator Control Crites & BartoWorld's best down-peak elevator controller

• Inventory Management Van Roy, Bertsekas, Lee & Tsitsiklis10-15% improvement over industry standard methods

• Dynamic Channel Assignment Singh & Bertsekas, Nie & HaykinWorld's best assigner of radio channels to mobile telephone calls

Lesson #1:Approximate the Solution, Not the Problem

• All these applications are large, stochastic, optimal control problems

– Too hard for conventional methods to solve exactly

– Require problem to be simplified

• RL just finds an approximate solution– . . . which can be much better!

Backgammon

SITUATIONS: configurations

of the playing board

(about 10 )

ACTIONS: moves

REWARDS: win: +1

lose: –1

else: 0

20

Pure delayed reward

Tesauro, 1992,1995

TD-Gammon

Tesauro, 1992–1995

Start with a random network

Play millions of games against self

Learn a value function from this simulated experience

This produces arguably the best player in the world

Action selectionby 2–3 ply search

Value

TD errorVt+1 −Vt

Neurogammonsame network, buttrained from 15,000

expert-labeled examples

Lesson #2:The Power of Learning from Experience

# hidden units

performanceagainst

gammontool

50%

70%

0 10 20 40 80

TD-Gammonself-play Tesauro, 1992

Expert examples are expensive and scarceExperience is cheap and plentiful!And teaches the real solution

Lesson #3:The Central Role of Value Functions

...of modifiable moment-by-moment estimates of

how well things are going

All RL methods learn value functions

All state-space planners compute value functions

Both are based on "backing up" value

Recognizing and reacting to the ups and downs of

life is an important part of intelligence

Lesson #4:Learning and Planning

can be radically similar

Historically, planning and trial-and-error

learning have been seen as opposites

But RL treats both as processing of experience

environment

environmentmodel

experience value/policy

on-lineinteraction

simulation

learning

planning

1-Step Tabular Q-Planning

1. Generate a state, s, and action, a

2. Consult model for next state, s', and reward, r

3. Learn from the experience, s,a,r,s' :

4. go to 1

Q(s,a) ← Q(s,a) + α r + maxa'

Q(s',a')−Q(s,a)⎡⎣

⎤⎦

With function approximation and cleverness in search control (Step 1), this is a viable, perhaps even superior, approach to state-space planning

Lesson #5: Accretive Computation

"Solves" Planning Dilemmas

Processes only loosely coupled

Can proceed in parallel, asynchronously

Quality of solution accumulates in value & model memories

Reactivity/Deliberation dilemma"solved" simply by not opposing search and

memoryIntractability of planning

"solved" by anytime improvement of solution

value/policy

model experience

acting

modellearning

directRL

planning

Dyna Algorithm

1. s current state

2. Choose an action, a, and take it3. Receive next state, s’, and reward, r4. Apply RL backup to s, a, s’, r

e.g., Q-learning update

5. Update Model(s, a) with s’, r6. Repeat k times:

- select a previously seen state-action pair s,a- s’, r Model(s, a)

- Apply RL backup to s, a, s’, r7. Go to 1

Actions –> Behaviors

MDPs seems too too flat, low-level

Need to learn/plan/design agents at a higher level

at the level of behaviors

rather than just low-level actions

e.g., open-the-door rather than twitch-muscle-17 walk-to-work rather than 1-step-forward

Behaviors are based on entire policies, directed toward subgoals

enable abstraction, hierarchy, modularity, transfer...

Rooms Example

HALLWAYS

O2

O1

4 rooms

4 hallways

8 multi-step options

Given goal location, quickly plan shortest route

up

down

rightleft

(to each room's 2 hallways)

G?

G?

4 unreliable primitive actions

Fail 33% of the time

Goal states are givena terminal value of 1 = .9

All rewards zero

ROOM

Planning by Value Iteration

Iteration #1 Iteration #2 Iteration #3

with primitive actions (cell-to-cell)

Iteration #1 Iteration #2 Iteration #3

with behaviors (room-to-room)

V (goal )=1

V (goal )=1

Behaviors aremuch like Primitive Actions

• Define a higher level Semi-MDP, overlaid on the original MDP

• Can be used with same planning methods- faster planning- guaranteed correct results

• Interchangeable with primitive actions

• Models and policies can be learned by TD methods

Lesson #6: Generality is no impediment to working with higher-level knowledge

Provide a clear semantics for multi-level, re-usable knowledge for the general MDP context

Summary of Lessons

1. Approximate the Solution, Not the Problem

2. The Power of Learning from Experience

3. The Central Role of Value Functions in finding optimal sequential behavior

4. Learning and Planning can be Radically Similar

5. Accretive Computation "Solves Dilemmas"

6. A General Approach need not be Flat, Low-level

All stem from trying to solve MDPs

Outline• Core RL Theory

– Value functions– Markov decision processes– Temporal-Difference Learning– Policy Iteration




Strands of History of RL

Trial-and-errorlearning

Temporal-differencelearning

Optimal control,value functions

Thorndike ()1911

Minsky

Klopf

Barto et al

Secondary reinforcement ()

Samuel

Witten

Sutton

Hamilton (Physics)1800s

Shannon

Bellman/Howard (OR)

Werbos

Watkins

Holland

Radical Generalityof the RL/MDP problem

Determinism

Goals of achievement

Complete knowledge

Closed world

Unlimited deliberation

General stochastic dynamics

General goals

Uncertainty

Reactive decision-making

ClassicalRL/MDPs

Getting the degree of abstraction right

• Actions can be low level (e.g. voltages to motors), high level (e.g., accept a job offer), "mental" (e.g., shift in focus of attention)

• Situations: direct sensor readings, or symbolic descriptions, or "mental" situations (e.g., the situation of being lost)

• An RL agent is not like a whole animal or robot, which consist of many RL agents as well as other components

• The environment is not necessarily unknown to the RL agent, only incompletely controllable

• Reward computation is in the RL agent's environment because the RL agent cannot change it arbitrarily

Some of the Open Problems

• Incomplete state information

• Exploration

• Structured states and actions

• Incorporating prior knowledge

• Using teachers

• Theory of RL with function approximators

• Modular and hierarchical architectures

• Integration with other problem–solving

and planning methods

Dimensions of RL Algorithms

• TD --- Monte Carlo ()

• Model --- Model free

• Kind of Function Approximator

• Exploration method

• Discounted --- Undiscounted

• Computed policy --- Stored policy

• On-policy --- Off-policy

• Action values --- State values

• Sample model --- Distribution model

• Discrete actions --- Continuous actions

A Unified View

• There is a space of methods, with tradeoffs and mixtures- rather than discrete choices

• Value function are key- all the methods are about learning, computing, or

estimating values- all the methods store a value function

• All methods are based on backups- different methods just have backups of different

shapes and sizes, sources and mixtures

Psychology and RL

PAVLOVIAN CONDITIONING

The single most common and important kind of learningUbiquitous - from people to paramecialearning to predict and anticipate, learning affect

TD methods are a compelling match to the dataMost modern, real-time methods are based on Temporal-

Difference Learning

INSTRUMENTAL LEARNING (OPERANT CONDITIONING)"Actions followed by good outcomes are repeated"Goal-directed learning, The Law of EffectBiological version of Reinforcement Learning

But careful comparisons to data have not yet been made

Hawkins & KandelHopfield & TankKlopfMoore et al.Byrne et al.Sutton & BartoMalaka

Neuroscience and RL

RL interpretations of brain structures are being explored in several areas:

Basal GangliaDopamine NeuronsValue Systems

Final Summary

• RL is about a problem:- goal-directed, interactive, real-time decision making- all hail the problem!

• Value functions seem to be the key to solving it,and TD methods the key new algorithmic idea

• There are a host of immediate applications- large-scale stochastic optimal control and

scheduling

• There are also a host of scientific problems

• An exciting time

BibliographySutton, R.S., Barto, A.G., Reinforcement Learning: An

Introduction. MIT Press 1998. Kaelbling, L.P., Littman, M.L., and Moore, A.W.

"Reinforcement learning: A survey". Journal of Artificial Intelligence Research, 4, 1996.

Special issues on Reinforcement Learning: Machine Learning, 8(3/4),1992, and 22(1/2/3), 1996.

Neuroscience: Schultz, W., Dayan, P., and Montague, P.R. "A neural substrate of prediction and reward". Science, 275:1593--1598, 1997.

Animal Learning: Sutton, R.S. and Barto, A.G. "Time-derivative models of Pavlovian reinforcement". In Gabriel, M. and Moore, J., Eds., Learning and Computational Neuroscience, pp. 497--537. MIT Press, 1990.

See also web pages, e.g., starting from http://www.cs.umass.edu/~rich.

Documents

Reinforcement Learning: A Tutorial Rich Sutton AT&T Labs -- Research sutton