Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)

Introduction to Reinforcement Learning

E0397 LectureSlides by Richard Sutton(With small changes)

Reinforcement Learning: Basic Idea Learn to take correct actions over time by experience Similar to how humans learn: “trial and error” Try an action –

“see” what happens “remember” what happens Use this to change choice of action next time you are in the

same situation “Converge” to learning correct actions

Focus on long term return, not immediate benefit Some action may seem not beneficial for short term, but its long

term return will be good.

RL in Cloud Computing

Why are we talking about RL in this class? Which problems due you think can be

solved by a learning approach? Is focus on long-term benefit appropriate?

RL Agent Object is to affect the environment Environment is stochastic and uncertain

Environment

actionstate

rewardAgent

What is Reinforcement Learning? An approach to Artificial Intelligence Learning from interaction Goal-oriented learning Learning about, from, and while interacting with an external

environment Learning what to do—how to map situations to actions—so as to

maximize a numerical reward signal

Key Features of RL Learner is not told which actions to take Trial-and-Error search Possibility of delayed reward

Sacrifice short-term gains for greater long-term gains

The need to explore and exploit Considers the whole problem of a goal-directed

agent interacting with an uncertain environment

Examples of Reinforcement Learning Robocup Soccer Teams Stone & Veloso, Reidmiller et al.

World’s best player of simulated soccer, 1999; Runner-up 2000

Inventory Management Van Roy, Bertsekas, Lee & Tsitsiklis 10-15% improvement over industry standard methods

Dynamic Channel Assignment Singh & Bertsekas, Nie & Haykin World's best assigner of radio channels to mobile telephone calls

Elevator Control Crites & Barto (Probably) world's best down-peak elevator controller

Many Robots navigation, bi-pedal walking, grasping, switching between skills...

TD-Gammon and Jellyfish Tesauro, Dahl World's best backgammon player

Supervised Learning

Supervised Learning SystemInputs Outputs

Training Info = desired (target) outputs

Error = (target output – actual output)

•Best examples: Classification/identification systems. E.g. fault classification, WiFi location

determination, workload identification

•E.g. Classification systems: give agent lots of data where data has been already

classified. Agent can “train” itself using this knowledge.

•Training phase is non trivial

•(Normally) no learning happens in on-line phase

Reinforcement Learning

RLSystemInputs Outputs (“actions”)

Training Info = evaluations (“rewards” / “penalties”)

Objective: get as much reward as possible

•Absolutely no “already existing training data”•Agent learns ONLY by experience

Elements of RL

State: state of the system Actions: Suggested by RL agent, taken by

actuators in the system Actions change state (deterministically, or

stochastically) Reward: Immediate gain when taking action a in

state s Value: Long term benefit of taking action a in

state s Rewards gained over a long time horizon

The n-Armed Bandit Problem Choose repeatedly from one of n

actions; each choice is called a play After each play , you get a reward

, whereat rt

These are unknown action valuesDistribution of depends only on rt at

Objective is to maximize the reward in the long term, e.g., over 1000 plays

To solve the n-armed bandit problem, you must explore a variety of actions and exploit the best of them

The Exploration/Exploitation Dilemma Suppose you form estimates

The greedy action at t is at

You can’t exploit all the time; you can’t explore all the time

You can never stop exploring; but you should always reduce exploring. Maybe.

Qt(a) Q*(a) action value estimates

at* argmax

aQt(a)

at at* exploitation

at at* exploration

Action-Value Methods

Methods that adapt action-value estimates and nothing else, e.g.: suppose by the t-th play, action had been chosen times, producing rewards then

r1, r2,, rka,

“sample average”

limk a

Qt(a) Q*(a)

-Greedy Action Selection Greedy action selection:

-Greedy:

at at* arg max

aQt(a)

at* with probability 1

random action with probability {at

. . . the simplest way to balance exploration and exploitation

10-Armed Testbed n = 10 possible actions Each is chosen randomly from a normal distrib.: each is also normal: 1000 plays repeat the whole thing 2000 times and average the results

-Greedy Methods on the 10-Armed Testbed

Incremental Implementation

r1 r2 rk

Recall the sample average estimation method:

Can we do this incrementally (without storing all the rewards)?

We could keep a running sum and count, or, equivalently:

Qk1 Qk 1

k 1rk1 Qk

The average of the first k rewards is(dropping the dependence on ):

This is a common form for update rules:

NewEstimate = OldEstimate + StepSize[Target – OldEstimate]

Tracking a Nonstationary Problem

Choosing to be a sample average is appropriate in astationary problem, i.e., when none of the change over time,

But not in a nonstationary problem.

Better in the nonstationary case is:Qk1 Qk rk1 Qk

for constant , 0 1

(1 ) kQ0 (1 i1

)k iri

exponential, recency-weighted average

Formal Description of RL

The Agent-Environment Interface

Agent and environment interact at discrete time steps : t 0,1, 2, Agent observes state at step t : st S

produces action at step t : at A(st )

gets resulting reward : rt1 and resulting next state : st1

. . . st art +1 st +1

t +1art +2 st +2

t +2art +3 st +3

. . .t +3a

Policy at step t, t :

a mapping from states to action probabilities

t (s, a) probability that at a when st s

The Agent Learns a Policy

Reinforcement learning methods specify how the agent changes its policy as a result of experience.

Roughly, the agent’s goal is to get as much reward as it can over the long run.

Getting the Degree of Abstraction Right

Time steps need not refer to fixed intervals of real time. Actions can be low level (e.g., voltages to motors), or high level

(e.g., accept a job offer), “mental” (e.g., shift in focus of attention), etc.

States can low-level “sensations”, or they can be abstract, symbolic, based on memory, or subjective (e.g., the state of being “surprised” or “lost”).

An RL agent is not like a whole animal or robot. Reward computation is in the agent’s environment because the

agent cannot change it arbitrarily. The environment is not necessarily unknown to the agent, only

incompletely controllable.

Goals and Rewards Is a scalar reward signal an adequate notion of a

goal?—maybe not, but it is surprisingly flexible. A goal should specify what we want to achieve,

not how we want to achieve it. A goal must be outside the agent’s direct control

—thus outside the agent. The agent must be able to measure success:

explicitly; frequently during its lifespan.

Returns

Suppose the sequence of rewards after step t is :

rt1, rt2 , rt3,What do we want to maximize?

In general,

we want to maximize the expected return, E Rt , for each step t.

Episodic tasks: interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze.

Rt rt1 rt2 rT ,

where T is a final time step at which a terminal state is reached, ending an episode.

Returns for Continuing Tasks

Continuing tasks: interaction does not have natural episodes.

Discounted return:

Rt rt1 rt2 2rt3 krtk1,k 0

where , 0 1, is the discount rate.

shortsighted 0 1 farsighted

An ExampleAvoid failure: the pole falling beyonda critical angle or the cart hitting end oftrack.

reward 1 for each step before failure

return number of steps before failure

As an episodic task where episode ends upon failure:

As a continuing task with discounted return:reward 1 upon failure; 0 otherwise

return k , for k steps before failure

In either case, return is maximized by avoiding failure for as long as possible.

In episodic tasks, we number the time steps of each episode starting from zero.

We usually do not have to distinguish between episodes, so we write instead of for the state at step t of episode j.

Think of each episode as ending in an absorbing state that always produces reward of zero:

We can cover all cases by writing

A Unified Notation

st st, j

Rt krtk 1,k 0

where can be 1 only if a zero reward absorbing state is always reached.

The Markov Property “the state” at step t, means whatever information is

available to the agent at step t about its environment. The state can include immediate “sensations,” highly

processed sensations, and structures built up over time from sequences of sensations.

Ideally, a state should summarize past sensations so as to retain all “essential” information, i.e., it should have the Markov Property:

Pr st1 s ,rt1 r st ,at ,rt , st 1,at 1,,r1,s0 ,a0 Pr st1 s ,rt1 r st ,at for all s , r, and histories st ,at ,rt , st 1,at 1,,r1, s0 ,a0.

Markov Decision Processes If a reinforcement learning task has the Markov

Property, it is basically a Markov Decision Process (MDP).

If state and action sets are finite, it is a finite MDP. To define a finite MDP, you need to give:

state and action sets one-step “dynamics” defined by transition probabilities:

Expected values of reward:

Ps s a Pr st1 s st s,at a for all s, s S, a A(s).

Rs s a E rt1 st s,at a,st1 s for all s, s S, a A(s).

Recycling Robot

An Example Finite MDP

At each step, robot has to decide whether it should (1) actively search for a can, (2) wait for someone to bring it a can, or (3) go to home base and recharge.

Searching is better but runs down the battery; if runs out of power while searching, has to be rescued (which is bad).

Decisions made on basis of current energy level: high, low. Reward = number of cans collected

Recycling Robot MDP

S high,low A(high) search, wait A(low) search,wait, recharge

Rsearch expected no. of cans while searching

Rwait expected no. of cans while waiting

Rsearch Rwait

The value of a state is the expected return starting from that state; depends on the agent’s policy:

The value of taking an action in a state under policy is the expected return starting from that state, taking that action, and thereafter following :

Value Functions

State - value function for policy :

V (s)E Rt st s E krtk 1 st sk 0

Action - value function for policy :

Q (s, a) E Rt st s, at a E krtk1 st s,at ak0

Bellman Equation for a Policy

Rt rt1 rt2 2rt3 3rt4

rt1 rt2 rt3 2rt4 rt1 Rt1

The basic idea:

So: V (s)E Rt st s E rt1 V st1 st s

Or, without the expectation operator:

V (s) (s,a) Ps s a Rs s

a V ( s ) s

Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes)

Documents

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 From Sutton & Barto Reinforcement Learning An Introduction

Reinforcement Learning: Learning to get what you want... Sutton & Barto, Reinforcement Learning: An Introduction, MIT Press 1998. sutton/book/the-book.html

Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

RESEARCH ON ROCK BOLT REINFORCEMENT by RICHARD · PDF fileresearch on rock bolt reinforcement by richard e ... idx 1 - fiite eimemt analysis ccmer program 169 ... effect of joints

Richard C. Sutton, Ph.D. Assistant Vice Chancellor for International Programs

2005 Abstract of the Minutes - RCUS...Email: chenson@alliancecom.net HOFMANN, Delary (Emmanuel RCUS, Sutton, NE) 1803 Road 1, Sutton, NE 68979 Phone: (402) 773 -4843. HONAKER, Richard

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Formulating MDPs pFormulating MDPs Rewards Returns Values pEscalator pElevators

Integrated Modeling and Control Based on Reinforcement ...€¦ · Integrated Modeling and Control Based on Reinforcement Learning and Dynamic Programming Richard S. Sutton GTE Laboratories

Multi-timescale Nexting in a Reinforcement Learning Robotamw8/sab.pdf · Multi-timescale Nexting in a Reinforcement Learning Robot Joseph Modayil, Adam White, and Richard S. Sutton

Machine Learning Introduction. 2 교재 Machine Learning, Tom T. Mitchell, McGraw- Hill 일부 Reinforcement Learning: An Introduction, R. S. Sutton and

Policy Gradient Methods for Reinforcement … Gradient Methods for Reinforcement Learning with Function Approximation Richard S. Sutton, David McAllester, Satinder Singh, Yishay Mansour

PART-TIME COURSE - Sutton College€¦ · Sutton College - Sutton Centre St Nicholas Way, Sutton SM1 1EA Sutton College - Wallington Centre Woodcote Road, Wallington SM6 0NB Sutton

Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé Ann.nowe@vub.ac.be By Sutton

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 3: The Reinforcement Learning Problem pdescribe the RL problem; ppresent

Reinforcement Learning an Introduction - Richard S. Sutton , Andrew G. Barto

Reinforcement Learning RLlib: Abstractions for Distributed … · 2018. 11. 14. · RLlib: Abstractions for Distributed Reinforcement Learning Eric Liang, Richard Liaw, Philipp Moritz,

Reinforcement Learning’s Computational Theory of Mind Rich Sutton Andy Barto Satinder SinghDoina Precup with thanks to:

Solutions to Exercises in Reinforcement Learning by Richard S. …tianlinliu.com/files/notes_exercise_RL.pdf · 2020. 11. 30. · Solutions to Exercises in Reinforcement Learning

Reinforcement Learning: A Tutorial Rich Sutton AT&T Labs -- Research sutton