33
An Introduction to Reinforcement Learning (Part 1) Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham [email protected] www.cs.bham.ac.uk/~jlw www.cs.bham.ac.uk/research/robotic s

An Introduction to Reinforcement Learning (Part 1)

Embed Size (px)

DESCRIPTION

An Introduction to Reinforcement Learning (Part 1). Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham [email protected] www.cs.bham.ac.uk/~jlw www.cs.bham.ac.uk/research/robotics. What is Reinforcement Learning (RL) ?. +3. +50. -1. -1. r 1. - PowerPoint PPT Presentation

Citation preview

Page 1: An Introduction to Reinforcement Learning  (Part 1)

An Introduction to Reinforcement Learning (Part 1)

Jeremy Wyatt

Intelligent Robotics Lab

School of Computer Science

University of Birmingham

[email protected]

www.cs.bham.ac.uk/~jlw

www.cs.bham.ac.uk/research/robotics

Page 2: An Introduction to Reinforcement Learning  (Part 1)

2

What is Reinforcement Learning (RL) ?

• Learning from punishments and rewards• Agent moves through world, observing states and rewards• Adapts its behaviour to maximise some function of

reward

s9s5s4s2

……

…s3

+50

-1-1

+3

r9r5r4r1

s1

a9a5a4a2 …a3a1

Page 3: An Introduction to Reinforcement Learning  (Part 1)

3

Return: A long term measure of performance • Let’s assume our agent acts according to some rules, called a

policy,

• The return Rt is a measure of long term reward collected after time t

+50

-1-1

+3

r9r5r4r1

21 2 3 1

0

kt t t t t k

k

R r r r r

3 4 80 3 1 1 50R

0 1

Page 4: An Introduction to Reinforcement Learning  (Part 1)

4

Value = Utility = Expected Return

• Rt is a random variable

• So it has an expected value in a state under a given policy

• RL problem is to find optimal policy that maximises the expected value in every state

21 2 3 1

0

kt t t t t k

k

R r r r r

10

( ) { | , } | ,kt t t t k t

k

V s E R s E r s

( , ) Pr( | )s a A a S s

Page 5: An Introduction to Reinforcement Learning  (Part 1)

5

Markov Decision Processes (MDPs)

• The transitions between states are uncertain• The probabilities depend only on the current state

• Transition matrix P, and reward function

r = 2211r = 0a1

a2

112p

111p

212p

211p

312 1p Pr( 2 | 1, 3)t t ts s a

1 111 12

2 211 12

p p

p p

P =

0

2

= R

Page 6: An Introduction to Reinforcement Learning  (Part 1)

6

Summary of the story so far…• Some key elements of RL problems:

– A class of sequential decision making problems

– We want *

– Performance metric: Short term Long term2

1 2 3 10

kt t t t t k

k

R r r r r

10

( ) { | , } | ,kt t t t k t

k

V s E R s E r s

rt+1 rt+2 rt+3

at at+1 at+2st st+1 st+2 st+3

Page 7: An Introduction to Reinforcement Learning  (Part 1)

7

Summary of the story so far…• Some common elements of RL solutions

– Exploit conditional independence

– Randomised interaction

rt+1 rt+2 rt+3

at at+1 at+2st st+1 st+2 st+3

Page 8: An Introduction to Reinforcement Learning  (Part 1)

8

Bellman equations

• Conditional independence allows us to define expected return in terms of a recurrence relation:

where

and

where

(s) (s)ss' ss'

'

( ) { ( )}s S

V s p V s'

R

( )' Pr( ' | , ( ))s

ssp s s s

( )1 1' E | ', ,s

t t tss r s s s s R

* *ss' ss'

'

( , ) { ( )}a a

s S

Q s a p V s'

R

* *( ) max ( , )a A

V s Q s a

s

4

3

5

(s)

Page 9: An Introduction to Reinforcement Learning  (Part 1)

9

Two types of bootstrapping

• We can bootstrap using explicit knowledge of P and R

(Dynamic Programming)

• Or we can bootstrap using samples from P and R

(Temporal Difference learning)

(s) (s)1ss' ss'

'

ˆ ˆ( ) { ( )}n ns S

V s p V s'

R

1 1 1ˆ ˆ ˆ ˆ( ) ( ) ( ( ) ( ))t t t t t t t t tV s V s r V s V s

st st+1rt+1

s

4

3

5

(s)

at = (st)

Page 10: An Introduction to Reinforcement Learning  (Part 1)

10

TD(0) learning

t:=0

is the policy to be evaluated

Initialise arbitrarily for all

Repeat

select an action at from (st)

observe the transition

update according to

t:=t+1

st st+1rt+1

at

1 1 1ˆ ˆ ˆ ˆ( ) ( ) ( ( ) ( ))t t t t t t t t tV s V s r V s V s

ˆ ( )tV s s S

ˆ ( )tV s

Page 11: An Introduction to Reinforcement Learning  (Part 1)

11

On and Off policy learning

• On policy: evaluate the policy you are following, e.g. TD learning

• Off-policy: evaluate one policy while

following another policy

• E.g. One step Q-learning

st st+1rt+1

at

1 1 1ˆ ˆ ˆ ˆ( , ) ( , ) max ( , ) ( , )t t t t t t t t t t t t

b AQ s a Q s a r Q s b Q s a

1 1 1ˆ ˆ ˆ ˆ( ) ( ) ( ( ) ( ))t t t t t t t t tV s V s r V s V s

st+1

atst

rt+1

Page 12: An Introduction to Reinforcement Learning  (Part 1)

12

Off policy learning of control

• Q-learning is powerful because

– it allows us to evaluate

– while taking non-greedy actions (explore)

• -greedy is a simple and popular exploration rule:

• take a greedy action with probability • Take a random action with probability 1-

• Q-learning is guaranteed to converge for MDPs

(with the right exploration policy)

• Is there a way of finding with an on-policy learner?

Page 13: An Introduction to Reinforcement Learning  (Part 1)

13

On policy learning of control: Sarsa

• Discard the max operator

• Learn about the policy you are following

• Change the policy gradually

• Guaranteed to converge for Greedy in the Limit Infinite Exploration Policies

1 1 1 1ˆ ˆ ˆ ˆ( , ) ( , ) ( , ) ( , )t t t t t t t t t t t t tQ s a Q s a r Q s a Q s a

atst+1st

at+1

rt+1

Page 14: An Introduction to Reinforcement Learning  (Part 1)

14

On policy learning of control: Sarsa

t:=0

Initialise arbitrarily for all

select an action at from explore( )

Repeat

observe the transition

select an action at+1 from explore( )

update according to

t:=t+1

1 1 1 1ˆ ˆ ˆ ˆ( , ) ( , ) ( , ) ( , )t t t t t t t t t t t t tQ s a Q s a r Q s a Q s a

ˆ ( , )tQ s a ,s S a A

1ˆ ( , )t tQ s a

ˆ ( , )t tQ s a

ˆ ( , )t t tQ s a

atst+1st

rt+1

Page 15: An Introduction to Reinforcement Learning  (Part 1)

15

Summary: TD, Q-learning, Sarsa

• TD learning

• One step Q-learning

• Sarsa learning 1 1 1 1

ˆ ˆ ˆ ˆ( , ) ( , ) ( , ) ( , )t t t t t t t t t t t t tQ s a Q s a r Q s a Q s a

strt+1

at

1 1 1ˆ ˆ ˆ ˆ( , ) ( , ) max ( , ) ( , )t t t t t t t t t t t t

b AQ s a Q s a r Q s b Q s a

1 1 1ˆ ˆ ˆ ˆ( ) ( ) ( ( ) ( ))t t t t t t t t tV s V s r V s V s

st+1

atst

rt+1

st+1

atst+1st

at+1

rt+1

Page 16: An Introduction to Reinforcement Learning  (Part 1)

16

• We add an eligibility for each state:

where

• Update in every state proportional to the eligibility

Speeding up learning: Eligibility traces, TD(

• TD learning only passes the TD error to one state

1 1 1ˆ ˆ ˆ ˆ( ) ( ) ( ( ) ( ))t t t t t t t t tV s V s r V s V s

at-2 at-1 at

rt-1 rt rt+1

st-2 st-1 st st+1

1

1 if ( )

( ) otherwiset

tt

s se s

e s

0 1

1 1 1ˆ ˆ ˆ ˆ( ) ( ) ( ( ) ( )) ( )t t t t t t t tV s V s r V s V s e s

ˆ ( )tV s s S

Page 17: An Introduction to Reinforcement Learning  (Part 1)

17

Eligibility traces for learning control: Q()

• There are various eligibility trace methods for Q-learning• Update for every s,a pair

• Pass information backwards through a non-greedy action• Lose convergence guarantee of one step Q-learning• Watkin’s Solution: zero all eligibilities after a non-greedy action• Problem: you lose most of the benefit of eligibility traces

rt+1

st+1rt

at-1 atat+1

st-1 st

agreedy

1 1 1ˆ ˆ ˆ ˆ( , ) ( , ) max ( , ) ( , ) ( , )t t t t t t t t t

b AQ s a Q s a r Q s b Q s a e s a

Page 18: An Introduction to Reinforcement Learning  (Part 1)

18

Eligibility traces for learning control: Sarsa()

• Solution: use Sarsa since it’s on policy

• Update for every s,a pair

• Keeps convergence guarantees

1 1 1 1ˆ ˆ ˆ ˆ( , ) ( , ) ( , ) ( , ) ( , )t t t t t t t t t tQ s a Q s a r Q s a Q s a e s a

at at+1 at+2

rt+1 rt+2

st st+1 st+2

Page 19: An Introduction to Reinforcement Learning  (Part 1)

19

Approximate Reinforcement Learning

• Why?

– To learn in reasonable time and space

(avoid Bellman’s curse of dimensionality)

– To generalise to new situations

• Solutions

– Approximate the value function

– Search in the policy space

– Approximate a model (and plan)

Page 20: An Introduction to Reinforcement Learning  (Part 1)

20

Linear Value Function Approximation

• Simplest useful class of problems

• Some convergence results

• We’ll focus on linear TD(

Weight vector at time t

Feature vector for state s

Our value estimate

Our objective is to minimise

1 , 2t t t t n

1 , 2s s s s n

t̂ t sV s

2ˆMSE ( )t t

s S

P s V s V s

Page 21: An Introduction to Reinforcement Learning  (Part 1)

21

Value Function Approximation: features

• There are numerous schemes, CMACs and RBFs are popular

• CMAC: n tiles in the space

(aggregate over all tilings)

• Features

• Properties– Coarse coding

– Regular tiling efficient access

– Use random hashing to

reduce memory

1 , 2s s s s n

1 or 0s i

Page 22: An Introduction to Reinforcement Learning  (Part 1)

22

Linear Value Function Approximation

• We perform gradient descent using

• The update equation for TD becomes

• Where the eligibility trace is an n-dim vector updated using

• If the states are presented with the frequency they would be seen under the policy you are evaluating TD converges close to

t

1 1ˆ ˆ

t t t t t t t tr V s V s e

1t t te e

*

Page 23: An Introduction to Reinforcement Learning  (Part 1)

23

Value Function Approximation (VFA)Convergence results

• Linear TD converges if we visit states using the on-policy distribution

• Off policy Linear TD( and linear Q learning are known to diverge in some cases

• Q-learning, and value iteration used with some averagers (including k-Nearest Neighbour and decision trees) has almost sure convergence if particular exploration policies are used

• A special case of policy iteration with Sarsa style updates and linear function approximation converges

• Residual algorithms are guaranteed to converge but only very slowly

Page 24: An Introduction to Reinforcement Learning  (Part 1)

24

Value Function Approximation (VFA)TD-gammon

• TD() learning and a Backprop net with one hidden layer

• 1,500,000 training games (self play)

• Equivalent in skill to the top dozen human players

• Backgammon has ~1020 states, so can’t be solved using DP

Page 25: An Introduction to Reinforcement Learning  (Part 1)

25

Model-based RL: structured models

• Transition model P is represented compactly using a Dynamic Bayes Net (or factored MDP)

• V is represented as a tree

• Backups look like goal regression operators

• Converging with the AI planning community

Page 26: An Introduction to Reinforcement Learning  (Part 1)

26

Reinforcement Learning with Hidden State

• Learning in a POMDP, or k-Markov environment

• Planning in POMDPs is intractable

• Factored POMDPs look promising

• Policy search can work well

at at+1 at+2

rt+1 rt+2

st st+1 st+2

ot ot+1 ot+2

Page 27: An Introduction to Reinforcement Learning  (Part 1)

27

Policy Search

• Why not search directly for a policy?• Policy gradient methods and Evolutionary methods• Particularly good for problems with hidden state

Page 28: An Introduction to Reinforcement Learning  (Part 1)

28

Other RL applications

• Elevator Control (Barto & Crites)

• Space shuttle job scheduling (Zhang & Dietterich)

• Dynamic channel allocation in cellphone networks (Singh & Bertsekas)

Page 29: An Introduction to Reinforcement Learning  (Part 1)

29

Hot Topics in Reinforcement Learning

• Efficient Exploration and Optimal learning• Learning with structured models (eg. Bayes Nets)• Learning with relational models• Learning in continuous state and action spaces• Hierarchical reinforcement learning• Learning in processes with hidden state (eg. POMDPs)• Policy search methods

Page 30: An Introduction to Reinforcement Learning  (Part 1)

30

Reinforcement Learning: key papers

Overviews

R. Sutton and A. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998.

J. Wyatt, Reinforcement Learning: A Brief Overview. Perspectives on Adaptivity and Learning. Springer Verlag, 2003.

L.Kaelbling, M.Littman and A.Moore, Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research, 4:237-285, 1996.

Value Function Approximation

D. Bersekas and J.Tsitsiklis. Neurodynamic Programming. Athena Scientific, 1998.

Eligibility Traces

S.Singh and R. Sutton. Reinforcement learning with replacing eligibility traces. Machine Learning, 22:123-158, 1996.

Page 31: An Introduction to Reinforcement Learning  (Part 1)

31

Reinforcement Learning: key papers

Structured Models and Planning

C. Boutillier, T. Dean and S. Hanks. Decision Theoretic Planning: Structural Assumptions and Computational Leverage. Journal of Artificial Intelligence Research, 11:1-94, 1999.

R. Dearden, C. Boutillier and M.Goldsmidt. Stochastic dynamic programming with factored representations. Artificial Intelligence, 121(1-2):49-107, 2000.

B. Sallans. Reinforcement Learning for Factored Markov Decision ProcessesPh.D. Thesis, Dept. of Computer Science, University of Toronto, 2001.

K. Murphy. Dynamic Bayesian Networks: Representation, Inference and Learning. Ph.D. Thesis, University of California, Berkeley, 2002.

Page 32: An Introduction to Reinforcement Learning  (Part 1)

32

Reinforcement Learning: key papers

Policy Search

R. Williams. Simple statistical gradient algorithms for connectionist reinforcement learning. Machine Learning, 8:229-256.

R. Sutton, D. McAllester, S. Singh, Y. Mansour. Policy Gradient Methods for Reinforcement Learning with Function Approximation. NIPS 12, 2000.

Hierarchical Reinforcement Learning

R. Sutton, D. Precup and S. Singh. Between MDPs and Semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112:181-211.

R. Parr. Hierarchical Control and Learning for Markov Decision Processes. PhD Thesis, University of California, Berkeley, 1998.

A. Barto and S. Mahadevan. Recent Advances in Hierarchical Reinforcement Learning. Discrete Event Systems Journal 13: 41-77, 2003.

Page 33: An Introduction to Reinforcement Learning  (Part 1)

33

Reinforcement Learning: key papers

Exploration

N. Meuleau and P.Bourgnine. Exploration of multi-state environments: Local Measures and back-propagation of uncertainty. Machine Learning, 35:117-154, 1999.

J. Wyatt. Exploration control in reinforcement learning using optimistic model selection. In Proceedings of 18th International Conference on Machine Learning, 2001.

POMDPs

L. Kaelbling, M. Littman, A. Cassandra. Planning and Acting in Partially Observable Stochastic Domains. Artificial Intelligence, 101:99-134, 1998.