20
Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy Wyatt (U Birmingham)

Decision Making Under Uncertainty Lec #8: Reinforcement Learning

Embed Size (px)

DESCRIPTION

Decision Making Under Uncertainty Lec #8: Reinforcement Learning. UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006. Most slides by Jeremy Wyatt (U Birmingham). So Far. MDPs Known transition model Known rewards. +3. +50. -1. -1. r 1. r 9. r 4. r 5. - PowerPoint PPT Presentation

Citation preview

Page 1: Decision Making Under Uncertainty Lec #8: Reinforcement Learning

Decision Making Under UncertaintyLec #8: Reinforcement Learning

UIUC CS 598: Section EA

Professor: Eyal AmirSpring Semester 2006

Most slides by Jeremy Wyatt (U Birmingham)

Page 2: Decision Making Under Uncertainty Lec #8: Reinforcement Learning

So Far

• MDPs– Known transition model– Known rewards

Page 3: Decision Making Under Uncertainty Lec #8: Reinforcement Learning

Return: A long term measure of performance

• Let’s assume our agent acts according to some rules, called a policy,

• The return Rt is a measure of long term reward collected after time t

+50

-1-1

+3

r9r5r4r1

21 2 3 1

0

kt t t t t k

k

R r r r r

3 4 80 3 1 1 50R

0 1

Page 4: Decision Making Under Uncertainty Lec #8: Reinforcement Learning

Bellman equations

• Conditional independence allows us to define expected return in terms of a recurrence relation:

where

and

where

(s) (s)ss' ss'

'

( ) { ( )}s S

V s p V s'

R

( )' Pr( ' | , ( ))s

ssp s s s

( )1 1' E | ', ,s

t t tss r s s s s R

* *ss' ss'

'

( , ) { ( )}a a

s S

Q s a p V s'

R

* *( ) max ( , )a A

V s Q s a

s

4

3

5

(s)

Page 5: Decision Making Under Uncertainty Lec #8: Reinforcement Learning

Today: Reinforcement Learning (RL)

• Learning from punishments and rewards• Agent moves through world, observing states

and rewards• Adapts its behaviour to maximise some function

of reward

• Issues:– Unknown transition model– Action selection while finding optimal policy

Page 6: Decision Making Under Uncertainty Lec #8: Reinforcement Learning

Two types of RL

• We can bootstrap using explicit knowledge of P and R

(Dynamic Programming)

• Or we can bootstrap using samples from P and R

(Temporal Difference learning)

(s) (s)1ss' ss'

'

ˆ ˆ( ) { ( )}n ns S

V s p V s'

R

1 1 1ˆ ˆ ˆ ˆ( ) ( ) ( ( ) ( ))t t t t t t t t tV s V s r V s V s

st st+1rt+1

s

4

3

5

(s)

at = (st)

Page 7: Decision Making Under Uncertainty Lec #8: Reinforcement Learning

TD(0) learning

t:=0

is the policy to be evaluated

Initialise arbitrarily for all

Repeat

select an action at from (st)

observe the transition

update according to

t:=t+1

st st+1rt+1

at

1 1 1ˆ ˆ ˆ ˆ( ) ( ) ( ( ) ( ))t t t t t t t t tV s V s r V s V s

ˆ ( )tV s s S

ˆ ( )tV s

Page 8: Decision Making Under Uncertainty Lec #8: Reinforcement Learning

On and Off policy learning

• On policy: evaluate the policy you are following, e.g. TD learning

• Off-policy: evaluate one policy while

following another policy

• E.g. One step Q-learning

st st+1rt+1

at

1 1 1ˆ ˆ ˆ ˆ( , ) ( , ) max ( , ) ( , )t t t t t t t t t t t t

b AQ s a Q s a r Q s b Q s a

1 1 1ˆ ˆ ˆ ˆ( ) ( ) ( ( ) ( ))t t t t t t t t tV s V s r V s V s

st+1

atst

rt+1

Page 9: Decision Making Under Uncertainty Lec #8: Reinforcement Learning

Off policy learning of control

• Q-learning is powerful because – it allows us to evaluate

– while taking non-greedy actions (explore)

• -greedy is a simple and popular exploration rule: • take a greedy action with probability • Take a random action with probability 1-

• Q-learning is guaranteed to converge for MDPs

(with the right exploration policy)

• Is there a way of finding with an on-policy learner?

Page 10: Decision Making Under Uncertainty Lec #8: Reinforcement Learning

On policy learning of control: Sarsa

• Discard the max operator• Learn about the policy you are following

• Change the policy gradually• Guaranteed to converge for Greedy in the Limit Infinite

Exploration Policies

1 1 1 1ˆ ˆ ˆ ˆ( , ) ( , ) ( , ) ( , )t t t t t t t t t t t t tQ s a Q s a r Q s a Q s a

atst+1st

at+1

rt+1

Page 11: Decision Making Under Uncertainty Lec #8: Reinforcement Learning

On policy learning of control: Sarsa t:=0Initialise arbitrarily for allselect an action at from explore( )Repeat observe the transition

select an action at+1 from explore( ) update according to

t:=t+1

1 1 1 1ˆ ˆ ˆ ˆ( , ) ( , ) ( , ) ( , )t t t t t t t t t t t t tQ s a Q s a r Q s a Q s a

ˆ ( , )tQ s a ,s S a A

1ˆ ( , )t tQ s a

ˆ ( , )t tQ s a

ˆ ( , )t t tQ s a

atst+1st

rt+1

Page 12: Decision Making Under Uncertainty Lec #8: Reinforcement Learning

Summary: TD, Q-learning, Sarsa • TD learning

• One step Q-learning

• Sarsa learning

1 1 1 1ˆ ˆ ˆ ˆ( , ) ( , ) ( , ) ( , )t t t t t t t t t t t t tQ s a Q s a r Q s a Q s a

strt+1

at

1 1 1ˆ ˆ ˆ ˆ( , ) ( , ) max ( , ) ( , )t t t t t t t t t t t t

b AQ s a Q s a r Q s b Q s a

1 1 1ˆ ˆ ˆ ˆ( ) ( ) ( ( ) ( ))t t t t t t t t tV s V s r V s V s

st+1

atst

rt+1

st+1

atst+1st

at+1

rt+1

Page 13: Decision Making Under Uncertainty Lec #8: Reinforcement Learning

• We add an eligibility for each state:

where

• Update in every state proportional to the eligibility

Speeding up learning: Eligibility traces, TD(

• TD learning only passes the TD error to one state

1 1 1ˆ ˆ ˆ ˆ( ) ( ) ( ( ) ( ))t t t t t t t t tV s V s r V s V s

at-2 at-1 at

rt-1 rt rt+1

st-2 st-1 st st+1

1

1 if ( )

( ) otherwiset

tt

s se s

e s

0 1

1 1 1ˆ ˆ ˆ ˆ( ) ( ) ( ( ) ( )) ( )t t t t t t t tV s V s r V s V s e s

ˆ ( )tV s s S

Page 14: Decision Making Under Uncertainty Lec #8: Reinforcement Learning

Eligibility traces for learning control: Q()

• There are various eligibility trace methods for Q-learning• Update for every s,a pair

• Pass information backwards through a non-greedy action• Lose convergence guarantee of one step Q-learning• Watkin’s Solution: zero all eligibilities after a non-greedy

action• Problem: you lose most of the benefit of eligibility traces

rt+1

st+1rt

at-1 atat+1

st-1 st

agreedy

1 1 1ˆ ˆ ˆ ˆ( , ) ( , ) max ( , ) ( , ) ( , )t t t t t t t t t

b AQ s a Q s a r Q s b Q s a e s a

Page 15: Decision Making Under Uncertainty Lec #8: Reinforcement Learning

Eligibility traces for learning control: Sarsa()

• Solution: use Sarsa since it’s on policy• Update for every s,a pair

• Keeps convergence guarantees

1 1 1 1ˆ ˆ ˆ ˆ( , ) ( , ) ( , ) ( , ) ( , )t t t t t t t t t tQ s a Q s a r Q s a Q s a e s a

at at+1 at+2

rt+1 rt+2

st st+1 st+2

Page 16: Decision Making Under Uncertainty Lec #8: Reinforcement Learning

Hot Topics in Reinforcement Learning

• Efficient Exploration and Optimal learning• Learning with structured models (eg. Bayes

Nets)• Learning with relational models• Learning in continuous state and action spaces• Hierarchical reinforcement learning• Learning in processes with hidden state (eg.

POMDPs)• Policy search methods

Page 17: Decision Making Under Uncertainty Lec #8: Reinforcement Learning

Reinforcement Learning: key papers

Overviews

R. Sutton and A. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998.

J. Wyatt, Reinforcement Learning: A Brief Overview. Perspectives on Adaptivity and Learning. Springer Verlag, 2003.

L.Kaelbling, M.Littman and A.Moore, Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research, 4:237-285, 1996.

Value Function Approximation

D. Bersekas and J.Tsitsiklis. Neurodynamic Programming. Athena Scientific, 1998.

Eligibility Traces

S.Singh and R. Sutton. Reinforcement learning with replacing eligibility traces. Machine Learning, 22:123-158, 1996.

Page 18: Decision Making Under Uncertainty Lec #8: Reinforcement Learning

Reinforcement Learning: key papers

Structured Models and Planning

C. Boutillier, T. Dean and S. Hanks. Decision Theoretic Planning: Structural Assumptions and Computational Leverage. Journal of Artificial Intelligence Research, 11:1-94, 1999.

R. Dearden, C. Boutillier and M.Goldsmidt. Stochastic dynamic programming with factored representations. Artificial Intelligence, 121(1-2):49-107, 2000.

B. Sallans. Reinforcement Learning for Factored Markov Decision ProcessesPh.D. Thesis, Dept. of Computer Science, University of Toronto, 2001.

K. Murphy. Dynamic Bayesian Networks: Representation, Inference and Learning. Ph.D. Thesis, University of California, Berkeley, 2002.

Page 19: Decision Making Under Uncertainty Lec #8: Reinforcement Learning

Reinforcement Learning: key papers

Policy Search

R. Williams. Simple statistical gradient algorithms for connectionist reinforcement learning. Machine Learning, 8:229-256.

R. Sutton, D. McAllester, S. Singh, Y. Mansour. Policy Gradient Methods for Reinforcement Learning with Function Approximation. NIPS 12, 2000.

Hierarchical Reinforcement Learning

R. Sutton, D. Precup and S. Singh. Between MDPs and Semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112:181-211.

R. Parr. Hierarchical Control and Learning for Markov Decision Processes. PhD Thesis, University of California, Berkeley, 1998.

A. Barto and S. Mahadevan. Recent Advances in Hierarchical Reinforcement Learning. Discrete Event Systems Journal 13: 41-77, 2003.

Page 20: Decision Making Under Uncertainty Lec #8: Reinforcement Learning

Reinforcement Learning: key papers

Exploration

N. Meuleau and P.Bourgnine. Exploration of multi-state environments: Local Measures and back-propagation of uncertainty. Machine Learning, 35:117-154, 1999.

J. Wyatt. Exploration control in reinforcement learning using optimistic model selection. In Proceedings of 18th International Conference on Machine Learning, 2001.

POMDPs

L. Kaelbling, M. Littman, A. Cassandra. Planning and Acting in Partially Observable Stochastic Domains. Artificial Intelligence, 101:99-134, 1998.