Decision Making Under Uncertainty Lec #8: Reinforcement Learning

Decision Making Under UncertaintyLec #8: Reinforcement Learning

UIUC CS 598: Section EA

Professor: Eyal AmirSpring Semester 2006

Most slides by Jeremy Wyatt (U Birmingham)

So Far

• MDPs– Known transition model– Known rewards

Return: A long term measure of performance

• Let’s assume our agent acts according to some rules, called a policy,

• The return Rt is a measure of long term reward collected after time t

+50

-1-1

+3

r9r5r4r1

21 2 3 1

0

kt t t t t k

k

R r r r r

3 4 80 3 1 1 50R

0 1

Bellman equations

• Conditional independence allows us to define expected return in terms of a recurrence relation:

where

and

where

(s) (s)ss' ss'

'

( ) { ( )}s S

V s p V s'

R

( )' Pr( ' | , ( ))s

ssp s s s

( )1 1' E | ', ,s

t t tss r s s s s R

* *ss' ss'

'

( , ) { ( )}a a

s S

Q s a p V s'

R

* *( ) max ( , )a A

V s Q s a

s

4

3

5

(s)

Today: Reinforcement Learning (RL)

• Learning from punishments and rewards• Agent moves through world, observing states

and rewards• Adapts its behaviour to maximise some function

of reward

• Issues:– Unknown transition model– Action selection while finding optimal policy

Two types of RL

• We can bootstrap using explicit knowledge of P and R

(Dynamic Programming)

• Or we can bootstrap using samples from P and R

(Temporal Difference learning)

(s) (s)1ss' ss'

'

ˆ ˆ( ) { ( )}n ns S

V s p V s'

R

1 1 1ˆ ˆ ˆ ˆ( ) ( ) ( ( ) ( ))t t t t t t t t tV s V s r V s V s

st st+1rt+1

s

4

3

5

(s)

at = (st)

TD(0) learning

t:=0

is the policy to be evaluated

Initialise arbitrarily for all

Repeat

select an action at from (st)

observe the transition

update according to

t:=t+1

st st+1rt+1

at


ˆ ( )tV s s S

ˆ ( )tV s

On and Off policy learning

• On policy: evaluate the policy you are following, e.g. TD learning

• Off-policy: evaluate one policy while

following another policy

• E.g. One step Q-learning

st st+1rt+1

at

1 1 1ˆ ˆ ˆ ˆ( , ) ( , ) max ( , ) ( , )t t t t t t t t t t t t

b AQ s a Q s a r Q s b Q s a


st+1

atst

rt+1

Off policy learning of control

• Q-learning is powerful because – it allows us to evaluate

– while taking non-greedy actions (explore)

• -greedy is a simple and popular exploration rule: • take a greedy action with probability • Take a random action with probability 1-

• Q-learning is guaranteed to converge for MDPs

(with the right exploration policy)

• Is there a way of finding with an on-policy learner?

On policy learning of control: Sarsa

• Discard the max operator• Learn about the policy you are following

• Change the policy gradually• Guaranteed to converge for Greedy in the Limit Infinite

Exploration Policies

1 1 1 1ˆ ˆ ˆ ˆ( , ) ( , ) ( , ) ( , )t t t t t t t t t t t t tQ s a Q s a r Q s a Q s a

atst+1st

at+1

rt+1

On policy learning of control: Sarsa t:=0Initialise arbitrarily for allselect an action at from explore( )Repeat observe the transition

select an action at+1 from explore( ) update according to

t:=t+1


ˆ ( , )tQ s a ,s S a A

1ˆ ( , )t tQ s a

ˆ ( , )t tQ s a

ˆ ( , )t t tQ s a

atst+1st

rt+1

Summary: TD, Q-learning, Sarsa • TD learning

• One step Q-learning

• Sarsa learning


strt+1

at

1 1 1ˆ ˆ ˆ ˆ( , ) ( , ) max ( , ) ( , )t t t t t t t t t t t t

b AQ s a Q s a r Q s b Q s a


st+1

atst

rt+1

st+1

atst+1st

at+1

rt+1

• We add an eligibility for each state:

where

• Update in every state proportional to the eligibility

Speeding up learning: Eligibility traces, TD(

• TD learning only passes the TD error to one state


at-2 at-1 at

rt-1 rt rt+1

st-2 st-1 st st+1

1

1 if ( )

( ) otherwiset

tt

s se s

e s

0 1

1 1 1ˆ ˆ ˆ ˆ( ) ( ) ( ( ) ( )) ( )t t t t t t t tV s V s r V s V s e s

ˆ ( )tV s s S

Eligibility traces for learning control: Q()

• There are various eligibility trace methods for Q-learning• Update for every s,a pair

• Pass information backwards through a non-greedy action• Lose convergence guarantee of one step Q-learning• Watkin’s Solution: zero all eligibilities after a non-greedy

action• Problem: you lose most of the benefit of eligibility traces

rt+1

st+1rt

at-1 atat+1

st-1 st

agreedy

1 1 1ˆ ˆ ˆ ˆ( , ) ( , ) max ( , ) ( , ) ( , )t t t t t t t t t

b AQ s a Q s a r Q s b Q s a e s a

Eligibility traces for learning control: Sarsa()

• Solution: use Sarsa since it’s on policy• Update for every s,a pair

• Keeps convergence guarantees

1 1 1 1ˆ ˆ ˆ ˆ( , ) ( , ) ( , ) ( , ) ( , )t t t t t t t t t tQ s a Q s a r Q s a Q s a e s a

at at+1 at+2

rt+1 rt+2

st st+1 st+2

Hot Topics in Reinforcement Learning

• Efficient Exploration and Optimal learning• Learning with structured models (eg. Bayes

Nets)• Learning with relational models• Learning in continuous state and action spaces• Hierarchical reinforcement learning• Learning in processes with hidden state (eg.

POMDPs)• Policy search methods

Reinforcement Learning: key papers

Overviews

R. Sutton and A. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998.

J. Wyatt, Reinforcement Learning: A Brief Overview. Perspectives on Adaptivity and Learning. Springer Verlag, 2003.

L.Kaelbling, M.Littman and A.Moore, Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research, 4:237-285, 1996.

Value Function Approximation

D. Bersekas and J.Tsitsiklis. Neurodynamic Programming. Athena Scientific, 1998.

Eligibility Traces

S.Singh and R. Sutton. Reinforcement learning with replacing eligibility traces. Machine Learning, 22:123-158, 1996.


Structured Models and Planning

C. Boutillier, T. Dean and S. Hanks. Decision Theoretic Planning: Structural Assumptions and Computational Leverage. Journal of Artificial Intelligence Research, 11:1-94, 1999.

R. Dearden, C. Boutillier and M.Goldsmidt. Stochastic dynamic programming with factored representations. Artificial Intelligence, 121(1-2):49-107, 2000.

B. Sallans. Reinforcement Learning for Factored Markov Decision ProcessesPh.D. Thesis, Dept. of Computer Science, University of Toronto, 2001.

K. Murphy. Dynamic Bayesian Networks: Representation, Inference and Learning. Ph.D. Thesis, University of California, Berkeley, 2002.


Policy Search

R. Williams. Simple statistical gradient algorithms for connectionist reinforcement learning. Machine Learning, 8:229-256.

R. Sutton, D. McAllester, S. Singh, Y. Mansour. Policy Gradient Methods for Reinforcement Learning with Function Approximation. NIPS 12, 2000.

Hierarchical Reinforcement Learning

R. Sutton, D. Precup and S. Singh. Between MDPs and Semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112:181-211.

R. Parr. Hierarchical Control and Learning for Markov Decision Processes. PhD Thesis, University of California, Berkeley, 1998.

A. Barto and S. Mahadevan. Recent Advances in Hierarchical Reinforcement Learning. Discrete Event Systems Journal 13: 41-77, 2003.


Exploration

N. Meuleau and P.Bourgnine. Exploration of multi-state environments: Local Measures and back-propagation of uncertainty. Machine Learning, 35:117-154, 1999.

J. Wyatt. Exploration control in reinforcement learning using optimistic model selection. In Proceedings of 18th International Conference on Machine Learning, 2001.

POMDPs

L. Kaelbling, M. Littman, A. Cassandra. Planning and Acting in Partially Observable Stochastic Domains. Artificial Intelligence, 101:99-134, 1998.

Documents

Decision Making Under Uncertainty Lec #8: Reinforcement Learning