Upload
hiram-williams
View
15
Download
1
Embed Size (px)
DESCRIPTION
Decision Making Under Uncertainty Lec #8: Reinforcement Learning. UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006. Most slides by Jeremy Wyatt (U Birmingham). So Far. MDPs Known transition model Known rewards. +3. +50. -1. -1. r 1. r 9. r 4. r 5. - PowerPoint PPT Presentation
Citation preview
Decision Making Under UncertaintyLec #8: Reinforcement Learning
UIUC CS 598: Section EA
Professor: Eyal AmirSpring Semester 2006
Most slides by Jeremy Wyatt (U Birmingham)
So Far
• MDPs– Known transition model– Known rewards
Return: A long term measure of performance
• Let’s assume our agent acts according to some rules, called a policy,
• The return Rt is a measure of long term reward collected after time t
+50
-1-1
+3
r9r5r4r1
21 2 3 1
0
kt t t t t k
k
R r r r r
3 4 80 3 1 1 50R
0 1
Bellman equations
• Conditional independence allows us to define expected return in terms of a recurrence relation:
where
and
where
(s) (s)ss' ss'
'
( ) { ( )}s S
V s p V s'
R
( )' Pr( ' | , ( ))s
ssp s s s
( )1 1' E | ', ,s
t t tss r s s s s R
* *ss' ss'
'
( , ) { ( )}a a
s S
Q s a p V s'
R
* *( ) max ( , )a A
V s Q s a
s
4
3
5
(s)
Today: Reinforcement Learning (RL)
• Learning from punishments and rewards• Agent moves through world, observing states
and rewards• Adapts its behaviour to maximise some function
of reward
• Issues:– Unknown transition model– Action selection while finding optimal policy
Two types of RL
• We can bootstrap using explicit knowledge of P and R
(Dynamic Programming)
• Or we can bootstrap using samples from P and R
(Temporal Difference learning)
(s) (s)1ss' ss'
'
ˆ ˆ( ) { ( )}n ns S
V s p V s'
R
1 1 1ˆ ˆ ˆ ˆ( ) ( ) ( ( ) ( ))t t t t t t t t tV s V s r V s V s
st st+1rt+1
s
4
3
5
(s)
at = (st)
TD(0) learning
t:=0
is the policy to be evaluated
Initialise arbitrarily for all
Repeat
select an action at from (st)
observe the transition
update according to
t:=t+1
st st+1rt+1
at
1 1 1ˆ ˆ ˆ ˆ( ) ( ) ( ( ) ( ))t t t t t t t t tV s V s r V s V s
ˆ ( )tV s s S
ˆ ( )tV s
On and Off policy learning
• On policy: evaluate the policy you are following, e.g. TD learning
• Off-policy: evaluate one policy while
following another policy
• E.g. One step Q-learning
st st+1rt+1
at
1 1 1ˆ ˆ ˆ ˆ( , ) ( , ) max ( , ) ( , )t t t t t t t t t t t t
b AQ s a Q s a r Q s b Q s a
1 1 1ˆ ˆ ˆ ˆ( ) ( ) ( ( ) ( ))t t t t t t t t tV s V s r V s V s
st+1
atst
rt+1
Off policy learning of control
• Q-learning is powerful because – it allows us to evaluate
– while taking non-greedy actions (explore)
• -greedy is a simple and popular exploration rule: • take a greedy action with probability • Take a random action with probability 1-
• Q-learning is guaranteed to converge for MDPs
(with the right exploration policy)
• Is there a way of finding with an on-policy learner?
On policy learning of control: Sarsa
• Discard the max operator• Learn about the policy you are following
• Change the policy gradually• Guaranteed to converge for Greedy in the Limit Infinite
Exploration Policies
1 1 1 1ˆ ˆ ˆ ˆ( , ) ( , ) ( , ) ( , )t t t t t t t t t t t t tQ s a Q s a r Q s a Q s a
atst+1st
at+1
rt+1
On policy learning of control: Sarsa t:=0Initialise arbitrarily for allselect an action at from explore( )Repeat observe the transition
select an action at+1 from explore( ) update according to
t:=t+1
1 1 1 1ˆ ˆ ˆ ˆ( , ) ( , ) ( , ) ( , )t t t t t t t t t t t t tQ s a Q s a r Q s a Q s a
ˆ ( , )tQ s a ,s S a A
1ˆ ( , )t tQ s a
ˆ ( , )t tQ s a
ˆ ( , )t t tQ s a
atst+1st
rt+1
Summary: TD, Q-learning, Sarsa • TD learning
• One step Q-learning
• Sarsa learning
1 1 1 1ˆ ˆ ˆ ˆ( , ) ( , ) ( , ) ( , )t t t t t t t t t t t t tQ s a Q s a r Q s a Q s a
strt+1
at
1 1 1ˆ ˆ ˆ ˆ( , ) ( , ) max ( , ) ( , )t t t t t t t t t t t t
b AQ s a Q s a r Q s b Q s a
1 1 1ˆ ˆ ˆ ˆ( ) ( ) ( ( ) ( ))t t t t t t t t tV s V s r V s V s
st+1
atst
rt+1
st+1
atst+1st
at+1
rt+1
• We add an eligibility for each state:
where
• Update in every state proportional to the eligibility
Speeding up learning: Eligibility traces, TD(
• TD learning only passes the TD error to one state
1 1 1ˆ ˆ ˆ ˆ( ) ( ) ( ( ) ( ))t t t t t t t t tV s V s r V s V s
at-2 at-1 at
rt-1 rt rt+1
st-2 st-1 st st+1
1
1 if ( )
( ) otherwiset
tt
s se s
e s
0 1
1 1 1ˆ ˆ ˆ ˆ( ) ( ) ( ( ) ( )) ( )t t t t t t t tV s V s r V s V s e s
ˆ ( )tV s s S
Eligibility traces for learning control: Q()
• There are various eligibility trace methods for Q-learning• Update for every s,a pair
• Pass information backwards through a non-greedy action• Lose convergence guarantee of one step Q-learning• Watkin’s Solution: zero all eligibilities after a non-greedy
action• Problem: you lose most of the benefit of eligibility traces
rt+1
st+1rt
at-1 atat+1
st-1 st
agreedy
1 1 1ˆ ˆ ˆ ˆ( , ) ( , ) max ( , ) ( , ) ( , )t t t t t t t t t
b AQ s a Q s a r Q s b Q s a e s a
Eligibility traces for learning control: Sarsa()
• Solution: use Sarsa since it’s on policy• Update for every s,a pair
• Keeps convergence guarantees
1 1 1 1ˆ ˆ ˆ ˆ( , ) ( , ) ( , ) ( , ) ( , )t t t t t t t t t tQ s a Q s a r Q s a Q s a e s a
at at+1 at+2
rt+1 rt+2
st st+1 st+2
Hot Topics in Reinforcement Learning
• Efficient Exploration and Optimal learning• Learning with structured models (eg. Bayes
Nets)• Learning with relational models• Learning in continuous state and action spaces• Hierarchical reinforcement learning• Learning in processes with hidden state (eg.
POMDPs)• Policy search methods
Reinforcement Learning: key papers
Overviews
R. Sutton and A. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998.
J. Wyatt, Reinforcement Learning: A Brief Overview. Perspectives on Adaptivity and Learning. Springer Verlag, 2003.
L.Kaelbling, M.Littman and A.Moore, Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research, 4:237-285, 1996.
Value Function Approximation
D. Bersekas and J.Tsitsiklis. Neurodynamic Programming. Athena Scientific, 1998.
Eligibility Traces
S.Singh and R. Sutton. Reinforcement learning with replacing eligibility traces. Machine Learning, 22:123-158, 1996.
Reinforcement Learning: key papers
Structured Models and Planning
C. Boutillier, T. Dean and S. Hanks. Decision Theoretic Planning: Structural Assumptions and Computational Leverage. Journal of Artificial Intelligence Research, 11:1-94, 1999.
R. Dearden, C. Boutillier and M.Goldsmidt. Stochastic dynamic programming with factored representations. Artificial Intelligence, 121(1-2):49-107, 2000.
B. Sallans. Reinforcement Learning for Factored Markov Decision ProcessesPh.D. Thesis, Dept. of Computer Science, University of Toronto, 2001.
K. Murphy. Dynamic Bayesian Networks: Representation, Inference and Learning. Ph.D. Thesis, University of California, Berkeley, 2002.
Reinforcement Learning: key papers
Policy Search
R. Williams. Simple statistical gradient algorithms for connectionist reinforcement learning. Machine Learning, 8:229-256.
R. Sutton, D. McAllester, S. Singh, Y. Mansour. Policy Gradient Methods for Reinforcement Learning with Function Approximation. NIPS 12, 2000.
Hierarchical Reinforcement Learning
R. Sutton, D. Precup and S. Singh. Between MDPs and Semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112:181-211.
R. Parr. Hierarchical Control and Learning for Markov Decision Processes. PhD Thesis, University of California, Berkeley, 1998.
A. Barto and S. Mahadevan. Recent Advances in Hierarchical Reinforcement Learning. Discrete Event Systems Journal 13: 41-77, 2003.
Reinforcement Learning: key papers
Exploration
N. Meuleau and P.Bourgnine. Exploration of multi-state environments: Local Measures and back-propagation of uncertainty. Machine Learning, 35:117-154, 1999.
J. Wyatt. Exploration control in reinforcement learning using optimistic model selection. In Proceedings of 18th International Conference on Machine Learning, 2001.
POMDPs
L. Kaelbling, M. Littman, A. Cassandra. Planning and Acting in Partially Observable Stochastic Domains. Artificial Intelligence, 101:99-134, 1998.