View
214
Download
0
Tags:
Embed Size (px)
Citation preview
1
Kunstmatige Intelligentie / RuG
KI2 - 11
Reinforcement Learning
Johan Everts
What is Learning ?
Learning takes place as a result of interaction between an agent and the world, the idea behind learning is that
Percepts received by an agent should be used not only for acting, but also for improving the agent’s ability to behave optimally in the future to achieve its goal.
Learning Types
Supervised learning: Situation in which sample (input, output)
pairs of the function to be learned can be perceived or are given
Reinforcement learning: Where the agent acts on its environment, it
receives some evaluation of its action (reinforcement), but is not told of which action is the correct one to achieve its goal
Unsupervised Learning:No information at all about given output
Reinforcement Learning
Task Learn how to behave successfully to achieve a
goal while interacting with an external environmentLearn through experience
Examples Game playing: The agent knows it has won or
lost, but it doesn’t know the appropriate action in each state
Control: a traffic system can measure the delay of cars, but not know how to decrease it.
Elements of RL
Transition model, how action influence states Reward R, imediate value of state-action transition Policy , maps states to actions
Agent
Environment
State Reward Action
Policy
sss 221100 r a2
r a1
r a0 :::
Elements of RL
r(state, action)immediate reward values
100
0
0
100
G
0
0
0
0
0
0
0
0
0
Elements of RL
Value function: maps states to state values
Discount factor [0, 1) (here 0.9)
V*(state) valuesr(state, action)immediate reward values
100
0
0
100
G
0
0
0
0
0
0
0
0
0 G
90 100 0
81 90 100
2 11π trγtγrtrsV ...
G 90 100 0
81 90 100
G 90 100 0
81 90 100
RL task (restated)
Execute actions in environment,
observe results.
Learn action policy : state action
that maximizes expected discounted
reward
E [r(t) + r(t + 1) + 2r(t + 2) + …]
from any starting state in S
Reinforcement Learning
Target function is : state action
RL differs from other function approximation tasks Partially observable states Exploration vs. Exploitation Delayed reward -> temporal credit
assignment
Reinforcement Learning
Target function is : state action
However… We have no training examples of form
<state, action>
Training examples are of form
<<state, action>, reward>
Utility-based agents
Try to learn V * (abbreviated V*) perform lookahead search to choose best action
from any state s
Works well if agent knows
: state action state
r : state action R
When agent doesn’t know and r, cannot choose
actions this way
a s,δ*Va s,rmaxargsπ*a
Q-learning
Q-learning
Define new function very similar to V*
If agent learns Q, it can choose optimal
action even without knowing or R
Using Learned Q
a s,δ*γVa s,ra s,Q
a s,Q maxargsπ*a
Learning the Q-value
Note: Q and V* closely related
Allows us to write Q recursively as
a' s,Q maxargs*Va'
a' ,tsQmax γta ,tsr
ta ,tsδγVta ,tsr ta ,tsQ
a'1
Learning the Q-value
FOR each <s, a> DO
Initialize table entry:
Observe current state s
WHILE (true) DO
Select action a and execute it
Receive immediate reward r
Observe new state s’
Update table entry for as follows
Move: record transition from s to s’
0 a s,Q̂
a s,Q̂
a' ,s'Q max γa s,r a s,Q a'
ˆˆ
r(state, action)immediate reward values
Q(state, action) valuesV*(state) values
100
0
0
100
G
0
0
0
0
0
0
0
0
0
90
81
100
G
0
81
72
90
81 81
72
90
81
100
G 90 100 0
81 90 100
Q-learning
Q-learning, learns the expected utility of taking a particular action a in a particular state s (Q-value of the pair (s,a))
Q-learning
Demonstration
http://iridia.ulb.ac.be/~fvandenb/qlearning/qlearning.html
eps: probability to use a random action instead of the optimal policy
gam: discount factor, closer to 1 more weight is given to future reinforcements.
alpha: learning rate
Q-learning estimates one time step difference
Why not for n steps?
a ,tsQ max γ tr ta ,tsQ a
11 ˆ
a ,ntsQ maxγntrγ tγr tr ta ,tsQ a
nnn ˆ11 1
Temporal Difference Learning:
TD() formula
Intuitive idea: use constant 0 1 to combine estimates from various lookahead distances (note normalization factor (1- ))
ta ,tsQλta ,tsλQta ,tsQ λ ta ,tsQ λ 32211
Temporal Difference Learning:
Genetic algorithms
Imagine the individuals as agent functions
Fitness function as performance measure or reward function
No attempt made to learn the relationship between the rewards and actions taken by an agent
Simply searches directly in the individual space to find one that maximizes the fitness functions
Genetic algorithms
Represent an individual as a binary string Selection works like this: if individual X scores
twice as high as Y on the fitness function, then X is twice as likely to be selected for reproduction than Y.
Reproduction is accomplished by cross-over and mutation
Cart – Pole balancing
Demonstration
http://www.bovine.net/~jlawson/hmc/pole/sane.html
Summary
RL addresses the problem of learning control strategies for autonomous agents
In Q-learning an evaluation function over states and actions is learned
TD-algorithms learn by iteratively reducing the differences between the estimates produced by the agent at different times
In the genetic approach, the relation between rewards and actions is not learned. You simply search the fitness function space.