Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Reinforcement Learning
Course 9.54
Final review
Agent learning to act in an unknown environment
Reinforcement Learning Setup
St
Background and setup
• The environment is initially unknown or partially known
• It is also stochastic, the agent cannot fully predict what will happen next
• What is a ‘good’ action to select under these conditions?
• Animals learning seeks to maximize their reward
Formal Setup
• The agent is in one of a set of states, {S1, S2,…Sn }
• At each state, it can take an action from a set of available actions {A1, A2,…Ak}
• From state Si taking action Aj –> a new state Sj and a possible reward
A1
A2
S1
A3
S2
S3
S1
S3
S2
R=2
R= -1
Stochastic transitions
The consequence of an action:
• (S,A) → (S', R)
• Governed by:
• P(S' | S, A)
• P(R | S, A, S')
• These probabilities are properties of the world. (‘Contingencies’)
• An assumption that the transitions are Markovian
Policy
• The goal is to learn a policy π: S → A
• The policy determines the future of the agent:
S1 S2 S3 A2 A1 A3
π π π
Model-based vs. Model-free RL
• Model-based methods assume that the probabilities:
– P(S' | S, A) – P(R | S, A, S')
are known and can be used in the planning
• In model-free methods: – The ‘contingencies’ are not known – Need to be learned by exploration as a part of policy
learning
S S(1) S(2) a1 a2
r1 r2
Step 1 defining Vπ(S)
• Start from S and just follow the policy π
• We find ourselves in state S(1) and reward r1 etc.
• Vπ(S) = < r1 + γ r2 + γ2 r3 + … >
• The expected (discounted) reward from S on.
Step 2: equations for V(S)
• Vπ(S) = < r1 + γ r2 + γ2 r3 + … >
• = Vπ(S) = < r1 + γ (r2 + γ r3 + … ) >
• = < r1 + γ V(S') >
• These are equations relating V(S) for different states.
• Next write the explicitly in terms of the known parameters (contingencies):
A
S1
S
S3
S2
r1
r3
r2
Equations for V(S)
• Vπ(S) = < r1 + γ Vπ (S') > =
• E.g.: Vπ(S) =
Vπ(S) = [ 0.2 (r1 + γVπ (S1)) + 0.5 (r2 + γVπ (S2)) + 0.3 (r3 + γVπ (S3)) ]
• Linear equations, the unknowns are Vπ (Si)
'
] )(S' V )S' a, [r(S, a)S,|p(S'S
Improving the Policy
• Given the policy π, we can find the values Vπ(S) by solving the linear equations iteratively
• Convergence is guaranteed (the system of equations is strongly diagonally dominant)
• Given V(S), we can improve the policy:
• We can combine these steps to find the optimal policy
A’
S1
S
S3
S2
r1
r3
r2
A
π
Improving the policy
Value Iteration
learning V and π when the ‘contingencies’ are
known:
Value Iteration Algorithm
* Value iteration is used in the problem set
Q-learning
• The main algorithm used for model-free RL
Q-values (state-action)
A1
A2
S1
A3
S2
S3
S1
S3
S2
R=2
R= -1
Q(S1, A1 )
Q(S1, A3 )
Qπ (S,a) is the expected return starting from S, taking the action a, and thereafter following policy π
Q-value (state-action)
• The same update is done on Q-values rather than on V
• Used in most practical algorithms and some brain models
• Qπ (S,a) is the expected return starting from S, taking the action a, and thereafter following policy π:
A1
A2
S1
A3
S2
S3
S1
S3
S2
R=2
R= -1
Q-values (state-action)
Q(S1, A1 )
Q(S1, A3 )
SARSA
• It is called SARSA because it uses: s(t), a(t), r(t+1), s(t+1), a(t+1)
• A step like this uses the current π, so that each S has its action according to the policy π: a = π(S)
SARSA RL Algorithm
* ԑ-greedy:
with probability ԑ , do not select the greedy action. Instead select with equal
probability among all actions.
TD learning
Biology:
dompanine
Behavioral support for ‘prediction error’
Associating light cue with food
‘Blocking’
• No response to the bell • The bell and food were consistently associated • There was no prediction error. • Conclusion: prediction error, not association, drives learning !
Learning and Dopamine
• Learning is driven by the prediction error:
δ(t) = r + γV(S’)) – V(S)
• Computed by the dopamine system (Here too, if there is no error, no learning will take place)
Dopaminergic neurons
• Dopamine is a neuro-modulator
• In the:
– VTA (ventral tegmental area)
– Substantia Nigra
• These neurons send their axons to brain structures involved in motivation and goal-directed behavior, for example, the striatum, nucleus accumbens, and frontal cortex.
Major players in RL
Effects of dopamine, why it is associated with reward and reward related learning
• Drugs like amphetamine and cocaine exert their addictive actions in part by prolonging the influence of dopamine on target neurons
• Second, neural pathways associated with dopamine neurons are among the best targets for electrical self-stimulation: – Animals treated with dopamine receptor
blockers learn less rapidly to press a bar for a reward pellet
Dopamine and prediction error
• The animal (rat, monkey) gets a cue (visual, or auditory).
• A reward after a delay (1 sec below)
Dopamine and prediction error