Course 9.54 Final revie9.54/fall14/slides/ClassReviewFinal_II.pdf · Formal Setup • The agent is in one of a set of states, {S 1, S 2,…S n } • At each state, it can take an

$Page 1: Course 9.54 Final revie9.54/fall14/slides/ClassReviewFinal_II.pdf · Formal Setup • The agent is in one of a set of states, {S 1, S 2,…S n } • At each state, it can take an$
Reinforcement Learning

Course 9.54

Final review

Agent learning to act in an unknown environment

Reinforcement Learning Setup

St

Background and setup

• The environment is initially unknown or partially known

• It is also stochastic, the agent cannot fully predict what will happen next

• What is a ‘good’ action to select under these conditions?

• Animals learning seeks to maximize their reward

Formal Setup

• The agent is in one of a set of states, {S1, S2,…Sn }

• At each state, it can take an action from a set of available actions {A1, A2,…Ak}

• From state Si taking action Aj –> a new state Sj and a possible reward

A1

A2

S1

A3

S2

S3

S1

S3

S2

R=2

R= -1

Stochastic transitions

The consequence of an action:

• (S,A) → (S', R)

• Governed by:

• P(S' | S, A)

• P(R | S, A, S')

• These probabilities are properties of the world. (‘Contingencies’)

• An assumption that the transitions are Markovian

Policy

• The goal is to learn a policy π: S → A

• The policy determines the future of the agent:

S1 S2 S3 A2 A1 A3

π π π

Model-based vs. Model-free RL

• Model-based methods assume that the probabilities:

– P(S' | S, A) – P(R | S, A, S')

are known and can be used in the planning

• In model-free methods: – The ‘contingencies’ are not known – Need to be learned by exploration as a part of policy

learning

S S(1) S(2) a1 a2

r1 r2

Step 1 defining Vπ(S)

• Start from S and just follow the policy π

• We find ourselves in state S(1) and reward r1 etc.

• Vπ(S) = < r1 + γ r2 + γ2 r3 + … >

• The expected (discounted) reward from S on.

Step 2: equations for V(S)

• Vπ(S) = < r1 + γ r2 + γ2 r3 + … >

• = Vπ(S) = < r1 + γ (r2 + γ r3 + … ) >

• = < r1 + γ V(S') >

• These are equations relating V(S) for different states.

• Next write the explicitly in terms of the known parameters (contingencies):

A

S1

S

S3

S2

r1

r3

r2

Equations for V(S)

• Vπ(S) = < r1 + γ Vπ (S') > =

• E.g.: Vπ(S) =

Vπ(S) = [ 0.2 (r1 + γVπ (S1)) + 0.5 (r2 + γVπ (S2)) + 0.3 (r3 + γVπ (S3)) ]

• Linear equations, the unknowns are Vπ (Si)

'

] )(S' V )S' a, [r(S, a)S,|p(S'S

Improving the Policy

• Given the policy π, we can find the values Vπ(S) by solving the linear equations iteratively

• Convergence is guaranteed (the system of equations is strongly diagonally dominant)

• Given V(S), we can improve the policy:

• We can combine these steps to find the optimal policy

A’

S1

S

S3

S2

r1

r3

r2

A

π

Improving the policy

Value Iteration

learning V and π when the ‘contingencies’ are

known:

Value Iteration Algorithm

* Value iteration is used in the problem set

Q-learning

• The main algorithm used for model-free RL

Q-values (state-action)

A1

A2

S1

A3

S2

S3

S1

S3

S2

R=2

R= -1

Q(S1, A1 )

Q(S1, A3 )

Qπ (S,a) is the expected return starting from S, taking the action a, and thereafter following policy π

Q-value (state-action)

• The same update is done on Q-values rather than on V

• Used in most practical algorithms and some brain models

• Qπ (S,a) is the expected return starting from S, taking the action a, and thereafter following policy π:

A1

A2

S1

A3

S2

S3

S1

S3

S2

R=2

R= -1

Q-values (state-action)

Q(S1, A1 )

Q(S1, A3 )

SARSA

• It is called SARSA because it uses: s(t), a(t), r(t+1), s(t+1), a(t+1)

• A step like this uses the current π, so that each S has its action according to the policy π: a = π(S)

SARSA RL Algorithm

* ԑ-greedy:

with probability ԑ , do not select the greedy action. Instead select with equal

probability among all actions.

TD learning

Biology:

dompanine

Behavioral support for ‘prediction error’

Associating light cue with food

‘Blocking’

• No response to the bell • The bell and food were consistently associated • There was no prediction error. • Conclusion: prediction error, not association, drives learning !

Learning and Dopamine

• Learning is driven by the prediction error:

δ(t) = r + γV(S’)) – V(S)

• Computed by the dopamine system (Here too, if there is no error, no learning will take place)

Dopaminergic neurons

• Dopamine is a neuro-modulator

• In the:

– VTA (ventral tegmental area)

– Substantia Nigra

• These neurons send their axons to brain structures involved in motivation and goal-directed behavior, for example, the striatum, nucleus accumbens, and frontal cortex.

Major players in RL

Effects of dopamine, why it is associated with reward and reward related learning

• Drugs like amphetamine and cocaine exert their addictive actions in part by prolonging the influence of dopamine on target neurons

• Second, neural pathways associated with dopamine neurons are among the best targets for electrical self-stimulation: – Animals treated with dopamine receptor

blockers learn less rapidly to press a bar for a reward pellet

Dopamine and prediction error

• The animal (rat, monkey) gets a cue (visual, or auditory).

• A reward after a delay (1 sec below)

Dopamine and prediction error

Documents

Course 9.54 Final revie9.54/fall14/slides/ClassReviewFinal_II.pdf · Formal Setup • The agent is in one of a set of states, {S 1, S 2,…S n } • At each state, it can take an