Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

Reinforcement LearningEmbodied AI VL WS16/17 Supplement

Oswald Berthold

Adaptive Systems Group, Dept. of Computer Science, Humboldt-Universität zuBerlin

February 6, 2017

Oswald Berthold Reinforcement Learning 1/34

Outline

Intro

Classical RL

Bio-plausible RL

Summary

References


1. Intro

Basics

Reinforcement Learning – Verstärkungslernen

This means

There is something (a signal) that reinforces

There is something (structure) which is reinforced


1. Intro

Plain language definition

Learn from, and while interacting with, the environment. Thereinforcing signal comes from this interaction.

Amplify correlation between any action and its causes (states)yielding favourable consequences. The associative link betweenstates and actions is reinforced.

Assertion: Any algorithm that "solves" this is an RL algorithm.Question: what do we accept as a solution? This is where RL andDR disagree?


1. Intro

Quote: Any algorithm . . .

Reinforcement learning, [. . . ], is simultaneously a problem, a class ofsolution methods that work well on the class of problems, and the fieldthat studies these problems and their solution methods. Reinforcementlearning problems involve learning what to do — how to map situations toactions — so as to maximize a numerical reward signal. In an essentialway these are closed-loop problems because the learning system’s actionsinfluence its later inputs. Moreover, the learner is not told which actionsto take, as in many forms of machine learning, but instead must discoverwhich actions yield the most reward by trying them out. In the mostinteresting and challenging cases, actions may affect not only theimmediate reward but also the next situation and, through that, allsubsequent rewards. These three characteristics — being closed-loop inan essential way, not having direct instructions as to what actions to take,and where the consequences of actions, including reward signals, play outover extended time periods — are the three most important distinguishingfeatures of the reinforcement learning problem.[. . . ]Any method that is well suited to solving such problems we consider to bea reinforcement learning method. 1

1SB pg. 1-2Oswald Berthold Reinforcement Learning 5/34

Aha!

1. Intro

Genealogy

What is the history of ideas leading to Reinforcement Learning?


Psychology conditioning, behaviourism, explain howanimals acquire skills and behaviour.↘

RL

↖Computational approach, formal definition of the problem, AI

algorithms for solving it

1. Intro

Machine learning context

Unsupervised

Supervised

?Reinforcement

Typeof fee

dback

Feedback spectrum: Types of learning principles populate spectrum offeedback types 2

2Images from Sullivan, https://commons.wikimedia.org/wiki/File:Explosions.jpg, PD andBenedicto16, https://de.wikipedia.org/wiki/Datei:Fire_exit.svg, CC-BY-SA


https://commons.wikimedia.org/wiki/File:Explosions.jpg

https://de.wikipedia.org/wiki/Datei:Fire_exit.svg

1. Intro

Methodology

We explore designs for machines that are effective in solvinglearning problems of scientific or economic interest, evaluatingthe designs through mathematical analysis or computationalexperiments.3

3Sutton & Barto: Reinforcement Learning - An Introduction (2nd Ed.), 2017, Unpublished,https://webdocs.cs.ualberta.ca/~sutton/book/the-book-2nd.html (SB)


https://webdocs.cs.ualberta.ca/~sutton/book/the-book-2nd.html

2. Classical RL

Formal definition: Markov Decision Process

MDP: Markov Decision Process, one way to formalize the RLproblem, a 5-tuple

MDP := (S,A,T , r , γ)

with states S, actions A, state transition probabilities through anaction T , reward r and discount factor γ.

Markov property allows to only consider current s ∈ S but puts theburden on state representation.


2. Classical RL

Solving the MDP

The solution to an MDP is a policy π, like a = π(s), or π(a|s),ideally π∗ denoting the optimal policy.

Idea: Guide search for policy with a Value function. Describes the"value" (expected return) of a given state s or state-action pair(s, a) with respect to a given policy.

vπ(s) = Eπ(Gt |St = s) = Eπ[ ∞∑k=0

γkRt+k+1|St = s]

or

qπ(s, a) = Eπ(Gt |St = s,At = a) = Eπ[ ∞∑k=0

γkRt+k+1|St = s,At = a]


2. Classical RL

Bellman equations

Bellman equation for vπ 4

vπ(s) =∑aπ(a|s)

∑s′,r

p(s ′, r |s, a)[r + γvπ(s ′)]

It expresses a relationship between the value of a stateand the values of its successor states.

4SB pg. 63/81 Eq. 3.12Oswald Berthold Reinforcement Learning 11/34

2. Classical RL

Bellman optimality equation

Bellman optimality equations for v and q for completeness sake 5

v∗(s) = maxa E [Rt+1 + γv∗(St+1)|St = s,At = a]

= maxa∑

s′,r p(s ′, r |s, a)[r + γv∗(s ′)]

q∗(s, a) = E [Rt+1 + γmaxa′ q∗(St+1, a′)|St = s,At = a]

=∑

s′,r p(s ′, r |s, a)[r + γmaxa′ q∗(s ′, a′)]

which can be turned into iterative update equations for successiveapproximations of v or q.

5SB pg. 79/97, pg 80/98Oswald Berthold Reinforcement Learning 12/34

2. Classical RL

Three main methods

Dynamic Programming (DP)

Temporal Difference Learning (TD)

Monte Carlo Methods (MC)

TD is a "bridge" connecting DP (bootstrapping) and MC(sampling).


2. Classical RL

DP solution

Assumes a perfect model = knowledge of T .

Fundamental method: Policy iteration 6: Policy evaluationcomputing vπ, Policy improvement by adopting any actionimproving v which is not in the current policy, with alternativecorrespondences: Evaluation / Prediction, Improvement / Control.

Variant Value iteration 7 is a modified Policy iteration. Thesemodifications lead to Generalized Policy Iteration (GPI):alternating between Policy evaluation and Policy improvement withdifferent grades of convergence for each.

6SB pg. p87/1057SB pg. 90/108


2. Classical RL

Monte Carlo solution

Monte Carlo: requires no model apart from an "explorer" and usesonly experience (sampling).

MC prediction: Sample full episode, for each occuring state saccumulate the return from the first occurence of s into vπ.

Policy improvement same as in DP using GPI.


2. Classical RL

Temporal Difference Learning

Combine MC (sampling) and DP (bootstrapping). Simplestmethod is TD(0) prediction,

V (S) ← V (S) + α (

target︷︸︸︷R + γV (S ′)−V (S))︸︷︷︸

TD error

and the variants Q-Learning (off-policy control)

Q(S,A)← Q(S,A) + α[R + γmaxa

Q(S ′, a)− Q(S,A)]

and SARSA (on-policy control)

Q(S,A)← Q(S,A) + α[R + γQ(S ′,A′)− Q(S,A)]

Still using GPI.


2. Classical RL

Temporal Difference Learning example

Different v and q functions using vanilla tables, using rl.py from 8

0 1 2 3 4 5x

012345

y

Agent 0 state (position on grid)

0 1 2 3 4 5x

012345

y

Agent 0 state value v(s)

012345

yAgent 0 state-action value q(s,a)

012345 Q_{Q}, min = 0.000000, max = 1.093734

x=x,a=nop x=x,a=w x=x,a=nw x=x,a=n x=x,a=ne x=x,a=e x=x,a=se x=x,a=s x=x,a=sw012345 Q_{SARSA} min = 0.000000, max = 1.083789

TD(0) learning of v and q, 1 agents using td_0_on_policy_control

8https://github.com/x75/playground


https://github.com/x75/playground

2. Classical RL

Temporal Difference Learning example

Same settings, different goal state, v and q functions using anMultilayer Perceptron approximation.

0 1 2 3 4 5x

012345

y

Agent 0 state (position on grid)

0 1 2 3 4 5x

012345

y

Agent 0 state value v(s)

012345

yAgent 0 state-action value q(s,a)

012345 Q_{Q}, min = -0.254582, max = 1.412525

x=x,a=nop x=x,a=w x=x,a=nw x=x,a=n x=x,a=ne x=x,a=e x=x,a=se x=x,a=s x=x,a=sw012345 Q_{SARSA} min = -0.229976, max = 1.408011

TD(0) learning of v and q, 1 agents using td_0_on_policy_control


2. Classical RL

Continuous state-action spaces

From discrete to continuous spaces: tables can be replaced withlearnable Function Approximators (FA) like Neural Networks(Connectionist RL), Gaussian Processes (GP), etc. In this case, theupdate equations from one slide ago have to be changed to complywith the training paradigm, e.g. for a Q approximator using theSARSA update:

X = (St−1,At−1) Inputy = R + γFA((St ,At)) Target

FA(X ) → y Update FA(X ) towards y


2. Classical RL

Continuous state-action spaces

Policy searchUse a direct encoding of a policy e.g. parameterized functions of thestate a = π(s, θ). Evolutionary algorithms can be seen as policy search.

Actor-Critic methodsOn-policy methods, the actor proposes an action (= policy), the criticevaluates the action and its outcome (value). The the actor is changedusing the critic’s feedback.

Policy gradientSample episodes controlled by current policy while putting noise on theparameters or on the output. Estimate the gradient of the loss withrespect to the noise dimensions and update the policy using that gradient.


2. Classical RL

Dangling bits

Episodic vs. non-episodicHow many lives?

On-policy vs. off-policyChange behaviour immediately or observe for while and change later.

Exploration / ExploitationAre the known options good? Any options we haven’t tried yet?

Temporal Credit AssignmentWhich part of the learner at what time is to blame for good or bad consequences?

Eligibility tracesExtend the temporal horizon over which action/reward correlations are observed

in a continuous manner. Usually parameter λ, with the range λ = 0 (TD(0)) to λ = 1(MC).


2. Classical RL

Current trends

Many recent results from combining deep learning andreinforcement learning.

Some buzzwords are Neural Fitted Q-Learning, Deep Q-Networks,deep learning of dynamics models, end-to-end reinforcementlearning, From pixels to torques, Atari (Pong), OpenAI Gym andUniverse, Torchcraft, Malmo, AlphaGO, . . .

Example: Learning to play Pong with visual input (OpenAI Gym) usingPolicy gradient.


2. Classical RL

Example: Policy Gradient with the Signed Derivative

Learning obstacle avoidance with Signed Derivate Policy Gradientmethod, DA Nestler 2016. Learn linear combination of visual inputfeatures to control speed and turning without hitting obstacles.

(a) Optical flow input

(b) Left/right input (c) Left/right input (d) Turtle bot


2. Classical RL

Example: Policy Gradient with Motion Primitives

Playing table tennis with a robot arm 9 from J. Peters’ lab @TUDarmstadt.

9Mülling and others, 2013, Learning to Select and Generalize Striking Movements in Robot Table Tennis,http://www.ias.informatik.tu-darmstadt.de/uploads/Publications/Kupcsik_AAAI_2013.pdf, Videos:https://www.youtube.com/watch?v=SH3bADiB7uQ, https://www.youtube.com/watch?v=BcJ4S4L1n78,


http://www.ias.informatik.tu-darmstadt.de/uploads/Publications/Kupcsik_AAAI_2013.pdf

https://www.youtube.com/watch?v=SH3bADiB7uQ

https://www.youtube.com/watch?v=BcJ4S4L1n78

2. Classical RL

Example: Policy Gradient with Motion Primitives

Playing table tennis with a robot arm 9 from J. Peters’ lab @TUDarmstadt.

(a) Robot setup


2. Classical RL

Examples: more of them (incomplete)

I Tak Kit Lau: Prisca: A Policy Search Method for Extreme Trajectory FollowingI J. Zico Kolter: Policy Search via the Signed Derivative (PGSD)I Lau & Liu: Stunt Driving via Policy SearchI Lau & Liu: Learning Hover with Scarce SamplesI Tedrake: Learning to Fly like a BirdI Michels, Saxena, Ng: High Speed Obstacle Avoidance using Monocular Vision

and Reinforcement LearningI Rottmann et al.: Autonomous Blimp Control using Model-free Reinforcement

LearningI Riedmiller and others: The Neuro Slot Car Racer: Reinforcement Learning in a

Real World Setting; Learn to Swing Up and Balance a Real Pole Based on RawVisual Input Data; Autonomous reinforcement learning on raw visual input datain a real world application;


3. Bio-plausible RL

Overview of different methods and their relationship

Machine Learning Classical Conditioning Synaptic Plasticity

Dynamic Prog.(Bellman Eq.)

REINFORCEMENT LEARNING UN-SUPERVISED LEARNINGexample based correlation based

-Rule

Monte CarloControl

Q-Learning

TD( )often =0

TD(1) TD(0)

Rescorla/Wagner

Neur.TD-Models(“Critic”)

Neur.TD-formalism

DifferentialHebb-Rule

(”fast”)

STDP-Modelsbiophysical & network

EVALUATIVE FEEDBACK (Rewards)

NON-EVALUATIVE FEEDBACK (Correlations)

SARSACorrelation

based Control(non-evaluative)

ISO-Learning

ISO-Modelof STDP

Actor/Critictechnical & Basal Gangl.

E lig ibility Tra ce s

Hebb-Rule

DifferentialHebb-Rule

(”slow”)

supervised L.

Anticipatory Control of Actions and Prediction of Values Correlation of Signals

=

=

=

Neuronal Reward Systems(Basal Ganglia)

Biophys. of Syn. PlasticityDopamine Glutamate

STDP

LTP(LTD=anti)

ISO-Control

Overview over different methods

5

Borrowed from F. Wörgötter, Lecture Slides Learning and Adaptive Algorithms 2014,with perms.


3. Bio-plausible RL

Neuromodulation and the reward system

Reward system: Cortico-basal ganglia-thalamic loop

Neuron groups related to Dopamine, Glutamate, and GABA, whichact as modulators on other groups of neurons.

Temporal difference reward prediction error: unpredicted rewardleads to activation, fully predicted reward to no response, omissionof predicted reward to depression of activity.


3. Bio-plausible RL

Conditioning

Rescorla-Wagner model of classical conditioning, associationbetween conditioned (CS) and unconditioned (US) stimulus. LooksHebbian.

∆V n+1X︸︷︷︸

weight change

= αX︸︷︷︸CS (pre)

β︸︷︷︸US (post)

(λ− Vtot)︸︷︷︸stabil. / modul. term

. . . the dopamine response seems to convey the cruciallearning term (λ− V ) . . . 10

10W. Schultz, http://www.scholarpedia.org/article/Reward_signals, 201701Oswald Berthold Reinforcement Learning 29/34

http://www.scholarpedia.org/article/Reward_signals

3. Bio-plausible RL

Reward modulation

This modulatory input can be modelled as an additionalmultiplicative term in a Hebbian update rule.

dW = η ·pre︷︸︸︷x ·

post︷︸︸︷y ·

reward︷︸︸︷r ·

other factors︷︸︸︷. . .

Binary reward ∈ [0, 1] leading to greedy updates toward"improvements", depending on how the reward is computed.

Is faster with dense reward. Learning a value function with FA canbe used to obtain a continuous reward prediction from an initiallysparse primary reward signal.


3. Bio-plausible RL

Reward modulation example

Exploratory Hebbian Learning e.g. for acquiring behaviour on theSphero robot 11. Prior value function for continuous reward. Fastbootstrapping of dynamic control.

Readouts

Sphero

Cmd vy

Cmd vx

− Target

Learning rule∆WWW out

uuu

Learning

y1

y2

ν2

ν1

Sens

ors(v

x,v y)

eyexvyvx

rrr

Reservoir

11Berthold & Hafner, 2015, Closed-loop acquisition of behaviour on the Sphero robot,https://mitpress.mit.edu/sites/default/files/titles/content/ecal2015/ch084.htmlhttps://www.youtube.com/watch?v=NepvHrF7AfU


https://mitpress.mit.edu/sites/default/files/titles/content/ecal2015/ch084.html

https://www.youtube.com/watch?v=NepvHrF7AfU

4. Summary

Summary

Classical RL is complete but at the cost of exhaustive exploration.

Accelerate learning with approximators (NFQ, DMPs, EH, . . . ) andon-policy learning at the cost of suboptimality.

Explore sampling with Exploration functions that observe the state of thelearner (confidence, error behaviour, information theoretic measures)

Priors and coarse estimates for value functions as initializations?

Temporal Difference- / Prediction-Learning are necessary prerequisites foradaptive behaviour? "Prediction-scope" as a measure of an agent’scomplexity?


5. References

Further reading

I Sutton & Barto, Reinforcement Learning - An Introduction (2nd Ed.), 2017,Unpublished,https://webdocs.cs.ualberta.ca/~sutton/book/the-book-2nd.html

I Wörgötter, Lecture Slides: Learning and Adaptive Algorithms (BCCN Göttingen/ Dept. Comp. Neurosci)

I Dayan & Abbott, Theoretical Neuroscience, 2000I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009I Hasselt, Insights in Reinforcement Learning, 2010, PhD ThesisI Riedmiller, 10 Steps and Some Tricks To Set Up Neural Reinforcement

Controllers, 2011?I Deisenroth & others, A Survey on Policy Search for Robotics, 2011I Kober & others, Reinforcement Learning in Robotics: A Survey, 2012I Nguyen-Tuong & Peters, Model Learning for Robot Control: A Survey, 2010


https://webdocs.cs.ualberta.ca/~sutton/book/the-book-2nd.html

5. References

Software environments

Some packages that can serve as starting points for exploring closed-loop learning

I pybrain: http://pybrain.org/I openai: gym: https://gym.openai.com/, and universe:

https://universe.openai.com/I Modular Data Processing Toolkit (steaming hot) with Online and RL Nodes:

https://github.com/varunrajk/mdp-toolkitI explauto: https://github.com/flowersteam/explauto


http://pybrain.org/

https://gym.openai.com/

https://universe.openai.com/

https://github.com/varunrajk/mdp-toolkit

https://github.com/flowersteam/explauto

Documents

Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights