34
Reinforcement Learning Embodied AI VL WS16/17 Supplement Oswald Berthold Adaptive Systems Group, Dept. of Computer Science, Humboldt-Universität zu Berlin February 6, 2017 Oswald Berthold Reinforcement Learning 1/34

Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

Reinforcement LearningEmbodied AI VL WS16/17 Supplement

Oswald Berthold

Adaptive Systems Group, Dept. of Computer Science, Humboldt-Universität zuBerlin

February 6, 2017

Oswald Berthold Reinforcement Learning 1/34

Page 2: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

Outline

Intro

Classical RL

Bio-plausible RL

Summary

References

Oswald Berthold Reinforcement Learning 2/34

Page 3: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

1. Intro

Basics

Reinforcement Learning – Verstärkungslernen

This means

There is something (a signal) that reinforces

There is something (structure) which is reinforced

Oswald Berthold Reinforcement Learning 3/34

Page 4: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

1. Intro

Plain language definition

Learn from, and while interacting with, the environment. Thereinforcing signal comes from this interaction.

Amplify correlation between any action and its causes (states)yielding favourable consequences. The associative link betweenstates and actions is reinforced.

Assertion: Any algorithm that "solves" this is an RL algorithm.Question: what do we accept as a solution? This is where RL andDR disagree?

Oswald Berthold Reinforcement Learning 4/34

Page 5: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

1. Intro

Quote: Any algorithm . . .

Reinforcement learning, [. . . ], is simultaneously a problem, a class ofsolution methods that work well on the class of problems, and the fieldthat studies these problems and their solution methods. Reinforcementlearning problems involve learning what to do — how to map situations toactions — so as to maximize a numerical reward signal. In an essentialway these are closed-loop problems because the learning system’s actionsinfluence its later inputs. Moreover, the learner is not told which actionsto take, as in many forms of machine learning, but instead must discoverwhich actions yield the most reward by trying them out. In the mostinteresting and challenging cases, actions may affect not only theimmediate reward but also the next situation and, through that, allsubsequent rewards. These three characteristics — being closed-loop inan essential way, not having direct instructions as to what actions to take,and where the consequences of actions, including reward signals, play outover extended time periods — are the three most important distinguishingfeatures of the reinforcement learning problem.[. . . ]Any method that is well suited to solving such problems we consider to bea reinforcement learning method. 1

1SB pg. 1-2Oswald Berthold Reinforcement Learning 5/34

Aha!

Page 6: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

1. Intro

Genealogy

What is the history of ideas leading to Reinforcement Learning?

Oswald Berthold Reinforcement Learning 6/34

Psychology conditioning, behaviourism, explain howanimals acquire skills and behaviour.↘

RL

↖Computational approach, formal definition of the problem, AI

algorithms for solving it

Page 7: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

1. Intro

Machine learning context

Unsupervised

Supervised

?Reinforcement

Typeof fee

dback

Feedback spectrum: Types of learning principles populate spectrum offeedback types 2

2Images from Sullivan, https://commons.wikimedia.org/wiki/File:Explosions.jpg, PD andBenedicto16, https://de.wikipedia.org/wiki/Datei:Fire_exit.svg, CC-BY-SA

Oswald Berthold Reinforcement Learning 7/34

Page 8: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

1. Intro

Methodology

We explore designs for machines that are effective in solvinglearning problems of scientific or economic interest, evaluatingthe designs through mathematical analysis or computationalexperiments.3

3Sutton & Barto: Reinforcement Learning - An Introduction (2nd Ed.), 2017, Unpublished,https://webdocs.cs.ualberta.ca/~sutton/book/the-book-2nd.html (SB)

Oswald Berthold Reinforcement Learning 8/34

Page 9: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

2. Classical RL

Formal definition: Markov Decision Process

MDP: Markov Decision Process, one way to formalize the RLproblem, a 5-tuple

MDP := (S,A,T , r , γ)

with states S, actions A, state transition probabilities through anaction T , reward r and discount factor γ.

Markov property allows to only consider current s ∈ S but puts theburden on state representation.

Oswald Berthold Reinforcement Learning 9/34

Page 10: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

2. Classical RL

Solving the MDP

The solution to an MDP is a policy π, like a = π(s), or π(a|s),ideally π∗ denoting the optimal policy.

Idea: Guide search for policy with a Value function. Describes the"value" (expected return) of a given state s or state-action pair(s, a) with respect to a given policy.

vπ(s) = Eπ(Gt |St = s) = Eπ[ ∞∑k=0

γkRt+k+1|St = s]

or

qπ(s, a) = Eπ(Gt |St = s,At = a) = Eπ[ ∞∑k=0

γkRt+k+1|St = s,At = a]

Oswald Berthold Reinforcement Learning 10/34

Page 11: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

2. Classical RL

Bellman equations

Bellman equation for vπ 4

vπ(s) =∑aπ(a|s)

∑s′,r

p(s ′, r |s, a)[r + γvπ(s ′)]

It expresses a relationship between the value of a stateand the values of its successor states.

4SB pg. 63/81 Eq. 3.12Oswald Berthold Reinforcement Learning 11/34

Page 12: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

2. Classical RL

Bellman optimality equation

Bellman optimality equations for v and q for completeness sake 5

v∗(s) = maxa E [Rt+1 + γv∗(St+1)|St = s,At = a]

= maxa∑

s′,r p(s ′, r |s, a)[r + γv∗(s ′)]

q∗(s, a) = E [Rt+1 + γmaxa′ q∗(St+1, a′)|St = s,At = a]

=∑

s′,r p(s ′, r |s, a)[r + γmaxa′ q∗(s ′, a′)]

which can be turned into iterative update equations for successiveapproximations of v or q.

5SB pg. 79/97, pg 80/98Oswald Berthold Reinforcement Learning 12/34

Page 13: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

2. Classical RL

Three main methods

Dynamic Programming (DP)

Temporal Difference Learning (TD)

Monte Carlo Methods (MC)

TD is a "bridge" connecting DP (bootstrapping) and MC(sampling).

Oswald Berthold Reinforcement Learning 13/34

Page 14: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

2. Classical RL

DP solution

Assumes a perfect model = knowledge of T .

Fundamental method: Policy iteration 6: Policy evaluationcomputing vπ, Policy improvement by adopting any actionimproving v which is not in the current policy, with alternativecorrespondences: Evaluation / Prediction, Improvement / Control.

Variant Value iteration 7 is a modified Policy iteration. Thesemodifications lead to Generalized Policy Iteration (GPI):alternating between Policy evaluation and Policy improvement withdifferent grades of convergence for each.

6SB pg. p87/1057SB pg. 90/108

Oswald Berthold Reinforcement Learning 14/34

Page 15: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

2. Classical RL

Monte Carlo solution

Monte Carlo: requires no model apart from an "explorer" and usesonly experience (sampling).

MC prediction: Sample full episode, for each occuring state saccumulate the return from the first occurence of s into vπ.

Policy improvement same as in DP using GPI.

Oswald Berthold Reinforcement Learning 15/34

Page 16: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

2. Classical RL

Temporal Difference Learning

Combine MC (sampling) and DP (bootstrapping). Simplestmethod is TD(0) prediction,

V (S) ← V (S) + α (

target︷ ︸︸ ︷R + γV (S ′)−V (S))︸ ︷︷ ︸

TD error

and the variants Q-Learning (off-policy control)

Q(S,A)← Q(S,A) + α[R + γmaxa

Q(S ′, a)− Q(S,A)]

and SARSA (on-policy control)

Q(S,A)← Q(S,A) + α[R + γQ(S ′,A′)− Q(S,A)]

Still using GPI.

Oswald Berthold Reinforcement Learning 16/34

Page 17: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

2. Classical RL

Temporal Difference Learning example

Different v and q functions using vanilla tables, using rl.py from 8

0 1 2 3 4 5x

012345

y

Agent 0 state (position on grid)

0 1 2 3 4 5x

012345

y

Agent 0 state value v(s)

012345

yAgent 0 state-action value q(s,a)

012345 Q_{Q}, min = 0.000000, max = 1.093734

x=x,a=nop x=x,a=w x=x,a=nw x=x,a=n x=x,a=ne x=x,a=e x=x,a=se x=x,a=s x=x,a=sw012345 Q_{SARSA} min = 0.000000, max = 1.083789

TD(0) learning of v and q, 1 agents using td_0_on_policy_control

8https://github.com/x75/playground

Oswald Berthold Reinforcement Learning 17/34

Page 18: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

2. Classical RL

Temporal Difference Learning example

Same settings, different goal state, v and q functions using anMultilayer Perceptron approximation.

0 1 2 3 4 5x

012345

y

Agent 0 state (position on grid)

0 1 2 3 4 5x

012345

y

Agent 0 state value v(s)

012345

yAgent 0 state-action value q(s,a)

012345 Q_{Q}, min = -0.254582, max = 1.412525

x=x,a=nop x=x,a=w x=x,a=nw x=x,a=n x=x,a=ne x=x,a=e x=x,a=se x=x,a=s x=x,a=sw012345 Q_{SARSA} min = -0.229976, max = 1.408011

TD(0) learning of v and q, 1 agents using td_0_on_policy_control

Oswald Berthold Reinforcement Learning 18/34

Page 19: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

2. Classical RL

Continuous state-action spaces

From discrete to continuous spaces: tables can be replaced withlearnable Function Approximators (FA) like Neural Networks(Connectionist RL), Gaussian Processes (GP), etc. In this case, theupdate equations from one slide ago have to be changed to complywith the training paradigm, e.g. for a Q approximator using theSARSA update:

X = (St−1,At−1) Inputy = R + γFA((St ,At)) Target

FA(X ) → y Update FA(X ) towards y

Oswald Berthold Reinforcement Learning 19/34

Page 20: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

2. Classical RL

Continuous state-action spaces

Policy searchUse a direct encoding of a policy e.g. parameterized functions of thestate a = π(s, θ). Evolutionary algorithms can be seen as policy search.

Actor-Critic methodsOn-policy methods, the actor proposes an action (= policy), the criticevaluates the action and its outcome (value). The the actor is changedusing the critic’s feedback.

Policy gradientSample episodes controlled by current policy while putting noise on theparameters or on the output. Estimate the gradient of the loss withrespect to the noise dimensions and update the policy using that gradient.

Oswald Berthold Reinforcement Learning 20/34

Page 21: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

2. Classical RL

Dangling bits

Episodic vs. non-episodicHow many lives?

On-policy vs. off-policyChange behaviour immediately or observe for while and change later.

Exploration / ExploitationAre the known options good? Any options we haven’t tried yet?

Temporal Credit AssignmentWhich part of the learner at what time is to blame for good or bad consequences?

Eligibility tracesExtend the temporal horizon over which action/reward correlations are observed

in a continuous manner. Usually parameter λ, with the range λ = 0 (TD(0)) to λ = 1(MC).

Oswald Berthold Reinforcement Learning 21/34

Page 22: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

2. Classical RL

Current trends

Many recent results from combining deep learning andreinforcement learning.

Some buzzwords are Neural Fitted Q-Learning, Deep Q-Networks,deep learning of dynamics models, end-to-end reinforcementlearning, From pixels to torques, Atari (Pong), OpenAI Gym andUniverse, Torchcraft, Malmo, AlphaGO, . . .

Example: Learning to play Pong with visual input (OpenAI Gym) usingPolicy gradient.

Oswald Berthold Reinforcement Learning 22/34

Page 23: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

2. Classical RL

Example: Policy Gradient with the Signed Derivative

Learning obstacle avoidance with Signed Derivate Policy Gradientmethod, DA Nestler 2016. Learn linear combination of visual inputfeatures to control speed and turning without hitting obstacles.

(a) Optical flow input

(b) Left/right input (c) Left/right input (d) Turtle bot

Oswald Berthold Reinforcement Learning 23/34

Page 24: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

2. Classical RL

Example: Policy Gradient with Motion Primitives

Playing table tennis with a robot arm 9 from J. Peters’ lab @TUDarmstadt.

9Mülling and others, 2013, Learning to Select and Generalize Striking Movements in Robot Table Tennis,http://www.ias.informatik.tu-darmstadt.de/uploads/Publications/Kupcsik_AAAI_2013.pdf, Videos:https://www.youtube.com/watch?v=SH3bADiB7uQ, https://www.youtube.com/watch?v=BcJ4S4L1n78,

Oswald Berthold Reinforcement Learning 24/34

Page 25: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

2. Classical RL

Example: Policy Gradient with Motion Primitives

Playing table tennis with a robot arm 9 from J. Peters’ lab @TUDarmstadt.

(a) Robot setup

Oswald Berthold Reinforcement Learning 25/34

Page 26: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

2. Classical RL

Examples: more of them (incomplete)

I Tak Kit Lau: Prisca: A Policy Search Method for Extreme Trajectory FollowingI J. Zico Kolter: Policy Search via the Signed Derivative (PGSD)I Lau & Liu: Stunt Driving via Policy SearchI Lau & Liu: Learning Hover with Scarce SamplesI Tedrake: Learning to Fly like a BirdI Michels, Saxena, Ng: High Speed Obstacle Avoidance using Monocular Vision

and Reinforcement LearningI Rottmann et al.: Autonomous Blimp Control using Model-free Reinforcement

LearningI Riedmiller and others: The Neuro Slot Car Racer: Reinforcement Learning in a

Real World Setting; Learn to Swing Up and Balance a Real Pole Based on RawVisual Input Data; Autonomous reinforcement learning on raw visual input datain a real world application;

Oswald Berthold Reinforcement Learning 26/34

Page 27: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

3. Bio-plausible RL

Overview of different methods and their relationship

Machine Learning Classical Conditioning Synaptic Plasticity

Dynamic Prog.(Bellman Eq.)

REINFORCEMENT LEARNING UN-SUPERVISED LEARNINGexample based correlation based

-Rule

Monte CarloControl

Q-Learning

TD( )often =0

TD(1) TD(0)

Rescorla/Wagner

Neur.TD-Models(“Critic”)

Neur.TD-formalism

DifferentialHebb-Rule

(”fast”)

STDP-Modelsbiophysical & network

EVALUATIVE FEEDBACK (Rewards)

NON-EVALUATIVE FEEDBACK (Correlations)

SARSACorrelation

based Control(non-evaluative)

ISO-Learning

ISO-Modelof STDP

Actor/Critictechnical & Basal Gangl.

E lig ibility Tra ce s

Hebb-Rule

DifferentialHebb-Rule

(”slow”)

supervised L.

Anticipatory Control of Actions and Prediction of Values Correlation of Signals

=

=

=

Neuronal Reward Systems(Basal Ganglia)

Biophys. of Syn. PlasticityDopamine Glutamate

STDP

LTP(LTD=anti)

ISO-Control

Overview over different methods

5

Borrowed from F. Wörgötter, Lecture Slides Learning and Adaptive Algorithms 2014,with perms.

Oswald Berthold Reinforcement Learning 27/34

Page 28: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

3. Bio-plausible RL

Neuromodulation and the reward system

Reward system: Cortico-basal ganglia-thalamic loop

Neuron groups related to Dopamine, Glutamate, and GABA, whichact as modulators on other groups of neurons.

Temporal difference reward prediction error: unpredicted rewardleads to activation, fully predicted reward to no response, omissionof predicted reward to depression of activity.

Oswald Berthold Reinforcement Learning 28/34

Page 29: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

3. Bio-plausible RL

Conditioning

Rescorla-Wagner model of classical conditioning, associationbetween conditioned (CS) and unconditioned (US) stimulus. LooksHebbian.

∆V n+1X︸ ︷︷ ︸

weight change

= αX︸︷︷︸CS (pre)

β︸︷︷︸US (post)

(λ− Vtot)︸ ︷︷ ︸stabil. / modul. term

. . . the dopamine response seems to convey the cruciallearning term (λ− V ) . . . 10

10W. Schultz, http://www.scholarpedia.org/article/Reward_signals, 201701Oswald Berthold Reinforcement Learning 29/34

Page 30: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

3. Bio-plausible RL

Reward modulation

This modulatory input can be modelled as an additionalmultiplicative term in a Hebbian update rule.

dW = η ·pre︷︸︸︷x ·

post︷︸︸︷y ·

reward︷︸︸︷r ·

other factors︷︸︸︷. . .

Binary reward ∈ [0, 1] leading to greedy updates toward"improvements", depending on how the reward is computed.

Is faster with dense reward. Learning a value function with FA canbe used to obtain a continuous reward prediction from an initiallysparse primary reward signal.

Oswald Berthold Reinforcement Learning 30/34

Page 31: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

3. Bio-plausible RL

Reward modulation example

Exploratory Hebbian Learning e.g. for acquiring behaviour on theSphero robot 11. Prior value function for continuous reward. Fastbootstrapping of dynamic control.

Readouts

Sphero

Cmd vy

Cmd vx

− Target

Learning rule∆WWW out

uuu

Learning

y1

y2

ν2

ν1

Sens

ors(v

x,v y)

eyexvyvx

rrr

Reservoir

11Berthold & Hafner, 2015, Closed-loop acquisition of behaviour on the Sphero robot,https://mitpress.mit.edu/sites/default/files/titles/content/ecal2015/ch084.htmlhttps://www.youtube.com/watch?v=NepvHrF7AfU

Oswald Berthold Reinforcement Learning 31/34

Page 32: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

4. Summary

Summary

Classical RL is complete but at the cost of exhaustive exploration.

Accelerate learning with approximators (NFQ, DMPs, EH, . . . ) andon-policy learning at the cost of suboptimality.

Explore sampling with Exploration functions that observe the state of thelearner (confidence, error behaviour, information theoretic measures)

Priors and coarse estimates for value functions as initializations?

Temporal Difference- / Prediction-Learning are necessary prerequisites foradaptive behaviour? "Prediction-scope" as a measure of an agent’scomplexity?

Oswald Berthold Reinforcement Learning 32/34

Page 33: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

5. References

Further reading

I Sutton & Barto, Reinforcement Learning - An Introduction (2nd Ed.), 2017,Unpublished,https://webdocs.cs.ualberta.ca/~sutton/book/the-book-2nd.html

I Wörgötter, Lecture Slides: Learning and Adaptive Algorithms (BCCN Göttingen/ Dept. Comp. Neurosci)

I Dayan & Abbott, Theoretical Neuroscience, 2000I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009I Hasselt, Insights in Reinforcement Learning, 2010, PhD ThesisI Riedmiller, 10 Steps and Some Tricks To Set Up Neural Reinforcement

Controllers, 2011?I Deisenroth & others, A Survey on Policy Search for Robotics, 2011I Kober & others, Reinforcement Learning in Robotics: A Survey, 2012I Nguyen-Tuong & Peters, Model Learning for Robot Control: A Survey, 2010

Oswald Berthold Reinforcement Learning 33/34

Page 34: Reinforcement Learning - Embodied AI VL WS16/17 Supplementbertolos/pdf/RL... · 2017-02-05 · I Yael Niv, Reinforcement learning in the brain, JMathPsy, 2009 I Hasselt, Insights

5. References

Software environments

Some packages that can serve as starting points for exploring closed-loop learning

I pybrain: http://pybrain.org/I openai: gym: https://gym.openai.com/, and universe:

https://universe.openai.com/I Modular Data Processing Toolkit (steaming hot) with Online and RL Nodes:

https://github.com/varunrajk/mdp-toolkitI explauto: https://github.com/flowersteam/explauto

Oswald Berthold Reinforcement Learning 34/34