Apprentissage par Renforcement Reinforcement Learning Kenji Doya doya@atr.co.jp ATR Human...

Apprentissage par Renforcement

Reinforcement Learning

Kenji Doyadoya@atr.co.jp

ATR Human Information Science LaboratoriesCREST, Japan Science and Technology Corporation

Outline

Introduction to Reinforcement Learning (RL)Markov decision process (MDP)Current topics

RL in Continuous Space and TimeModel-free and model-based approaches

Learning to Stand UpDiscrete plans and continuous control

Modular DecompositionMultiple model-based RL (MMRL)

Learning to Walk (Doya & Nakano, 1985)

Action: cycle of 4 posturesReward: speed sensor output

Multiple solutions: creeping, jumping,…

QuickTime˛ Ç∆H.263 êLí£ÉvÉçÉOÉâÉÄ

Ç™Ç±ÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇ…ÇÕïKóvÇ≈Ç∑ÅB

Markov Decision Process (MDP)

Environmentdynamics P(s’|s,a)reward P(r|s,a)

Agentpolicy P(a|s)

Goal: maximize cumulative future rewards E[ r(t+1) + r(t+2) + …]0≤≤1: discount factor

agent environment

reward r

action a

state s

Value Function and TD error

State value function V(s) = E[ r(t+1) + r(t+2) + …| s(t)=s, P(a|s)]

0≤≤1: discount factorConsistency condition

(t) = r(t) + V(s(t)) - V(s(t-1)) = 0 new estimate - old estimate

Dual role of temporal difference (TD) error (t)Reward prediction: (t) 0 in averageAction selection: (t)>0 better than average

QuickTime˛ Ç∆ÉrÉfÉI êLí£ÉvÉçÉOÉâÉÄ

Example: Navigation

Reward field

Value function=0.9

Actor-Critic Architecture

Critic: future reward predictionupdate value V(s(t-1)) (t)

Actor: action reinforcementincrease P(a(t-1)|s(t-1)) if (t) > 0

critic: V(s)

actor: P(a|s)

TD error reward r

environmentaction a

state s

Q Learning

Action value functionQ(s,a) = E[ r(t+1) + r(t+2) + …|

s(t)=s, a(t)=a, P(a|s)]= E[ r(t+1) + V(s(t+1))| s(t)=s,

a(t)=a]Action selection

a(t) = argmaxa Q(s(t),a) with prob. 1-

UpdateQ(s(t),a(t)) := r(t+1) + maxa[ Q(s(t+1),a)]

Q(s(t),a(t)) := r(t+1) + Q(s(t+1),a(t+1))

Dynamic Programming and RL

Dynamic Programminggiven models P(s’|s,a) and P(r|s,a)off-line solution of Bellman equation

V*(s) = maxa [ rrP(r|s,a) + s’V(s’)P(s’|s,a)]

Reinforcement Learningon-line learning with TD error

(t) = r(t) + V(s(t) - V(s(t-1))V(s(t-1)) = (t)Q(s(t-1),a(t-1)) = (t)

Model-free and Model-based RL

Model-free: e.g., learn action values Q(s,a) := r(s,a) + Q(s’,a’) a = argmaxa Q(s,a)

Model-based: forward model P(s’|s,a)action selection:

a = argmaxa E[ R(s,a) + s’ V(s’)P(s’|s,a)]

simulation: learn V(s) and/or Q(s,a) off-linedynamic programming: solve Bellman eq.

V(s) = maxa E[ R(s,a) + s’ V(s’)P(s’|s,a)]

Current Topics

Convergence proofswith function approximators

Learning with hidden states: POMDPestimate belief statesreactive, stochastic policyparameterized finite-state policies

Hierarchical architectureslearn to select fixed sub-modulestrain sub-modulesboth

Partially Observable Markov Decision Process (POMDP)

Update the belief stateobservation P(o|s): not identitybelief state b=(P(s1), P(s2),…): real valued

P(sk|o) P(o|sk) i P(sk|si,a) P(si)

Tiger Problem (Kaelbing et al., 1998)

state: a tiger is in {left,right}action: {left, right, listen}

observation with 15% error

policy tree finite state policy

Outline

Why Continuous?

Analog control problemsdiscretization poor control performancehow to discretize?

Better theoretical propertiesdifferential algorithmsuse of local linear models

Continuous TD learning

Dynamics

Value function

TD error

Discount factor

Gradient Policy

˙ x =f (x,u)

V(x(t))= ew−tτ r(w)dw

∞∫

δ(t)=r(t)+ ˙ V (t)−1τ

τ=Δt

1−γ, γ=1−

u(t) =g∂V∂x

∂f∂u

⎝ ⎜

⎠ ⎟

On-line Learning of State Value

state x=(angle, angular vel.)

V(x)QuickTime˛ Ç∆

H.263 êLí£ÉvÉçÉOÉâÉÄÇ™Ç±ÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇ…ÇÕïKóvÇ≈Ç∑ÅB

Example: Cart-pole Swing up

Reward: height of the tipPunish: crash to wall

QuickTime˛ Ç∆H.263 êLí£ÉvÉçÉOÉâÉÄ

Fast Learning by Internal Models

Pole balancing (Stefan Schaal, USC)

Forward modelof pole dynamics

Inverse modelof arm dynamics

QuickTime˛ Ç∆DV - NTSC êLí£ÉvÉçÉOÉâÉÄ

Internal Models for Planning

Devil sticking (Chris Atkeson, CMU)

QuickTime˛ Ç∆ êLí£ÉvÉçÉOÉâÉÄ

Outline

Need for Hierarchical Architecture

Performance of controlMany high-precision sensors and actuatorProhibitively long time for learning

Speed of learningSearch in low-dimensional, low-resolution space

Learning to Stand up(Morimoto & Doya, 1998)

Reward: height of the headPunishment: tumbleState: pitch and joint angles, their derivatives

Simulation many thousands of trials to learn

Hierarchical Architecture

Upper leveldiscrete state/time

kinematicsaction: subgoalsreward: total task

Lower levelcontinuous state/time

dynamicsaction: motor torquereward:

achieving subgoals

Q(S,A)

a=g(s)

sequence ofsubgoals

Learning in Simulation

early learning after ~700 trials

Upper levelsubgoals

Lower levelcontrol

QuickTime˛ Ç∆Animation êLí£ÉvÉçÉOÉâÉÄ

Learning with Real Hardware(Morimoto & Doya, 2001)

after simulation

after ~100 physical trials

Adaptation by lower control modules

QuickTime˛ Ç∆ÉVÉlÉpÉbÉN êLí£ÉvÉçÉOÉâÉÄ

Outline

Modularity in Motor Learning

Fast De-adaptation and Re-adaptationswitching rather than re-learning

Combination of Learned Modulesserial/parallel/sigmoidal mixture

‘Soft’ Switching of Adaptive Modules

‘Hard’ switching based on prediction errors(Narendra et al., 1995)

Can result in sub-optimal task decomposition with initially poor prediction models.

‘Soft’ switching by ‘softmax’ of prediction errors

(Wolpert and Kawato, 1998)

Can use ‘annealing’ for optimal decomposition.

(Pawelzik et al., 1996)

Responsibility by Competition

predict state change

responsibility

weight output/learning

u(t) = λi(t)i=1

∑ gi(x(t))

˙ ˆ x i(t)= fi(x(t),u(t))

λi(t) =exp− 1

2σ 2 ˙ ˆ x i(t)−˙ x (t)2⎛

⎝ ⎜

⎞ ⎠ ⎟

exp− 12σ 2 ˙ ˆ x j (t)−˙ x (t)

2⎛ ⎝ ⎜

⎞ ⎠ ⎟

=softmax− ˙ ˆ x i(t)−˙ x (t)2

σ 2⎛ ⎝ ⎜

⎞ ⎠ ⎟

value Vi(x)

policy μi(x)

RL controller

statepredictor

responsibilitypredictor

Predictor

reward r(t)

actionu(t)

signal

λ i(t)

Environment

state x(t)

action u(t)

module 1

module n:

exp[-Ei (t)/ 2σ2]

λi(t)

.xi(t)

δi(t)

Multiple Linear Quadratic Controllers

Linear dynamic models

Quadratic reward models

Value functions

Action outputs

˙ ˆ x i(t)=Ai(x(t) −x i)+Biu(t)

u(t) =− λi(t)Kii=1

∑ (x(t)−x i); K i =R−1 ′ B iPi

ˆ r i (x(t),u(t))=r i −12 (x(t)−x i ′ ) Qi(x(t)−x i) −1

2 u(t ′ ) Riu(t)

Vi(x(t))= e−s−t

∞∫ r(x(s),u(s))ds=−1

2(x(t) −x i ′ ) Pi(x(t)−x i)

0=−PiAi − ′ A iPi +PiBiR−1 ′ B iPi −Qi +

Swing-up control of a pendulum

Red: module 1 Green: module 2

-0.4 -0.2 0 0.2 0.4 0.6 0.8 1-0.8

1.2t=20.0

x1 [π]

x2 [π]

Non-linearity and Non-stationarity

Specialization by predictability in space and time

p 2p-p-2p 0-20

p 2p-p-2p 0

0 20 40 60 80 100 120 140 160 180 200-10

0 20 40 60 80 100 120 140 160 180 200-1

(c) before learning (d) after 50 trials (e) after 200 trials[rad]q [rad]q[rad]q

Stationary non-Stationary

Trials

Stationary non-Stationary

Trials

(a) (b)

p 2p-p-2 p 0

module 3

module 4

module 1

module 2

1,2 3,4 1,2 3,4

{l=1.m=1}

{l=10.m=0.2}

Swing-up control of an ‘Acrobot’

Reward: height of the center of massLinearized around four fixed points

Swing-up motions

R=0.001 R=0.002

QuickTime˛ Ç∆Video êLí£ÉvÉçÉOÉâÉÄÇ™Ç±ÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇ…ÇÕïKóvÇ≈Ç∑ÅBQuickTime˛ Ç∆Video êLí£ÉvÉçÉOÉâÉÄÇ™Ç±ÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇ…ÇÕïKóvÇ≈Ç∑ÅB

Module switching

trajectories x(t) R=0.001 R=0.002

responsibility i : symbol-like representation

1-2-1-2-1-3-4-1-3-4-3-4 1-2-1-2-1-2-1-3-4-1-3-4

t=0 10t=0t=0t=0t=0t=0t=0t=0

Stand Up by Multiple Modules

Seven locally linear models

0 0.5 1 1.5 2 2.5 3 3.5 4 4.51

RespTextEnd

0 1 2 3 4 5 61

RespTextEnd

Segmentation of Observed Trajectory

Predicted motor output

Predicted state change

Predicted responsibility

predictor 1

controller 1

predictor i

controller i

predictor n

controller n

.xo(t)

xo(t)demonstrator

xo(t) uoi(t)

oi( )t

i( )t^

softmax

uio(t) =gi(x

˙ ˆ x io =fi (x

o(t),uio(t))

λio(t) =

e− 1

2σ 2˙ ˆ x io (t)−˙ x o (t)

e− 1

2σ2˙ ˆ x jo(t)−˙ x o (t)

Imitation of Acrobot Swing-up

1(0)=π/12 1(0)=π/6 1(0)=π/12 (imitation)

QuickTime˛ Ç∆Video êLí£ÉvÉçÉOÉâÉÄÇ™Ç±ÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇ…ÇÕïKóvÇ≈Ç∑ÅBQuickTime˛ Ç∆Video êLí£ÉvÉçÉOÉâÉÄÇ™Ç±ÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇ…ÇÕïKóvÇ≈Ç∑ÅB

Outline

Future Directions

Autonomous learning agentsTuning of meta-parametersDesign of rewardsSelection of necessary/sufficient state coding

Neural mechanisms of RLDopamine neurons: encoding TD errorBasal ganglia: value-based action selectionCerebellum: internal modelsCerebral cortex: modular decomposition

What is Reward for a robot?

Should be grounded bySelf preservation: self rechargingSelf reproduction: copying control program

Cyber Rodent

The Cyber Rodent Project

Learning mechanisms under realistic constraints of self-preservation and self-reproduction

acquisition of task-oriented internal representationmetalearning algorithms

constraints of finite time and energymechanisms for collaborative behaviorsroles of communication

abstract/emotional, concrete/symbolicgene exchange rules for evolution

Input/Output

SensoryCCD camerarange sensorIR proximity x8acceleration/gylomicrophone x2

Motortwo wheelsjawR/G/B LEDspeaker

Computation/Communication

CPU: Hitachi SH-4 CPUFPGA image processorIO modules

CommunicationIR portwireless LAN

Softwarelearning/evolutiondynamic simulation

Apprentissage par Renforcement Reinforcement Learning Kenji Doya doya@atr.co.jp ATR Human...

Documents

Apprentissage moteur

Apprentissage par renforcement d’actes de communication

Apprentissage artiﬁciel Fabien Teytaud Intro Supervisé Non ...teytaud/files/Cours/Ap...Apprentissage artiﬁciel Fabien Teytaud Intro Supervisé Non supervisé Par renforcement

Renforcement des

But d ’apprentissage

Apprentissage non supervisée

Apprentissage statistique - Introduction: intelligence

Intelligence Artificielle - Apprentissage Automatique

UX & Apprentissage

Computing Neurons - An Introduction - Kenji Doya doya@oist.jp Neural Computation Unit Initial Research Project Okinawa Institute of Science and Technology

APPRENTISSAGE DE LA DISSERTATION APPRENTISSAGE DE

MAP 553 Apprentissage statistiquegiraud/MAP55… · Christophe Giraud MAP 553 Apprentissage statistique. R eduction de dimension : ACP 16/39 Christophe Giraud MAP 553 Apprentissage

Apprentissage par renforcement pour l'improvisation ... · l’état de l’art est celui d’OMax[2]. Ce système consiste à découper judicieusement le morceau, qui a été joué,

Apprentissage transformationnel et compétences

apprentissage precoce problematique.pps

Verbes irréguliers anglais apprentissage

Apprentissage relationnel Apprentissage Data Mining ILP

« Phenomix Grant for Medical Imaging...learning (apprentissage statistique, ou apprentissage « machine ») et de deep learning (apprentissage profond) permettent aujourd’hui de

Apprentissage de Typo3

Module 4 Demarches d Apprentissage Strategies d Enseignement Apprentissage