Prediction, Control and Decisions Kenji Doya [email protected] Initial Research Project, OIST ATR Computational Neuroscience Laboratories CREST, Japan

Prediction, Control and DecisionsKenji Doya

[email protected]

Initial Research Project, OISTATR Computational Neuroscience LaboratoriesCREST, Japan Science and Technology Agency

Nara Institute of Science and Technology

Outline

Introduction

Cerebellum, basal ganglia, and cortex

Meta-learning and neuromodulators

Prediction time scale and serotonin

Learning to Walk (Doya & Nakano, 1985)

Action: cycle of 4 posturesReward: speed sensor output

Multiple solutions: creeping, jumping,…

QuickTime˛ Ç∆H.263 êLí£ÉvÉçÉOÉâÉÄÇ™Ç±ÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇΩÇflÇ…ÇÕïKóvÇ≈Ç∑ÅB

Learning to Stand Up (Morimoto &Doya, 2001)

QuickTime˛ Ç∆ÉVÉlÉpÉbÉN êLí£ÉvÉçÉOÉâÉÄ

Ç™Ç±ÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇ…ÇÕïKóvÇ≈Ç∑ÅB

QuickTime˛ Ç∆ÉVÉlÉpÉbÉN êLí£ÉvÉçÉOÉâÉÄ

Ç™Ç±ÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇ…ÇÕïKóvÇ≈Ç∑ÅB

early trials

after learning Reward: height of the headNo desired trajectory

Framework for learning state-action mapping (policy) by exploration and reward feedback

Criticreward prediction

Actoraction selection

Learningexternal reward rinternal reward : difference from prediction

Reinforcement Learning (RL)

environment

reward r

action a

state s

agentcritic

actor

Reinforcement Learning Methods

Model-free MethodsEpisode-based

parameterize policy P(a|s; )Temporal difference

state value function V(s)(state-)action value function Q(s,a)

Model-based methodsDynamic Programming

forward model P(s’|s,a)

Temporal Difference Learning

Predict reward: value functionV(s) = E[ r(t) + r(t+1) + 2r(t+2)…| s(t)=s]Q(s,a) = E[ r(t) + r(t+1) + 2r(t+2)…| s(t)=s, a(t)=a]

Select actiongreedy: a = argmax Q(s,a)Boltzmann: P(a|s) exp[ Q(s,a)]

Update prediction: TD error(t) = r(t) + V(s(t+1)) - V(s(t))V(s(t)) = (t)Q(s(t),a(t)) = (t)

Dynamic Programming and RL

Dynamic Programmingmodel-based, off-line

solve Bellman equationV(s) = maxa s’ [ P(s’|s,a) {r(s,a,s’) + V(s’)}]

Reinforcement Learningmodel-free, on-line

learn by TD error(t) = r(t) + V(s(t+1)) - V(s(t))V(s(t)) = (t)Q(s(t),a(t)) = (t)

Discrete vs. Continuous RL(Doya, 2000)

Discrete time

Continuous time

€

V (x) = E r(t) + γr(t + Δt) + γ 2r(t + 2Δt) + ...[ ]

δ(t) = r(t) + γV (t + Δt) −V (t)

V(x) = es−tτ r(s)ds

t

∞

∫δ(t) =r(t)+ ˙ V (t) −

1τ

V(t)

τ=Δt

1−γ, γ =1−

Δtτ

Questions

Computational QuestionsHow to learn:

direct policy P(a|s)value functions V(s), Q(s,a)forward models P(s’|s,a)

When to use which method?Biological Questions

Where in the brain?How are they represented/updated?How are they selected/coordinated?

Brain HierarchyForebrainCerebral cortex (a)

neocortexpaleocortex: olfactory cortex archicortex: basal forebrain,

hippocampusBasal nuclei (b)

neostriatum: caudate, putamenpaleostriatum: globus pallidusarchistriatum: amygdala

Diencephalonthalamus (c)hypothalamus (d)

Brain stem & CerebellumMidbrain (e)Hindbrain

pons (f)cerebellum (g)

Medulla (h)Spinal cord (i)

Just for Motor Control?(Middleton & Strick 1994)

Basal ganglia (Globus Pallidus)

Prefrontal cortex (area46)

Cerebellum (dentate nucleus)

thalamus

SN

IO

Cortex

BasalGanglia

Cerebellum

target

error+

-

outputinput

Cerebellum: Supervised Learning

reward

outputinput

Basal Ganglia: Reinforcement Learning

Cerebral Cortex ： Unsupervised Learning

outputinput

Specialization by Learning Algorithms

(Doya, 1999)

Cerebellum

Purkinje cells~105 parallel fiberssingle climbing fiberlong-term depression

Supervised learningperceptron hypothesisinternal models

early learning after learning

Internal Models in the Cerebellum

(Imamizu et al., 2000)

Learning to use ‘rotated’ mouse

Motor Imagery (Luft et al. 1998)

Finger movement Imagery of movement

Basal Ganglia

Striatumstriosome & matrixdopamine-dependent plasticity

Dopamine neuronsreward-predictive response

TD learning

(a) äwèKëO

(b) äwèKå„

(c) ïÒèVÇ»Çµ

ïÒèV r

ÉhÅ[ÉpÉ~Éìç◊ñE

ïÒèVó\ë™ V

ïÒèV r


ïÒèVó\ë™ V

ïÒèV r


ïÒèVó\ë™ V

r

V

r

V

r

V

Dopamine Neurons and TD Error

(t) = r(t) + V(s(t+1)) - V(s(t))before learning

after learning

omit reward

(Schultz et al. 1997)

Reward-predicting Activities of Striatal Neurons

Delayed saccade task (Kawagoe et al., 1998)

Not just actions, but resulting rewards

Reward: Right Up Left Down All

Target: Right

Up

Left

Down

Cerebral Cortex

Recurrent connectionsHebbian plasticity

Unsupervised learning, e.g., PCA, ICA

Replicating V1Receptive Fields

(Olshausen & Field, 1996)

Infomax and sparsenessHebbian plasticity and recurrent inhibition

Specialization by Learning?

Cerebellum: Supervised learningerror signal by climbing fibersforward model s’=f(s,a) and policy a=g(s)

Basal ganglia: Reinforcement leaningreward signal by dopamine fibersvalue functions V(s) and Q(s,a)

Cerebral cortex: Unsupervised learningHebbian plasticity and recurrent inhibitionrepresentation of state s and action a

But how are they recruited and combined?

Multiple Action Selection Schemes

Model-freea = argmaxa Q(s,a)

Model-baseda = argmaxa [r+V(f(s,a))]

forward model: f(s,a) Encapsulation

a = g(s)

sa

Qs’a

Vai

f

s

sa

g

Lectures at OCNC 2005

Internal models/CerebellumReza ShadmehrStefan SchaalMitsuo Kawato

Reward/Basal gangliaAndrew G. BartoBernard BalleinePeter DayanJohn O’DohertyMinoru KimuraWolfram Schultz

State coding/CortexNathaniel DawLeo SugrueDaeyeol LeeJun TanjiAnitha PasupathyMasamichi Sakagami

Outline

Introduction




Framework for learning state-action mapping (policy) by exploration and reward feedback

Criticreward prediction

Actoraction selection

Learningexternal reward rinternal reward : difference from prediction

Reinforcement Learning (RL)

environment

reward r

action a

state s

agentcritic

actor

Reinforcement Learning

Predict reward: value functionV(s) = E[ r(t) + r(t+1) + 2r(t+2)…| s(t)=s]Q(s,a) = E[ r(t) + r(t+1) + 2r(t+2)…| s(t)=s, a(t)=a]

Select actiongreedy: a = argmax Q(s,a)Boltzmann: P(a|s) exp[ Q(s,a)]

Update prediction: TD error(t) = r(t) + V(s(t+1)) - V(s(t))V(s(t)) = (t)Q(s(t),a(t)) = (t)

Cyber Rodent Project

Robots with same constraint as biological agents

What is the origin of rewards?What to be learned, what to be evolved?

Self-preservationcapture batteries

Self-reproductionexchange programs through IR ports

Cyber Rodent: Hardware

camera range sensor proximity sensors gyro

battery latch 　　　　　　 two wheels

IR port speaker microphones R/G/B LED

Evolving Robot Colony

Survivalcatch battery packs

Reproductioncopy ‘genes’ through IR ports

QuickTime˛ Ç∆YUV420 ÉRÅ[ÉfÉbÉN êLí£ÉvÉçÉOÉâÉÄ

Ç™Ç±ÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇΩÇflÇ…ÇÕïKóvÇ≈Ç∑ÅB

QuickTime˛ Ç∆YUV420 ÉRÅ[ÉfÉbÉN êLí£ÉvÉçÉOÉâÉÄ


Discounting Future Reward

large small

QuickTime˛ Ç∆DV/DVCPRO - NTSC êLí£ÉvÉçÉOÉâÉÄ




Setting of Reward Function

Reward r = rmain + rsupp - rcost

e.g., reward for vision of battery



Reinforcement Learning of Reinforcement Learning (Schweighfer&Doya, 2003)

Fluctuations in the metaparameters correlate with average reward

reward

Battery level

β

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

2

4

6

8

10

12

14

Randomness Control by Battery Level

Greedier action at both extremes

Neuromodulators for Metalearning

(Doya, 2002)

Metaparameter tuning is critical in RLHow does the brain tune them?

Dopamine: TD error Acetylcholine: learning rate Noradrenaline: inv. temp. Serotonin: discount

Learning Rate

V(s(t-1)) = (t)

Q(s(t-1),a(t-1)) = (t)small slow learninglarge unstable learning

Acetylcholine basal forebrainRegulate memory update and retention

(Hasselmo et al.)

LTP in cortex, hippocampustop-down and bottom-up information flow

Inverse Temperature

Greediness in action selection

P(ai|s) exp[ Q(s,ai)]

small exploration

large exploitation

Noradrenaline locus coeruleusCorrelation with performance accuracy

(Aston-Jones et al.)

Modulation of cellular I/O gain(Cohen et al.)

-4 -2 0 2 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Q(s,a1)-Q(s,a

2)

P(a

1)

=0=1=10

Serotonin dorsal rapheLow activity associated with impulsivity

depression, bipolar disordersaggression, eating disorders

Discount Factor

1 2 3 4 5 6 7 8 9 10

-1

-0.5

0

0.5

1

Time

Reward TextEnd

=0.5 =-0.093 V

1 2 3 4 5 6 7 8 9 10

-1

-0.5

0

0.5

1

Time

Reward TextEnd

=0.9 =+0.062 V

V(s(t)) = E[ r(t+1) + r(t+2) + 2r(t+3) + …]Balance between short- and long-term results

TD Error

(t) = r(t) + V(s(t)) - V(s(t-1))

Global learning signal

reward prediction: V(s(t-1)) = (t)

reinforcement: Q(s(t-1),a(t-1)) = (t)

Dopamine substantia nigra, VTARespond to errors in reward predictionReinforcement of actions

addiction

TD Model of Basal Ganglia(Houk et al. 1995, Montague et al. 1996, Schultz et al. 1997,...)

Striosome: state value V(s)Matrix: action value Q(s,a)

evaluation

action selection

state representation

actionoutput

sensoryinput

TD signal

Cerebral cortex

Striatum

Dopamine neurons

reward SNr, GP

Thalamus

s

V(s)

DA neurons: TD error

r

Q(s,a) a

SNr/GPi: action selection: Q(s,a) a

NA?

Ach?

5-HT?

Possible Control of Discount Factor

Modulation of TD error

Selection/weighting of parallel networks

V1 V2 V31 2 3

striatum

Dopamineneurons (t)

V(s(t))

V(s(t+1))

€

(t) = r(t) + γV (s(t +1)) −V (s(t))

Markov Decision Task(Tanaka et al., 2004)

State transition and reward functions

Stimulus and response

Behavior Results

All subjects successfully learned optimal behavior

Block-Design Analysis

SHORT vs. NO (p < 0.001 uncorrected)

LONG vs. SHORT (p < 0.0001 uncorrected)

OFC Insula Striatum Cerebellum

CerebellumStriatum Dorsal rapheDLPFC, VLPFC, IPC, PMd

Different brain areas involved in immediate and future reward prediction

Ventro-Dorsal Difference

Lateral PFC Insula Striatum

　

Estimate V(t) and (t) from subjects’ performance dataRegression analysis of fMRI data

Model-based Regressor Analysis

fMRI data

Policy

reward r(t)

state s(t)

action a(t)

TD error (t)

Agent

Value functionV(s)

Value functionV(s)

TD error (t)

Environment

20yen

Explanatory Variables (subject NS)

Reward prediction V(t)

= 0

= 0.3

= 0.6

= 0.8

= 0.9

= 0.99

Reward prediction error t

= 0

= 0.3

= 0.6

= 0.8

= 0.9

= 0.99

1 312trial

Regression Analysis

mPFC Insula

x = -2 mm x = -42 mm

Reward prediction

V

Reward prediction error

Striatum

z = 2

Tryptophan Depletion/Loading

Tryptophan: precursor of serotonindepletion/loading affect central serotonin levels(e.g. Bjork et al. 2001, Luciana et al. 2001)

100 g of amino acid drinkexperiments after 6 hours

Day2: Tr0 Day3: Tr+Day1: Tr-

10.3g of tryptophan (Loading)

No tryptophan (Depletion)

2.3g of tryptophan(Control)

Blood Tryptophan LevelsBlood Tryptophan Levels

N.D. (< 3.9 g/ml)

Delayed Reward Choice TaskDelayed Reward Choice Task

Delayed Reward Choice Task

Sessions

Initial black patches

Patches/step

Yellow White Yellow White

1,2,7,8 72 24

18 9 8 2 6 2

3 72 24

18 9 8 2 14 2

4 72 24

18 9 16 2 14 2

5,6 72 24

18 9 16 2 6 2

yellow: large reward with long delaywhite: small reward with short delay

Choice Behaviors

Shift of indifference linenot consistent among 12 subjects

Modulation of Striatal Response

Tr0

0.990.90.80.70.6

Tr- Tr+

Modulation by Tr Levels

QuickTime˛ Ç∆TIFFÅiLZWÅj êLí£ÉvÉçÉOÉâÉÄ






Changes in Correlation CoefficientChanges in Correlation Coefficient

= 0.6(28, 0, -4)

= 0.99(16, 2, 28)

Tr- < Tr+correlation with V at large in dorsal Putamen

Tr- > Tr+correlation with V at small in ventral PutamenR

egre

ssio

n s

lop

eR

egre

ssio

n s

lop

e

ROI (region of interest) analysis

Summary

Immediate rewardlateral OFC

Future rewardparietal, PMd, DLPFlateranl cerebellumdorsal raphe

Ventro-dorsal gradientinsulastriatum

Serotonergic modulation

Outline

Introduction




Collaborators

Kyoto PUMMinoru KimuraYasumasa Ueda

Hiroshima UShigeto YamawakiYasumasa OkamotoGo OkadaKazutaka UedaShuji AsahiKazuhiro Shishida

ATRJun MorimotoKazuyuki Samejima

CRESTNicolas SchweighoferGenci Capi

NAISTSaori Tanaka

OISTEiji UchibeStefan Elfwing

Documents

Prediction, Control and Decisions Kenji Doya [email protected] Initial Research Project, OIST ATR Computational Neuroscience Laboratories CREST, Japan