Upload
benjamin-ellis
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Prediction, Control and DecisionsKenji Doya
Initial Research Project, OISTATR Computational Neuroscience LaboratoriesCREST, Japan Science and Technology Agency
Nara Institute of Science and Technology
Outline
Introduction
Cerebellum, basal ganglia, and cortex
Meta-learning and neuromodulators
Prediction time scale and serotonin
Learning to Walk (Doya & Nakano, 1985)
Action: cycle of 4 posturesReward: speed sensor output
Multiple solutions: creeping, jumping,…
QuickTime˛ Ç∆H.263 êLí£ÉvÉçÉOÉâÉÄǙDZÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇΩÇflÇ…ÇÕïKóvÇ≈Ç∑ÅB
Learning to Stand Up (Morimoto &Doya, 2001)
QuickTime˛ Ç∆ÉVÉlÉpÉbÉN êLí£ÉvÉçÉOÉâÉÄ
ǙDZÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇ…ÇÕïKóvÇ≈Ç∑ÅB
QuickTime˛ Ç∆ÉVÉlÉpÉbÉN êLí£ÉvÉçÉOÉâÉÄ
ǙDZÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇ…ÇÕïKóvÇ≈Ç∑ÅB
early trials
after learning Reward: height of the headNo desired trajectory
Framework for learning state-action mapping (policy) by exploration and reward feedback
Criticreward prediction
Actoraction selection
Learningexternal reward rinternal reward : difference from prediction
Reinforcement Learning (RL)
environment
reward r
action a
state s
agentcritic
actor
Reinforcement Learning Methods
Model-free MethodsEpisode-based
parameterize policy P(a|s; )Temporal difference
state value function V(s)(state-)action value function Q(s,a)
Model-based methodsDynamic Programming
forward model P(s’|s,a)
Temporal Difference Learning
Predict reward: value functionV(s) = E[ r(t) + r(t+1) + 2r(t+2)…| s(t)=s]Q(s,a) = E[ r(t) + r(t+1) + 2r(t+2)…| s(t)=s, a(t)=a]
Select actiongreedy: a = argmax Q(s,a)Boltzmann: P(a|s) exp[ Q(s,a)]
Update prediction: TD error(t) = r(t) + V(s(t+1)) - V(s(t))V(s(t)) = (t)Q(s(t),a(t)) = (t)
Dynamic Programming and RL
Dynamic Programmingmodel-based, off-line
solve Bellman equationV(s) = maxa s’ [ P(s’|s,a) {r(s,a,s’) + V(s’)}]
Reinforcement Learningmodel-free, on-line
learn by TD error(t) = r(t) + V(s(t+1)) - V(s(t))V(s(t)) = (t)Q(s(t),a(t)) = (t)
Discrete vs. Continuous RL(Doya, 2000)
Discrete time
Continuous time
€
V (x) = E r(t) + γr(t + Δt) + γ 2r(t + 2Δt) + ...[ ]
δ(t) = r(t) + γV (t + Δt) −V (t)
V(x) = es−tτ r(s)ds
t
∞
∫δ(t) =r(t)+ ˙ V (t) −
1τ
V(t)
τ=Δt
1−γ, γ =1−
Δtτ
Questions
Computational QuestionsHow to learn:
direct policy P(a|s)value functions V(s), Q(s,a)forward models P(s’|s,a)
When to use which method?Biological Questions
Where in the brain?How are they represented/updated?How are they selected/coordinated?
Brain HierarchyForebrainCerebral cortex (a)
neocortexpaleocortex: olfactory cortex archicortex: basal forebrain,
hippocampusBasal nuclei (b)
neostriatum: caudate, putamenpaleostriatum: globus pallidusarchistriatum: amygdala
Diencephalonthalamus (c)hypothalamus (d)
Brain stem & CerebellumMidbrain (e)Hindbrain
pons (f)cerebellum (g)
Medulla (h)Spinal cord (i)
Just for Motor Control?(Middleton & Strick 1994)
Basal ganglia (Globus Pallidus)
Prefrontal cortex (area46)
Cerebellum (dentate nucleus)
thalamus
SN
IO
Cortex
BasalGanglia
Cerebellum
target
error+
-
outputinput
Cerebellum: Supervised Learning
reward
outputinput
Basal Ganglia: Reinforcement Learning
Cerebral Cortex : Unsupervised Learning
outputinput
Specialization by Learning Algorithms
(Doya, 1999)
Cerebellum
Purkinje cells~105 parallel fiberssingle climbing fiberlong-term depression
Supervised learningperceptron hypothesisinternal models
early learning after learning
Internal Models in the Cerebellum
(Imamizu et al., 2000)
Learning to use ‘rotated’ mouse
Motor Imagery (Luft et al. 1998)
Finger movement Imagery of movement
Basal Ganglia
Striatumstriosome & matrixdopamine-dependent plasticity
Dopamine neuronsreward-predictive response
TD learning
(a) äwèKëO
(b) äwèKå„
(c) ïÒèVǻǵ
ïÒèV r
ÉhÅ[ÉpÉ~Éìç◊ñE
ïÒèVó\ë™ V
ïÒèV r
ÉhÅ[ÉpÉ~Éìç◊ñE
ïÒèVó\ë™ V
ïÒèV r
ÉhÅ[ÉpÉ~Éìç◊ñE
ïÒèVó\ë™ V
r
V
r
V
r
V
Dopamine Neurons and TD Error
(t) = r(t) + V(s(t+1)) - V(s(t))before learning
after learning
omit reward
(Schultz et al. 1997)
Reward-predicting Activities of Striatal Neurons
Delayed saccade task (Kawagoe et al., 1998)
Not just actions, but resulting rewards
Reward: Right Up Left Down All
Target: Right
Up
Left
Down
Cerebral Cortex
Recurrent connectionsHebbian plasticity
Unsupervised learning, e.g., PCA, ICA
Replicating V1Receptive Fields
(Olshausen & Field, 1996)
Infomax and sparsenessHebbian plasticity and recurrent inhibition
Specialization by Learning?
Cerebellum: Supervised learningerror signal by climbing fibersforward model s’=f(s,a) and policy a=g(s)
Basal ganglia: Reinforcement leaningreward signal by dopamine fibersvalue functions V(s) and Q(s,a)
Cerebral cortex: Unsupervised learningHebbian plasticity and recurrent inhibitionrepresentation of state s and action a
But how are they recruited and combined?
Multiple Action Selection Schemes
Model-freea = argmaxa Q(s,a)
Model-baseda = argmaxa [r+V(f(s,a))]
forward model: f(s,a) Encapsulation
a = g(s)
sa
Qs’a
Vai
f
s
sa
g
Lectures at OCNC 2005
Internal models/CerebellumReza ShadmehrStefan SchaalMitsuo Kawato
Reward/Basal gangliaAndrew G. BartoBernard BalleinePeter DayanJohn O’DohertyMinoru KimuraWolfram Schultz
State coding/CortexNathaniel DawLeo SugrueDaeyeol LeeJun TanjiAnitha PasupathyMasamichi Sakagami
Outline
Introduction
Cerebellum, basal ganglia, and cortex
Meta-learning and neuromodulators
Prediction time scale and serotonin
Framework for learning state-action mapping (policy) by exploration and reward feedback
Criticreward prediction
Actoraction selection
Learningexternal reward rinternal reward : difference from prediction
Reinforcement Learning (RL)
environment
reward r
action a
state s
agentcritic
actor
Reinforcement Learning
Predict reward: value functionV(s) = E[ r(t) + r(t+1) + 2r(t+2)…| s(t)=s]Q(s,a) = E[ r(t) + r(t+1) + 2r(t+2)…| s(t)=s, a(t)=a]
Select actiongreedy: a = argmax Q(s,a)Boltzmann: P(a|s) exp[ Q(s,a)]
Update prediction: TD error(t) = r(t) + V(s(t+1)) - V(s(t))V(s(t)) = (t)Q(s(t),a(t)) = (t)
Cyber Rodent Project
Robots with same constraint as biological agents
What is the origin of rewards?What to be learned, what to be evolved?
Self-preservationcapture batteries
Self-reproductionexchange programs through IR ports
Cyber Rodent: Hardware
camera range sensor proximity sensors gyro
battery latch two wheels
IR port speaker microphones R/G/B LED
Evolving Robot Colony
Survivalcatch battery packs
Reproductioncopy ‘genes’ through IR ports
QuickTime˛ Ç∆YUV420 ÉRÅ[ÉfÉbÉN êLí£ÉvÉçÉOÉâÉÄ
ǙDZÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇΩÇflÇ…ÇÕïKóvÇ≈Ç∑ÅB
QuickTime˛ Ç∆YUV420 ÉRÅ[ÉfÉbÉN êLí£ÉvÉçÉOÉâÉÄ
ǙDZÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇΩÇflÇ…ÇÕïKóvÇ≈Ç∑ÅB
Discounting Future Reward
large small
QuickTime˛ Ç∆DV/DVCPRO - NTSC êLí£ÉvÉçÉOÉâÉÄ
ǙDZÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇΩÇflÇ…ÇÕïKóvÇ≈Ç∑ÅB
QuickTime˛ Ç∆DV/DVCPRO - NTSC êLí£ÉvÉçÉOÉâÉÄ
ǙDZÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇΩÇflÇ…ÇÕïKóvÇ≈Ç∑ÅB
Setting of Reward Function
Reward r = rmain + rsupp - rcost
e.g., reward for vision of battery
QuickTime˛ Ç∆DV/DVCPRO - NTSC êLí£ÉvÉçÉOÉâÉÄ
ǙDZÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇΩÇflÇ…ÇÕïKóvÇ≈Ç∑ÅB
Reinforcement Learning of Reinforcement Learning (Schweighfer&Doya, 2003)
Fluctuations in the metaparameters correlate with average reward
reward
Battery level
β
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
2
4
6
8
10
12
14
Randomness Control by Battery Level
Greedier action at both extremes
Neuromodulators for Metalearning
(Doya, 2002)
Metaparameter tuning is critical in RLHow does the brain tune them?
Dopamine: TD error Acetylcholine: learning rate Noradrenaline: inv. temp. Serotonin: discount
Learning Rate
V(s(t-1)) = (t)
Q(s(t-1),a(t-1)) = (t)small slow learninglarge unstable learning
Acetylcholine basal forebrainRegulate memory update and retention
(Hasselmo et al.)
LTP in cortex, hippocampustop-down and bottom-up information flow
Inverse Temperature
Greediness in action selection
P(ai|s) exp[ Q(s,ai)]
small exploration
large exploitation
Noradrenaline locus coeruleusCorrelation with performance accuracy
(Aston-Jones et al.)
Modulation of cellular I/O gain(Cohen et al.)
-4 -2 0 2 40
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Q(s,a1)-Q(s,a
2)
P(a
1)
=0=1=10
Serotonin dorsal rapheLow activity associated with impulsivity
depression, bipolar disordersaggression, eating disorders
Discount Factor
1 2 3 4 5 6 7 8 9 10
-1
-0.5
0
0.5
1
Time
Reward TextEnd
=0.5 =-0.093 V
1 2 3 4 5 6 7 8 9 10
-1
-0.5
0
0.5
1
Time
Reward TextEnd
=0.9 =+0.062 V
V(s(t)) = E[ r(t+1) + r(t+2) + 2r(t+3) + …]Balance between short- and long-term results
TD Error
(t) = r(t) + V(s(t)) - V(s(t-1))
Global learning signal
reward prediction: V(s(t-1)) = (t)
reinforcement: Q(s(t-1),a(t-1)) = (t)
Dopamine substantia nigra, VTARespond to errors in reward predictionReinforcement of actions
addiction
TD Model of Basal Ganglia(Houk et al. 1995, Montague et al. 1996, Schultz et al. 1997,...)
Striosome: state value V(s)Matrix: action value Q(s,a)
evaluation
action selection
state representation
actionoutput
sensoryinput
TD signal
Cerebral cortex
Striatum
Dopamine neurons
reward SNr, GP
Thalamus
s
V(s)
DA neurons: TD error
r
Q(s,a) a
SNr/GPi: action selection: Q(s,a) a
NA?
Ach?
5-HT?
Possible Control of Discount Factor
Modulation of TD error
Selection/weighting of parallel networks
V1 V2 V31 2 3
striatum
Dopamineneurons (t)
V(s(t))
V(s(t+1))
€
(t) = r(t) + γV (s(t +1)) −V (s(t))
Markov Decision Task(Tanaka et al., 2004)
State transition and reward functions
Stimulus and response
Behavior Results
All subjects successfully learned optimal behavior
Block-Design Analysis
SHORT vs. NO (p < 0.001 uncorrected)
LONG vs. SHORT (p < 0.0001 uncorrected)
OFC Insula Striatum Cerebellum
CerebellumStriatum Dorsal rapheDLPFC, VLPFC, IPC, PMd
Different brain areas involved in immediate and future reward prediction
Ventro-Dorsal Difference
Lateral PFC Insula Striatum
Estimate V(t) and (t) from subjects’ performance dataRegression analysis of fMRI data
Model-based Regressor Analysis
fMRI data
Policy
reward r(t)
state s(t)
action a(t)
TD error (t)
Agent
Value functionV(s)
Value functionV(s)
TD error (t)
Environment
20yen
Explanatory Variables (subject NS)
Reward prediction V(t)
= 0
= 0.3
= 0.6
= 0.8
= 0.9
= 0.99
Reward prediction error t
= 0
= 0.3
= 0.6
= 0.8
= 0.9
= 0.99
1 312trial
Regression Analysis
mPFC Insula
x = -2 mm x = -42 mm
Reward prediction
V
Reward prediction error
Striatum
z = 2
Tryptophan Depletion/Loading
Tryptophan: precursor of serotonindepletion/loading affect central serotonin levels(e.g. Bjork et al. 2001, Luciana et al. 2001)
100 g of amino acid drinkexperiments after 6 hours
Day2: Tr0 Day3: Tr+Day1: Tr-
10.3g of tryptophan (Loading)
No tryptophan (Depletion)
2.3g of tryptophan(Control)
Blood Tryptophan LevelsBlood Tryptophan Levels
N.D. (< 3.9 g/ml)
Delayed Reward Choice TaskDelayed Reward Choice Task
Delayed Reward Choice Task
Sessions
Initial black patches
Patches/step
Yellow White Yellow White
1,2,7,8 72 24
18 9 8 2 6 2
3 72 24
18 9 8 2 14 2
4 72 24
18 9 16 2 14 2
5,6 72 24
18 9 16 2 6 2
yellow: large reward with long delaywhite: small reward with short delay
Choice Behaviors
Shift of indifference linenot consistent among 12 subjects
Modulation of Striatal Response
Tr0
0.990.90.80.70.6
Tr- Tr+
Modulation by Tr Levels
QuickTime˛ Ç∆TIFFÅiLZWÅj êLí£ÉvÉçÉOÉâÉÄ
ǙDZÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇΩÇflÇ…ÇÕïKóvÇ≈Ç∑ÅB
QuickTime˛ Ç∆TIFFÅiLZWÅj êLí£ÉvÉçÉOÉâÉÄ
ǙDZÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇΩÇflÇ…ÇÕïKóvÇ≈Ç∑ÅB
QuickTime˛ Ç∆TIFFÅiLZWÅj êLí£ÉvÉçÉOÉâÉÄ
ǙDZÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇΩÇflÇ…ÇÕïKóvÇ≈Ç∑ÅB
Changes in Correlation CoefficientChanges in Correlation Coefficient
= 0.6(28, 0, -4)
= 0.99(16, 2, 28)
Tr- < Tr+correlation with V at large in dorsal Putamen
Tr- > Tr+correlation with V at small in ventral PutamenR
egre
ssio
n s
lop
eR
egre
ssio
n s
lop
e
ROI (region of interest) analysis
Summary
Immediate rewardlateral OFC
Future rewardparietal, PMd, DLPFlateranl cerebellumdorsal raphe
Ventro-dorsal gradientinsulastriatum
Serotonergic modulation
Outline
Introduction
Cerebellum, basal ganglia, and cortex
Meta-learning and neuromodulators
Prediction time scale and serotonin
Collaborators
Kyoto PUMMinoru KimuraYasumasa Ueda
Hiroshima UShigeto YamawakiYasumasa OkamotoGo OkadaKazutaka UedaShuji AsahiKazuhiro Shishida
ATRJun MorimotoKazuyuki Samejima
CRESTNicolas SchweighoferGenci Capi
NAISTSaori Tanaka
OISTEiji UchibeStefan Elfwing