Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Ramping and State Uncertainty in the Dopamine Signal
John G. Mikhael1,2∗, HyungGoo R. Kim3, Naoshige Uchida3, Samuel J. Gershman4
1Program in Neuroscience, Harvard Medical School, Boston, MA 02115
2MD-PhD Program, Harvard Medical School, Boston, MA 02115
3Department of Molecular and Cellular Biology and Center for Brain Science, Harvard University,
Cambridge, MA 02138
4Department of Psychology and Center for Brain Science, Harvard University, Cambridge, MA 02138
∗Corresponding Author
Correspondence: john [email protected]
1
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/805366doi: bioRxiv preprint first posted online Oct. 16, 2019;
Abstract1
Reinforcement learning models of the basal ganglia map the phasic dopamine signal to reward prediction2
errors (RPEs). Conventional models assert that, when a stimulus reliably predicts a reward with fixed3
delay, dopamine activity during the delay period and at reward time should converge to baseline through4
learning. However, recent studies have found that dopamine exhibits a gradual ramp before reward5
in certain conditions even after extensive learning, such as when animals are trained to run to obtain6
the reward, thus challenging the conventional RPE models. In this work, we begin with the limitation7
of temporal uncertainty (animals cannot perfectly estimate time to reward), and show that sensory8
feedback, which reduces this uncertainty, will cause an unbiased learner to produce RPE ramps. On9
the other hand, in the absence of feedback, RPEs will be flat after learning. These results reconcile the10
seemingly conflicting data on dopamine behaviors under the RPE hypothesis.11
Keywords: dopamine, ramping, reinforcement learning, reward prediction error, state value, state uncer-12
tainty, sensory feedback13
Introduction14
Perhaps the most successful convergence of reinforcement learning theory with neuroscience has been the15
insight that the phasic activity of midbrain dopamine (DA) neurons tracks ‘reward prediction errors’ (RPEs),16
or the difference between received and expected reward (Schultz et al., 1997; Schultz, 2007a; Glimcher, 2011).17
In reinforcement learning algorithms, RPEs serve as teaching signals that update an agent’s estimate of18
rewards until those rewards are well-predicted. In a seminal experiment, Schultz et al. (1997) recorded from19
midbrain DA neurons in primates and found that the neurons responded with a burst of activity when20
an unexpected reward was delivered. However, if a reward-predicting cue was available, the DA neurons21
eventually stopped responding to the (now expected) reward and instead began to respond to the cue, much22
like an RPE (see Results). This finding formed the basis for the RPE hypothesis of DA.23
Over the past two decades, a large and compelling body of work has supported the view that phasic DA24
functions as a teaching signal (Schultz et al., 1997; Niv and Schoenbaum, 2008; Glimcher, 2011; Steinberg25
et al., 2013; Eshel et al., 2015). In particular, phasic DA activity has been shown to track the RPE term26
of temporal difference (TD) learning models, which we review below, remarkably well (Schultz, 2007a).27
However, recent results have called this model of DA into question. Using fast-scan cyclic voltammetry in rat28
striatum during a goal-directed spatial navigation task, Howe et al. (2013) observed a ramping phenomenon—29
2
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/805366doi: bioRxiv preprint first posted online Oct. 16, 2019;
a steady increase in DA over the course of a single trial—that persisted even after extensive training. Since30
then, DA ramping has been observed during a two-armed bandit task (Hamid et al., 2016) and during the31
execution of self-initiated action sequences (Collins et al., 2016). At first glance, these findings appear to32
contradict the RPE hypothesis of DA. Indeed, why would error signals persist (and ramp) after a task has33
been well-learned? Perhaps, then, instead of reporting an RPE, DA should be reinterpreted as reflecting34
the value of the animal’s current state, such as its position during reward approach (Hamid et al., 2016).35
Alternatively, perhaps DA signals different quantities in different tasks, e.g., value in operant tasks, in which36
the animal must act to receive reward, and RPE in classical conditioning tasks, in which the animal need37
not act to receive reward.38
To distinguish among these possibilities, Kim et al. (2019) recently devised an experimental paradigm that39
dissociates the value and RPE interpretations of DA. As we show in the Methods, RPEs in the experiments40
considered above can be approximated as the derivative of value under the TD learning framework. By41
training mice on a virtual reality environment and manipulating various properties of the task—namely, the42
speed of scene movement, teleports, and temporary pauses at various locations—the authors could dissociate43
spatial navigation from locomotion and make precise predictions about how value should change vs. how44
its derivative (RPE) should change. The authors found that mice continued to display ramping DA signals45
during the task even without locomotion, and that the changes in DA behaviors were consistent with the46
RPE hypothesis and not with the value interpretation.47
The body of experimental studies outlined above produce a number of unanswered questions regarding the48
function of DA: First, why would an error signal persist once an association is well-learned? Second, why49
would it ramp over the duration of the trial? Third, why would this ramp occur in some tasks but not others?50
Does value (and thus RPE) take different functional forms in different tasks, and if so, what determines which51
forms result in a ramp and which do not? Here we address these questions from normative principles.52
In this work, we examine the influence of sensory feedback in guiding value estimation. Because of irre-53
ducible temporal uncertainty, animals not receiving sensory feedback (and therefore relying only on internal54
timekeeping mechanisms) will have corrupted value estimates regardless of how well a task is learned. In55
this case, value functions will be “blurred” in proportion to the uncertainty at each point. Sensory feedback,56
however, reduces this blurring as each new timepoint is approached. Beginning with the normative principle57
that animals seek to best learn the value of each state, we show that unbiased learning, in the presence of58
feedback, requires RPEs that ramp. These ramps scale with the informativeness of the feedback (i.e., the59
reduction in uncertainty), and at the extreme, absence of feedback leads to flat RPEs. Thus we show that60
3
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/805366doi: bioRxiv preprint first posted online Oct. 16, 2019;
differences in a task’s feedback profile explain the puzzling collection of DA behaviors described above.61
We will begin the next section with a review of the TD learning algorithm, then examine the effect of62
state uncertainty on value learning. We will then show how, by reducing state uncertainty without biasing63
learning, sensory feedback causes the RPE to reproduce the experimentally observed behaviors of DA.64
Results65
Temporal Difference Learning66
In TD learning, an agent transitions through a sequence of states according to a Markov process (Sutton,67
1988). The value associated with each state is defined as the expected discounted future return:68
Vt = E
[ ∞∑k=0
γkrt+k
], (1)
where t denotes time and indexes states, rt denotes the reward delivered at time t, and γ ∈ (0, 1) is a discount69
factor. In the experiments we will examine, a single reward is presented at the end of each trial. For these70
cases, Equation (1) can be written simply as:71
Vt = γT−tr, (2)
for all t ∈ [0, T ], where r is the magnitude of reward delivered at time T . In words, value increases72
exponentially as reward time T is approached, peaking at a value of r at T (Figure 1B,D). Additionally,73
note that exponential functions are convex: The convex shape of the value function will be important in74
subsequent sections (see Kim et al. (2019) for experimental verification).75
How does the agent learn this value function? Under the Markov property, Equation (1) can be rewritten76
as:77
Vt = rt + γVt+1, (3)
which is referred to as the Bellman equation (Bellman, 1957). The agent approximates Vt with Vt, which78
is updated in the event of a mismatch between the estimated value and the reward actually received. By79
4
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/805366doi: bioRxiv preprint first posted online Oct. 16, 2019;
analogy with Equation (3), this mismatch (the RPE) can be written as:80
δt = rt + γVt+1 − Vt. (4)
When δt is zero, Equation (3) has been well-approximated. However, when δt is positive or negative, Vt81
must be increased or decreased, respectively:82
V(t+1)t = V
(t)t + αδ
(t)t , (5)
where α ∈ (0, 1) denotes the learning rate, and the superscript denotes the learning step. Learning will83
progress until δt = 0 on average. After this point, Vt = γT−tr on average, which is precisely the true value.84
(See the Methods for a more general description of TD learning and its neural implementation.)85
Having described TD learning in the simplified case where the agent has a perfect internal clock and thus86
no state uncertainty, let us now examine how state uncertainty affects learning, and how this uncertainty is87
reduced with sensory feedback.88
Value Learning Under State Uncertainty89
Because animals do not have perfect internal clocks, they do not have complete access to the true time t90
(Gibbon, 1977; Church and Meck, 2003; Staddon, 1965). Instead, t is a latent state corrupted by timing91
noise, often modeled as follows:92
τ ∼ N (t, σ2t ), (6)
where τ is subjective (internal) time, drawn from a distribution centered on objective time t, with some93
standard deviation σt. We take this distribution to be Gaussian for simplicity (an assumption we relax in94
the Methods). Thus the subjective estimate of value Vτ is an average over the estimated values Vt of each95
state t:96
Vτ =∑t
p(t|τ) Vt, (7)
where p(t|τ) denotes the probability that t is the true state given the subjective measurement τ , and thus97
represents state uncertainty. We refer to this quantity as the uncertainty kernel (Figure 1A,C). Intuitively,98
Vτ is the result of blurring Vt proportionally to the uncertainty kernel (Methods).99
5
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/805366doi: bioRxiv preprint first posted online Oct. 16, 2019;
0 10 20 30 40 50Time
0
0.2
0.4
0.6
0.8
1
Valu
e0 10 20 30 40 50
Time
0
0.2
0.4
0.6
0.8
1
Valu
e
True ValueEstimated Value
B
0 10 20 30 40 50
0 10 20 30 40 500 10 20 30 40 50
0 10 20 30 40 50
C S.F.
S.F.
S.F.
At τ = 10At τ = 20At τ = 30At τ = 40
At τ = 10At τ = 20At τ = 30At τ = 40
0 10 20 30 40 50Time
0
0.5
1
Valu
e
With Feedback
0 10 20 30 40 50Time
0
0.5
1
Valu
e
Without FeedbackTrue ValueEstimated Value
0 10 20 30 40 50
0 10 20 30 40 50
D 0 10 20 30 40 50
0 10 20 30 40 50
0 10 20 30 40 50Time
0
0.5
1
Valu
e
With Feedback
0 10 20 30 40 50Time
0
0.5
1
Valu
e
Without FeedbackTrue ValueEstimated Value
A
Figure 1: Sensory Feedback Biases Value Learning. (A) Illustration of state uncertainty in the absenceof sensory feedback. Each row includes the uncertainty kernels at the current state and the next state (solidcurves). Lighter gray curves represent uncertainty kernels for later states. Thus, similarly colored kernelson different rows represent uncertainty kernels for the same state, but evaluated at different timepoints(e.g., dashed box). In the absence of feedback, state uncertainty for a single state does not acutely changeacross time (compare with C). (B) Without feedback, value is unbiased on average. Red curves representthe predicted increase in value between the current state and the next state (10 and 20 for light red; 20 and30 for red; 30 and 40 for brick red). After learning, this roughly equals an increase by γ−1 on average. (C)Sensory feedback reduces state uncertainty. Three instances of feedback are shown for illustration (S.F.;arrows). Note here that the kernels used to estimate value at the same state have different widths dependingon whether they were evaluated before or after feedback. This results in different value estimates being usedto compute the RPE at the current state and at the next state (Equations (8) and (9)). (D) As a resultof sensory feedback, value at each state will be estimated based on an inflated version of value at the nextstate. Hence, after learning (when RPE is zero on average), estimated value will be systematically largerthan true value. Red curves represent the predicted increase in value between the current state and the nextstate. After learning, this roughly equals an increase by γ−1 on average. See Methods for simulation details.
After learning (i.e., when the RPE is zero on average), the estimated value at every state will be roughly the100
estimated value at the next state, discounted by γ, on average (black curve in Figure 1B). A key requirement101
for this unbiased learning can be discovered by writing the RPE equations for two successive states:102
δτ = rτ + γVτ+1 − Vτ (8)
δτ+1 = rτ+1 + γVτ+2 − Vτ+1. (9)
Notice here that Vτ+1 is represented in both equations. Thus, for value to be well-learned, a requirement103
6
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/805366doi: bioRxiv preprint first posted online Oct. 16, 2019;
is that Vτ+1 not acutely change during the interval after computing δτ and before computing δτ+1. This104
requirement extends to changes in the uncertainty kernels: By Equation (7), if the kernel p(t|τ + 1) were to105
be acutely updated due to information available at τ + 1 but not at τ , then Vτ+1 will acutely change as well.106
This means that Vτ will be discounted based on Vτ+1 before feedback (i.e., as estimated at τ ; red curves in107
Figure 1D) rather than Vτ+1 after feedback (i.e., as estimated at τ + 1; black curve). In the next section, we108
examine this effect more precisely.109
Value Learning in the Presence of Sensory Feedback110
How is value learning affected by sensory feedback? As each time τ is approached, state uncertainty is reduced111
due to sensory feedback (arrows in Figure 1C). This is because at timepoints preceding τ , the estimate of112
what value will be at τ is corrupted by both temporal noise and the lower-resolution stimuli associated with113
τ . Approaching τ in the presence of sensory feedback reduces this corruption. This, however, means that114
Vτ+1 will be estimated differently while computing δτ and δτ+1 (Equations (8) and (9); compare widths of115
similarly colored kernels beneath each arrow in Figure 1C), which in turn results in biased value learning.116
To examine the nature of this bias, we note that averaging over a convex value function results in over-117
estimation of value. Intuitively, convex functions are steeper on the right (larger values) and shallower on118
the left (smaller values), so averaging results in a bias toward larger values. Furthermore, wider kernels119
result in greater overestimation (Methods). Thus upon entering each new state, the reduction of uncertainty120
via sensory feedback will acutely mitigate this overestimation, resulting in different estimates Vτ+1 being121
used for δτ and δτ+1. Left uncorrected, the value estimate will be systematically biased, and in particular,122
value will be overestimated at every point (Figure 2A; Methods). An intuitive way to see this is as follows:123
The objective of the TD algorithm (in this simplified task setting) is for the value at each state τ to be γ124
times smaller than the value at τ + 1 by the time the RPE converges to zero (Equation (2)). If an animal125
systematically overestimates value at the next state, then it will overestimate value at the current state as126
well (even if sensory feedback subsequently diminishes the next state’s overestimation). Thus the “wrong”127
value function is learned (Figure 2A,B).128
7
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/805366doi: bioRxiv preprint first posted online Oct. 16, 2019;
0 10 20 30 40 50Time
0
0.5
1
Valu
e
Without CorrectionTrue ValueEstimated Value
0 10 20 30 40 50Time
0
0.01
0.02
0.03
RPE
0 10 20 30 40 50Time
0
0.5
1
Valu
e
With Correction
0 10 20 30 40 50Time
0
0.01
0.02
0.03
RPE
A
B
C
D
Figure 2: Unbiased Learning in the Presence of Feedback Leads to RPE Ramps. (A) Value ateach state is learned according to an overestimated version of value at the next state. Thus, a biased valuefunction is learned (see Figure 1D). (B) After learning, the RPE converges to zero. (C) With a correctionterm, the correct value function is learned. (D) The cost of forcing an unbiased learning of value is apersistent RPE. Intuitively, value at the current state is not influenced by the overestimated version of valueat the next state (compare with A,B). By Equation (13), this results in RPEs that ramp. See Methods forsimulation details.
To overcome this bias, an optimal agent must correct the just-computed RPE as sensory feedback becomes129
available. In the Methods, we show that this correction can simply be written as:130
V(t+1)t = V
(t)t + αδ(t)
τ p(t|τ)− βV (t)τ p(t|τ) (10)
' V (t)t + αδ(t)
τ p(t|τ)− βV (t)t , (11)
where the approximate equality holds for sufficient reductions in state uncertainty due to feedback, and131
β = α
(exp
[(ln γ)2(l2 − s2)
2
]− 1
). (12)
Here, the uncertainty kernel of Vτ+1 has some standard deviation l at τ and a smaller standard deviation s132
at τ + 1. In words, as the animal gains an improved estimate of Vτ+1, it corrects the previously computed133
δτ with a feedback term to ensure unbiased learning of value (Figure 2C). Notice here that the correction134
term is a function of the reduction in variance (l2− s2) due to sensory feedback. In the absence of feedback,135
8
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/805366doi: bioRxiv preprint first posted online Oct. 16, 2019;
the reduction in variance is zero (the uncertainty kernel for τ + 1 cannot be reduced during the transition136
from τ to τ + 1), which means β = 0.137
How does this correction affect the RPE? By Equation (10), the RPE will converge to:138
δτ =β
αVτ . (13)
Therefore, with sensory feedback, the RPE ramps and tracks Vτ in shape (Figure 2D). In the absence of139
feedback, β = 0; thus, there is no ramp.140
In summary, when feedback is provided with new states, value learning becomes miscalibrated, as each value141
point will be learned according to an overestimated version of the next (Figure 2A). With a subsequent142
correction of this bias, the agent will continue to overestimate the RPEs at each point (RPEs will ramp;143
Figure 2D), in exchange for learning the correct value function (Figure 2C).144
Relationship with Experimental Data145
In classical conditioning tasks without sensory feedback, DA ramping is not observed (Schultz et al., 1997;146
Kobayashi and Schultz, 2008; Stuber et al., 2008; Flagel et al., 2011; Cohen et al., 2012; Hart et al., 2014;147
Eshel et al., 2015; Menegas et al., 2015, 2017; Babayan et al., 2018) (Figure 3A). On the other hand, in148
goal-directed navigation tasks, characterized by sensory feedback in the form of salient visual cues as well as149
locomotive cues (e.g., joint movement), DA ramping is present (Howe et al., 2013) (Figure 3C). DA ramping150
is also present in classical conditioning tasks that do not involve locomotion but that include either spatial151
or non-spatial feedback (Kim et al., 2019), as well as in two-armed bandit tasks (Hamid et al., 2016) and152
when executing self-initiated action sequences (Wassum et al., 2012; Collins et al., 2016).153
9
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/805366doi: bioRxiv preprint first posted online Oct. 16, 2019;
Information Encoded inDopaminergic Activity
Dopamine neurons of the ventral tegmentalarea (VTA) and substantia nigra have longbeen identified with the processing of re-warding stimuli. These neurons send theiraxons to brain structures involved in moti-vation and goal-directed behavior, for ex-ample, the striatum, nucleus accumbens,and frontal cortex. Multiple lines of evi-dence support the idea that these neuronsconstruct and distribute information aboutrewarding events.
First, drugs like amphetamine and co-caine exert their addictive actions in part byprolonging the influence of dopamine ontarget neurons (14). Second, neural path-ways associated with dopamine neurons areamong the best targets for electrical self-stimulation. In these experiments, rats pressbars to excite neurons at the site of an im-planted electrode (15). The rats oftenchoose these apparently rewarding stimuliover food and sex. Third, animals treatedwith dopamine receptor blockers learn lessrapidly to press a bar for a reward pellet (16).All the above results generally implicatemidbrain dopaminergic activity in reward-dependent learning. More precise informa-tion about the role played by midbrain do-paminergic activity derives from experimentsin which activity of single dopamine neuronsis recorded in alert monkeys while they per-form behavioral acts and receive rewards.
In these latter experiments (17), dopa-mine neurons respond with short, phasicactivations when monkeys are presentedwith various appetitive stimuli. For exam-ple, dopamine neurons are activated whenanimals touch a small morsel of apple orreceive a small quantity of fruit juice to themouth as liquid reward (Fig. 1). These pha-sic activations do not, however, discrimi-nate between these different types of re-warding stimuli. Aversive stimuli like airpuffs to the hand or drops of saline to themouth do not cause these same transientactivations. Dopamine neurons are also ac-tivated by novel stimuli that elicit orientingreactions; however, for most stimuli, thisactivation lasts for only a few presentations.The responses of these neurons are relative-ly homogeneous—different neurons re-spond in the same manner and differentappetitive stimuli elicit similar neuronal re-sponses. All responses occur in the majorityof dopamine neurons (55 to 80%).
Surprisingly, after repeated pairings ofvisual and auditory cues followed by reward,dopamine neurons change the time of theirphasic activation from just after the time ofreward delivery to the time of cue onset. Inone task, a naıve monkey is required totouch a lever after the appearance of a smalllight. Before training and in the initialphases of training, most dopamine neuronsshow a short burst of impulses after rewarddelivery (Fig. 1, top). After several days oftraining, the animal learns to reach for the
lever as soon as the light is illuminated, andthis behavioral change correlates with tworemarkable changes in the dopamine neu-ron output: (i) the primary reward no longerelicits a phasic response; and (ii) the onsetof the (predictive) light now causes a phasicactivation in dopamine cell output (Fig. 1,middle). The changes in dopaminergic ac-tivity strongly resemble the transfer of ananimal’s appetitive behavioral reactionfrom the US to the CS.
In trials where the reward is not deliv-ered at the appropriate time after the onsetof the light, dopamine neurons are de-pressed markedly below their basal firingrate exactly at the time that the rewardshould have occurred (Fig. 1, bottom). Thiswell-timed decrease in spike output showsthat the expected time of reward deliverybased on the occurrence of the light is alsoencoded in the fluctuations in dopaminer-gic activity (18). In contrast, very few do-pamine neurons respond to stimuli that pre-dict aversive outcomes.
The language used in the foregoing de-scription already incorporates the idea thatdopaminergic activity encodes expectationsabout external stimuli or reward. This inter-pretation of these data provides a link to anestablished body of computational theory (6,7). From this perspective, one sees that dopa-mine neurons do not simply report the occur-rence of appetitive events. Rather, their out-puts appear to code for a deviation or errorbetween the actual reward received and pre-dictions of the time and magnitude of reward.These neurons are activated only if the timeof the reward is uncertain, that is, unpredictedby any preceding cues. Dopamine neurons aretherefore excellent feature detectors of the“goodness” of environmental events relativeto learned predictions about those events.They emit a positive signal (increased spikeproduction) if an appetitive event is betterthan predicted, no signal (no change in spikeproduction) if an appetitive event occurs aspredicted, and a negative signal (decreasedspike production) if an appetitive event isworse than predicted (Fig. 1).
Computational Theory and Model
The TD algorithm (6, 7) is particularly wellsuited to understanding the functional roleplayed by the dopamine signal in terms ofthe information it constructs and broadcasts(8, 10, 12). This work has used fluctuationsin dopamine activity in dual roles (i) as asupervisory signal for synaptic weightchanges (8, 10, 12) and (ii) as a signal toinfluence directly and indirectly the choiceof behavioral actions in humans and bees(9–11). Temporal difference methods havebeen used in a wide spectrum of engineeringapplications that seek to solve prediction
Reward predictedReward occurs
No predictionReward occurs
Reward predictedNo reward occurs
(No CS)
(No R)CS-1 0 1 2 s
CS
R
R
Do dopamine neurons report an error in the prediction of reward?
Fig. 1. Changes in dopamine neurons’output code for an error in the prediction ofappetitive events. (Top) Before learning, adrop of appetitive fruit juice occurs in theabsence of prediction—hence a positiveerror in the prediction of reward. The do-pamine neuron is activated by this unpre-dicted occurrence of juice. (Middle) Afterlearning, the conditioned stimulus predictsreward, and the reward occurs accordingto the prediction—hence no error in theprediction of reward. The dopamine neu-ron is activated by the reward-predictingstimulus but fails to be activated by thepredicted reward (right). (Bottom) Afterlearning, the conditioned stimulus predictsa reward, but the reward fails to occur be-cause of a mistake in the behavioral re-sponse of the monkey. The activity of thedopamine neuron is depressed exactly atthe time when the reward would have oc-curred. The depression occurs more than1 s after the conditioned stimulus withoutany intervening stimuli, revealing an inter-nal representation of the time of the pre-dicted reward. Neuronal activity is alignedon the electronic pulse that drives the solenoid valve delivering the reward liquid (top) or the onset of theconditioned visual stimulus (middle and bottom). Each panel shows the peri-event time histogram andraster of impulses from the same neuron. Horizontal distances of dots correspond to real-time intervals.Each line of dots shows one trial. Original sequence of trials is plotted from top to bottom. CS,conditioned, reward-predicting stimulus; R, primary reward.
SCIENCE z VOL. 275 z 14 MARCH 1997 z http://www.sciencemag.org1594
on May 13, 2018
http://science.sciencem
ag.org/Downloaded from
trials (Fig. 2e), and were not correlated with trial length (Fig. 2d–f) orwith run velocity or acceleration (Extended Data Fig. 6e, f). Moreover,on trials in which rats paused mid-run, the signals remained sustained(or dipped slightly) and resembled the actual proximity to reward(Extended Data Fig. 7). These observations indicated that the rampingsignals could represent a novel form of dopamine signalling that pro-vides a continuous estimate of the animal’s spatial proximity to distantrewards (Fig. 2, Extended Data Fig. 6, and Supplementary Discussion).
Given that phasic responses of dopamine-containing neurons canreflect the relative value of stimuli21, we asked, in a subset of rats,whether the ramping dopamine signals could also be modulated bythe size of the delivered rewards (Methods). We used mazes with T, Mor S configurations and different total lengths (Fig. 3, and ExtendedData Fig. 8). We required the animals to run towards one or the otherend of the maze and varied the rewards available at the alternate goalregions. With all three mazes, dopamine ramping became stronglybiased towards the goal with the larger reward (Fig. 3, and ExtendedData Fig. 8). Run speed was slightly higher for the high-reward mazearms (Fig. 3i, k), but these small differences were unlikely to account
fully for the large differences in the dopamine signals recorded. Whenwe then reversed the locations of the small and large rewards, theramping signals also shifted, across sessions or just a few trials, to favourthe new high-value maze arm (Fig. 3, and Extended Data Fig. 8). Thesebias effects were statistically significant for each experimental paradigm(Extended Data Fig. 8h–j, Mann–Whitney U-test, P , 0.05) and acrossall rats (Fig. 3d, n 5 4, Mann–Whitney U-test, P 5 0.02).
In the M-maze, the ramps became extended to cover the longerend-arm distances to goal-reaching, and critically, peaked at nearlythe same level before goal-reaching as did the ramping signals recordedin the T-maze, despite the longer distance travelled (Fig. 3e). This resultsuggested that the ramping dopamine signals do not signal rewardproximity in absolute terms but, instead, scale with the path distanceto a fixed level that depends on the relative reward value.
To determine whether such value-related differences in the rampingdopamine signals would occur when the actions to reach the distantgoal sites were equivalent, we used the S-shaped maze. The rampingsignals were larger for the run trajectories leading to the larger rewards(Fig. 3c, j, and Extended Data Fig. 9), despite the fact that the sequence
–0.5–1
0.50
11.52
–8 0 2
510152025303540 –15
–10–5051015202530
–2 0 2.4 6 8–10
0
10
20
30
40
–2 0 2.4 6 8
–0.4Click Tone Goal
Click Tone Goal
Goal
Goal
–0.4
0.65
Time (s)Time (s)
Time (s) Time (s) Time (s)
Volta
ge (V
) Current (nA
)
[DA
] (nM)
Tria
l num
ber
–8 0 2
510152025303540 0
51015202530
Time (s)
Velocity (pixel per s)
Tria
l num
ber
[DA
] (nM
)
c
VMS
[DA
] (nM
)D
LS [D
A] (
nM)
e
fb d
a
–101234567
Click Tone Turn Goal
Click Tone Turn Goal
–1 10 3 5 7 9
–1 10 3 5 7 9
–101234567
Figure 1 | Ramping striatal dopamine signals occur during maze runs. a,b, Baseline subtracted current (a) and dopamine concentration ([DA],b) measured by FSCV in VMS during a single T-maze trial. c, d, Trial-by-trialchanges in dopamine concentration (c) and velocity (d) relative to
goal-reaching. e, f, Dopamine concentration (mean 6 s.e.m.) for VMS(e, n 5 300 session-averaged recordings from 18 probes across 214 sessions)and for DLS (f, n 5 262, 13 probes) for correct (blue) and incorrect (red) trials,averaged over all 40 trial sessions.
Time (s)Time (s)
Time (s)
123 6 9 8–1 00
50100150200250 1
0
1
0
Tria
ls
Sim
ulat
ed [D
A]
a b
8–1 0 1 2 3 4 5 6 7–2
0
2
4
6
8
10
12
[DA
] (nM
)
Nor
mal
ized
[DA
]
e f
Time elapsed model
Time (s)8–1 0
Sim
ulat
ed [D
A]
c Spatial proximity model
Shorttrials
Longtrials
0.85
0.9
0.95
1
1.05
1.1
1.15
1.2 Experimental dataTime-elapsed modelSpatial proximity model
d
Short trial goalClick Long trial goal
2 4 6 8 10 12 14
0
20
40
60
Pea
k [D
A]
Trial time (s)
Figure 2 | Ramping dopamine signals proximityto distant rewards. a, Distribution of trial times(from warning click to goal-reaching, n 5 3,933trials). b, c, Dopamine release modelled as afunction of time elapsed since maze-running onset(b) and as a function of spatial proximity to visitedgoal (c) for short (purple) and long (orange) trials(see Methods). Vertical lines indicate trial start(red) and end (purple and orange) times. d, Peakdopamine concentration versus trial time for allramping trials (n 5 2,273, Pearson’s R 5 0.0004,P 5 0.98). e, Experimentally recorded dopaminerelease (mean 6 s.e.m.) in short (n 5 327, purple)and long (n 5 423, orange) trials. Dopamine peaksat equivalent levels, as in the proximity model inc. f, Normalized peak dopamine levels(mean 6 s.e.m.) predicted by time-elapsed (red)and proximity (light blue) models, and measuredexperimental data (dark blue).
RESEARCH LETTER
5 7 6 | N A T U R E | V O L 5 0 0 | 2 9 A U G U S T 2 0 1 3
Macmillan Publishers Limited. All rights reserved©2013
CS RTime
0
0.05
0.1
RPE
A
C
B
D
Click GoalTime
0
0.05
0.1
RPE
Figure 3: Differences in Feedback Result in Different RPE Behaviors. (A) Schultz et al. (1997)have found that after learning, phasic DA responses to a predicted reward (R) diminish, and instead beginto appear at the earliest reward-predicting cue (conditioned stimulus; CS). Figure from Schultz et al. (1997).(B) Our derivations recapitulate this result. In the absence of sensory feedback, RPEs converge to zero.(C) Howe et al. (2013) have found that the DA signal ramps during a well-learned navigation task over thecourse of a single trial. Figure from Howe et al. (2013). (D) Our derivations recapitulate this result. Inthe presence of sensory feedback, RPEs track the shape of the estimated value function. See Methods forsimulation details.
As described in the previous section, sensory feedback—due to external cues or to the animal’s own movement—154
can reconcile both types of DA behaviors with the RPE hypothesis: In the absence of feedback, there is no155
reduction in state uncertainty upon entering each new state (β = 0), and therefore no ramps (Equation (13);156
Figure 3B). On the other hand, when state uncertainty is reduced as each state is entered, ramps will occur157
(Figure 3D).158
More generally, our results demonstrate that a measured DA signal whose shape tracks with estimated value159
need not be evidence against the RPE hypothesis of DA, contrary to some claims (Hamid et al., 2016; Berke,160
2018): Indeed, in the presence of sensory feedback, δτ and Vτ have the same shape. Thus, our derivation is161
conceptually compatible with the value interpretation of DA under certain circumstances, but importantly,162
this derivation captures the experimental findings in other circumstances in which the value interpretation163
fails.164
10
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/805366doi: bioRxiv preprint first posted online Oct. 16, 2019;
Discussion165
The role of DA in reinforcement learning has long been studied. While a large body of work has established166
phasic DA as an error signal (Schultz et al., 1997; Niv and Schoenbaum, 2008; Glimcher, 2011; Steinberg167
et al., 2013; Eshel et al., 2015), more recent work has questioned this view (Wassum et al., 2012; Howe et al.,168
2013; Hamid et al., 2016; Collins et al., 2016). Indeed, in light of persistent DA ramps occurring in certain169
tasks even after extensive learning, some authors have proposed that DA may instead communicate value170
itself in these tasks (Hamid et al., 2016). However, the determinants of DA ramps have remained unclear:171
Ramps are observed during goal-directed navigation, in which animals must run to receive reward (operant172
tasks; Howe et al., 2013), but can also be elicited in virtual reality tasks in which animals do not need to173
run for reward (classical conditioning tasks; Kim et al., 2019). Within classical conditioning, DA ramps can174
occur in the presence of navigational or non-navigational stimuli indicating time to reward (Kim et al., 2019).175
Within operant tasks, ramps can be observed in the period preceding the action (Totah et al., 2013) as well176
as during the action itself (Howe et al., 2013). These ramps are furthermore not specific to experimental177
techniques and measurements, and can be observed in cell body activities, axonal calcium signals, and in178
the DA concentrations (Kim et al., 2019).179
We have shown in this work that under the RPE hypothesis of DA, sensory feedback may control the180
different observed DA behaviors: In the presence of sensory feedback, RPEs track the estimated value in181
shape (ramps), but they remain flat in the absence of feedback (no ramps). Thus DA ramps and phasic182
responses follow from common computational principles and may be generated by common neurobiological183
mechanisms.184
Our derivation makes a number of testable predictions. In particular, our results predict that any type of185
information that reduces state uncertainty—for example, an auditory tone whose frequency reflects time to186
reward or a moving visual stimulus whose position reflects time to reward—will result in a DA ramp, and187
furthermore, the magnitude of the ramp will increase with the informativeness of the stimulus (i.e., with a188
greater reduction in state uncertainty; Equations (12) and (13)). Therefore, in trials where the change in the189
tone’s frequency is less apparent, or the contrast of the visual stimulus is lower, the ramp will be blunted.190
At the extreme, when the tone’s frequency does not change with time or the contrast is minimal, no ramp191
will be observed. At this point, the task is indistinguishable from the classical conditioning experiments of192
Schultz et al. (1997) discussed in the Introduction.193
Our work takes inspiration from previous studies that examined the role of state uncertainty in DA responses194
11
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/805366doi: bioRxiv preprint first posted online Oct. 16, 2019;
(Kobayashi and Schultz, 2008; Fiorillo et al., 2008; de Lafuente and Romo, 2011; Starkweather et al., 2017;195
Lak et al., 2017). For instance, temporal uncertainty increases with longer durations (Staddon, 1965; Gibbon,196
1977; Church and Meck, 2003). This means that in a classical conditioning task, DA bursts at reward time197
will not be completely diminished, and will be larger for longer durations, as Kobayashi and Schultz (2008)198
and Fiorillo et al. (2008) have observed. Similarly, Starkweather et al. (2017) have found that in tasks with199
uncertainty both in whether reward will be delivered as well as when it is delivered, DA exhibits a prolonged200
dip (i.e., a negative ramp) leading up to reward delivery. Here, value initially increases as expected reward201
time is approached, but then begins to slowly decrease as the probability of reward delivery during the present202
trial becomes less and less likely, resulting in persistently negative prediction errors (see also Starkweather203
et al., 2018; Babayan et al., 2018). As the authors of these studies note, both results are fully predicted by204
the RPE hypothesis of DA. Hence, state uncertainty, due to noise either in the internal circuitry or in the205
external environment, is reflected in the DA signal.206
A number of questions arise from our analysis. First, is there any evidence to support the benefits of learning207
the ‘true’ value function as written in Equation (2) (Figure 2C) over the biased version of value (Figure 2A)?208
We note here that under the normative account, the agent seeks to learn some value function that maximizes209
its well-being. Our key result is that this function—regardless of its exact shape—will not be learned well210
if feedback is delivered during learning, unless correction ensues. While we have chosen the exponential211
shape in Equation (2) after the conventional TD models, our results extend to any convex value function.212
Second, due to this presumed exponential shape, the ramping behaviors resulting from our analysis also213
look exponential, rather than linear (compare with experimental results). We nonetheless have chosen to214
remain close to conventional TD models and purely exponential value functions for ease of comparison with215
the existing theoretical literature. Perhaps equally important, the relationship between RPE and its neural216
correlate need only be monotonic and not necessarily equal. In other words, a measured linear signal does217
not necessarily imply a linear RPE, and a convex neural signal need not communicate convex information.218
Third, while we have derived RPE ramping from normative principles, it is important to note that biases in219
value learning may also produce ramping. For instance, one earlier proposal by Gershman (2014) was that220
value may take a fixed convex shape in spatial navigation tasks; the mismatch between this shape and the221
exponential shape in Equation (2) produces a ramp (see Methods for a general derivation of the conditions222
for a ramp). Morita and Kato (2014), on the other hand, posited that value updating involves a decay term.223
Assuming such a decay term results in a relationship qualitatively similar to that in Equation (10), and thus224
RPE ramping (see also implementations in Mikhael and Bogacz, 2016; Cinotti et al., 2019). Ramping can225
similarly be explained by assuming temporal or spatial bias that decreases with approach to the reward, by226
12
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/805366doi: bioRxiv preprint first posted online Oct. 16, 2019;
modulating the temporal discount term during task execution, or by other mechanisms (see Supplemental227
Information for derivations). In each of these proposals, ramps emerge as a ‘bug’ in the implementation,228
rather than as an optimal strategy for unbiased learning. These proposals furthermore do not explain the229
different DA patterns that emerge under different paradigms. Finally, it should be noted that we have not230
assumed any modality- or task-driven differences in learning (any differences in the shape of the RPE follow231
solely from the sensory feedback profile), although in principle, different value functions may certainly be232
learned in different types of tasks.233
Alternative accounts of DA ramping that deviate more significantly from our framework have also been234
proposed. In particular, Lloyd and Dayan (2015) have provided three compelling theoretical accounts of235
ramping. In the first account, the authors show that within an actor-critic framework, uncertainty in the236
communicated information between actor and critic regarding the timing of action execution may result237
in a monotonically increasing RPE leading up to the action. In the second account, ramping modulates238
gain control for value accumulation within a drift-diffusion model (e.g., by modulating neuronal excitability239
(Nicola et al., 2000)). Under this framework, fluctuations in tonic and phasic DA produce average ramping.240
The third account extends the average reward rate model of tonic DA proposed by Niv et al. (2007). In241
this extended view, ramping constitutes a ‘quasi-tonic’ signal that reflects discounted vigor. The authors242
show that the discounted average reward rate follows (1 − γ)V , and hence takes the shape of the value243
function in TD learning models. Finally, Howe et al. (2013) have proposed that these ramps may be244
necessary for sustained motivation in the operant tasks considered. Indeed, the notion that DA may serve245
multiple functions beyond the communication of RPEs is well-motivated and deeply ingrained (Schultz,246
2007b, 2010; Berridge, 2007; Frank et al., 2007; Gardner et al., 2018). Our work does not necessarily247
invalidate these alternative interpretations, but rather shows how a single RPE interpretation can embrace248
a range of apparently inconsistent phenomena.249
Methods250
Temporal Difference Learning and Its Neural Correlates251
Under TD learning, each state is determined by task-relevant contextual cues, referred to as features, that252
predict future rewards. For instance, a state might be determined by an internal estimate of time or253
perceived distance from a reward. We model the agent as approximating Vt by taking a linear combination254
13
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/805366doi: bioRxiv preprint first posted online Oct. 16, 2019;
of the features (Schultz et al., 1997; Ludvig et al., 2008, 2012):255
Vt =∑d
wdxd,t, (14)
where Vt denotes the estimated value at time t, and xd,t denotes the dth feature at t. The learned relevance256
of each feature xd is reflected in its weight wd, and the weights are updated in the event of a mismatch257
between the estimated value and the rewards actually received. The update occurs in proportion to each258
weight’s contribution to the value estimate at t:259
w(t+1)d = w
(t)d + αδ
(t)t xd,t, (15)
where α ∈ (0, 1) denotes the learning rate, and the superscript denotes the learning step. In words, when260
a feature xd does not contribute to the value estimate at t (xd,t = 0), its weight is not updated. On the261
other hand, weights corresponding to features that do contribute to Vt will be updated in proportion to their262
activations at that time. This update rule is referred to as gradient ascent (xd,t is equal to the gradient of Vt263
with respect to the weight wd), and it implements a form of credit assignment, in which the features most264
activated at t undergo the greatest modification to their weights.265
In this formulation, the basal ganglia implements the TD algorithm termwise: Cortical inputs to striatum266
encode the features xd,t, corticostriatal synaptic strengths encode the weights wd (Houk et al., 1995; Mon-267
tague et al., 1996), phasic activity of midbrain dopamine neurons encodes the error signal δt (Schultz et al.,268
1997; Niv and Schoenbaum, 2008; Glimcher, 2011; Steinberg et al., 2013; Eshel et al., 2015), and the output269
nuclei of the basal ganglia (substantia nigra pars reticulata and internal globus pallidus) encode estimated270
value Vt (Ratcliff and Frank, 2012).271
We have implicitly assumed in the Results a maximally flexible feature set, the complete serial compound272
representation (Moore et al., 1989; Sutton and Barto, 1990; Montague et al., 1996; Schultz et al., 1997), in273
which every time step following trial onset is represented as a separate feature. In other words, the feature274
xd,t is 1 when t = d and 0 otherwise. In this case, value at each timepoint is updated independently of the275
other timepoints, and each has its own weight. It follows that Vt = wt, and we can write Equation (15)276
directly in terms of Vt, as in Equation (5).277
14
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/805366doi: bioRxiv preprint first posted online Oct. 16, 2019;
Value Learning Under State Uncertainty278
Animals only have access to subjective time, and must infer objective time given the corruption in Equation279
(6). The RPE is then:280
δτ = rτ + γVτ+1 − Vτ , (16)
and this error signal is used to update the value estimates at each point t in proportion to its posterior281
probability p(t|τ):282
V(t+1)t = V
(t)t + αδ(t)
τ p(t|τ). (17)
Said differently, the effect of state uncertainty is that when the error signal δτ is computed, it updates the283
value estimate at a number of timepoints, in proportion to the uncertainty kernel.284
Acute Changes in State Uncertainty Result in Biased Value Learning285
Averaging over a convex value function results in overestimation of value. For an exponential value function,286
we can derive this result analytically in the continuous domain:287
∫t
γT−tN (t; τ, σ2t )dt = γT−τ exp
[(ln γ)2σ2
t
2
], (18)
where the second term on the left-hand side is greater than one. Intuitively, because the function is steeper288
on the right side and shallower on the left side, the average will be overestimated. Importantly, however,289
the estimate will be a multiple of the true value, with a scaling factor that depends on the width of the290
kernel (second term on left-hand side of Equation (18); note also that while we have assumed a Gaussian291
distribution, our results hold for any distribution that results in overestimation of value). Thus, with sensory292
feedback that modifies the width of the kernel upon transitioning from one state (τ) to the next (τ + 1),293
there will be a mismatch in the value estimate when computing each RPE. More precisely, the learning rules294
are:295
Vτ =∑t
p(t|τ, σt = s) Vt (19)
Vτ+1 =∑t
p(t|τ + 1, σt+1 = l) Vt (20)
δτ = rτ + γVτ+1 − Vτ (21)
V(t+1)t = V
(t)t + αδ(t)
τ p(t|τ, σt = s). (22)
15
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/805366doi: bioRxiv preprint first posted online Oct. 16, 2019;
Notice that Vτ+1 takes different values depending on the state: When computing δτ ,296
Vτ+1 =∑t
p(t|τ + 1, σt+1 = l) Vt. (23)
On the other hand, when computing δτ+1,297
Vτ+1 =∑t
p(t|τ + 1, σt+1 = s) Vt. (24)
How does this mismatch affect the learned value estimate? If averaging with kernels of different standard298
deviations can be written as multiples of true value, then they can be written as multiples of each other.299
The RPE is then300
δτ = rτ + γ(aVτ+1,s)− Vτ,s, (25)
where we use the comma notation to denote that the two value estimates are evaluated with the same301
kernel width s, and a is a constant. By analogy with Equations (2) and (4), estimated value converges to302
Vτ = (aγ)T−τr. Here, a > 1, so value is systematically overestimated. By the learning rules in Equations303
(19) to (22), this is because δτ is inflated by304
γ∑t
p(t|τ + 1, σt+1 = l) Vt − γ∑t
p(t|τ + 1, σt+1 = s) Vt = βVτ , (26)
where β is defined in Equation (12).305
An optimal agent will use the available sensory feedback to overcome this biased learning. Because averaging306
with a kernel of width l is simply a multiple of that with width s, it follows that a simple subtraction can307
achieve this correction (Equations (10) and (11)). Hence, sensory feedback can improve value learning with308
a correction term. It should be noted that with a complete correction to s as derived above, the bias is fully309
extinguished. For corrections to intermediate widths between s and l, the bias will be partially corrected310
but not eliminated. In both cases, because β > 0, ramps will occur.311
16
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/805366doi: bioRxiv preprint first posted online Oct. 16, 2019;
RPEs Are Approximately the Derivative of Value312
Consider the formula for RPEs in Equation (4). In tasks where a single reward is delivered at T , rt = 0 for313
all t < T (no rewards delivered before T ). Because γ ' 1, the RPE can be approximated as314
δt 'Vt+1 − Vt(t+ 1)− t
, (27)
which is the slope of the estimated value. To examine the relationship between value and RPEs more315
precisely, we can extend our analysis to the continuous domain:316
δ(t) = lim∆t→0
γ∆tV (t+ ∆t)− V (t)
∆t
=˙V (t) lim
∆t→0γ∆t + V (t) lim
∆t→0
γ∆t − 1
∆t
=˙V (t) lim
∆t→0γ∆t + V (t)(ln γ) lim
∆t→0γ∆t
=˙V (t) + V (t) ln γ, (28)
where˙V (t) is the time derivative of V (t), and the third equality follows from L’Hopital’s Rule. Here, ln γ317
has units of inverse time. Because ln γ ' 0, RPE is approximately the derivative of value.318
Sensory Feedback in Continuous Time319
In the complete absence of sensory feedback, σt is not constant, but rather increases linearly with time, a320
phenomenon referred to as scalar variability, a manifestation of Weber’s law in the domain of timing (Gibbon,321
1977; Church and Meck, 2003; Staddon, 1965). In this case, we can write the standard deviation as σt = wt,322
where w is the Weber fraction, which is constant over the duration of the trial.323
Set l = w(τ + ∆τ) and s = wτ . Following the steps in the previous section,324
δ(τ) = lim∆τ→0
γ∆τe(ln γ)2
2 w2((τ+∆τ)2−τ2)V (τ + ∆τ)− V (τ)
∆τ
=˙V (τ) + V (τ) ln γ + V (τ)(ln γ)2w2τ
>˙V (τ) + V (τ) ln γ. (29)
Hence, as derived for the discrete case, RPEs are inflated, and value is systematically overestimated.325
17
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/805366doi: bioRxiv preprint first posted online Oct. 16, 2019;
RPE Ramps Result From Sufficiently Convex Value Functions326
By Equation (28), the condition for ramping is δ(t) > 0, i.e., the estimated shape of the value function at327
any given point, before feedback, must obey328
¨V (t) +
˙V (t) ln γ > 0, (30)
where¨V (t) is the second derivative of V (t) with respect to time. For an intuition of this relation, note that329
when γ ' 1, the inequality can be approximated as¨V (t) > 0, which denotes any convex function. The330
exact inequality, however, has a tighter requirement on V (t): Since˙V (t) ln γ < 0 for all t, ramping will only331
be observed if the contribution from¨V (t) (i.e., the convexity) outweighs the quantity
˙V (t) ln γ (the scaled332
slope). For example, the function in Equation (2) does not satisfy the strict inequality even though it is333
convex, and therefore with this choice of V (t), the RPE does not ramp. In other words, to result in an RPE334
ramp, V (t) has to be ‘sufficiently’ convex.335
Simulation Details336
Value Learning Under State Uncertainty (Figure 1): For our TD learning model, we have chosen337
γ = 0.9, α = 0.1, n = 50 states, and T = 48. In the absence of feedback, uncertainty kernels are determined338
by the Weber fraction, arbitrarily set to w = 0.15. In the presence of feedback, uncertainty kernels have a339
standard deviation of l = 3 before feedback and s = 0.1 after feedback. For the purposes of averaging with340
uncertainty kernels, value peaks at T and remains at its peak value after T , and the standard deviation at341
the last 4 states in the presence of feedback is fixed to 0.1. Intuitively, the animal expects reward to be342
delivered, and attributes any lack of reward delivery at τ = T to noise in its timing mechanism (uncertainty343
kernels have nonzero width) rather than to a reward omission. The learning rules were iterated 1000 times.344
Value Learning in the Presence of Sensory Feedback (Figure 2): For our TD learning model, we345
have chosen γ = 0.9, α = 0.1, n = 50 states, and T = 48. The learning rules were iterated 1000 times.346
Relationship with Experimental Data (Figure 3): For our TD learning model, we have chosen γ = 0.8,347
α = 0.1, and Weber fraction w = 0.05. For the navigation task, kernels have standard deviation l = 3 before348
feedback and s = 0.1 after feedback. In the experimental paradigms, trial durations were approximately349
1.5 seconds (Schultz et al., 1997) and over 5 seconds (Howe et al., 2013). Thus for (B) and (D), we have350
arbitrarily set n = 10 and 25 states, respectively, between trial start and reward. The learning rules were351
18
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/805366doi: bioRxiv preprint first posted online Oct. 16, 2019;
iterated 1000 times.352
Full implementations can be found at www.github.com/jgmikhael/ramping.353
Acknowledgments354
The project described was supported by National Institutes of Health grants T32GM007753 and T32MH020017355
(JGM), R01 MH110404 and MH095953 (NU), U19 NS113201-01 (SJG and NU), the Simons Collaboration356
on the Global Brain (NU), and a research fellowship from the Alfred P. Sloan Foundation (SJG). The content357
is solely the responsibility of the authors and does not necessarily represent the official views of the National358
Institutes of Health or the Simons Collaboration on the Global Brain. The funders had no role in study359
design, data collection and analysis, decision to publish, or preparation of the manuscript.360
Author Contributions361
J.G.M. and S.J.G. developed the model. H.R.K. and N.U. conceived that the structure of state uncertainty362
may influence the shape of estimated value functions and thus RPEs. J.G.M., H.R.K., N.U., and S.J.G.363
contributed to the writing of the paper. J.G.M. analyzed and simulated the model, and wrote the first draft.364
Declaration of Interests365
The authors declare no competing interests.366
Data and Code Availability367
Source code for all simulations can be found at www.github.com/jgmikhael/ramping.368
19
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/805366doi: bioRxiv preprint first posted online Oct. 16, 2019;
Supplemental Information369
1 Alternative Causes of Ramping370
In the main text, we argue that ramping follows from normative principles. In this section, we illustrate that371
various types of biases (‘bugs’ in the implementation) may also lead to RPE ramps.372
Ramping Due to State-Dependent Bias373
Assume the animal persistently overestimates the amount of time or distance remaining to reach its reward374
(or, equivalently, that it underestimates the time elapsed or the distance traversed so far), and that this375
overestimation decreases as the animal approaches the reward. For instance, since the receptive fields of376
place cells decrease as the animal approaches reward (O’Keefe and Burgess, 1996), the contribution of place377
cells immediately behind the approaching animal in its estimate of value may outweigh that from the place378
cells in front of it. It will simplify our analysis to set T = 0 without loss of generality, and allow time to379
progress from the negative domain (t < 0) toward T = 0. In the continuous domain and for the simple case380
of linear overestimation, we can write this as381
V (t) = γ−ηtr, (31)
where η > 1 is our overestimation factor. Therefore, by Equation (28),382
δ(t) =˙V (t) + V (t) ln γ
= (ln γ)(1− η)γ−ηtr, (32)
which is monotonically increasing. Hence, the RPE should ramp. Equivalently, in the discrete domain,383
δt = γVt+1 − Vt
= γγ−η(t+1)r − γ−ηtr
= γ−ηt(γ1−η − 1)r. (33)
Here, δt+1 > δt. Hence, the RPE should ramp.384
20
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/805366doi: bioRxiv preprint first posted online Oct. 16, 2019;
Ramping Due to State-Dependent Discounting of Estimated Value385
Assume the animal underestimates V (t) by directly decreasing the temporal discount term γ. Then if386
V (t) = (ηγ)T−tr, with η ∈ (0, 1), we can write in the continuous domain:387
δ(t) =˙V (t) + V (t) ln γ
= (− ln η)(ηγ)T−tr, (34)
which is monotonically increasing. Hence, the RPE should ramp. Equivalently, in the discrete domain, if388
Vt = (ηγ)T−tr with η ∈ (0, 1), we can write389
δt = (ηγ)T−t(
1
η− 1
)r, (35)
and390
δt+1 = (ηγ)−1δt. (36)
Here, δt+1 > δt. Hence, the RPE should ramp.391
21
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/805366doi: bioRxiv preprint first posted online Oct. 16, 2019;
References392
Babayan, B. M., Uchida, N., and Gershman, S. J. (2018). Belief state representation in the dopamine system.393
Nature communications, 9(1):1891.394
Bellman, R. (1957). Dynamic programming. Princeton University Press.395
Berke, J. D. (2018). What does dopamine mean? Nature neuroscience, page 1.396
Berridge, K. C. (2007). The debate over dopamine’s role in reward: the case for incentive salience. Psy-397
chopharmacology, 191(3):391–431.398
Church, R. M. and Meck, W. (2003). A concise introduction to scalar timing theory. Functional and neural399
mechanisms of interval timing, pages 3–22.400
Cinotti, F., Fresno, V., Aklil, N., Coutureau, E., Girard, B., Marchand, A. R., and Khamassi, M. (2019).401
Dopamine blockade impairs the exploration-exploitation trade-off in rats. Scientific reports, 9(1):6770.402
Cohen, J. Y., Haesler, S., Vong, L., Lowell, B. B., and Uchida, N. (2012). Neuron-type-specific signals for403
reward and punishment in the ventral tegmental area. Nature, 482(7383):85–88.404
Collins, A. L., Greenfield, V. Y., Bye, J. K., Linker, K. E., Wang, A. S., and Wassum, K. M. (2016). Dy-405
namic mesolimbic dopamine signaling during action sequence learning and expectation violation. Scientific406
reports, 6.407
de Lafuente, V. and Romo, R. (2011). Dopamine neurons code subjective sensory experience and uncertainty408
of perceptual decisions. Proceedings of the National Academy of Sciences, 108(49):19767–19771.409
Eshel, N., Bukwich, M., Rao, V., Hemmelder, V., Tian, J., and Uchida, N. (2015). Arithmetic and local410
circuitry underlying dopamine prediction errors. Nature, 525:243–246.411
Fiorillo, C. D., Newsome, W. T., and Schultz, W. (2008). The temporal precision of reward prediction in412
dopamine neurons. Nature neuroscience, 11(8):966.413
Flagel, S. B., Clark, J. J., Robinson, T. E., Mayo, L., Czuj, A., Willuhn, I., Akers, C. A., Clinton, S. M.,414
Phillips, P. E., and Akil, H. (2011). A selective role for dopamine in stimulus–reward learning. Nature,415
469(7328):53.416
Frank, M. J., Moustafa, A. A., Haughey, H. M., Curran, T., and Hutchison, K. E. (2007). Genetic triple417
dissociation reveals multiple roles for dopamine in reinforcement learning. Proceedings of the National418
Academy of Sciences, 104(41):16311–16316.419
22
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/805366doi: bioRxiv preprint first posted online Oct. 16, 2019;
Gardner, M. P., Schoenbaum, G., and Gershman, S. J. (2018). Rethinking dopamine as generalized prediction420
error. Proceedings of the Royal Society B, 285(1891):20181645.421
Gershman, S. J. (2014). Dopamine ramps are a consequence of reward prediction errors. Neural computation,422
26(3):467–471.423
Gibbon, J. (1977). Scalar expectancy theory and weber’s law in animal timing. Psychological review,424
84(3):279.425
Glimcher, P. W. (2011). Understanding dopamine and reinforcement learning: the dopamine reward predic-426
tion error hypothesis. Proceedings of the National Academy of Sciences, 108(Supplement 3):15647–15654.427
Hamid, A. A., Pettibone, J. R., Mabrouk, O. S., Hetrick, V. L., Schmidt, R., Vander Weele, C. M., Kennedy,428
R. T., Aragona, B. J., and Berke, J. D. (2016). Mesolimbic dopamine signals the value of work. Nature429
Neuroscience, 19:117–126.430
Hart, A. S., Rutledge, R. B., Glimcher, P. W., and Phillips, P. E. (2014). Phasic dopamine release in the431
rat nucleus accumbens symmetrically encodes a reward prediction error term. Journal of Neuroscience,432
34(3):698–704.433
Houk, J. C., Adams, J. L., and Barto, A. G. (1995). A model of how the basal ganglia generate ans use434
neural signals that predict reinforcement. In Houk, J. C., Davis, J. L., and Beiser, D. G., editors, Models435
of information processing in the basal ganglia. MIT Press, Cambridge.436
Howe, M. W., Tierney, P. L., Sandberg, S. G., Phillips, P. E., and Graybiel, A. M. (2013). Prolonged437
dopamine signalling in striatum signals proximity and value of distant rewards. nature, 500(7464):575.438
Kim, H. R., Malik, A. N., Mikhael, J. G., Bech, P., Tsutsui-Kimura, I., Sun, F., Zhang, Y., Li, Y., Watabe-439
Uchida, M., Gershman, S. J., and Uchida, N. (2019). A unified framework for dopamine signals across440
timescales. bioRxiv.441
Kobayashi, S. and Schultz, W. (2008). Influence of reward delays on responses of dopamine neurons. Journal442
of neuroscience, 28(31):7837–7846.443
Lak, A., Nomoto, K., Keramati, M., Sakagami, M., and Kepecs, A. (2017). Midbrain dopamine neurons444
signal belief in choice accuracy during a perceptual decision. Current Biology, 27(6):821–832.445
Lloyd, K. and Dayan, P. (2015). Tamping ramping: Algorithmic, implementational, and computational446
explanations of phasic dopamine signals in the accumbens. PLoS computational biology, 11(12):e1004622.447
23
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/805366doi: bioRxiv preprint first posted online Oct. 16, 2019;
Ludvig, E., Sutton, R. S., Kehoe, E. J., et al. (2008). Stimulus representation and the timing of reward-448
prediction errors in models of the dopamine system.449
Ludvig, E. A., Sutton, R. S., and Kehoe, E. J. (2012). Evaluating the td model of classical conditioning.450
Learning & behavior, 40(3):305–319.451
Menegas, W., Babayan, B. M., Uchida, N., and Watabe-Uchida, M. (2017). Opposite initialization to novel452
cues in dopamine signaling in ventral and posterior striatum in mice. Elife, 6:e21886.453
Menegas, W., Bergan, J. F., Ogawa, S. K., Isogai, Y., Venkataraju, K. U., Osten, P., Uchida, N., and454
Watabe-Uchida, M. (2015). Dopamine neurons projecting to the posterior striatum form an anatomically455
distinct subclass. Elife, 4:e10032.456
Mikhael, J. G. and Bogacz, R. (2016). Learning reward uncertainty in the basal ganglia. PLoS computational457
biology, 12(9):e1005062.458
Montague, P. R., Dayan, P., and Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems459
based on predictive hebbian learning. The Journal of neuroscience, 16(5):1936–1947.460
Moore, J., Desmond, J., and Berthier, N. (1989). Adaptively timed conditioned responses and the cerebellum:461
a neural network approach. Biological cybernetics, 62(1):17–28.462
Morita, K. and Kato, A. (2014). Striatal dopamine ramping may indicate flexible reinforcement learning463
with forgetting in the cortico-basal ganglia circuits. Frontiers in neural circuits, 8:36.464
Nicola, S. M., Surmeier, D. J., and Malenka, R. C. (2000). Dopaminergic modulation of neuronal excitability465
in the striatum and nucleus accumbens. Annual review of neuroscience, 23(1):185–215.466
Niv, Y., Daw, N. D., Joel, D., and Dayan, P. (2007). Tonic dopamine: opportunity costs and the control of467
response vigor. Psychopharmacology, 191(3):507–520.468
Niv, Y. and Schoenbaum, G. (2008). Dialogues on prediction errors. Trends in cognitive sciences, 12(7):265–469
272.470
O’Keefe, J. and Burgess, N. (1996). Geometric determinants of the place fields of hippocampal neurons.471
Nature, 381(6581):425.472
Ratcliff, R. and Frank, M. J. (2012). Reinforcement-based decision making in corticostriatal circuits: mutual473
constraints by neurocomputational and diffusion models. Neural computation, 24(5):1186–1229.474
Schultz, W. (2007a). Behavioral dopamine signals. Trends in neurosciences, 30(5):203–210.475
24
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/805366doi: bioRxiv preprint first posted online Oct. 16, 2019;
Schultz, W. (2007b). Multiple dopamine functions at different time courses. Annu. Rev. Neurosci., 30:259–476
288.477
Schultz, W. (2010). Review dopamine signals for reward value and risk: basic and recent data. Behav. Brain478
Funct, 6:24.479
Schultz, W., Dayan, P., and Montague, P. R. (1997). A neural substrate of prediction and reward. Science,480
275(5306):1593–1599.481
Staddon, J. (1965). Some properties of spaced responding in pigeons. Journal of the Experimental Analysis482
of Behavior, 8(1):19–28.483
Starkweather, C. K., Babayan, B. M., Uchida, N., and Gershman, S. J. (2017). Dopamine reward prediction484
errors reflect hidden-state inference across time. Nature Neuroscience, 20(4):581–589.485
Starkweather, C. K., Gershman, S. J., and Uchida, N. (2018). The medial prefrontal cortex shapes dopamine486
reward prediction errors under state uncertainty. Neuron, 98:616–629.487
Steinberg, E. E., Keiflin, R., Boivin, J. R., Witten, I. B., Deisseroth, K., and Janak, P. H. (2013). A causal488
link between prediction errors, dopamine neurons and learning. Nature neuroscience, 16(7):966–973.489
Stuber, G. D., Klanker, M., de Ridder, B., Bowers, M. S., Joosten, R. N., Feenstra, M. G., and Bonci, A.490
(2008). Reward-predictive cues enhance excitatory synaptic strength onto midbrain dopamine neurons.491
Science, 321(5896):1690–1692.492
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–493
44.494
Sutton, R. S. and Barto, A. G. (1990). Time-derivative models of pavlovian reinforcement.495
Totah, N. K., Kim, Y., and Moghaddam, B. (2013). Distinct prestimulus and poststimulus activation of vta496
neurons correlates with stimulus detection. Journal of neurophysiology, 110(1):75–85.497
Wassum, K. M., Ostlund, S. B., and Maidment, N. T. (2012). Phasic mesolimbic dopamine signaling precedes498
and predicts performance of a self-initiated action sequence task. Biological psychiatry, 71(10):846–854.499
25
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/805366doi: bioRxiv preprint first posted online Oct. 16, 2019;