Ramping and State Uncertainty in the Dopamine Signal · 16.10.2019 · 4 delay, dopamine activity during the delay period and at reward time should converge to baseline through 5

Ramping and State Uncertainty in the Dopamine Signal

John G. Mikhael1,2∗, HyungGoo R. Kim3, Naoshige Uchida3, Samuel J. Gershman4

1Program in Neuroscience, Harvard Medical School, Boston, MA 02115

2MD-PhD Program, Harvard Medical School, Boston, MA 02115

3Department of Molecular and Cellular Biology and Center for Brain Science, Harvard University,

Cambridge, MA 02138

4Department of Psychology and Center for Brain Science, Harvard University, Cambridge, MA 02138

∗Corresponding Author

Correspondence: john [email protected]

1

All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint. http://dx.doi.org/10.1101/805366doi: bioRxiv preprint first posted online Oct. 16, 2019;

http://dx.doi.org/10.1101/805366

Abstract1

Reinforcement learning models of the basal ganglia map the phasic dopamine signal to reward prediction2

errors (RPEs). Conventional models assert that, when a stimulus reliably predicts a reward with fixed3

delay, dopamine activity during the delay period and at reward time should converge to baseline through4

learning. However, recent studies have found that dopamine exhibits a gradual ramp before reward5

in certain conditions even after extensive learning, such as when animals are trained to run to obtain6

the reward, thus challenging the conventional RPE models. In this work, we begin with the limitation7

of temporal uncertainty (animals cannot perfectly estimate time to reward), and show that sensory8

feedback, which reduces this uncertainty, will cause an unbiased learner to produce RPE ramps. On9

the other hand, in the absence of feedback, RPEs will be flat after learning. These results reconcile the10

seemingly conflicting data on dopamine behaviors under the RPE hypothesis.11

Keywords: dopamine, ramping, reinforcement learning, reward prediction error, state value, state uncer-12

tainty, sensory feedback13

Introduction14

Perhaps the most successful convergence of reinforcement learning theory with neuroscience has been the15

insight that the phasic activity of midbrain dopamine (DA) neurons tracks ‘reward prediction errors’ (RPEs),16

or the difference between received and expected reward (Schultz et al., 1997; Schultz, 2007a; Glimcher, 2011).17

In reinforcement learning algorithms, RPEs serve as teaching signals that update an agent’s estimate of18

rewards until those rewards are well-predicted. In a seminal experiment, Schultz et al. (1997) recorded from19

midbrain DA neurons in primates and found that the neurons responded with a burst of activity when20

an unexpected reward was delivered. However, if a reward-predicting cue was available, the DA neurons21

eventually stopped responding to the (now expected) reward and instead began to respond to the cue, much22

like an RPE (see Results). This finding formed the basis for the RPE hypothesis of DA.23

Over the past two decades, a large and compelling body of work has supported the view that phasic DA24

functions as a teaching signal (Schultz et al., 1997; Niv and Schoenbaum, 2008; Glimcher, 2011; Steinberg25

et al., 2013; Eshel et al., 2015). In particular, phasic DA activity has been shown to track the RPE term26

of temporal difference (TD) learning models, which we review below, remarkably well (Schultz, 2007a).27

However, recent results have called this model of DA into question. Using fast-scan cyclic voltammetry in rat28

striatum during a goal-directed spatial navigation task, Howe et al. (2013) observed a ramping phenomenon—29

2



http://dx.doi.org/10.1101/805366

a steady increase in DA over the course of a single trial—that persisted even after extensive training. Since30

then, DA ramping has been observed during a two-armed bandit task (Hamid et al., 2016) and during the31

execution of self-initiated action sequences (Collins et al., 2016). At first glance, these findings appear to32

contradict the RPE hypothesis of DA. Indeed, why would error signals persist (and ramp) after a task has33

been well-learned? Perhaps, then, instead of reporting an RPE, DA should be reinterpreted as reflecting34

the value of the animal’s current state, such as its position during reward approach (Hamid et al., 2016).35

Alternatively, perhaps DA signals different quantities in different tasks, e.g., value in operant tasks, in which36

the animal must act to receive reward, and RPE in classical conditioning tasks, in which the animal need37

not act to receive reward.38

To distinguish among these possibilities, Kim et al. (2019) recently devised an experimental paradigm that39

dissociates the value and RPE interpretations of DA. As we show in the Methods, RPEs in the experiments40

considered above can be approximated as the derivative of value under the TD learning framework. By41

training mice on a virtual reality environment and manipulating various properties of the task—namely, the42

speed of scene movement, teleports, and temporary pauses at various locations—the authors could dissociate43

spatial navigation from locomotion and make precise predictions about how value should change vs. how44

its derivative (RPE) should change. The authors found that mice continued to display ramping DA signals45

during the task even without locomotion, and that the changes in DA behaviors were consistent with the46

RPE hypothesis and not with the value interpretation.47

The body of experimental studies outlined above produce a number of unanswered questions regarding the48

function of DA: First, why would an error signal persist once an association is well-learned? Second, why49

would it ramp over the duration of the trial? Third, why would this ramp occur in some tasks but not others?50

Does value (and thus RPE) take different functional forms in different tasks, and if so, what determines which51

forms result in a ramp and which do not? Here we address these questions from normative principles.52

In this work, we examine the influence of sensory feedback in guiding value estimation. Because of irre-53

ducible temporal uncertainty, animals not receiving sensory feedback (and therefore relying only on internal54

timekeeping mechanisms) will have corrupted value estimates regardless of how well a task is learned. In55

this case, value functions will be “blurred” in proportion to the uncertainty at each point. Sensory feedback,56

however, reduces this blurring as each new timepoint is approached. Beginning with the normative principle57

that animals seek to best learn the value of each state, we show that unbiased learning, in the presence of58

feedback, requires RPEs that ramp. These ramps scale with the informativeness of the feedback (i.e., the59

reduction in uncertainty), and at the extreme, absence of feedback leads to flat RPEs. Thus we show that60

3



http://dx.doi.org/10.1101/805366

differences in a task’s feedback profile explain the puzzling collection of DA behaviors described above.61

We will begin the next section with a review of the TD learning algorithm, then examine the effect of62

state uncertainty on value learning. We will then show how, by reducing state uncertainty without biasing63

learning, sensory feedback causes the RPE to reproduce the experimentally observed behaviors of DA.64

Results65

Temporal Difference Learning66

In TD learning, an agent transitions through a sequence of states according to a Markov process (Sutton,67

1988). The value associated with each state is defined as the expected discounted future return:68

Vt = E

[ ∞∑k=0

γkrt+k

], (1)

where t denotes time and indexes states, rt denotes the reward delivered at time t, and γ ∈ (0, 1) is a discount69

factor. In the experiments we will examine, a single reward is presented at the end of each trial. For these70

cases, Equation (1) can be written simply as:71

Vt = γT−tr, (2)

for all t ∈ [0, T ], where r is the magnitude of reward delivered at time T . In words, value increases72

exponentially as reward time T is approached, peaking at a value of r at T (Figure 1B,D). Additionally,73

note that exponential functions are convex: The convex shape of the value function will be important in74

subsequent sections (see Kim et al. (2019) for experimental verification).75

How does the agent learn this value function? Under the Markov property, Equation (1) can be rewritten76

as:77

Vt = rt + γVt+1, (3)

which is referred to as the Bellman equation (Bellman, 1957). The agent approximates Vt with Vt, which78

is updated in the event of a mismatch between the estimated value and the reward actually received. By79

4



http://dx.doi.org/10.1101/805366

analogy with Equation (3), this mismatch (the RPE) can be written as:80

δt = rt + γVt+1 − Vt. (4)

When δt is zero, Equation (3) has been well-approximated. However, when δt is positive or negative, Vt81

must be increased or decreased, respectively:82

V(t+1)t = V

(t)t + αδ

(t)t , (5)

where α ∈ (0, 1) denotes the learning rate, and the superscript denotes the learning step. Learning will83

progress until δt = 0 on average. After this point, Vt = γT−tr on average, which is precisely the true value.84

(See the Methods for a more general description of TD learning and its neural implementation.)85

Having described TD learning in the simplified case where the agent has a perfect internal clock and thus86

no state uncertainty, let us now examine how state uncertainty affects learning, and how this uncertainty is87

reduced with sensory feedback.88

Value Learning Under State Uncertainty89

Because animals do not have perfect internal clocks, they do not have complete access to the true time t90

(Gibbon, 1977; Church and Meck, 2003; Staddon, 1965). Instead, t is a latent state corrupted by timing91

noise, often modeled as follows:92

τ ∼ N (t, σ2t ), (6)

where τ is subjective (internal) time, drawn from a distribution centered on objective time t, with some93

standard deviation σt. We take this distribution to be Gaussian for simplicity (an assumption we relax in94

the Methods). Thus the subjective estimate of value Vτ is an average over the estimated values Vt of each95

state t:96

Vτ =∑t

p(t|τ) Vt, (7)

where p(t|τ) denotes the probability that t is the true state given the subjective measurement τ , and thus97

represents state uncertainty. We refer to this quantity as the uncertainty kernel (Figure 1A,C). Intuitively,98

Vτ is the result of blurring Vt proportionally to the uncertainty kernel (Methods).99

5



http://dx.doi.org/10.1101/805366

0 10 20 30 40 50Time

0

0.2

0.4

0.6

0.8

1

Valu

e0 10 20 30 40 50

Time

0

0.2

0.4

0.6

0.8

1

Valu

e

True ValueEstimated Value

B

0 10 20 30 40 50

0 10 20 30 40 500 10 20 30 40 50

0 10 20 30 40 50

C S.F.

S.F.

S.F.

At τ = 10At τ = 20At τ = 30At τ = 40

At τ = 10At τ = 20At τ = 30At τ = 40

0 10 20 30 40 50Time

0

0.5

1

Valu

e

With Feedback

0 10 20 30 40 50Time

0

0.5

1

Valu

e

Without FeedbackTrue ValueEstimated Value

0 10 20 30 40 50

0 10 20 30 40 50

D 0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50Time

0

0.5

1

Valu

e

With Feedback

0 10 20 30 40 50Time

0

0.5

1

Valu

e

Without FeedbackTrue ValueEstimated Value

A

Figure 1: Sensory Feedback Biases Value Learning. (A) Illustration of state uncertainty in the absenceof sensory feedback. Each row includes the uncertainty kernels at the current state and the next state (solidcurves). Lighter gray curves represent uncertainty kernels for later states. Thus, similarly colored kernelson different rows represent uncertainty kernels for the same state, but evaluated at different timepoints(e.g., dashed box). In the absence of feedback, state uncertainty for a single state does not acutely changeacross time (compare with C). (B) Without feedback, value is unbiased on average. Red curves representthe predicted increase in value between the current state and the next state (10 and 20 for light red; 20 and30 for red; 30 and 40 for brick red). After learning, this roughly equals an increase by γ−1 on average. (C)Sensory feedback reduces state uncertainty. Three instances of feedback are shown for illustration (S.F.;arrows). Note here that the kernels used to estimate value at the same state have different widths dependingon whether they were evaluated before or after feedback. This results in different value estimates being usedto compute the RPE at the current state and at the next state (Equations (8) and (9)). (D) As a resultof sensory feedback, value at each state will be estimated based on an inflated version of value at the nextstate. Hence, after learning (when RPE is zero on average), estimated value will be systematically largerthan true value. Red curves represent the predicted increase in value between the current state and the nextstate. After learning, this roughly equals an increase by γ−1 on average. See Methods for simulation details.

After learning (i.e., when the RPE is zero on average), the estimated value at every state will be roughly the100

estimated value at the next state, discounted by γ, on average (black curve in Figure 1B). A key requirement101

for this unbiased learning can be discovered by writing the RPE equations for two successive states:102

δτ = rτ + γVτ+1 − Vτ (8)

δτ+1 = rτ+1 + γVτ+2 − Vτ+1. (9)

Notice here that Vτ+1 is represented in both equations. Thus, for value to be well-learned, a requirement103

6



http://dx.doi.org/10.1101/805366

is that Vτ+1 not acutely change during the interval after computing δτ and before computing δτ+1. This104

requirement extends to changes in the uncertainty kernels: By Equation (7), if the kernel p(t|τ + 1) were to105

be acutely updated due to information available at τ + 1 but not at τ , then Vτ+1 will acutely change as well.106

This means that Vτ will be discounted based on Vτ+1 before feedback (i.e., as estimated at τ ; red curves in107

Figure 1D) rather than Vτ+1 after feedback (i.e., as estimated at τ + 1; black curve). In the next section, we108

examine this effect more precisely.109

Value Learning in the Presence of Sensory Feedback110

How is value learning affected by sensory feedback? As each time τ is approached, state uncertainty is reduced111

due to sensory feedback (arrows in Figure 1C). This is because at timepoints preceding τ , the estimate of112

what value will be at τ is corrupted by both temporal noise and the lower-resolution stimuli associated with113

τ . Approaching τ in the presence of sensory feedback reduces this corruption. This, however, means that114

Vτ+1 will be estimated differently while computing δτ and δτ+1 (Equations (8) and (9); compare widths of115

similarly colored kernels beneath each arrow in Figure 1C), which in turn results in biased value learning.116

To examine the nature of this bias, we note that averaging over a convex value function results in over-117

estimation of value. Intuitively, convex functions are steeper on the right (larger values) and shallower on118

the left (smaller values), so averaging results in a bias toward larger values. Furthermore, wider kernels119

result in greater overestimation (Methods). Thus upon entering each new state, the reduction of uncertainty120

via sensory feedback will acutely mitigate this overestimation, resulting in different estimates Vτ+1 being121

used for δτ and δτ+1. Left uncorrected, the value estimate will be systematically biased, and in particular,122

value will be overestimated at every point (Figure 2A; Methods). An intuitive way to see this is as follows:123

The objective of the TD algorithm (in this simplified task setting) is for the value at each state τ to be γ124

times smaller than the value at τ + 1 by the time the RPE converges to zero (Equation (2)). If an animal125

systematically overestimates value at the next state, then it will overestimate value at the current state as126

well (even if sensory feedback subsequently diminishes the next state’s overestimation). Thus the “wrong”127

value function is learned (Figure 2A,B).128

7



http://dx.doi.org/10.1101/805366

0 10 20 30 40 50Time

0

0.5

1

Valu

e

Without CorrectionTrue ValueEstimated Value

0 10 20 30 40 50Time

0

0.01

0.02

0.03

RPE

0 10 20 30 40 50Time

0

0.5

1

Valu

e

With Correction

0 10 20 30 40 50Time

0

0.01

0.02

0.03

RPE

A

B

C

D

Figure 2: Unbiased Learning in the Presence of Feedback Leads to RPE Ramps. (A) Value ateach state is learned according to an overestimated version of value at the next state. Thus, a biased valuefunction is learned (see Figure 1D). (B) After learning, the RPE converges to zero. (C) With a correctionterm, the correct value function is learned. (D) The cost of forcing an unbiased learning of value is apersistent RPE. Intuitively, value at the current state is not influenced by the overestimated version of valueat the next state (compare with A,B). By Equation (13), this results in RPEs that ramp. See Methods forsimulation details.

To overcome this bias, an optimal agent must correct the just-computed RPE as sensory feedback becomes129

available. In the Methods, we show that this correction can simply be written as:130

V(t+1)t = V

(t)t + αδ(t)

τ p(t|τ)− βV (t)τ p(t|τ) (10)

' V (t)t + αδ(t)

τ p(t|τ)− βV (t)t , (11)

where the approximate equality holds for sufficient reductions in state uncertainty due to feedback, and131

β = α

(exp

[(ln γ)2(l2 − s2)

2

]− 1

). (12)

Here, the uncertainty kernel of Vτ+1 has some standard deviation l at τ and a smaller standard deviation s132

at τ + 1. In words, as the animal gains an improved estimate of Vτ+1, it corrects the previously computed133

δτ with a feedback term to ensure unbiased learning of value (Figure 2C). Notice here that the correction134

term is a function of the reduction in variance (l2− s2) due to sensory feedback. In the absence of feedback,135

8



http://dx.doi.org/10.1101/805366

the reduction in variance is zero (the uncertainty kernel for τ + 1 cannot be reduced during the transition136

from τ to τ + 1), which means β = 0.137

How does this correction affect the RPE? By Equation (10), the RPE will converge to:138

δτ =β

αVτ . (13)

Therefore, with sensory feedback, the RPE ramps and tracks Vτ in shape (Figure 2D). In the absence of139

feedback, β = 0; thus, there is no ramp.140

In summary, when feedback is provided with new states, value learning becomes miscalibrated, as each value141

point will be learned according to an overestimated version of the next (Figure 2A). With a subsequent142

correction of this bias, the agent will continue to overestimate the RPEs at each point (RPEs will ramp;143

Figure 2D), in exchange for learning the correct value function (Figure 2C).144

Relationship with Experimental Data145

In classical conditioning tasks without sensory feedback, DA ramping is not observed (Schultz et al., 1997;146

Kobayashi and Schultz, 2008; Stuber et al., 2008; Flagel et al., 2011; Cohen et al., 2012; Hart et al., 2014;147

Eshel et al., 2015; Menegas et al., 2015, 2017; Babayan et al., 2018) (Figure 3A). On the other hand, in148

goal-directed navigation tasks, characterized by sensory feedback in the form of salient visual cues as well as149

locomotive cues (e.g., joint movement), DA ramping is present (Howe et al., 2013) (Figure 3C). DA ramping150

is also present in classical conditioning tasks that do not involve locomotion but that include either spatial151

or non-spatial feedback (Kim et al., 2019), as well as in two-armed bandit tasks (Hamid et al., 2016) and152

when executing self-initiated action sequences (Wassum et al., 2012; Collins et al., 2016).153

9



http://dx.doi.org/10.1101/805366

Information Encoded inDopaminergic Activity

Dopamine neurons of the ventral tegmentalarea (VTA) and substantia nigra have longbeen identified with the processing of re-warding stimuli. These neurons send theiraxons to brain structures involved in moti-vation and goal-directed behavior, for ex-ample, the striatum, nucleus accumbens,and frontal cortex. Multiple lines of evi-dence support the idea that these neuronsconstruct and distribute information aboutrewarding events.

First, drugs like amphetamine and co-caine exert their addictive actions in part byprolonging the influence of dopamine ontarget neurons (14). Second, neural path-ways associated with dopamine neurons areamong the best targets for electrical self-stimulation. In these experiments, rats pressbars to excite neurons at the site of an im-planted electrode (15). The rats oftenchoose these apparently rewarding stimuliover food and sex. Third, animals treatedwith dopamine receptor blockers learn lessrapidly to press a bar for a reward pellet (16).All the above results generally implicatemidbrain dopaminergic activity in reward-dependent learning. More precise informa-tion about the role played by midbrain do-paminergic activity derives from experimentsin which activity of single dopamine neuronsis recorded in alert monkeys while they per-form behavioral acts and receive rewards.

In these latter experiments (17), dopa-mine neurons respond with short, phasicactivations when monkeys are presentedwith various appetitive stimuli. For exam-ple, dopamine neurons are activated whenanimals touch a small morsel of apple orreceive a small quantity of fruit juice to themouth as liquid reward (Fig. 1). These pha-sic activations do not, however, discrimi-nate between these different types of re-warding stimuli. Aversive stimuli like airpuffs to the hand or drops of saline to themouth do not cause these same transientactivations. Dopamine neurons are also ac-tivated by novel stimuli that elicit orientingreactions; however, for most stimuli, thisactivation lasts for only a few presentations.The responses of these neurons are relative-ly homogeneous—different neurons re-spond in the same manner and differentappetitive stimuli elicit similar neuronal re-sponses. All responses occur in the majorityof dopamine neurons (55 to 80%).

Surprisingly, after repeated pairings ofvisual and auditory cues followed by reward,dopamine neurons change the time of theirphasic activation from just after the time ofreward delivery to the time of cue onset. Inone task, a naıve monkey is required totouch a lever after the appearance of a smalllight. Before training and in the initialphases of training, most dopamine neuronsshow a short burst of impulses after rewarddelivery (Fig. 1, top). After several days oftraining, the animal learns to reach for the

lever as soon as the light is illuminated, andthis behavioral change correlates with tworemarkable changes in the dopamine neu-ron output: (i) the primary reward no longerelicits a phasic response; and (ii) the onsetof the (predictive) light now causes a phasicactivation in dopamine cell output (Fig. 1,middle). The changes in dopaminergic ac-tivity strongly resemble the transfer of ananimal’s appetitive behavioral reactionfrom the US to the CS.

In trials where the reward is not deliv-ered at the appropriate time after the onsetof the light, dopamine neurons are de-pressed markedly below their basal firingrate exactly at the time that the rewardshould have occurred (Fig. 1, bottom). Thiswell-timed decrease in spike output showsthat the expected time of reward deliverybased on the occurrence of the light is alsoencoded in the fluctuations in dopaminer-gic activity (18). In contrast, very few do-pamine neurons respond to stimuli that pre-dict aversive outcomes.

The language used in the foregoing de-scription already incorporates the idea thatdopaminergic activity encodes expectationsabout external stimuli or reward. This inter-pretation of these data provides a link to anestablished body of computational theory (6,7). From this perspective, one sees that dopa-mine neurons do not simply report the occur-rence of appetitive events. Rather, their out-puts appear to code for a deviation or errorbetween the actual reward received and pre-dictions of the time and magnitude of reward.These neurons are activated only if the timeof the reward is uncertain, that is, unpredictedby any preceding cues. Dopamine neurons aretherefore excellent feature detectors of the“goodness” of environmental events relativeto learned predictions about those events.They emit a positive signal (increased spikeproduction) if an appetitive event is betterthan predicted, no signal (no change in spikeproduction) if an appetitive event occurs aspredicted, and a negative signal (decreasedspike production) if an appetitive event isworse than predicted (Fig. 1).

Computational Theory and Model

The TD algorithm (6, 7) is particularly wellsuited to understanding the functional roleplayed by the dopamine signal in terms ofthe information it constructs and broadcasts(8, 10, 12). This work has used fluctuationsin dopamine activity in dual roles (i) as asupervisory signal for synaptic weightchanges (8, 10, 12) and (ii) as a signal toinfluence directly and indirectly the choiceof behavioral actions in humans and bees(9–11). Temporal difference methods havebeen used in a wide spectrum of engineeringapplications that seek to solve prediction

Reward predictedReward occurs

No predictionReward occurs

Reward predictedNo reward occurs

(No CS)

(No R)CS-1 0 1 2 s

CS

R

R

Do dopamine neurons report an error in the prediction of reward?

Fig. 1. Changes in dopamine neurons’output code for an error in the prediction ofappetitive events. (Top) Before learning, adrop of appetitive fruit juice occurs in theabsence of prediction—hence a positiveerror in the prediction of reward. The do-pamine neuron is activated by this unpre-dicted occurrence of juice. (Middle) Afterlearning, the conditioned stimulus predictsreward, and the reward occurs accordingto the prediction—hence no error in theprediction of reward. The dopamine neu-ron is activated by the reward-predictingstimulus but fails to be activated by thepredicted reward (right). (Bottom) Afterlearning, the conditioned stimulus predictsa reward, but the reward fails to occur be-cause of a mistake in the behavioral re-sponse of the monkey. The activity of thedopamine neuron is depressed exactly atthe time when the reward would have oc-curred. The depression occurs more than1 s after the conditioned stimulus withoutany intervening stimuli, revealing an inter-nal representation of the time of the pre-dicted reward. Neuronal activity is alignedon the electronic pulse that drives the solenoid valve delivering the reward liquid (top) or the onset of theconditioned visual stimulus (middle and bottom). Each panel shows the peri-event time histogram andraster of impulses from the same neuron. Horizontal distances of dots correspond to real-time intervals.Each line of dots shows one trial. Original sequence of trials is plotted from top to bottom. CS,conditioned, reward-predicting stimulus; R, primary reward.

SCIENCE z VOL. 275 z 14 MARCH 1997 z http://www.sciencemag.org1594

on May 13, 2018

http://science.sciencem

ag.org/Downloaded from

trials (Fig. 2e), and were not correlated with trial length (Fig. 2d–f) orwith run velocity or acceleration (Extended Data Fig. 6e, f). Moreover,on trials in which rats paused mid-run, the signals remained sustained(or dipped slightly) and resembled the actual proximity to reward(Extended Data Fig. 7). These observations indicated that the rampingsignals could represent a novel form of dopamine signalling that pro-vides a continuous estimate of the animal’s spatial proximity to distantrewards (Fig. 2, Extended Data Fig. 6, and Supplementary Discussion).

Given that phasic responses of dopamine-containing neurons canreflect the relative value of stimuli21, we asked, in a subset of rats,whether the ramping dopamine signals could also be modulated bythe size of the delivered rewards (Methods). We used mazes with T, Mor S configurations and different total lengths (Fig. 3, and ExtendedData Fig. 8). We required the animals to run towards one or the otherend of the maze and varied the rewards available at the alternate goalregions. With all three mazes, dopamine ramping became stronglybiased towards the goal with the larger reward (Fig. 3, and ExtendedData Fig. 8). Run speed was slightly higher for the high-reward mazearms (Fig. 3i, k), but these small differences were unlikely to account

fully for the large differences in the dopamine signals recorded. Whenwe then reversed the locations of the small and large rewards, theramping signals also shifted, across sessions or just a few trials, to favourthe new high-value maze arm (Fig. 3, and Extended Data Fig. 8). Thesebias effects were statistically significant for each experimental paradigm(Extended Data Fig. 8h–j, Mann–Whitney U-test, P , 0.05) and acrossall rats (Fig. 3d, n 5 4, Mann–Whitney U-test, P 5 0.02).

In the M-maze, the ramps became extended to cover the longerend-arm distances to goal-reaching, and critically, peaked at nearlythe same level before goal-reaching as did the ramping signals recordedin the T-maze, despite the longer distance travelled (Fig. 3e). This resultsuggested that the ramping dopamine signals do not signal rewardproximity in absolute terms but, instead, scale with the path distanceto a fixed level that depends on the relative reward value.

To determine whether such value-related differences in the rampingdopamine signals would occur when the actions to reach the distantgoal sites were equivalent, we used the S-shaped maze. The rampingsignals were larger for the run trajectories leading to the larger rewards(Fig. 3c, j, and Extended Data Fig. 9), despite the fact that the sequence

–0.5–1

0.50

11.52

–8 0 2

510152025303540 –15

–10–5051015202530

–2 0 2.4 6 8–10

0

10

20

30

40

–2 0 2.4 6 8

–0.4Click Tone Goal

Click Tone Goal

Goal

Goal

–0.4

0.65

Time (s)Time (s)

Time (s) Time (s) Time (s)

Volta

ge (V

) Current (nA

)

[DA

] (nM)

Tria

l num

ber

–8 0 2

510152025303540 0

51015202530

Time (s)

Velocity (pixel per s)

Tria

l num

ber

[DA

] (nM

)

c

VMS

[DA

] (nM

)D

LS [D

A] (

nM)

e

fb d

a

–101234567

Click Tone Turn Goal

Click Tone Turn Goal

–1 10 3 5 7 9

–1 10 3 5 7 9

–101234567

Figure 1 | Ramping striatal dopamine signals occur during maze runs. a,b, Baseline subtracted current (a) and dopamine concentration ([DA],b) measured by FSCV in VMS during a single T-maze trial. c, d, Trial-by-trialchanges in dopamine concentration (c) and velocity (d) relative to

goal-reaching. e, f, Dopamine concentration (mean 6 s.e.m.) for VMS(e, n 5 300 session-averaged recordings from 18 probes across 214 sessions)and for DLS (f, n 5 262, 13 probes) for correct (blue) and incorrect (red) trials,averaged over all 40 trial sessions.

Time (s)Time (s)

Time (s)

123 6 9 8–1 00

50100150200250 1

0

1

0

Tria

ls

Sim

ulat

ed [D

A]

a b

8–1 0 1 2 3 4 5 6 7–2

0

2

4

6

8

10

12

[DA

] (nM

)

Nor

mal

ized

[DA

]

e f

Time elapsed model

Time (s)8–1 0

Sim

ulat

ed [D

A]

c Spatial proximity model

Shorttrials

Longtrials

0.85

0.9

0.95

1

1.05

1.1

1.15

1.2 Experimental dataTime-elapsed modelSpatial proximity model

d

Short trial goalClick Long trial goal

2 4 6 8 10 12 14

0

20

40

60

Pea

k [D

A]

Trial time (s)

Figure 2 | Ramping dopamine signals proximityto distant rewards. a, Distribution of trial times(from warning click to goal-reaching, n 5 3,933trials). b, c, Dopamine release modelled as afunction of time elapsed since maze-running onset(b) and as a function of spatial proximity to visitedgoal (c) for short (purple) and long (orange) trials(see Methods). Vertical lines indicate trial start(red) and end (purple and orange) times. d, Peakdopamine concentration versus trial time for allramping trials (n 5 2,273, Pearson’s R 5 0.0004,P 5 0.98). e, Experimentally recorded dopaminerelease (mean 6 s.e.m.) in short (n 5 327, purple)and long (n 5 423, orange) trials. Dopamine peaksat equivalent levels, as in the proximity model inc. f, Normalized peak dopamine levels(mean 6 s.e.m.) predicted by time-elapsed (red)and proximity (light blue) models, and measuredexperimental data (dark blue).

RESEARCH LETTER

5 7 6 | N A T U R E | V O L 5 0 0 | 2 9 A U G U S T 2 0 1 3

Macmillan Publishers Limited. All rights reserved©2013

CS RTime

0

0.05

0.1

RPE

A

C

B

D

Click GoalTime

0

0.05

0.1

RPE

Figure 3: Differences in Feedback Result in Different RPE Behaviors. (A) Schultz et al. (1997)have found that after learning, phasic DA responses to a predicted reward (R) diminish, and instead beginto appear at the earliest reward-predicting cue (conditioned stimulus; CS). Figure from Schultz et al. (1997).(B) Our derivations recapitulate this result. In the absence of sensory feedback, RPEs converge to zero.(C) Howe et al. (2013) have found that the DA signal ramps during a well-learned navigation task over thecourse of a single trial. Figure from Howe et al. (2013). (D) Our derivations recapitulate this result. Inthe presence of sensory feedback, RPEs track the shape of the estimated value function. See Methods forsimulation details.

As described in the previous section, sensory feedback—due to external cues or to the animal’s own movement—154

can reconcile both types of DA behaviors with the RPE hypothesis: In the absence of feedback, there is no155

reduction in state uncertainty upon entering each new state (β = 0), and therefore no ramps (Equation (13);156

Figure 3B). On the other hand, when state uncertainty is reduced as each state is entered, ramps will occur157

(Figure 3D).158

More generally, our results demonstrate that a measured DA signal whose shape tracks with estimated value159

need not be evidence against the RPE hypothesis of DA, contrary to some claims (Hamid et al., 2016; Berke,160

2018): Indeed, in the presence of sensory feedback, δτ and Vτ have the same shape. Thus, our derivation is161

conceptually compatible with the value interpretation of DA under certain circumstances, but importantly,162

this derivation captures the experimental findings in other circumstances in which the value interpretation163

fails.164

10



http://dx.doi.org/10.1101/805366

Discussion165

The role of DA in reinforcement learning has long been studied. While a large body of work has established166

phasic DA as an error signal (Schultz et al., 1997; Niv and Schoenbaum, 2008; Glimcher, 2011; Steinberg167

et al., 2013; Eshel et al., 2015), more recent work has questioned this view (Wassum et al., 2012; Howe et al.,168

2013; Hamid et al., 2016; Collins et al., 2016). Indeed, in light of persistent DA ramps occurring in certain169

tasks even after extensive learning, some authors have proposed that DA may instead communicate value170

itself in these tasks (Hamid et al., 2016). However, the determinants of DA ramps have remained unclear:171

Ramps are observed during goal-directed navigation, in which animals must run to receive reward (operant172

tasks; Howe et al., 2013), but can also be elicited in virtual reality tasks in which animals do not need to173

run for reward (classical conditioning tasks; Kim et al., 2019). Within classical conditioning, DA ramps can174

occur in the presence of navigational or non-navigational stimuli indicating time to reward (Kim et al., 2019).175

Within operant tasks, ramps can be observed in the period preceding the action (Totah et al., 2013) as well176

as during the action itself (Howe et al., 2013). These ramps are furthermore not specific to experimental177

techniques and measurements, and can be observed in cell body activities, axonal calcium signals, and in178

the DA concentrations (Kim et al., 2019).179

We have shown in this work that under the RPE hypothesis of DA, sensory feedback may control the180

different observed DA behaviors: In the presence of sensory feedback, RPEs track the estimated value in181

shape (ramps), but they remain flat in the absence of feedback (no ramps). Thus DA ramps and phasic182

responses follow from common computational principles and may be generated by common neurobiological183

mechanisms.184

Our derivation makes a number of testable predictions. In particular, our results predict that any type of185

information that reduces state uncertainty—for example, an auditory tone whose frequency reflects time to186

reward or a moving visual stimulus whose position reflects time to reward—will result in a DA ramp, and187

furthermore, the magnitude of the ramp will increase with the informativeness of the stimulus (i.e., with a188

greater reduction in state uncertainty; Equations (12) and (13)). Therefore, in trials where the change in the189

tone’s frequency is less apparent, or the contrast of the visual stimulus is lower, the ramp will be blunted.190

At the extreme, when the tone’s frequency does not change with time or the contrast is minimal, no ramp191

will be observed. At this point, the task is indistinguishable from the classical conditioning experiments of192

Schultz et al. (1997) discussed in the Introduction.193

Our work takes inspiration from previous studies that examined the role of state uncertainty in DA responses194

11



http://dx.doi.org/10.1101/805366

(Kobayashi and Schultz, 2008; Fiorillo et al., 2008; de Lafuente and Romo, 2011; Starkweather et al., 2017;195

Lak et al., 2017). For instance, temporal uncertainty increases with longer durations (Staddon, 1965; Gibbon,196

1977; Church and Meck, 2003). This means that in a classical conditioning task, DA bursts at reward time197

will not be completely diminished, and will be larger for longer durations, as Kobayashi and Schultz (2008)198

and Fiorillo et al. (2008) have observed. Similarly, Starkweather et al. (2017) have found that in tasks with199

uncertainty both in whether reward will be delivered as well as when it is delivered, DA exhibits a prolonged200

dip (i.e., a negative ramp) leading up to reward delivery. Here, value initially increases as expected reward201

time is approached, but then begins to slowly decrease as the probability of reward delivery during the present202

trial becomes less and less likely, resulting in persistently negative prediction errors (see also Starkweather203

et al., 2018; Babayan et al., 2018). As the authors of these studies note, both results are fully predicted by204

the RPE hypothesis of DA. Hence, state uncertainty, due to noise either in the internal circuitry or in the205

external environment, is reflected in the DA signal.206

A number of questions arise from our analysis. First, is there any evidence to support the benefits of learning207

the ‘true’ value function as written in Equation (2) (Figure 2C) over the biased version of value (Figure 2A)?208

We note here that under the normative account, the agent seeks to learn some value function that maximizes209

its well-being. Our key result is that this function—regardless of its exact shape—will not be learned well210

if feedback is delivered during learning, unless correction ensues. While we have chosen the exponential211

shape in Equation (2) after the conventional TD models, our results extend to any convex value function.212

Second, due to this presumed exponential shape, the ramping behaviors resulting from our analysis also213

look exponential, rather than linear (compare with experimental results). We nonetheless have chosen to214

remain close to conventional TD models and purely exponential value functions for ease of comparison with215

the existing theoretical literature. Perhaps equally important, the relationship between RPE and its neural216

correlate need only be monotonic and not necessarily equal. In other words, a measured linear signal does217

not necessarily imply a linear RPE, and a convex neural signal need not communicate convex information.218

Third, while we have derived RPE ramping from normative principles, it is important to note that biases in219

value learning may also produce ramping. For instance, one earlier proposal by Gershman (2014) was that220

value may take a fixed convex shape in spatial navigation tasks; the mismatch between this shape and the221

exponential shape in Equation (2) produces a ramp (see Methods for a general derivation of the conditions222

for a ramp). Morita and Kato (2014), on the other hand, posited that value updating involves a decay term.223

Assuming such a decay term results in a relationship qualitatively similar to that in Equation (10), and thus224

RPE ramping (see also implementations in Mikhael and Bogacz, 2016; Cinotti et al., 2019). Ramping can225

similarly be explained by assuming temporal or spatial bias that decreases with approach to the reward, by226

12



http://dx.doi.org/10.1101/805366

modulating the temporal discount term during task execution, or by other mechanisms (see Supplemental227

Information for derivations). In each of these proposals, ramps emerge as a ‘bug’ in the implementation,228

rather than as an optimal strategy for unbiased learning. These proposals furthermore do not explain the229

different DA patterns that emerge under different paradigms. Finally, it should be noted that we have not230

assumed any modality- or task-driven differences in learning (any differences in the shape of the RPE follow231

solely from the sensory feedback profile), although in principle, different value functions may certainly be232

learned in different types of tasks.233

Alternative accounts of DA ramping that deviate more significantly from our framework have also been234

proposed. In particular, Lloyd and Dayan (2015) have provided three compelling theoretical accounts of235

ramping. In the first account, the authors show that within an actor-critic framework, uncertainty in the236

communicated information between actor and critic regarding the timing of action execution may result237

in a monotonically increasing RPE leading up to the action. In the second account, ramping modulates238

gain control for value accumulation within a drift-diffusion model (e.g., by modulating neuronal excitability239

(Nicola et al., 2000)). Under this framework, fluctuations in tonic and phasic DA produce average ramping.240

The third account extends the average reward rate model of tonic DA proposed by Niv et al. (2007). In241

this extended view, ramping constitutes a ‘quasi-tonic’ signal that reflects discounted vigor. The authors242

show that the discounted average reward rate follows (1 − γ)V , and hence takes the shape of the value243

function in TD learning models. Finally, Howe et al. (2013) have proposed that these ramps may be244

necessary for sustained motivation in the operant tasks considered. Indeed, the notion that DA may serve245

multiple functions beyond the communication of RPEs is well-motivated and deeply ingrained (Schultz,246

2007b, 2010; Berridge, 2007; Frank et al., 2007; Gardner et al., 2018). Our work does not necessarily247

invalidate these alternative interpretations, but rather shows how a single RPE interpretation can embrace248

a range of apparently inconsistent phenomena.249

Methods250

Temporal Difference Learning and Its Neural Correlates251

Under TD learning, each state is determined by task-relevant contextual cues, referred to as features, that252

predict future rewards. For instance, a state might be determined by an internal estimate of time or253

perceived distance from a reward. We model the agent as approximating Vt by taking a linear combination254

13



http://dx.doi.org/10.1101/805366

of the features (Schultz et al., 1997; Ludvig et al., 2008, 2012):255

Vt =∑d

wdxd,t, (14)

where Vt denotes the estimated value at time t, and xd,t denotes the dth feature at t. The learned relevance256

of each feature xd is reflected in its weight wd, and the weights are updated in the event of a mismatch257

between the estimated value and the rewards actually received. The update occurs in proportion to each258

weight’s contribution to the value estimate at t:259

w(t+1)d = w

(t)d + αδ

(t)t xd,t, (15)

where α ∈ (0, 1) denotes the learning rate, and the superscript denotes the learning step. In words, when260

a feature xd does not contribute to the value estimate at t (xd,t = 0), its weight is not updated. On the261

other hand, weights corresponding to features that do contribute to Vt will be updated in proportion to their262

activations at that time. This update rule is referred to as gradient ascent (xd,t is equal to the gradient of Vt263

with respect to the weight wd), and it implements a form of credit assignment, in which the features most264

activated at t undergo the greatest modification to their weights.265

In this formulation, the basal ganglia implements the TD algorithm termwise: Cortical inputs to striatum266

encode the features xd,t, corticostriatal synaptic strengths encode the weights wd (Houk et al., 1995; Mon-267

tague et al., 1996), phasic activity of midbrain dopamine neurons encodes the error signal δt (Schultz et al.,268

1997; Niv and Schoenbaum, 2008; Glimcher, 2011; Steinberg et al., 2013; Eshel et al., 2015), and the output269

nuclei of the basal ganglia (substantia nigra pars reticulata and internal globus pallidus) encode estimated270

value Vt (Ratcliff and Frank, 2012).271

We have implicitly assumed in the Results a maximally flexible feature set, the complete serial compound272

representation (Moore et al., 1989; Sutton and Barto, 1990; Montague et al., 1996; Schultz et al., 1997), in273

which every time step following trial onset is represented as a separate feature. In other words, the feature274

xd,t is 1 when t = d and 0 otherwise. In this case, value at each timepoint is updated independently of the275

other timepoints, and each has its own weight. It follows that Vt = wt, and we can write Equation (15)276

directly in terms of Vt, as in Equation (5).277

14



http://dx.doi.org/10.1101/805366

Value Learning Under State Uncertainty278

Animals only have access to subjective time, and must infer objective time given the corruption in Equation279

(6). The RPE is then:280

δτ = rτ + γVτ+1 − Vτ , (16)

and this error signal is used to update the value estimates at each point t in proportion to its posterior281

probability p(t|τ):282

V(t+1)t = V

(t)t + αδ(t)

τ p(t|τ). (17)

Said differently, the effect of state uncertainty is that when the error signal δτ is computed, it updates the283

value estimate at a number of timepoints, in proportion to the uncertainty kernel.284

Acute Changes in State Uncertainty Result in Biased Value Learning285

Averaging over a convex value function results in overestimation of value. For an exponential value function,286

we can derive this result analytically in the continuous domain:287

∫t

γT−tN (t; τ, σ2t )dt = γT−τ exp

[(ln γ)2σ2

t

2

], (18)

where the second term on the left-hand side is greater than one. Intuitively, because the function is steeper288

on the right side and shallower on the left side, the average will be overestimated. Importantly, however,289

the estimate will be a multiple of the true value, with a scaling factor that depends on the width of the290

kernel (second term on left-hand side of Equation (18); note also that while we have assumed a Gaussian291

distribution, our results hold for any distribution that results in overestimation of value). Thus, with sensory292

feedback that modifies the width of the kernel upon transitioning from one state (τ) to the next (τ + 1),293

there will be a mismatch in the value estimate when computing each RPE. More precisely, the learning rules294

are:295

Vτ =∑t

p(t|τ, σt = s) Vt (19)

Vτ+1 =∑t

p(t|τ + 1, σt+1 = l) Vt (20)

δτ = rτ + γVτ+1 − Vτ (21)

V(t+1)t = V

(t)t + αδ(t)

τ p(t|τ, σt = s). (22)

15



http://dx.doi.org/10.1101/805366

Notice that Vτ+1 takes different values depending on the state: When computing δτ ,296

Vτ+1 =∑t

p(t|τ + 1, σt+1 = l) Vt. (23)

On the other hand, when computing δτ+1,297

Vτ+1 =∑t

p(t|τ + 1, σt+1 = s) Vt. (24)

How does this mismatch affect the learned value estimate? If averaging with kernels of different standard298

deviations can be written as multiples of true value, then they can be written as multiples of each other.299

The RPE is then300

δτ = rτ + γ(aVτ+1,s)− Vτ,s, (25)

where we use the comma notation to denote that the two value estimates are evaluated with the same301

kernel width s, and a is a constant. By analogy with Equations (2) and (4), estimated value converges to302

Vτ = (aγ)T−τr. Here, a > 1, so value is systematically overestimated. By the learning rules in Equations303

(19) to (22), this is because δτ is inflated by304

γ∑t

p(t|τ + 1, σt+1 = l) Vt − γ∑t

p(t|τ + 1, σt+1 = s) Vt = βVτ , (26)

where β is defined in Equation (12).305

An optimal agent will use the available sensory feedback to overcome this biased learning. Because averaging306

with a kernel of width l is simply a multiple of that with width s, it follows that a simple subtraction can307

achieve this correction (Equations (10) and (11)). Hence, sensory feedback can improve value learning with308

a correction term. It should be noted that with a complete correction to s as derived above, the bias is fully309

extinguished. For corrections to intermediate widths between s and l, the bias will be partially corrected310

but not eliminated. In both cases, because β > 0, ramps will occur.311

16



http://dx.doi.org/10.1101/805366

RPEs Are Approximately the Derivative of Value312

Consider the formula for RPEs in Equation (4). In tasks where a single reward is delivered at T , rt = 0 for313

all t < T (no rewards delivered before T ). Because γ ' 1, the RPE can be approximated as314

δt 'Vt+1 − Vt(t+ 1)− t

, (27)

which is the slope of the estimated value. To examine the relationship between value and RPEs more315

precisely, we can extend our analysis to the continuous domain:316

δ(t) = lim∆t→0

γ∆tV (t+ ∆t)− V (t)

∆t

=˙V (t) lim

∆t→0γ∆t + V (t) lim

∆t→0

γ∆t − 1

∆t

=˙V (t) lim

∆t→0γ∆t + V (t)(ln γ) lim

∆t→0γ∆t

=˙V (t) + V (t) ln γ, (28)

where˙V (t) is the time derivative of V (t), and the third equality follows from L’Hopital’s Rule. Here, ln γ317

has units of inverse time. Because ln γ ' 0, RPE is approximately the derivative of value.318

Sensory Feedback in Continuous Time319

In the complete absence of sensory feedback, σt is not constant, but rather increases linearly with time, a320

phenomenon referred to as scalar variability, a manifestation of Weber’s law in the domain of timing (Gibbon,321

1977; Church and Meck, 2003; Staddon, 1965). In this case, we can write the standard deviation as σt = wt,322

where w is the Weber fraction, which is constant over the duration of the trial.323

Set l = w(τ + ∆τ) and s = wτ . Following the steps in the previous section,324

δ(τ) = lim∆τ→0

γ∆τe(ln γ)2

2 w2((τ+∆τ)2−τ2)V (τ + ∆τ)− V (τ)

∆τ

=˙V (τ) + V (τ) ln γ + V (τ)(ln γ)2w2τ

>˙V (τ) + V (τ) ln γ. (29)

Hence, as derived for the discrete case, RPEs are inflated, and value is systematically overestimated.325

17



http://dx.doi.org/10.1101/805366

RPE Ramps Result From Sufficiently Convex Value Functions326

By Equation (28), the condition for ramping is δ(t) > 0, i.e., the estimated shape of the value function at327

any given point, before feedback, must obey328

¨V (t) +

˙V (t) ln γ > 0, (30)

where¨V (t) is the second derivative of V (t) with respect to time. For an intuition of this relation, note that329

when γ ' 1, the inequality can be approximated as¨V (t) > 0, which denotes any convex function. The330

exact inequality, however, has a tighter requirement on V (t): Since˙V (t) ln γ < 0 for all t, ramping will only331

be observed if the contribution from¨V (t) (i.e., the convexity) outweighs the quantity

˙V (t) ln γ (the scaled332

slope). For example, the function in Equation (2) does not satisfy the strict inequality even though it is333

convex, and therefore with this choice of V (t), the RPE does not ramp. In other words, to result in an RPE334

ramp, V (t) has to be ‘sufficiently’ convex.335

Simulation Details336

Value Learning Under State Uncertainty (Figure 1): For our TD learning model, we have chosen337

γ = 0.9, α = 0.1, n = 50 states, and T = 48. In the absence of feedback, uncertainty kernels are determined338

by the Weber fraction, arbitrarily set to w = 0.15. In the presence of feedback, uncertainty kernels have a339

standard deviation of l = 3 before feedback and s = 0.1 after feedback. For the purposes of averaging with340

uncertainty kernels, value peaks at T and remains at its peak value after T , and the standard deviation at341

the last 4 states in the presence of feedback is fixed to 0.1. Intuitively, the animal expects reward to be342

delivered, and attributes any lack of reward delivery at τ = T to noise in its timing mechanism (uncertainty343

kernels have nonzero width) rather than to a reward omission. The learning rules were iterated 1000 times.344

Value Learning in the Presence of Sensory Feedback (Figure 2): For our TD learning model, we345

have chosen γ = 0.9, α = 0.1, n = 50 states, and T = 48. The learning rules were iterated 1000 times.346

Relationship with Experimental Data (Figure 3): For our TD learning model, we have chosen γ = 0.8,347

α = 0.1, and Weber fraction w = 0.05. For the navigation task, kernels have standard deviation l = 3 before348

feedback and s = 0.1 after feedback. In the experimental paradigms, trial durations were approximately349

1.5 seconds (Schultz et al., 1997) and over 5 seconds (Howe et al., 2013). Thus for (B) and (D), we have350

arbitrarily set n = 10 and 25 states, respectively, between trial start and reward. The learning rules were351

18



http://dx.doi.org/10.1101/805366

iterated 1000 times.352

Full implementations can be found at www.github.com/jgmikhael/ramping.353

Acknowledgments354

The project described was supported by National Institutes of Health grants T32GM007753 and T32MH020017355

(JGM), R01 MH110404 and MH095953 (NU), U19 NS113201-01 (SJG and NU), the Simons Collaboration356

on the Global Brain (NU), and a research fellowship from the Alfred P. Sloan Foundation (SJG). The content357

is solely the responsibility of the authors and does not necessarily represent the official views of the National358

Institutes of Health or the Simons Collaboration on the Global Brain. The funders had no role in study359

design, data collection and analysis, decision to publish, or preparation of the manuscript.360

Author Contributions361

J.G.M. and S.J.G. developed the model. H.R.K. and N.U. conceived that the structure of state uncertainty362

may influence the shape of estimated value functions and thus RPEs. J.G.M., H.R.K., N.U., and S.J.G.363

contributed to the writing of the paper. J.G.M. analyzed and simulated the model, and wrote the first draft.364

Declaration of Interests365

The authors declare no competing interests.366

Data and Code Availability367

Source code for all simulations can be found at www.github.com/jgmikhael/ramping.368

19



http://dx.doi.org/10.1101/805366

Supplemental Information369

1 Alternative Causes of Ramping370

In the main text, we argue that ramping follows from normative principles. In this section, we illustrate that371

various types of biases (‘bugs’ in the implementation) may also lead to RPE ramps.372

Ramping Due to State-Dependent Bias373

Assume the animal persistently overestimates the amount of time or distance remaining to reach its reward374

(or, equivalently, that it underestimates the time elapsed or the distance traversed so far), and that this375

overestimation decreases as the animal approaches the reward. For instance, since the receptive fields of376

place cells decrease as the animal approaches reward (O’Keefe and Burgess, 1996), the contribution of place377

cells immediately behind the approaching animal in its estimate of value may outweigh that from the place378

cells in front of it. It will simplify our analysis to set T = 0 without loss of generality, and allow time to379

progress from the negative domain (t < 0) toward T = 0. In the continuous domain and for the simple case380

of linear overestimation, we can write this as381

V (t) = γ−ηtr, (31)

where η > 1 is our overestimation factor. Therefore, by Equation (28),382

δ(t) =˙V (t) + V (t) ln γ

= (ln γ)(1− η)γ−ηtr, (32)

which is monotonically increasing. Hence, the RPE should ramp. Equivalently, in the discrete domain,383

δt = γVt+1 − Vt

= γγ−η(t+1)r − γ−ηtr

= γ−ηt(γ1−η − 1)r. (33)

Here, δt+1 > δt. Hence, the RPE should ramp.384

20



http://dx.doi.org/10.1101/805366

Ramping Due to State-Dependent Discounting of Estimated Value385

Assume the animal underestimates V (t) by directly decreasing the temporal discount term γ. Then if386

V (t) = (ηγ)T−tr, with η ∈ (0, 1), we can write in the continuous domain:387

δ(t) =˙V (t) + V (t) ln γ

= (− ln η)(ηγ)T−tr, (34)

which is monotonically increasing. Hence, the RPE should ramp. Equivalently, in the discrete domain, if388

Vt = (ηγ)T−tr with η ∈ (0, 1), we can write389

δt = (ηγ)T−t(

1

η− 1

)r, (35)

and390

δt+1 = (ηγ)−1δt. (36)

Here, δt+1 > δt. Hence, the RPE should ramp.391

21



http://dx.doi.org/10.1101/805366

References392

Babayan, B. M., Uchida, N., and Gershman, S. J. (2018). Belief state representation in the dopamine system.393

Nature communications, 9(1):1891.394

Bellman, R. (1957). Dynamic programming. Princeton University Press.395

Berke, J. D. (2018). What does dopamine mean? Nature neuroscience, page 1.396

Berridge, K. C. (2007). The debate over dopamine’s role in reward: the case for incentive salience. Psy-397

chopharmacology, 191(3):391–431.398

Church, R. M. and Meck, W. (2003). A concise introduction to scalar timing theory. Functional and neural399

mechanisms of interval timing, pages 3–22.400

Cinotti, F., Fresno, V., Aklil, N., Coutureau, E., Girard, B., Marchand, A. R., and Khamassi, M. (2019).401

Dopamine blockade impairs the exploration-exploitation trade-off in rats. Scientific reports, 9(1):6770.402

Cohen, J. Y., Haesler, S., Vong, L., Lowell, B. B., and Uchida, N. (2012). Neuron-type-specific signals for403

reward and punishment in the ventral tegmental area. Nature, 482(7383):85–88.404

Collins, A. L., Greenfield, V. Y., Bye, J. K., Linker, K. E., Wang, A. S., and Wassum, K. M. (2016). Dy-405

namic mesolimbic dopamine signaling during action sequence learning and expectation violation. Scientific406

reports, 6.407

de Lafuente, V. and Romo, R. (2011). Dopamine neurons code subjective sensory experience and uncertainty408

of perceptual decisions. Proceedings of the National Academy of Sciences, 108(49):19767–19771.409

Eshel, N., Bukwich, M., Rao, V., Hemmelder, V., Tian, J., and Uchida, N. (2015). Arithmetic and local410

circuitry underlying dopamine prediction errors. Nature, 525:243–246.411

Fiorillo, C. D., Newsome, W. T., and Schultz, W. (2008). The temporal precision of reward prediction in412

dopamine neurons. Nature neuroscience, 11(8):966.413

Flagel, S. B., Clark, J. J., Robinson, T. E., Mayo, L., Czuj, A., Willuhn, I., Akers, C. A., Clinton, S. M.,414

Phillips, P. E., and Akil, H. (2011). A selective role for dopamine in stimulus–reward learning. Nature,415

469(7328):53.416

Frank, M. J., Moustafa, A. A., Haughey, H. M., Curran, T., and Hutchison, K. E. (2007). Genetic triple417

dissociation reveals multiple roles for dopamine in reinforcement learning. Proceedings of the National418

Academy of Sciences, 104(41):16311–16316.419

22



http://dx.doi.org/10.1101/805366

Gardner, M. P., Schoenbaum, G., and Gershman, S. J. (2018). Rethinking dopamine as generalized prediction420

error. Proceedings of the Royal Society B, 285(1891):20181645.421

Gershman, S. J. (2014). Dopamine ramps are a consequence of reward prediction errors. Neural computation,422

26(3):467–471.423

Gibbon, J. (1977). Scalar expectancy theory and weber’s law in animal timing. Psychological review,424

84(3):279.425

Glimcher, P. W. (2011). Understanding dopamine and reinforcement learning: the dopamine reward predic-426

tion error hypothesis. Proceedings of the National Academy of Sciences, 108(Supplement 3):15647–15654.427

Hamid, A. A., Pettibone, J. R., Mabrouk, O. S., Hetrick, V. L., Schmidt, R., Vander Weele, C. M., Kennedy,428

R. T., Aragona, B. J., and Berke, J. D. (2016). Mesolimbic dopamine signals the value of work. Nature429

Neuroscience, 19:117–126.430

Hart, A. S., Rutledge, R. B., Glimcher, P. W., and Phillips, P. E. (2014). Phasic dopamine release in the431

rat nucleus accumbens symmetrically encodes a reward prediction error term. Journal of Neuroscience,432

34(3):698–704.433

Houk, J. C., Adams, J. L., and Barto, A. G. (1995). A model of how the basal ganglia generate ans use434

neural signals that predict reinforcement. In Houk, J. C., Davis, J. L., and Beiser, D. G., editors, Models435

of information processing in the basal ganglia. MIT Press, Cambridge.436

Howe, M. W., Tierney, P. L., Sandberg, S. G., Phillips, P. E., and Graybiel, A. M. (2013). Prolonged437

dopamine signalling in striatum signals proximity and value of distant rewards. nature, 500(7464):575.438

Kim, H. R., Malik, A. N., Mikhael, J. G., Bech, P., Tsutsui-Kimura, I., Sun, F., Zhang, Y., Li, Y., Watabe-439

Uchida, M., Gershman, S. J., and Uchida, N. (2019). A unified framework for dopamine signals across440

timescales. bioRxiv.441

Kobayashi, S. and Schultz, W. (2008). Influence of reward delays on responses of dopamine neurons. Journal442

of neuroscience, 28(31):7837–7846.443

Lak, A., Nomoto, K., Keramati, M., Sakagami, M., and Kepecs, A. (2017). Midbrain dopamine neurons444

signal belief in choice accuracy during a perceptual decision. Current Biology, 27(6):821–832.445

Lloyd, K. and Dayan, P. (2015). Tamping ramping: Algorithmic, implementational, and computational446

explanations of phasic dopamine signals in the accumbens. PLoS computational biology, 11(12):e1004622.447

23



http://dx.doi.org/10.1101/805366

Ludvig, E., Sutton, R. S., Kehoe, E. J., et al. (2008). Stimulus representation and the timing of reward-448

prediction errors in models of the dopamine system.449

Ludvig, E. A., Sutton, R. S., and Kehoe, E. J. (2012). Evaluating the td model of classical conditioning.450

Learning & behavior, 40(3):305–319.451

Menegas, W., Babayan, B. M., Uchida, N., and Watabe-Uchida, M. (2017). Opposite initialization to novel452

cues in dopamine signaling in ventral and posterior striatum in mice. Elife, 6:e21886.453

Menegas, W., Bergan, J. F., Ogawa, S. K., Isogai, Y., Venkataraju, K. U., Osten, P., Uchida, N., and454

Watabe-Uchida, M. (2015). Dopamine neurons projecting to the posterior striatum form an anatomically455

distinct subclass. Elife, 4:e10032.456

Mikhael, J. G. and Bogacz, R. (2016). Learning reward uncertainty in the basal ganglia. PLoS computational457

biology, 12(9):e1005062.458

Montague, P. R., Dayan, P., and Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems459

based on predictive hebbian learning. The Journal of neuroscience, 16(5):1936–1947.460

Moore, J., Desmond, J., and Berthier, N. (1989). Adaptively timed conditioned responses and the cerebellum:461

a neural network approach. Biological cybernetics, 62(1):17–28.462

Morita, K. and Kato, A. (2014). Striatal dopamine ramping may indicate flexible reinforcement learning463

with forgetting in the cortico-basal ganglia circuits. Frontiers in neural circuits, 8:36.464

Nicola, S. M., Surmeier, D. J., and Malenka, R. C. (2000). Dopaminergic modulation of neuronal excitability465

in the striatum and nucleus accumbens. Annual review of neuroscience, 23(1):185–215.466

Niv, Y., Daw, N. D., Joel, D., and Dayan, P. (2007). Tonic dopamine: opportunity costs and the control of467

response vigor. Psychopharmacology, 191(3):507–520.468

Niv, Y. and Schoenbaum, G. (2008). Dialogues on prediction errors. Trends in cognitive sciences, 12(7):265–469

272.470

O’Keefe, J. and Burgess, N. (1996). Geometric determinants of the place fields of hippocampal neurons.471

Nature, 381(6581):425.472

Ratcliff, R. and Frank, M. J. (2012). Reinforcement-based decision making in corticostriatal circuits: mutual473

constraints by neurocomputational and diffusion models. Neural computation, 24(5):1186–1229.474

Schultz, W. (2007a). Behavioral dopamine signals. Trends in neurosciences, 30(5):203–210.475

24



http://dx.doi.org/10.1101/805366

Schultz, W. (2007b). Multiple dopamine functions at different time courses. Annu. Rev. Neurosci., 30:259–476

288.477

Schultz, W. (2010). Review dopamine signals for reward value and risk: basic and recent data. Behav. Brain478

Funct, 6:24.479

Schultz, W., Dayan, P., and Montague, P. R. (1997). A neural substrate of prediction and reward. Science,480

275(5306):1593–1599.481

Staddon, J. (1965). Some properties of spaced responding in pigeons. Journal of the Experimental Analysis482

of Behavior, 8(1):19–28.483

Starkweather, C. K., Babayan, B. M., Uchida, N., and Gershman, S. J. (2017). Dopamine reward prediction484

errors reflect hidden-state inference across time. Nature Neuroscience, 20(4):581–589.485

Starkweather, C. K., Gershman, S. J., and Uchida, N. (2018). The medial prefrontal cortex shapes dopamine486

reward prediction errors under state uncertainty. Neuron, 98:616–629.487

Steinberg, E. E., Keiflin, R., Boivin, J. R., Witten, I. B., Deisseroth, K., and Janak, P. H. (2013). A causal488

link between prediction errors, dopamine neurons and learning. Nature neuroscience, 16(7):966–973.489

Stuber, G. D., Klanker, M., de Ridder, B., Bowers, M. S., Joosten, R. N., Feenstra, M. G., and Bonci, A.490

(2008). Reward-predictive cues enhance excitatory synaptic strength onto midbrain dopamine neurons.491

Science, 321(5896):1690–1692.492

Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–493

44.494

Sutton, R. S. and Barto, A. G. (1990). Time-derivative models of pavlovian reinforcement.495

Totah, N. K., Kim, Y., and Moghaddam, B. (2013). Distinct prestimulus and poststimulus activation of vta496

neurons correlates with stimulus detection. Journal of neurophysiology, 110(1):75–85.497

Wassum, K. M., Ostlund, S. B., and Maidment, N. T. (2012). Phasic mesolimbic dopamine signaling precedes498

and predicts performance of a self-initiated action sequence task. Biological psychiatry, 71(10):846–854.499

25



http://dx.doi.org/10.1101/805366

Documents

Ramping and State Uncertainty in the Dopamine Signal · 16.10.2019 · 4 delay, dopamine activity during the delay period and at reward time should converge to baseline through 5