Addressing Function Approximation Error in Actor-Critic

Addressing Function Approximation Error in Actor-Critic Methods

Scott Fujimoto 1 Herke van Hoof 1 David Meger 1

Abstract

In value-based reinforcement learning methodssuch as deep Q-learning, function approximationerrors are known to lead to overestimated valueestimates and suboptimal policies. We show thatthis problem persists in an actor-critic setting andpropose novel mechanisms to minimize its effectson both the actor and critic. Our algorithm takesthe minimum value between a pair of critics to re-strict overestimation and delays policy updates toreduce per-update error. We evaluate our methodon the suite of OpenAI gym tasks, outperformingthe state of the art in every environment tested.

1. IntroductionIn reinforcement learning problems with discrete actionspaces, the problem of value overestimation as a result offunction approximation errors is a well-studied problem.However, similar problems with actor-critic methods in con-tinuous control domains have been largely left unstudied.In this paper, we show overestimation bias and the accumu-lation of error in temporal difference methods are presentin an actor-critic setting. Our proposed method works inconsideration of these challenges, and greatly outperformsthe current state of the art.

In temporal difference learning (Sutton, 1988) an estimateof the value function is updated using the estimate of a sub-sequent state. If this subsequent estimate is far from the truevalue, then the update might be poor. Furthermore, eachupdate leaves some residual error which is then repeatedlypropagated throughout the state space by the nature of thetemporal difference update. This accumulated error cancause arbitrarily bad states to be estimated as high value,resulting in sub-optimal policy updates and divergent be-havior. In the discrete action setting, this property is furtherexaggerated by updates with a greedy policy which selectsactions with the maximum approximate value, which causesa consistent overestimation (Thrun & Schwartz, 1993).

1McGill University, Montreal, Canada. Correspondence to:Scott Fujimoto <[email protected]>.

We demonstrate that this overestimation property is alsopresent in deterministic policy gradients and furthermorethat a naıve generalization of the most successful solution,Double DQN (Van Hasselt et al., 2016) to the actor-criticsetting is not effective in an actor-critic. To deal with thisconcern, we adapt a Q-learning variant, Double Q-learning(Van Hasselt, 2010), to an actor-critic format, by using apair of critics for unbiased value estimation. However, anunbiased estimate with high variance can still lead to over-estimations in local regions of state space, which in turn cannegatively affect the global policy. To address this concern,we propose a clipped Double Q-learning variant which fa-vors underestimations by bounding situations where DoubleQ-learning does poorly. This situation is preferable as un-derestimations do not tend to be propagated during learning,as states with a low value estimate are avoided by the policy.

Given the role of variance in overestimation bias, this pa-per contains a number of components that address variancereduction. First, we show that target networks, a commonapproach in deep Q-learning methods, are critical for vari-ance reduction. Second, to address the coupling of value andpolicy, we propose delaying policy updates until each targethas converged. Finally, we introduce a novel regularizationstrategy, where a SARSA-style update bootstraps similaraction estimates to further reduce variance.

Our modifications are applied to the state of the art actor-critic method for continuous control, deep deterministicpolicy gradients (Lillicrap et al., 2015), to form TD3, TwinDelayed Deep Deterministic policy gradients, an actor-criticalgorithm which considers the interplay between functionapproximation error in both policy and value updates. Weevaluate our algorithm on seven continuous control domainsfrom OpenAI gym (Brockman et al., 2016), where we out-perform the state of the art by a wide margin.

Given the recent concerns in reproducibility (Hendersonet al., 2017), we run our experiments across a large numberof seeds with fair evaluation metrics, perform detailed ab-lation across each contribution, and open source both ourcode and learning curves1. Videos of our algorithm can befound on our github.

1https://github.com/sfujim/TD3

arX

iv:1

802.

0947

7v1

[cs

.AI]

26

Feb

2018

https://github.com/sfujim/TD3


2. Related WorkFunction approximation error and its effect on bias and vari-ance in reinforcement learning algorithms has been studiedin prior works (Pendrith et al., 1997; Mannor et al., 2007).Our work focuses on two outcomes that occur from estima-tion error, namely overestimation bias and a high variancebuild-up.

Several approaches exist to reduce the effects of overesti-mation bias due to function approximation and policy op-timization in Q-learning. For example, Double Q-learninguses two independent estimators to make unbiased value es-timates (Van Hasselt, 2010; Van Hasselt et al., 2016). Otherapproaches have focused directly on reducing the variance(Anschel et al., 2017), minimizing over-fitting to early highvariance estimates (Fox et al., 2016), or through correctiveterms (Lee et al., 2013). Further, the variance of the valueestimate has been considered directly (Osband et al., 2016;O’Donoghue et al., 2017) to estimate uncertainty of theagent as an exploration method.

The accumulation of error in temporal difference learninghas been largely dealt with by either minimizing the sizeof errors at each time step or mixing off-policy and Monte-Carlo returns. Methods with multi-step returns, which trade-off environment and policy variance with accumulated esti-mation error and bias, have been shown to be an effectiveapproach, through importance sampling (Munos et al., 2016;Mahmood et al., 2017), distributed methods (Mnih et al.,2016), and approximate bounds (He et al., 2016). How-ever, rather than provide a direct solution to accumulatingerror, these methods circumvent the problem by consider-ing a longer horizon. Another approach is a reduction inthe discount factor (Petrik & Scherrer, 2009), reducing thecontribution of each error.

Our method builds on Deterministic Policy Gradients (DPG)(Silver et al., 2014), an actor-critic method which uses alearned value estimate to train a deterministic policy. Anextension to deep reinforcement learning, DDPG (Lillicrapet al., 2015), has shown to produce state of the art resultswith an efficient number of iterations. Recent improvementsto DDPG include distributed methods (Popov et al., 2017),along with multi-step returns and prioritized experiencereplay (Schaul et al., 2016; Horgan et al., 2018), and dis-tributional methods (Bellemare et al., 2017; Barth-Maronet al., 2018). These improvements are orthogonal to ourapproach. In our experiments, we focus on the single-stepreturn case, but the improvements we propose are compati-ble with the use of multi-step returns. In contrast with ourmethod, work has been done on adapting deep Q-learningmethods for continuous actions, which replace selectionof the action with the highest estimated value with variousapproximations (Gu et al., 2016; Metz et al., 2017; Tavakoliet al., 2017).

3. BackgroundReinforcement learning considers the paradigm of an agentinteracting with its environment with the aim of learningreward-maximizing behavior. At each discrete time stept, with a given state s ∈ S, the agent selects actionsa ∈ A with respect to its policy π : S → A, receivingsome reward r and the new state of the environment s′.The return is defined as the discounted sum of rewardsRt =

∑Ti=t γ

i−tr(si, ai), where γ is a discount factor de-termining the priority of short-term rewards. In reinforce-ment learning, the objective is to find the optimal policyπ with parameters φ which maximizes the expected returnJ(φ) = Eφ [R0].

Policies may be learned by maximizing a learned estimate ofexpected cumulative reward (Sutton, 1988; Watkins, 1989),known as a value function. The value function Qπ of apolicy π is defined as the expected return when performingaction a in state s and following π after:

Qπ(s, a) = E [Rt|s, a] . (1)

The Bellman equation (Bellman, 1957) is a fundamentalidea to value function learning. It relates the value of a state-action pair (s, a) to the value of the subsequent state-actionpair (s′, a′):

Qπ(s, a) = r + γEs′,a′ [Qπ(s′, a′)] , (2)

where a′ ∼ π(s′). The optimal value function Q∗(s, a) isdefined as the value function associated with the optimalpolicy, which selects the action leading to the highest long-term rewards.

With a discrete action space, a greedy policy that selectsthe highest valued action is often used. However, in a con-tinuous action space, the value-maximizing action cannotusually be found analytically. Instead, actor-critic meth-ods can be used, where the policy, known as the actor, istrained to maximize the estimated expected return definedby the value estimate, known as the critic. The policy can beupdated through the deterministic policy gradient theorem(Silver et al., 2014):

∇φJ(φ) = E[∇aQπ(s, a)|a=π(s)∇φπφ(s)

]. (3)

For a large state space, the value function can be estimatedwith a differentiable function approximator Qθ(s, a), withparameters θ. For deep Q-learning (Mnih et al., 2015) this re-quires the use of target networks, where a secondary frozennetwork Qθtarget(s, a) is used to maintain a fixed objectivey over multiple updates:

y = r + γQθtarget(s′, a′). (4)

For discrete actions a′ = argmaxaQθtarget(s′, a) while

similarly, a′ ∼ πφtarget(s) for continuous actions, where


πφtarget is an additional target actor network. A target net-work is either updated periodically to exactly match theweights of the current network, or by some proportion τat each time step θtarget ← τθ + (1 − τ)θtarget. Thisupdate can be applied in an off-policy fashion, samplingrandom mini-batches of experiences, following the currentor previous policies, from an experience replay buffer (Lin,1992).

4. Overestimation BiasIn Q-learning, estimation error can lead to overly optimisticvalue estimates, i.e., overestimation bias. In the discreteaction setting the value estimate is updated with a greedytarget r+γ argmaxaQ(s, a), but if the target is susceptibleto error ε, then E[maxa(Q(s, a) + ε)] ≥ maxaQ(s, a). Asa result, initially random function approximation error istransformed into a consistent overestimation bias, which isthen propagated through the Bellman equation.

While in the discrete action setting overestimation bias isan obvious artifact from the analytical maximization, thepresence and effects of overestimation bias is less clearin an actor-critic setting. We begin by proving that thevalue estimate in deterministic policy gradients will be anoverestimation under some basic assumptions in Section 4.1and then propose a clipped variant of Double Q-learningin an actor-critic setting to reduce overestimation bias inSection 4.2.

4.1. Overestimation Bias in Actor-Critic

In actor-critic methods the policy is updated with respect tothe value estimates of an approximate critic. In this sectionwe assume the policy is updated using the deterministicpolicy gradient theorem, and show that the update inducesoverestimation in the value estimate. Given current policyparameters φ, let φapprox define the parameters from the ac-tor update induced by the maximization of the approximatecritic Qθ(s, a) and φtrue the parameters from the hypothet-ical actor update with respect to the true value functionQπ(s, a) (which is not known during learning):

φapprox = φ+α

Z1Es∼pπ

[∇φπφ(s)∇aQθ(s, a)|a=πφ(s)

]φtrue = φ+

α

Z2Es∼pπ

[∇φπφ(s)∇aQπ(s, a)|a=πφ(s)

],

(5)where we assume Z1 and Z2 are chosen to normalize thegradient, i.e., such that Z−1||E[·]|| = 1. Without normal-ized gradients, overestimation bias is still guaranteed tooccur with slightly stricter conditions, we examine this casefurther in the supplementary material (B).

As the gradient direction is a local maximizer, there exists ε1sufficiently small such that if α ≤ ε1 then the approximate

value of πφapprox will be bounded below by the approximatevalue of πφtrue :

E[Qθ(s, πφapprox(s))

]≥ E [Qθ(s, πφtrue(s))] . (6)

Conversely, there exists ε2 sufficiently small such that ifα ≤ ε2 then the true value of πφapprox will be boundedabove by the true value of πφtrue :

E [Qπ(s, πφtrue(s))] ≥ E[Qπ(s, πφapprox(s))

]. (7)

If in expectation the value estimate is at least as large asthe true value with respect to φtrue, E [Qθ (s, πφtrue(s))] ≥E [Qπ (s, πφtrue(s))], then equations 6 and 7 imply that ifα < min(ε1, ε2), then the value estimate will be overesti-mated:

E[Qθ(s, πφapprox(s))

]≥ E

[Qπ(s, πφapprox(s))

]. (8)

Although this overestimation will be minimal with eachupdate, this raises two concerns. Firstly that the overestima-tion will develop into significant bias over many updates ifleft unchecked. Secondly, an inaccurate value estimate willlead to repeated poor policy updates. This is particularlyproblematic because a feedback loop is created, in whichsub-optimal actions might be highly rated by the suboptimalcritic, reinforcing the sub-optimal action in the next policyupdate.

Does this theoretical overestimation occur in practicefor state-of-the-art methods? We answer this question byplotting the value estimate of DDPG (Lillicrap et al., 2015)over time while it learns on the MuJoCo Hopper-v1 andWalker2d-v1 environments (Todorov et al., 2012). In Figure1, we graph the average value estimate over 10000 states andcompare it to an estimate of the true value. The true valueis estimated using the average discounted return over 1000evaluations following the current policy, starting from statessampled from the buffer replay. A very clear overestimationbias occurs from the learning procedure, which contrastswith the novel method that we describe later, TD3, whichgreatly reduces overestimation by the critic.

4.2. Clipped Double Q-Learning for Actor-Critic

Following Section 4.1, the learning procedure of actor-criticalgorithms will induce overestimation of the value of actionstaken by a sub-optimal policy. While several approachesto reducing overestimation bias have been proposed, wefind them ineffective in an actor-critic setting. This sectionintroduces a novel clipped variant to Double Q-learning(Van Hasselt, 2010) which can effectively act as critic.

In Double Q-learning (Van Hasselt, 2010) the greedy updateis disentangled from the value function by maintaining twoseparate value estimates, each of which is used to update


0.0 0.2 0.4 0.6 0.8 1.0Time steps ×106

0

50

100

150

200

250

300

350

Estim

ated

Val

ue

TD3DDPGTrue TD3True DDPG

(a) Hopper-v1

0.0 0.2 0.4 0.6 0.8 1.0Time steps ×106

0

100

200

300

400

500

Estim

ated

Val

ue

TD3DDPGTrue TD3True DDPG

(b) Walker2d-v1

Figure 1. Measuring overestimation bias in the value estimates ofDDPG and our proposed method, TD3, on MuJoCo environmentsover 1 million time steps and 3 trials.

the other. As a result, the target values will be unbiased esti-mates of the value of the actions selected by each policy. InDouble DQN (Van Hasselt et al., 2016) the authors proposeusing the target network as one of the value estimates, andobtain a policy by greedy maximization of the current valuenetwork rather than the target network. In an actor-criticsetting a analogous update uses the current policy ratherthan the target policy in the learning target:

y = r + γQθtarget(s′, πφ(s′)). (9)

In practice however, we found that with the slow-changingpolicy in actor-critic, the current and target networks weretoo similar to make an independent estimation, and offeredlittle improvement over the original learning target. Weshow this effect empirically in Section 6.1. Instead, the orig-inal double-Q formulation can be used, with a pair of actors(πφ1

, πφ2) and critics (Qθ1 , Qθ2), where πφ1

is optimizedwith respect to Qθ1 and πφ2

with respect to Qθ2 :

y1 = r + γQθtarget,2(s′, πφ1(s′))

y2 = r + γQθtarget,1(s′, πφ2(s′)).

(10)

As πφ1 optimizes with respect to Qθ1 , using Qθ2 in thetarget update of Qθ1 avoid the bias introduced by the pol-icy update. While Qθ2(s, πφ1

(s)) is an unbiased estimateof Qπ(s, πφ1

(s)), for some s we will occasionally haveQθ2(s, πφ1

(s)) > Qθ1(s, πφ1(s)). This is problematic be-

cause E [Qθ1(s, πφ1(s))] > E [Qπ(s, πφ1

(s))], meaningthat Qθ1 will generally overestimate and in certain areasof the state space the overestimation will be further exagger-ated.

To address this problem, we propose to upper-bound thevalue estimate by Qθ1 , to give the target update of ourClipped Double Q-learning algorithm:

y1 = r + γ mini=1,2

Qθtarget,i(s′, πφ1

(s′)). (11)

With Clipped Double-Q learning, our worst case becomesthe standard Q-learning target and we restrict any possible

overestimation introduced into the value estimate. Intu-itively, Qθ2 acts as an unbiased estimate, while Qθ1 acts asan approximate upper-bound to the value estimate. Whilethis update rule may induce an underestimation bias, thisis far preferable to overestimation bias, as unlike overesti-mated actions, the value of underestimated actions will notbe explicitly propagated through the policy update. Notethat overestimation due to function approximation is not agood measure of uncertainty, thus, using an over-estimatingcritic is not an effective way to implement optimism in theface of uncertainty (Kaelbling et al., 1996), and can lead tosuboptimal policies (Thrun & Schwartz, 1993).

In implementation, computational costs can be reduced byusing a single actor optimized with respect to Qθ1 . We thenuse the same target y2 = y1 for Qθ2 . If Qθ2 > Qθ1 thenthe update is identical to the standard update and induces noadditional bias. If Qθ2 < Qθ1 , this suggests overestimationhas occurred and the value is reduced similar to double Q-learning. A proof of convergence in the finite MDP settingfollows from this intuition. We provide formal details andjustification in the supplementary material (A).

By treating the function approximation error as a randomvariable we can see that the minimum operator should re-sult in higher value to states with lower variance estimationerror, as the expected minimum of a set of random variabledecreases as the variance of the random variables increases.This effect means that the minimization in Equation 11 willlead to a preference for states with low-variance value esti-mates, leading to safer policy updates with stable learningtargets.

5. Addressing VarianceIn a function approximation setting the Bellman equation isnever exactly satisfied, and each update leaves some amountof residual TD-error δ(s, a):

Qθ(s, a) = r + γQθ(s′, a′)− δ(s, a). (12)

In this section we emphasize the importance of minimizingerror at each update, build the connection between targetnetworks and estimation error and propose modifications tothe learning procedure of actor-critic for variance reduction.

5.1. Accumulating Error

Due to the temporal difference update where an estimateof the value function is built from an estimate of a subse-quent state, there is a build up of function approximationerror. While it is reasonable to expect small TD-errors fora given Bellman update, estimation errors can accumulate,resulting in the potential for large overestimation bias andsuboptimal policy updates. Following Equation 12, anddenoting y(s, a) = r + γQθ(s

′, a′), it can be shown that


rather than learning an estimate of the expected return, thevalue estimate approximates the expected return minus theexpected discounted sum of future TD-errors:

Qθ(st, at) = rt + γE[Qθ(st+1, at+1)]− δt= rt + γE [rt+1 + γE [Qθ(st+2, at+2)− δt+1]]− δt...

= Es∼pπ

[T∑i=t

γi−tri

]− Es∼pπ

[T∑i=t

γi−tδi

].

(13)Then it follows that the variance of our estimate will beproportional to the variance of future rewards and TD-errors.Given a large discount factor γ, the variance can growrapidly with each update if the error from each update isnot tamed. Furthermore each gradient update only reduceserrors with respect to a small mini-batch which gives littleguarantee about the size of errors of value estimates outsidethe mini-batch.

High variance estimates are problematic as they introduceoverestimation bias (Thrun & Schwartz, 1993; Anschel et al.,2017) and provide a noisy gradient for the policy update.This greatly increases the likelihood of poor local minima, aprimary concern for methods using gradient-based policies.

5.2. Target Networks and Delayed Policy Updates

In this section we examine the relationship between targetnetworks and function approximation error, and show theuse of a stable target reduces the growth of error noted above.This insight allows us to consider the interplay between highvariance estimates and policy performance and we suggestdelaying policy updates is an effective strategy to improveconvergence.

Target networks are a critical tool to achieve stability in deepreinforcement learning, and value estimates quickly divergewithout them. As deep function approximators require mul-tiple gradient updates to converge, target networks providea stable objective in the learning procedure, and allow agreater coverage of the training data. Without a fixed target,each update may leave residual error which will begin toaccumulate.

To provide some intuition, we examine the learning behaviorwith and without target networks in Figure 2, where wegraph the value, in a similar manner to Figure 1, in theMuJoCo Hopper-v1 environment (Todorov et al., 2012). In(a) we compare the behavior with a fixed policy and in (b)we examine the value estimates with a policy that continuesto learn, trained with the current value estimate. The targetnetwork uses a slow moving update rate, parametrized by τ .

While updating the value estimate without a target network(τ = 1) increases the volatility, all update rates result in

0.0 0.2 0.4 0.6 0.8 1.0Time steps ×105

150

200

250

300

350

Estim

ated

Val

ue

= 1= 0.1= 0.01

True Value

(a) Fixed Policy

0.0 0.2 0.4 0.6 0.8 1.0Time steps ×105

102

103

104

Estim

ated

Val

ue

= 1= 0.1= 0.01

(b) Learned Policy

Figure 2. Average estimated value of a randomly selected stateon Hopper-v1 without a target network, (τ = 1), and with aslow updating target network, (τ = 0.1, 0.01), with a fixed and alearned policy. (b) uses a logarithmic scale where the true value isdenoted by circles in the corresponding color.

similar convergent behaviors when considering a fixed pol-icy. However, when the policy is trained with the currentvalue estimate, the use of quick updating target networksresult in highly divergent behavior. The worst case of thisbehavior is with τ = 0.1, which updates at a high enoughrate to allow for error build-up but results in slower correc-tions than τ = 1. These results suggest that the divergencethat occurs without the target network is the product of thepolicy update with a high variance value estimate.

When are actor-critic methods divergent? These resultssuggest that the divergence that occurs without the networkis the product of the policy update with a high variance valueestimate. Figure 2 as well as Section 4 suggest failures occurbecause of an interplay between the actor and critic updates.Value estimates diverge through overestimation when thepolicy is poor, yet the policy will only become poor if thevalue estimate itself is inaccurate.

If target networks can be used to reduce the error over mul-tiple updates, and policy updates on high error states causedivergence, then the policy network should be updated at alower frequency than the value network, to first minimizeerror before introducing a policy update. Using a slow-moving target network helps in this regard, but we proposeto delay the policy updates until the value error is as small aspossible. The proposed modification is to freeze the targetnetwork for some fixed number of updates, then update thepolicy when the error is minimized, and then finally updatethe target network slowly θtarget ← τθ+ (1− τ)θtarget sothat the TD-error remains small.

By sufficiently delaying the policy updates we limit thelikelihood of repeated updates with respect to similar critics.Even if the value estimate is high variance, the averageestimated gradient with respect to the critic will be closerto the true gradient. The less frequent policy updates thatdo occur will use a lower variance critic, and in principleresult in higher quality policy updates. This is captured byour empirical results presented in Section 6.1.


5.3. Target Policy Noise Regularization

The variance of function approximation error can be furtherreduced through regularization. We introduce a regulariza-tion strategy for deep value learning, target policy noise,in which we enforce the notion that similar actions shouldhave similar value. While the function approximation doesthis implicitly, this relationship can be forced explicitly bymodifying the training procedure. Ideally, rather than fittingthe exact target value, fitting the value of small area aroundthe target action

y = r + Ea′∼π(s′)+ε[Qθtarget(s

′, a′)], (14)

would have the benefit of smoothing the value estimate bybootstrapping off of similar state-action value estimates.This reduces the variance of the value estimate, and thusfunction approximation error, at the cost of introducing anegative bias. In practice, we can approximate this expecta-tion over actions by adding a small amount of random noiseto the target policy and averaging over mini-batches. Thismakes our modified target update:

y = r + γQθtarget(s′, a′ + ε)

ε ∼ clip(N (0, σ),−c, c),(15)

where the added noise is clipped to keep the target in a smallrange. This results in an algorithm reminiscent of SARSA(Sutton & Barto, 1998) and Expected-SARSA (Van Seijenet al., 2009), where the value estimate is instead learned off-policy and the target policy noise is chosen independentlyof the exploration policy. The value estimate learned iswith respect to a noisy policy defined by the parameter σ.Intuitively, it is known that policies derived from SARSAvalue estimates tend to be safer, as they provide higher valueto actions resistant to perturbations. Thus, this style ofupdate can additionally lead to improvement in stochasticdomains with failure cases.

6. ExperimentsWe present Twin Delayed Deep Deterministic policy gra-dients (TD3), which builds on deep deterministic policygradients (Lillicrap et al., 2015) by applying the modifica-tions described in sections 4.2, 5.2 and 5.3 to increase thestability and performance with consideration of functionapproximation error. TD3 maintains a pair of critics alongwith a single actor. For each time step, we update the pair ofcritics towards the minimum target value of actions selectedby the target policy:

y = r + γ mini=1,2

Qθtarget,i(s′, πtargetφ(s′) + ε)

ε ∼ clip(N (0, σ),−c, c).(16)

Every d iterations, the policy is updated with respect to Qθ1following the deterministic policy gradient theorem (Silveret al., 2014). TD3 is summarized in Algorithm 1.

Algorithm 1 TD3

Initialize critic networks Qi and actor network π withrandom parameters θi, φInitialize target networks θi ← θi,target, φ← φtargetInitialize replay buffer Bfor t = 1 to T do

Select action at ∼ π(s) +N (0, ε) and observe rewardr and new state s′

Store transition tuple (s, a, r, s′) in BSample mini-batch of N transitions (s, a, r, s′) from Ba← πφtarget(s) + clip(N (0, σ),−c, c)y ← r + mini γ(Qθi,target(s

′, a))Update θi to minimize 1

N

∑n(y −Qθi(s, a))2

if t mod d thenUpdate φ by deterministic policy gradient theorem:

∇φJ(φ) =1

N

∑n

∇aQθ1(s, a)|a=πφ(s)∇φπφ(s)

Update target networks:θi,target = τθi + (1− τ)θi,targetφtarget = τφ+ (1− τ)φtarget

end ifend for

(a) (b) (c) (d)

Figure 3. Example MuJoCo environments (a) HalfCheetah-v1, (b)Hopper-v1, (c) Walker2d-v1, (d) Ant-v1.

6.1. Evaluation

To evaluate our algorithm, we measure its performance onthe suite of MuJoCo continuous control tasks (Todorov et al.,2012), interfaced through OpenAI Gym (Brockman et al.,2016) (Figure 3). To allow for reproducible comparison, weuse Brockman et al.’s original set of tasks with no modifica-tions to the environment or reward.

For our implementation of DDPG (Lillicrap et al., 2015),we use a two layer feed forward neural network of 400 and300 hidden nodes respectively, with rectified linear units(ReLU) between each layer for both the actor and critic, anda final tanh unit following the output of the actor. Unlike theoriginal DDPG, both networks receive the state and actionas input to the first layer. Both network parameters areupdated using Adam (Kingma & Ba, 2014) with a learningrate of 10−3 without L2 regularization. After each timestep the networks are trained with a mini-batch of a 100


0.0 0.2 0.4 0.6 0.8 1.0Time steps ×106

0

2000

4000

6000

8000

10000

Aver

age

Ret

urn

(a) HalfCheetah-v1

0.0 0.2 0.4 0.6 0.8 1.0Time steps ×106

0

500

1000

1500

2000

2500

3000

3500

Aver

age

Ret

urn

(b) Hopper-v1

0.0 0.2 0.4 0.6 0.8 1.0Time steps ×106

0

1000

2000

3000

4000

5000

Aver

age

Ret

urn

(c) Walker2d-v1

0.0 0.2 0.4 0.6 0.8 1.0Time steps ×106

1000

0

1000

2000

3000

4000

5000

Aver

age

Ret

urn

(d) Ant-v1

0.0 0.2 0.4 0.6 0.8 1.0Time steps ×106

400

500

600

700

800

900

1000

Aver

age

Ret

urn

(e) InvertedPendulum-v1

0.0 0.2 0.4 0.6 0.8 1.0Time steps ×106

4000

5000

6000

7000

8000

9000

10000

Aver

age

Ret

urn

(f) InvertedDoublePendulum-v1

0.0 0.2 0.4 0.6 0.8 1.0Time steps ×106

12

11

10

9

8

7

6

5

4

Aver

age

Ret

urn

(g) Reacher-v1

TD3DDPGour DDPGPPOTRPOACKTR

Figure 4. Learning curves on OpenAI gym continuous control tasks. The shaded region represents half a standard deviation of the averageevaluation over 10 trials. Some graphs are cropped to display the interesting regions.

Table 1. Max Average Return over 10 trials of 1 million time steps. Maximum value for each task is bolded.

Environment TD3 DDPG Our DDPG PPO TRPO ACKTR

HalfCheetah 9751.01 ± 1326.48 3305.60 8577.29 1795.43 -15.57 1450.46Hopper 3567.60 ± 127.37 2020.46 1860.02 2164.70 2471.30 2428.39Walker2d 4565.89 ± 529.36 1843.85 3098.11 3317.69 2321.47 1216.70Ant 4765.85 ± 737.48 1005.30 888.77 1083.20 -75.85 1821.94Reacher -3.70 ± 0.51 -6.51 -4.01 -6.18 -111.43 -4.26InvPendulum 1000.00 ± 0 1000.00 1000.00 1000.00 985.40 1000.00InvDPendulum 9337.47 ± 14.96 9355.52 8369.95 8977.94 205.85 9081.92

transitions, sampled uniformly from an experience replaybuffer containing the entire history of the agent.

The target policy noise is implemented by adding ε ∼N (0, 0.2) to the actions chosen by the target actor network,clipped to (−0.5, 0.5), delayed policy updates consists ofonly updating the actor and target critic network every diterations, with d = 2. For fair comparison to DDPG thecritics are only trained once per time step. While a largerd would result in a larger benefit, training the actor for toofew iterations would cripple learning. Both target networkswere updated with τ = 0.005.

To remove the dependency on the initial parameters of thepolicy we use a purely exploratory policy for the first 10000time steps of stable length environments (HalfCheetah-v1and Ant-v1) or the first 1000 time steps for the remainingenvironments. Afterwards, we use an off-policy explorationstrategy, adding Gaussian noise N (0, 0.1) to each action.Unlike the original implementation of DDPG, we used un-correlated noise for exploration as we found noise drawnfrom the Ornstein-Uhlenbeck (Uhlenbeck & Ornstein, 1930)

process offered no performance benefits.

Each task is run for 1 million time steps with evaluationsevery 5000 time steps, where each evaluation reports theaverage reward over 10 episodes with no exploration noise.Our results are reported over 10 random seeds of the Gymsimulator and the network initialization. We compare ouralgorithm against DDPG (Lillicrap et al., 2015) as well asthe state of art policy gradient algorithms PPO (Schulmanet al., 2017), ACKTR (Wu et al., 2017) and TRPO (Schul-man et al., 2015), as implemented by OpenAI’s baselinesrepository (Dhariwal et al., 2017). Additionally, we com-pare our method with our re-tuned version of DDPG, whichincludes all architecture and hyper-parameter modificationsto DDPG without the adjustments to the target and policyupdates. A full comparison between the implementation ofour algorithm and the baselines DDPG is provided in thesupplementary material (C).

Our results are presented in Table 1 and learning curves inFigure 4. TD3 matches or outperforms all other algorithmsin both final performance and learning speed across all tasks.


0.0 0.2 0.4 0.6 0.8 1.0Time steps ×106

0

2000

4000

6000

8000

10000

Aver

age

Ret

urn

TDPGour DDPGAHE - CDQDouble-1Double-2

(a) HalfCheetah-v1

0.0 0.2 0.4 0.6 0.8 1.0Time steps ×106

0

500

1000

1500

2000

2500

3000

3500

Aver

age

Ret

urn

(b) Hopper-v1

0.0 0.2 0.4 0.6 0.8 1.0Time steps ×106

0

1000

2000

3000

4000

5000

Aver

age

Ret

urn

(c) Walker2d-v1

0.0 0.2 0.4 0.6 0.8 1.0Time steps ×106

1000

0

1000

2000

3000

4000

5000

Aver

age

Ret

urn

(d) Ant-v1

Figure 5. Comparison of TD3, TD3 without Clipped Double Q-learning and the Double-1 and Double-2 variants.

6.2. Ablation Studies

We perform ablation studies to understand the contributionof each individual component: Clipped Double Q-learning(4.2), delayed policy updates (5.2) and target policy noise(5.3). We present our results in Table 2 in which we com-pare the performance of removing each component fromTD3, along with our modifications to architecture and hyper-parameters and the two actor-critic variants of Double-Qlearning (Van Hasselt, 2010; Van Hasselt et al., 2016) de-scribed in Section 4.1 for DDPG, denoted Double-1 forthe single critic variant and Double-2 for the double criticvariant. For fairness in comparison, these methods also ben-efited from delayed policy updates, target policy noise anduse our architecture and hyper-parameters. Full learningcurves can be found in the supplementary material (D).

Table 2. Average return over the last 10 evaluations over 10 trialsof 1 million time steps, comparing ablation over delayed policy(DP), target policy noise (TPN), Clipped Double Q-learning (CDQ)and our architecture, hyper-parameters and exploration (AHE).Maximum value for each task is bolded.

Method HCheetah Hopper Walker2d Ant

TD3 9551.41 3353.60 4456.42 4653.71DDPG 3162.50 1731.94 1520.90 816.35+ AHE 8401.02 1512.03 2362.12 564.07

AHE + DP 7588.64 1465.11 2459.53 896.13AHE + TPN 9023.40 907.56 2961.36 872.17AHE + CDQ 6470.20 1134.14 3979.21 3818.71

TD3 - DP 9720.88 2170.80 4497.50 4382.01TD3 - TPN 8987.69 2392.59 4033.67 4155.24TD3 - CDQ 10280.90 1543.17 2773.63 933.91

Double-1 10723.53 2087.33 2278.33 1093.73Double-2 10413.73 1799.57 3494.05 2935.28

The significance of each component varies task to task. Inthe HalfCheetah-v1 environment we see the architecturecontributes the largest improvement, while in Ant-v1 theimprovements are almost entirely due to the CDQ target.Although the actor is trained for only half the number ofiterations, the inclusion of delayed policy update causes

no loss in performance across all environments except forHalfCheetah-v1. The full algorithm outperforms every othermethod in most tasks (and performs competitively on theonly exception, the HalfCheetah-v1 task). Ablation of allproposed components performed the worst in every task.While the addition of only a single component causes in-significant improvement in most cases, the addition of com-binations performs at a much higher level.

We compare the effectiveness of the Double Q-Learningvariants, Double-1 and Double2, in Figure 5 and Table 2.Due to the slow-changing nature of actor-critic, the tar-get networks of Double-1 remain too close to the currentnetworks to capture independence. As a result, Double-1 generally causes an insignificant change in results fromTD3 with our Clipped Double Q-learning, while Double-2shows a notable improvement in Walker2d-v1 and Ant-v1.These problems are alleviated in Double-2, however, it isstill greatly outperformed by the full TD3 in most tasks.This suggests that subduing the overestimations from theunbiased estimator is an effective measure to improve per-formance.

7. ConclusionWe have shown that overestimation bias occurs in actor-critic methods and that its presence worsens empirical per-formance. With this in consideration, we develop a novelclipped variant of Double Q-learning for actor-critic meth-ods. We justify the use of the target network in deep valuelearning and its role in function approximation error reduc-tion. Finally, we introduce a SARSA-style target regularizerto bootstrap off similar state-action pairs.

We show that our final method, TD3 improves overestima-tion bias, learning speed and performance in a number ofchallenging tasks in the continuous settings, exceeding theperformance of numerous state of the art algorithms. Asour modifications are simple to implement and maintain theoff-policy nature of Q-learning, they can be easily added todeep Q-learning alongside numerous other improvementssuch as (Schaul et al., 2016; Wang et al., 2016; Bellemareet al., 2017).


ReferencesAnschel, Oron, Baram, Nir, and Shimkin, Nahum.

Averaged-dqn: Variance reduction and stabilization fordeep reinforcement learning. In International Conferenceon Machine Learning, pp. 176–185, 2017.

Barth-Maron, Gabriel, Hoffman, Matthew W, Budden,David, Dabney, Will, Horgan, Dan, TB, Dhruva, Muldal,Alistair, Heess, Nicolas, and Lillicrap, Timothy. Distri-butional policy gradients. International Conference onLearning Representations, 2018.

Bellemare, Marc G, Dabney, Will, and Munos, Remi. Adistributional perspective on reinforcement learning. InInternational Conference on Machine Learning, pp. 449–458, 2017.

Bellman, Richard. Dynamic Programming. Princeton Uni-versity Press, 1957.

Bertsekas, Dimitri P. Dynamic programming and optimalcontrol, volume 1. Athena scientific Belmont, MA, 1995.

Brockman, Greg, Cheung, Vicki, Pettersson, Ludwig,Schneider, Jonas, Schulman, John, Tang, Jie, andZaremba, Wojciech. Openai gym, 2016.

Dhariwal, Prafulla, Hesse, Christopher, Plappert, Matthias,Radford, Alec, Schulman, John, Sidor, Szymon, and Wu,Yuhuai. Openai baselines. https://github.com/openai/baselines, 2017.

Fox, Roy, Pakman, Ari, and Tishby, Naftali. Taming thenoise in reinforcement learning via soft updates. In Pro-ceedings of the Thirty-Second Conference on Uncertaintyin Artificial Intelligence, pp. 202–211. AUAI Press, 2016.

Gu, Shixiang, Lillicrap, Timothy, Sutskever, Ilya, andLevine, Sergey. Continuous deep q-learning with model-based acceleration. In International Conference on Ma-chine Learning, pp. 2829–2838, 2016.

He, Frank S, Liu, Yang, Schwing, Alexander G, and Peng,Jian. Learning to play in a day: Faster deep reinforce-ment learning by optimality tightening. arXiv preprintarXiv:1611.01606, 2016.

Henderson, Peter, Islam, Riashat, Bachman, Philip, Pineau,Joelle, Precup, Doina, and Meger, David. Deep Re-inforcement Learning that Matters. arXiv preprintarXiv:1709.06560, 2017.

Horgan, Dan, Quan, John, Budden, David, Barth-Maron,Gabriel, Hessel, Matteo, van Hasselt, Hado, and Silver,David. Distributed prioritized experience replay. Interna-tional Conference on Learning Representations, 2018.

Kaelbling, Leslie Pack, Littman, Michael L, and Moore,Andrew W. Reinforcement learning: A survey. Journalof artificial intelligence research, 4:237–285, 1996.

Kingma, Diederik and Ba, Jimmy. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980,2014.

Lee, Donghun, Defourny, Boris, and Powell, Warren B.Bias-corrected q-learning to control max-operator bias inq-learning. In Adaptive Dynamic Programming And Re-inforcement Learning (ADPRL), 2013 IEEE Symposiumon, pp. 93–99. IEEE, 2013.

Lillicrap, Timothy P, Hunt, Jonathan J, Pritzel, Alexander,Heess, Nicolas, Erez, Tom, Tassa, Yuval, Silver, David,and Wierstra, Daan. Continuous control with deep re-inforcement learning. arXiv preprint arXiv:1509.02971,2015.

Lin, Long-Ji. Self-improving reactive agents based on re-inforcement learning, planning and teaching. Machinelearning, 8(3-4):293–321, 1992.

Mahmood, Ashique Rupam, Yu, Huizhen, and Sutton,Richard S. Multi-step off-policy learning without impor-tance sampling ratios. arXiv preprint arXiv:1702.03006,2017.

Mannor, Shie, Simester, Duncan, Sun, Peng, and Tsitsik-lis, John N. Bias and variance approximation in valuefunction estimates. Management Science, 53(2):308–322,2007.

Metz, Luke, Ibarz, Julian, Jaitly, Navdeep, and Davidson,James. Discrete sequential prediction of continuous ac-tions for deep rl. arXiv preprint arXiv:1705.05035, 2017.

Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David,Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Graves,Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostro-vski, Georg, et al. Human-level control through deep re-inforcement learning. Nature, 518(7540):529–533, 2015.

Mnih, Volodymyr, Badia, Adria Puigdomenech, Mirza,Mehdi, Graves, Alex, Lillicrap, Timothy, Harley, Tim,Silver, David, and Kavukcuoglu, Koray. Asynchronousmethods for deep reinforcement learning. In Interna-tional Conference on Machine Learning, pp. 1928–1937,2016.

Munos, Remi, Stepleton, Tom, Harutyunyan, Anna, andBellemare, Marc. Safe and efficient off-policy reinforce-ment learning. In Advances in Neural Information Pro-cessing Systems, pp. 1054–1062, 2016.

O’Donoghue, Brendan, Osband, Ian, Munos, Remi, andMnih, Volodymyr. The uncertainty bellman equation andexploration. arXiv preprint arXiv:1709.05380, 2017.

https://github.com/openai/baselines

https://github.com/openai/baselines


Osband, Ian, Blundell, Charles, Pritzel, Alexander, andVan Roy, Benjamin. Deep exploration via bootstrappeddqn. In Advances in Neural Information Processing Sys-tems, pp. 4026–4034, 2016.

Pendrith, Mark D, Ryan, Malcolm RK, et al. Estimatorvariance in reinforcement learning: Theoretical problemsand practical solutions. University of New South Wales,School of Computer Science and Engineering, 1997.

Petrik, Marek and Scherrer, Bruno. Biasing approximatedynamic programming with a lower discount factor. InAdvances in neural information processing systems, pp.1265–1272, 2009.

Popov, Ivaylo, Heess, Nicolas, Lillicrap, Timothy, Hafner,Roland, Barth-Maron, Gabriel, Vecerik, Matej, Lampe,Thomas, Tassa, Yuval, Erez, Tom, and Riedmiller, Martin.Data-efficient deep reinforcement learning for dexterousmanipulation. arXiv preprint arXiv:1704.03073, 2017.

Schaul, Tom, Quan, John, Antonoglou, Ioannis, and Silver,David. Prioritized experience replay. In InternationalConference on Learning Representations, Puerto Rico,2016.

Schulman, John, Levine, Sergey, Abbeel, Pieter, Jordan,Michael, and Moritz, Philipp. Trust region policy opti-mization. In International Conference on Machine Learn-ing, pp. 1889–1897, 2015.

Schulman, John, Wolski, Filip, Dhariwal, Prafulla, Radford,Alec, and Klimov, Oleg. Proximal policy optimizationalgorithms. arXiv preprint arXiv:1707.06347, 2017.

Silver, David, Lever, Guy, Heess, Nicolas, Degris, Thomas,Wierstra, Daan, and Riedmiller, Martin. Deterministicpolicy gradient algorithms. In International Conferenceon Machine Learning, pp. 387–395, 2014.

Singh, Satinder, Jaakkola, Tommi, Littman, Michael L, andSzepesvari, Csaba. Convergence results for single-stepon-policy reinforcement-learning algorithms. Machinelearning, 38(3):287–308, 2000.

Sutton, Richard S. Learning to predict by the methods oftemporal differences. Machine learning, 3(1):9–44, 1988.

Sutton, Richard S and Barto, Andrew G. Reinforcementlearning: An introduction, volume 1. MIT press Cam-bridge, 1998.

Tavakoli, Arash, Pardo, Fabio, and Kormushev, Petar. Ac-tion branching architectures for deep reinforcement learn-ing. arXiv preprint arXiv:1711.08946, 2017.

Thrun, Sebastian and Schwartz, Anton. Issues in usingfunction approximation for reinforcement learning. In

Proceedings of the 1993 Connectionist Models SummerSchool Hillsdale, NJ. Lawrence Erlbaum, 1993.

Todorov, Emanuel, Erez, Tom, and Tassa, Yuval. Mujoco:A physics engine for model-based control. In IntelligentRobots and Systems (IROS), 2012 IEEE/RSJ InternationalConference on, pp. 5026–5033. IEEE, 2012.

Uhlenbeck, George E and Ornstein, Leonard S. On thetheory of the brownian motion. Physical review, 36(5):823, 1930.

Van Hasselt, Hado. Double q-learning. In Advances inNeural Information Processing Systems, pp. 2613–2621,2010.

Van Hasselt, Hado, Guez, Arthur, and Silver, David. Deepreinforcement learning with double q-learning. In AAAI,pp. 2094–2100, 2016.

Van Seijen, Harm, Van Hasselt, Hado, Whiteson, Shimon,and Wiering, Marco. A theoretical and empirical analysisof expected sarsa. In Adaptive Dynamic Programmingand Reinforcement Learning, 2009. ADPRL’09. IEEESymposium on, pp. 177–184. IEEE, 2009.

Wang, Ziyu, Schaul, Tom, Hessel, Matteo, Hasselt, Hado,Lanctot, Marc, and Freitas, Nando. Dueling networkarchitectures for deep reinforcement learning. pp. 1995–2003, 2016.

Watkins, Christopher John Cornish Hellaby. Learning fromdelayed rewards. PhD thesis, King’s College, Cambridge,1989.

Wu, Yuhuai, Mansimov, Elman, Grosse, Roger B, Liao,Shun, and Ba, Jimmy. Scalable trust-region method fordeep reinforcement learning using kronecker-factoredapproximation. pp. 5285–5294, 2017.

Supplementary Material

A. Proof of Convergence of Clipped Double Q-LearningIn a finite MDP setting version of Clipped Double Q-learning we maintain two tabular value estimates QA, QB . At eachtime step we select actions a∗ = argmaxaQ

A(s, a) and then perform an update by setting target y:

a∗ = argmaxa

QA(s′, a)

y = r + γmin(QA(s′, a∗), QB(s′, a∗)),(17)

and updating the value estimates with respect to the target and learning rate αt(s, a):

QA(s, a) = QA(s, a) + αt(s, a)(y −QA(s, a))

QB(s, a) = QB(s, a) + αt(s, a)(y −QB(s, a)).(18)

In a finite MDP setting Double Q-learning is often used to deal with noise induced by random rewards or state transitions,and so either QA or QB is updated randomly. However, in a function approximation setting the interest may be moretowards the approximation error, thus we can update both QA and QB at each iteration. The proof extends naturally toupdating either randomly.

The proof borrows heavily from the proof of convergence of SARSA (Singh et al., 2000) as well as Double Q-learning(Van Hasselt, 2010). The proof of lemma 1 can be found in Singh et al. (2000), building on a proposition from Bertsekas(1995).

Lemma 1. Consider a stochastic process (ζt,∆t, Ft), t ≥ 0 where ζt,∆t, Ft : X → R satisfy the equation:

∆t+1(xt) = (1− ζt(xt))∆t(xt) + ζt(xt)Ft(xt), (19)

where xt ∈ X and t = 0, 1, 2, .... Let Pt be a sequence of increasing σ-fields such that ζ0 and ∆0 are P0-measurable andζt,∆t and Ft−1 are Pt-measurable, t = 1, 2, .... Assume that the following hold:

1. The set X is finite.

2. ζt(xt) ∈ [0, 1],∑t ζt(xt) =∞,

∑t(ζt(xt))

2 <∞ with probability 1 and ∀x 6= xt : ζ(x) = 0.

3. ||E [Ft|Pt] || ≤ κ||∆t||+ ct where κ ∈ [0, 1) and ct converges to 0 with probability 1.

4. Var[Ft(xt)|Pt] ≤ K(1 + κ||∆t||)2, where K is some constant

Where || · || denotes the maximum norm. Then ∆t converges to 0 with probability 1.

Theorem 1. Given the following conditions:

1. Each state action pair is sampled an infinite number of times.

2. The MDP is finite.

3. γ ∈ [0, 1).

4. Q values are stored in a lookup table.

5. Both QA and QB receive an infinite number of updates.

6. The learning rates satisfy αt(s, a) ∈ [0, 1],∑t αt(s, a) =∞,

∑t(αt(s, a))2 <∞ with probability 1 and αt(s, a) =

0,∀(s, a) 6= (st, at).


7. Var[r(s, a)] <∞,∀s, a.

Then clipped Double Q-learning will converge to the optimal value function Q∗, as defined by the Bellman optimalityequation, with probability 1.

Proof of Theorem 1. We apply Lemma 1 with Pt = {QA0 , QB0 , s0, a0, α0, r1, s1, ..., st, at}, X = S × A,∆t = QAt −Q∗, ζt = αt.

First note condition 1 and 4 of the lemma holds by the conditions 2 and 7 of the theorem respectively. Lemma condition 2holds by the theorem condition 6 along with our selection of ζt = αt.

Defining a∗ = argmaxaQA(st+1, a) we have

∆t+1(st, at) = (1− αt(st, at))(QAt (st, at)−Q∗(st, at))+ αt(st, at)(rt + γmin(QAt (st+1, a

∗), QBt (st+1, a∗))−Q∗(st, at))

= (1− αt(st, at))∆t(st, at) + αt(st, at)Ft(st, at)),

(20)

where we have defined Ft(st, at) as:

Ft(st, at) = rt + γmin(QAt (st+1, a∗), QBt (st+1, a

∗))−Q∗t (st, at)= rt + γmin(QAt (st+1, a

∗), QBt (st+1, a∗))−Q∗t (st, at) + γQAt (st+1, a

∗)− γQAt (st+1, a∗)

= FQt (st, at) + ct,

(21)

where FQt = rt + γQAt (st+1, a∗) − Q∗t (st, at) denotes the value of Ft under standard Q-learning and ct =

γmin(QAt (st+1, a∗), QBt (st+1, a

∗))− γQAt (st+1, a∗). As E

[FQt |Pt

]≤ γ||∆t|| is a well-known result, then condition 3

of lemma 1 holds if it can be shown that ct converges to 0 with probability 1.

Let y = rt + γmin(QBt (st+1, a∗), QAt (st+1, a

∗)) and ∆BAt (st, at) = QBt (st, at)−QAt (st, at), where ct converges to 0

if ∆BA converges to 0. The update of ∆BAt at time t is the sum of updates of QA and QB :

∆BAt+1(st, at) = ∆BA

t (st, at) + αt(st, at)(y −QBt (st, at)− (y −QAt (st, at))

)= ∆BA

t (st, at) + αt(st, at)(QAt (st, at)−QBt (st, at)

)= (1− αt(st, at))∆BA

t (st, at).

(22)

Clearly ∆BAt will converge to 0, which then shows we have satisfied condition 3 of lemma 1, implying that QA(st, at)

converges to Q∗t (st, at). Similarly, we get convergence of QB(st, at) to the optimal vale function by choosing ∆t =QBt −Q∗ and repeating the same arguments, thus proving theorem 1.

B. Overestimation Bias in Deterministic Policy GradientsIf the gradients from the deterministic policy gradient update are unnormalized, this overestimation is still guaranteed tooccur under a slightly stronger condition on the expectation of the value estimate. Assume the approximate value function isequal to the true value function, in expectation over the steady-state distribution, with respect to policy parameters betweenthe original policy and in the direction of the true policy update:

Es∼π [Qθ(s, πφnew(s))] = Es∼π [Qπ(s, πφnew(s))]

∀φnew ∈ [φ, φ+ β(φtrue − φ)] such that β > 0.(23)

Then noting that θπtrue maximizes the rate of change of the true value ∆Qπ(s, πφ(s)), and by the given condition 23 themaximal rate of change of the approximate value must be at least this great ∆Qθ(s, πφ(s)) ≥ ∆Qπ(s, πφ(s)). Summingthis change with condition 23 then shows an overestimation of the value function.


Table 3. A complete comparison of hyper-parameter choices between our DDPG and the OpenAI baselines implementation (Dhariwalet al., 2017).

Hyper-parameter Ours DDPG

Critic Learning Rate 10−3 10−3

Critic Regularization None 10−2 · ||θ||2Actor Learning Rate 10−3 10−4

Actor Regularization None NoneOptimizer Adam AdamTarget Update Rate (τ ) 5 · 10−3 10−3

Batch Size 100 64Iterations per time step 1 1Discount Factor 0.99 0.99Reward Scaling 1.0 1.0Normalized Observations False TrueGradient Clipping False FalseExploration Policy N (0, 0.1) OU, θ = 0.15, µ = 0, σ = 0.2

C. Network and Hyper-parameter ComparisonDDPG Critic Architecture

(state dim, 400)ReLU(action dim + 400, 300)ReLU(300, 1)

DDPG Actor Architecture

(state dim, 400)ReLU(400, 300)ReLU(300, 1)tanh

Our Critic Architecture

(state dim + action dim, 400)ReLU(action dim + 400, 300)RelU(300, 1)

Our Actor Architecture

(state dim, 400)ReLU(400, 300)RelU(300, 1)tanh

D. Learning Curves


0.0 0.2 0.4 0.6 0.8 1.0Time steps ×106

0

2000

4000

6000

8000

10000

Aver

age

Ret

urn

TD3DDPGour DDPGTD3 - DPTD3 - TPNTD3 - CDQ

(a) HalfCheetah-v1

0.0 0.2 0.4 0.6 0.8 1.0Time steps ×106

0

500

1000

1500

2000

2500

3000

3500

Aver

age

Ret

urn

(b) Hopper-v1

0.0 0.2 0.4 0.6 0.8 1.0Time steps ×106

0

1000

2000

3000

4000

5000

Aver

age

Ret

urn

(c) Walker2d-v1

0.0 0.2 0.4 0.6 0.8 1.0Time steps ×106

1000

0

1000

2000

3000

4000

5000

Aver

age

Ret

urn

(d) Ant-v1

Figure 6. Ablation over the varying modifications to our DDPG (+ AHE), comparing no delayed policy (- DP), no target policy noise(- TPN) and no clipped Double-Q (- CDQ).

0.0 0.2 0.4 0.6 0.8 1.0Time steps ×106

0

2000

4000

6000

8000

10000

Aver

age

Ret

urn

TD3DDPGour DDPGAHE + DPAHE + TPNAHE + CDQ

(a) HalfCheetah-v1

0.0 0.2 0.4 0.6 0.8 1.0Time steps ×106

0

500

1000

1500

2000

2500

3000

3500

Aver

age

Ret

urn

(b) Hopper-v1

0.0 0.2 0.4 0.6 0.8 1.0Time steps ×106

0

1000

2000

3000

4000

5000

Aver

age

Ret

urn

(c) Walker2d-v1

0.0 0.2 0.4 0.6 0.8 1.0Time steps ×106

1000

0

1000

2000

3000

4000

5000

Aver

age

Ret

urn

(d) Ant-v1

Figure 7. Ablation over the varying modifications to our DDPG (+ AHE), comparing the addition of delayed policy (AHE + DP), targetpolicy noise (AHE + TPN) and clipped Double-Q (AHE + CDQ).

Documents

Addressing Function Approximation Error in Actor-Critic