Variance Reduction for Evolutionary Strategies via Structured … · 2020-03-16 · Yunhao Tang, Krzysztof Choromanski, Alp Kucukelbir of a vector g2Rd as the sum of the variance

Variance Reduction for Evolutionary Strategies via StructuredControl Variates

Yunhao Tang Krzysztof Choromanski Alp KucukelbirColumbia University Google Robotics Fero Labs & Columbia University

Abstract

Evolution strategies (es) are a powerful classof blackbox optimization techniques that re-cently became a competitive alternative tostate-of-the-art policy gradient (pg) algo-rithms for reinforcement learning (rl). Wepropose a new method for improving accu-racy of the ES algorithms, that as opposed torecent approaches utilizing only Monte Carlostructure of the gradient estimator, takes ad-vantage of the underlying Markov decisionprocess (mdp) structure to reduce the vari-ance. We observe that the gradient estima-tor of the ES objective can be alternativelycomputed using reparametrization and pg es-timators, which leads to new control variatetechniques for gradient estimation in es opti-mization. We provide theoretical insights andshow through extensive experiments that thisrl-specific variance reduction approach out-performs general purpose variance reductionmethods.

1 Introduction

Evolution strategies (es) have regained popularitythrough their successful application to modern rein-forcement learning (rl). es are a powerful alternativeto policy gradient (pg) methods. Instead of leveragingthe Markov decision process (mdp) structure of a givenrl problem, es cast the rl problem as a blackboxoptimization. To carry out this optimization, es usegradient estimators based on randomized finite differ-ence methods. This presents a trade-off: es are betterat handling long term horizons and sparse rewards thanpg methods, but the es gradient estimator may exhibitprohibitively large variance.

Proceedings of the 23rdInternational Conference on ArtificialIntelligence and Statistics (AISTATS) 2020, Palermo, Italy.PMLR: Volume 108. Copyright 2020 by the author(s).

Variance reduction techniques can make both methodsmore practical. Control variates (also known as base-line functions) that leverage Markovian (Mnih et al.,2016; Schulman et al., 2015b, 2017) and factorizedpolicy structures (Gu et al., 2016; Liu et al., 2017;Grathwohl et al., 2018; Wu et al., 2018), help to im-prove pg methods. In contrast to these structuredapproaches, variance reduction for es has been focusedon general-purpose Monte Carlo (mc) techniques, suchas antithetic sampling (Salimans et al., 2017; Maniaet al., 2018), orthogonalization (Choromanski et al.,2018), optimal couplings (Rowland et al., 2018) andquasi-mc sampling (Choromanski et al., 2018; Rowlandet al., 2018).

Main idea. We propose a variance reduction tech-nique for es that leverages the underlying mdp struc-ture of rl problems. We begin with a simple re-parameterization of the problem that uses pg esti-mators computed via backpropagation. We follow byconstructing a control variate using the difference oftwo gradient estimators of the same objective. The re-sult is a rl-specific variance reduction technique for esthat achieves better performance across a wide varietyof rl problems.

Figure 1 summarizes the performance of our proposalover 16 rl benchmark tasks. Our method consistentlyimproves over vanilla es baselines and other state-of-the-art, general purpose mc variance reduction meth-ods. Moreover, we provide theoretical insight into whyour algorithm achieves more variance reduction thanorthogonal es (Choromanski et al., 2018) when thepolicy itself is highly stochastic. (Section 3 presents adetailed analysis.)

Related Work. Control variates are commonly usedto reduce the variance of mc-based estimators (Ross,2002). In black-box variational inference algorithms,carefully designed control variates can reduce the vari-ance of mc gradient updates, leading to faster conver-gence (Paisley et al., 2012; Ranganath et al., 2014). Inrl, pg methods apply state-dependent baseline func-tions as control variates (Mnih et al., 2016; Schulmanet al., 2015b, 2017). While action-dependent control

arX

iv:1

906.

0886

8v2

[cs

.NE

] 1

3 M

ar 2

020

Variance Reduction for Evolutionary Strategies via Structured Control Variates

Figure 1: Percentile improvement over 16 benchmark tasks. The metric is calculated for each task as (rcv −rrandom)/(res − rrandom), where rcv, res, rrandom are the final rewards for the rl tasks obtained with our controlvariate, vanilla es, and random policy methods respectively. We see that our proposal consistently improvesover vanilla es for all tasks, and over all compared variance reduction methods for 10 out of 16 tasks. Section 4provides additional details.

variates (Gu et al., 2017; Liu et al., 2017; Grathwohlet al., 2018; Wu et al., 2018) have been proposed toachieve further variance reduction, Tucker et al. (2018)recently showed that the reported performance gainsmay be due to subtle implementation details, ratherthan better baseline functions.

To leverage the mdp structure in developing a rl-specific control variate for es, we derive a gradient esti-mator for the es objective based on reparameterization(Kingma and Welling, 2014) and the pg estimator. Thecontrol variate is constructed as a difference betweentwo alternative gradient estimators. Our approach isrelated to a control variate techniques developed formodeling discrete latent variables in variational infer-ence (Tucker et al., 2017). The idea is to relax thediscrete model into a differentiable one and constructthe control variate as the difference between the scorefunction gradient estimator and reparameterized gradi-ent estimator of the relaxed model. We expand on thisconnection in Section 3.3.

2 Policy Optimization inReinforcement Learning

Sequential decision making problems are often formu-lated as a mdps. Consider an episode indexed by time.At any given time t ≥ 0, an agent is in a state st ∈ S.The agent then takes an action at ∈ A, receives aninstant reward rt = r(st, at) ∈ R, and transitions tothe next state st+1 ∼ p(· | st, at), where p is a distribu-tion determining transitional probabilities. Define thepolicy π : S 7→ P(A) as a conditional distribution overactions A given a state s ∈ S. Rl seeks to maximizethe expected cumulative rewards over a given timehorizon T ,

Jγ(π) = Eπ

[T−1∑t=0

rtγt

], (1)

where γ ∈ (0, 1] is a discount factor and the expectationis with respect to randomized environment and outputsof the policy π.

Ideally, we would like to work with an infinite hori-zon and no discount factor. In practice, horizon T isbounded by sample collection (Brockman et al., 2016)while directly optimizing the undiscounted objectiveJ1(π) admits unusably high variance gradient estima-tors (Schulman et al., 2015b). As a result, modernrl algorithms tackle the problem through a discountfactor γ < 1, which reduces the variance of the gradientestimators but introduces bias (Schulman et al., 2015b,2017; Mnih et al., 2016). At evaluation time, the algo-rithms are evaluated with finite horizons T <∞ andundiscounted returns γ = 1 (Schulman et al., 2015b,2017; Mnih et al., 2016). We follow this setup here aswell.

Consider parameterizing the policy as πθ where θ ∈ Rd.The goal is to optimize Equation (1) with respect topolicy parameters. A natural approach is to use exactgradient methods. Regrettably, this objective functiondoes not admit an analytic gradient. Thus, we turnto stochastic gradient techniques (Robbins and Monro,1951) and seek to construct approximations to the truegradient gγθ = ∇θJγ(πθ).

2.1 Evolution Strategies for PolicyOptimization

Evolution strategies (es) (Salimans et al., 2017) take ablack-box optimization approach to maximizing Equa-tion (1). To do so, the first step is to ensure that theobjective function is differentiable with respect to thepolicy parameters. To this end, es begin by convolv-ing the original objective Jγ(πθ) with a multivariateisotropic Gaussian distribution of mean θ and varianceσ2:

Fσ,γ(θ) = Eθ′∼N (θ,σ2I) [Jγ(πθ′)] . (2)

Yunhao Tang, Krzysztof Choromanski, Alp Kucukelbir

es maximize this smooth objective as a proxy to max-imizing the original objective Jγ(πθ). The convolvedobjective F enjoys the advantage of being differen-tiable with respect to the policy. In the limit σ → 0, anoptimal point of Fσ,γ(θ) is also optimal with respectto Jγ(πθ). The next step is to derive a gradient ofEquation (2). Consider the score function gradient,

∇θFσ,γ(θ) = Eθ′∼N (θ,σ2I)

[Jγ(πθ′)

θ′ − θσ2

]. (3)

This gradient can be computed by sampling θ′i ∼N (θ, σ2I) and computing unbiased estimates of eachJγ(πθ′i) using a single roll-out trajectory of πθ′i in theenvironment. The resulting score function gradientestimator has the following form:

ges,γθ =1

N

N∑i=1

Jγ(πθ′i)θ′i − θσ2

. (4)

This gradient estimator is biased with respect to theoriginal objective. However, in practice this bias doesnot hinder optimization; on the contrary, the smoothedobjective landscape is often more amenable to gradient-based optimization (Leordeanu and Hebert, 2008). Wealso make clear that though the es gradient is definedfor any γ ∈ (0, 1], in practice parameters are updatedwith the gradient of the undiscounted objective gES,1

θ

(Salimans et al., 2017; Mania et al., 2018; Choromanskiet al., 2018).

2.2 Policy Gradient Methods for PolicyOptimization

Policy gradient (pg) (Sutton et al., 2000) methods takea different approach. Instead of deriving the gradientthrough a parameter level perturbation as in es, thecore idea of pg is to leverage the randomness in thepolicy itself. Using a standard procedure from stochas-tic computational graphs (Schulman et al., 2015a), wecompute the gradient of Equation (1) as follows

∇θJγ(πθ) = Eπθ

[T−1∑t=0

(T−1∑t′=t

rt′

)γt∇θ log πθ(at|st)

].

(5)

Unbiased estimators gpg,γθ of this gradient can be com-puted using sampling as above for the es method. Inpractice, the sample estimate of Equation (5) often haslarge variance which destabilizes the updates. To alle-viate this issue, one convenient choice is to set γ < 1 sothat the long term effects of actions are weighted downexponentially. This reduces the variance of the estima-tor, but introduces bias with respect to the originalundiscounted objective J1(πθ).

Es and pg are two alternative methods for deriving gra-dient estimators with respect to the policy parameters.On an intuitive level, these two methods complementeach other for variance reduction: pg leverages themdp structure and achieves lower variance when thepolicy is stochastic; es derives the gradient by injectingnoise directly into the parameter space and is char-acterized by lower variance when the policy itself isnear-deterministic. Our goal in this paper is to developa single estimator that benefits from both approaches.We formalize this intuition in the next section.

3 Variance Reduction via StructuredControl Variates

We seek a control variate for the es gradient estimatorin Equation (4). Recall that this gradient is withrespect to a smoothed objective: Fσ,γ(θ) with γ ∈(0, 1].

3.1 Reparameterized Gradients of theSmoothed Objective

The es gradient estimator in Equation (4) leveragesthe derivative of the logarithm. We can also apply thereparameterization technique (Kingma and Welling,2014) to the distribution θ′ ∼ N (θ, σ2I) to obtain:

∇θFσ,γ(θ) = ∇θEθ′∼N (θ,σ2I)[Jγ(πθ′)]

= ∇θEε∼N (0,I)[Jγ(πθ+ε·σ)]

= Eε∼N (0,I)[∇θ+ε·σJγ(πθ+ε·σ)], (6)

where ∇θ+ε·σJγ(πθ+ε·σ) can be computed by pg esti-mators for the discounted objective (5). To estimate(6), we sample εi ∼ N (0, σ2I) and construct perturbedpolicies θ′i = θ + εi · σ. Then an unbiased estimategpg,γθ+εi·σ of the policy gradient ∇θ+εi·σJγ(πθ+εi·σ) canbe computed from a single rollout trajectory usingπθ+εi·σ. Finally the reparameterized gradient is com-puted by averaging:

gre,γθ =1

N

N∑i=1

gpg,γθ+εi·σ. (7)

3.2 Evolution Strategies with StructuredControl Variates

For the discounted objective Fσ,γ(θ) we have two al-ternative gradient estimators. One is constructed us-ing the score function gradient estimator (see: Equa-tion (4)). The other uses the re-parameterizationtechqniue along with policy gradient estimators (see:Equation (7)). Combining these two estimators withthe vanilla es gradient for the undiscounted objective


ges,1θ , we get:

gcvθ = ges,1θ + η � (ges,γθ − gre,γθ ), (8)

where η is a vector of same dimension of θ and � de-notes an element-wise product. This scaling parameterη controls the relative importance of the two terms in(8). As discussed below, we can adapt the discountfactor γ and the scaling parameter η to minimize thevariance over time.

Discount factor γ. As in Choromanski et al. (2018),for a vector g ∈ Rn, we define its variance as the sum ofits component variances V[g] :=

∑ni=1 V[gi]. We then

adapt the discount factor γ ← γ−αγ∇γV[gcvθ ] for somelearning rate αγ > 0. Since E[gcvθ ] does not depend onγ, we have equivalently ∇γV[gcvθ ] = ∇γE[(gcvθ )2]. Thegradient∇γE[(gcvθ )2] can be itself estimated using back-propagation on mini-batches but this tends to be unsta-ble because each term in (5) involves γt. Alternatively,we build a more robust estimator of ∇γE[(gcvθ )2] usinges: in particular, sample εi ∼ N (0, 1) and let vi be theevaluation of E[(gcvθ )2] under γ + σγεi for some σγ > 0.The gradient estimator for γ is gγ = 1

N

∑Ni=1 vi

εiσγ

.

Though the full estimator (8) is defined for all discountfactors γ ∈ (0, 1], in general we find it better to setγ < 1 to stablize the pg components of the controlvariate.

Coefficient η. Since η is a vector with the samedimensionality as θ, we can update each componentof η to reduce the variance of each component of gcvθ .Begin by computing, ∇ηV[gcvθ ] as follows:

∇ηV[gcvθ

]= 2η � E

[(ges,γθ − gre,γθ

)2]+

2E[(ges,γθ − gre,γθ

)� ges,1θ

]. (9)

Then, estimate this gradient using mc sampling. Fi-nally, adapt η by running online gradient descent:η ← η − αη∇ηV[gcvθ ] with some αη > 0.

Practical considerations. Certain practical tech-niques can be applied to stabilize the es optimizationprocedure. For example, Salimans et al. (2017) ap-ply a centered rank transformation to the estimatedreturns Jγ(πθ′i) to compute the estimator of the gra-dient in Equation (4). This transformation is com-patible with our proposal. The construction becomesgcvθ = gθ + η(ges,γθ − gre,γθ ) where gθ can be computedthrough the rank transformation.

Stochastic policies. While many prior works (Sal-imans et al., 2017; Mania et al., 2018; Choromanski

et al., 2018; Rowland et al., 2018) focus on determinis-tic policies for continuous action spaces, our methodtargets stochastic policies, as required by the pg com-putation. Estimating pg for a deterministic policyrequires training critic functions and is in general bi-ased (Silver et al., 2014). We leave the investigation ofdeterministic policies to future work.

3.3 Relationship to REBAR

REBAR (Tucker et al., 2017) considers variance reduc-tion of gradient estimators for probabilistic models withdiscrete latent variables. The discrete latent variablemodel has a relaxed model version, where the discretesampling procedure is replaced by a differentiable func-tion with reparameterized noise. This relaxed modelusually has a temperature parameter τ such that whenτ → 0 the relaxed model converges to the originaldiscrete model. To optimize the discrete model, thebaseline approach is to use score function gradientestimator, which is unbiased but has high variance. Al-ternatively, one could use the reparameterized gradientthrough the relaxed model, which has lower variance,but the gradient is biased for finite τ > 0 (Jang et al.,2017; Maddison et al., 2017). The bias and variance ofthe gradient through the relaxed model is controlled byτ . REBAR proposes to use the difference between thescore function gradient and reparameterized gradient ofthe relaxed model as a control variate. Their differencehas expectation zero and should be highly correlatedwith the reinforced gradient of the original discretemodel, leading to potentially large variance reduction.

A similar connection can be found in es for rl context.We can interpret the non-discounted objective, namelyFσ,γ(πθ) with γ = 1, as the original model whichgradient we seek to estimate. When γ < 1, we havethe relaxed model which gradient becomes biased buthas lower variance (with respect to the non-discountedobjective). Similar to REBAR, our proposal is toconstruct the score function gradient (es estimator)ges,γθ and reparameterized gradient (pg estimator) gre,γθ

of the general discounted objective γ < 1 (relaxedmodel), such that their difference ges,γθ − gre,γθ serves asa control variate. The variance reduction from REBARapplies here, ges,γθ − gre,γθ should be highly correlatedwith ges,1θ , which leads to effective variance reduction.

3.4 How much variance reduction is possible?

How does the variance reduction provided by controlvariate compare to that of general purpose methods,such as orthogonalization (Choromanski et al., 2018)?In this section, we build on a simple example to il-lustrate the different variance reduction properties ofthese approaches. Recall that we define the variance


of a vector g ∈ Rd as the sum of the variance of itscomponents V[g] =

∑di=1 V[gi] following notation from

prior literature (Choromanski et al., 2018; Rowlandet al., 2018).

Consider a one-step mdp problem where the agent takesonly one action a and receives a reward r(a) = αTafor some α ∈ Rd. We choose the reward function to belinear in a, as a local approximation to a potentiallynonlinear reward function landscape. Let the policy bea Gaussian with mean µ and diagonal covariance matrixΣ = σ2

2I with fixed σ2. Here the policy parametercontains only the mean θ = µ. The rl objective isJ(πµ) = Ea∼N (µ,Σ)[r(a)]. To compute the gradient, essmoothes the objective with the GaussinN (µ, σ2

1I) for afixed σ1. While vanilla es generates i.i.d. perturbationsεi, orthogonal es couples the perturbations such thatε′iTε′j = 0, i 6= j.

Denote d as the dimensionality of the parameter spaceµ ∈ Rd and let ρ = σ2/σ1. Let J(πµ1) be a one-sample noisy estimate of J(πµ). Recall that es gra-dient takes the form gesµ = 1/N

∑Ni=1 J(πµ+σ1εi) εi/σ1

(eq. (4)). The orthogonal es gradient takes thesame form, but with orthogonal perturbations gort

µ =1/N

∑Ni=1 J(πµ+σ1ε′i

) ε′i/σ1. Finally, the es with controlvariate produces an estimator of the form (8). We arenow ready to provide the following theoretical result.

Theorem 1. In the one-step mdp described above,the ratio of the variance of the orthogonal es to thevariance of the vanilla es, and the corresponding ratiofor the control variate es satisfy respectively:

V[gortµ ]

V[gesµ ]= 1− N − 1

(1 + ρ2)d+ 1,

V[gcvµ ]

V[gesµ ]≤ 1− ρ2[d((1 + ρ2)− 4]

[(1 + ρ2)d+ 1](1 + ρ2).

As a result, there exists a threshold ρ0 such that whenρ ≥ ρ0, we always have V[gcvµ ] ≤ V[gortµ ]. (See: Ap-pendix for details). Importantly, when d is large enough,we have ρ0 →

√N/d.

Some implications of the above theorem: (1) For or-thogonal es, the variance reduction depends explicitlyon the sample size N . In cases where N is small,the variance gain over vanilla es is not significant.On the other hand, V[gcvµ ] depends implicitly on Nbecause in practice η∗ is approximated via gradientdecent and large sample size N leads to more stableupdates; (2) The threshold ρ0 is useful in practice. Inhigh-dimensional applications where sample efficiencyis important, we have large d and small N . This impliesthat for a large range of the ratio ρ = σ2

σ1≥ ρ0, we

could expect to achieve more variance reduction thanorthogonal es. (3) The above derivation is based on

the simplification of the general multi-step mdp prob-lem. The practical performance of the control variatecan also be influenced by how well η∗ is estimated.Nevertheless, we expect this example to provide someguideline as to how the variance reduction property ofes with control variate depends on ρ and N , in con-trast to orthogonal es; (4) The theoretical guaranteefor variance reduction of orthogonal es (Choromanskiet al., 2018) relies on the assumption that J(πθ′) canbe simulated without noise 1, which does not hold inpractice. In fact, in rl the noise of the reward estimateheavily depends on the policy πθ′ (intuitively the morerandom the policy is, the more noise there is in theestimate). On the other hand, es with control variatedepends less on such assumptions but rather relies onfinding the proper scalar η using gradient descent. Wewill see in the experiments that this latter approachreliably improves upon the es baseline.

4 Experiments

In the experiments, we aim to address the followingquestions: (1): Does the control variate improve down-stream training through variance reduction? (2): Howdoes it compare with other recent variance reductiontechniques for es?

To address these questions, we compare our control vari-ate to general purpose variance reduction techniqueson a wide range of high-dimensional rl tasks withcontinuous action space.

Antithetic sampling: See Salimans et al. (2017).Let εi, 1 ≤ i ≤ N be the set of perturbation directions.Antithetic sampling perturbs the policy parameterwith a set of antithetic pairs (εi, ε

′i), 1 ≤ i ≤ N where

ε′i = −εi.

Orthogonal directions (ORTHO): See Choro-manski et al. (2018). The set of perturbationsεi, 1 ≤ i ≤ N applied to the policy parameter aregenerated such that they have the same marginalGaussian distributions but are orthogonal to eachother εTi εj = 0, i 6= j.

Geometrically coupled MC sampling (GCMC):See Rowland et al. (2018). For each antitheticpair (εi, ε

′i), GCMC couples their length such that

FR(‖εi‖) + FR(‖ε′i‖) = 1 where FR is the CDF ofthe norm of a standard Gaussian with the samedimension as εi.

Quasi Monte-Carlo (QMC): See Choromanskiet al. (2018); Rowland et al. (2018). QMC first

1The variance reduction proof can be extended to caseswhere J(πθ′) has the same level of independent noise forall θ′.


generates a low-discrepancy Halton sequence {ri}Ni=1

in [0, 1]d with N elements where d is the dimensionof parameter θ. Then apply the inverse CDF F−1

g

of a standard univariate Gaussian elementwise tothe sequence εi = F−1

g (ri) to generate perturbationvectors.

We find that our rl-specific control variate achievesoutperforms these general purpose variance reductiontechniques. In addition to the above baseline compari-son, we also compare CV with more advanced policygradient algorithms such as TRPO / PPO (Schulmanet al., 2015b, 2017), as well as deterministic policy +es baseline (Mania et al., 2018).

Implementation details. Since antithetic samplingis the most commonly applied variance reductionmethod, we combine it with the control variate, OR-THO, GCMC and QMC. The policy πθ is parameter-ized as a neural network with 2 hidden layers each with32 units and relu activation function. The output isa vector µθ ∈ RK used as a Gaussian mean, with aseparately parameterized diagonal vector σ0 ∈ RK in-dependent of state. The action is sampled the Gaussiana ∼ N (µθ(s), diag(σ2)). The backpropagation pipelineis implemented with Chainer (Tokui et al., 2015). Thelearning rate is α = 0.01 with Adam optimizer (Kingmaand Ba, 2014), the perturbation standard deviationσ = 0.02. At each iteration we have N = 5 distinctperturbations εi (2N samples in total due to antithetcsampling). For the control variate (8), the discountfactor is initialized to be γ = 0.99 and updated with es,we introduce the details in the Appendix. The controlvariate scaling factor η is updated with learning rateselected from αη ∈ {10−3, 10−4, 10−5}. As commonlypracticed in prior works (Salimans et al., 2017; Maniaet al., 2018; Choromanski et al., 2018), in order to makethe gradient updates less sensitive to the reward scale,returns are normalized before used for computing thegradients. We adopt this technique and discuss thedetails in the Appendix. Importantly, we do not nor-malize the observations (as explained in (Mania et al.,2018)) to avoid additional biasing of the gradients.

Benchmark tasks and baselines. To evaluate howvariance reduction impacts downstream policy opti-mization, we train neural network policies over a widerange of high-dimensional continuous control tasks,taken from OpenAI gym (Brockman et al., 2016), Ro-boschool (Schulman et al., 2017) and DeepMind ControlSuites (Tassa et al., 2018). We introduce their detailsbelow. We also include a LQR task suggested in (Maniaet al., 2018) to test the stability of the gradient updatefor long horizons (T = 2000). Details of the tasksare in the Appendix. The policies are trained withfive variance reduction settings: Vanilla es baseline;es with orthogonalization (ORTHO); es with GCMC

(GCMC); es with Quasi-mc (QMC); and finally ourproposed es with control variate (CV).

Results. In each subplot of Figure 2, we present thelearning curves of each variance reduction technique,with average performance over 5 random seeds and theshaded areas indicate standard deviations.

We make several observations regarding each variancereduction technique: (1) Though ORTHO and GCMCsignificantly improve the learning progress over thebaseline es under certain settings (e.g. ORTHO inSwimmer and GCMC), their improvement is not veryconsistent. In certain cases, adding such techniqueseven makes the performance worse than the baselinees. We speculate that this is because the variancereduction conditions required by these methods are notsatisfied, e.g. the assumption of noiseless estimate ofreturns. Overall, ORTHO is more stable than GCMCand QMC; (2) QMC performs poorly on most tasks.We note that similar results have been reported in(Rowland et al., 2018) where they train a navigationpolicy using QMC and show that the agent does notlearn at all. We speculate that this is because the vari-ance reduction achieved by QMC is not worth the biasin the rl contexts; (3) No variance reduction techniqueperforms uniformly best across all tasks. However,CV performs the most consistently and achieves stablegains over the vanilla es baseline, while other methodscan underperform the vanilla es baseline.

To make clear the comparison of final performance,we record the final performance (mean ± std) of allmethods in Table 1. Best results across each task arehighlighted in bold font. For a fair presentation ofthe results, in cases where multiple methods achievestatistically similar performance, we highlight all suchmethods. CV consistently achieves the top resultsacross all reported tasks.

Comparison with Policy Gradients. A naturalquestion is what happens when we update the policyjust based on the pg estimator? We show the com-plete comparison in Table 1, where we find pure pgto be mostly outperformed by the other es baselines.We speculate that this is because the vanilla pg arethemselves quite unstable, as commonly observed inprior works on pg which aim to alleviate the instabilityby introducing bias in exchange for smaller variance(Schulman et al., 2015b; Mnih et al., 2016; Schulmanet al., 2017). We provide a detailed review in theAppendix. This comparison also implies that a care-ful control variate scheme can extract the benefit ofpg estimators for variance reduction in es, instead ofcompletely relying on pg.

To assess the impact of our method, we also com-


(a) LQR (b) Swimmer (c) HalfCheetah (d) Walker

(e) RoboschoolPong (f) RoboschoolCheetah (g) BipedalWalker (h) Swingup (DM)

(i) TwoPoles (DM) (j) CheetahRun (DM) (k) HopperStand (DM) (l) AntEscape(DM)

Figure 2: Training performance on Continuous Control Benchmarks: Swimmer, HalfCheetah, CartPole +{Swingup,TwoPoles,Balance}, Pendulum Swingup, Cheetah Run and Hopper. Tasks with DM stickers are from theDeepMind Control Suites. We compare five alternatives: baseline ES (orange), CV (blue, ours), ORTHO (marron), GCMC(green) and QMC (yellow). Each task is trained for 2 · 107 time steps (LQR is trained for 4 · 106 steps) and the trainingcurves show the mean ± std cumulative rewards across 5 random seeds.

pare with trust-region based on-policy pg algorithms:TRPO (Schulman et al., 2015b) and PPO (Schulmanet al., 2017). Instead of a batched implementation as inthe original paper, we adopt a fully on-line pg updatesto align with the way es baselines are implemented.We list detailed hyper-parameters and implementationsin the Appendix C. Due to space limit, the results ofselected tasks are presented in table 2 and we leave amore comprehensive set of results in the Appendix.

We see that the trust-region variants lead to generallymore stable performance than the vanilla pg. Neverthe-less, the performance of these algorithms do not matchthose of the es baselines. There are several enhance-ments one can make to improve these pg baselines, allgenerally relying on biasing the gradients in exchangefor smaller variance, e.g. normalizing the observationsand considering a biased objective (Dhariwal et al.,2017). While here we consider only the ’unbiased’ pg

methods, we discuss these methods in the Appendix.

Comparison with ES baselines. As mentioned inSection 3, the CV is applicable only in cases wherepolicies are stochastic. However, prior works considermainly deterministic policy for continuous control (Ma-nia et al., 2018; Choromanski et al., 2018). To assessthe effects of policy choices, we adopt the same esbaseline pipeline but with deterministic policies. Theresults of deterministic policies are reported in Table 2.We make several observations: (1) Deterministic poli-cies generally perform better than stochastic policieswhen using es baselines. We speculate this is becausethe additional noise introduced by the stochastic policyis not worth the additional variance in the gradientestimation. Nevertheless, CV can leverage the vari-ance reduction effects thanks to the stochastic policy,and maintain generally the best performance acrosstasks. Such comparison illustrates that it is beneficial


Table 1: Final performance on benchmark tasks. The policy is trained for a fixed number of steps on each task.The result is mean ± std across 5 random seeds. The best results are highlighted in bold font. We highlightmultiple methods if their results cannot be separated (mean± std overlap). CV (ours) achieves consistent gainsover the baseline and other variance reduction methods. We also include a pg baseline.

Tasks Vanilla ES Orthogonal GCMC QMC CV (Ours) PG

LQR −176 ± 12 −1337 ± 573 −1246 ± 502 −5634± 1059 −143± 4 −7243 ± 275Swimmer 141 ± 20 171 ± 47 94 ± 19 16 ± 2 237± 33 −132 ± 5

HalfCheetah 1339 ± 178 1185 ± 76 1375 ± 58 −3466 ± 338 1897± 232 −180 ± 4Walker 1155 ± 34 1087 ± 1 360 ± 4 6 ± 0 1476± 112 282 ± 25Pong(R) −5.0 ± 0.8 −5.5 ± 0.3 −10.6 ± 0.4 −15.6 ± 0.3 −3.0± 0.3 −17 ± 0.2

HalfCheetah(R) 595± 42 685± 34 68 ± 8 11 ± 2 709± 16 12 ± 0BipedalWalker 25 ± 9 107± 31 −19 ± 5 −70 ± 3 105± 40 −82 ± 12Cheetah(DM) 281 ± 15 217 ± 15 129 ± 4 18 ± 5 296± 15 25 ± 6Pendulum(DM) 20 ± 3 54± 17 25 ± 8 11 ± 2 43± 1 3 ± 1TwoPoles(DM) 159 ± 13 158 ± 2 196 ± 12 62 ± 12 245± 29 14 ± 1Swingup(DM) 394± 15 369± 22 414± 31 67 ± 14 406± 26 55 ± 10Balance(DM) 692 ± 57 771 ± 41 995± 1 223 ± 32 847± 71 401 ± 12HopperHop(DM) 5.7± 2.1 6.8± 0.7 0.3 ± 0.1 0.0 ± 0.0 6.5± 1.5 0.1 ± 0.0Stand(DM) 21 ± 5 36 ± 10 54± 4 1.0 ± 0.2 60± 11 0.5 ± 0.1

AntWalk(DM) 200± 19 234± 10 82 ± 11 133 ± 9 239± 10 100 ± 11AntEscape(DM) 47± 3 52± 3 8 ± 2 10 ± 1 51± 2 6 ± 1

Table 2: Final performance on benchmark tasks. The setup is the same as in Table 2. CV (ours) achievesconsistent gains over deterministic policy + es as well as more advanced baselines such as TRPO and PPO. Wealso include a pg baseline for easy comparison. In the following, ’Det’ denotes the deterministic policies.

Tasks PPO TRPO Det CV (Ours) PG

Swimmer 11 ± 7 11 ± 7 191 ± 100 237± 33 −132 ± 5HalfCheetah −175 ± 20 −174 ± 16 1645 ± 1298 1897± 232 −180 ± 4

Walker 657 ± 291 657 ± 292 1588 ± 744 1476± 112 282 ± 25Pong(R) −17.1 ± 0.4 −15.0 ± 2.6 −10.4 ± 5.4 −3.0± 0.3 −17 ± 0.2

HalfCheetah(R) 14 ± 2 13 ± 3 502 ± 199 709± 16 12 ± 0BipedalWalker −66 ± 39 −66 ± 38 1 ± 2 105± 40 −82 ± 12Pendulum(DM) 0.6 ± 0.5 7.5 ± 4.8 40 ± 16 43± 1 3 ± 1Balance(DM) 264 ± 46 515 ± 42 692 ± 250 847± 71 401 ± 12HopperHop(DM) 0.2 ± 0.2 0.0 ± 0.0 4.6 ± 6.2 6.5± 1.5 0.1 ± 0.0AntWalk(DM) 96 ± 36 180 ± 41 192 ± 20 239± 10 100 ± 11

to combine stochastic policies with es methods as longas there is proper variance reduction.

Practical Scalability. In practical applications, thescalability of algorithms over large computational archi-tecture is critical. Since CV blends ideas from es withpg, we require both distributed sample collection (Sal-imans et al., 2017; Mania et al., 2018) and distributedgradient computation (Dhariwal et al., 2017). Sinceboth components can be optimally implemented overlarge architectures, we expect CV to properly scale. Amore detailed discussion is in the Appendix.

5 Conclusion

We constructed a control variate for es that take advan-tage of the mdp structure of rl problems to improve on

state-of-the-art variance reduction methods for es algo-rithms. Training algorithms using our control variateoutperform those applying general-purpose mc meth-ods for variance reduction. We provided theoreticalinsight into the effectiveness of our algorithm as well asexhaustive comparison of its performance with othermethods on the set of over 16 rl benchmark tasks. Inprinciple, this control variate can be combined withother variance reduction techniques; this may lead tofurther performance gains.

We leave as future work to study how similar structuredcontrol variates can be applied to improve the perfor-mance of state-of-the-art pg algorithms, in particular,cases where gradients have already been deliberately bi-ased to achieve better performance such as in (Dhariwalet al., 2017).


References

Brockman, G., Cheung, V., Pettersson, L., Schneider,J., Schulman, J., Tang, J., and Zaremba, W. (2016).Openai gym. arXiv preprint arXiv:1606.01540.

Choromanski, K., Rowland, M., Sindhwani, V., Turner,R., and Weller, A. (2018). Structured evolution withcompact architectures for scalable policy optimiza-tion. In Dy, J. and Krause, A., editors, Proceedings ofthe 35th International Conference on Machine Learn-ing, volume 80 of Proceedings of Machine LearningResearch, pages 970–978, Stockholmsmässan, Stock-holm Sweden. PMLR.

Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plap-pert, M., Radford, A., Schulman, J., Sidor, S.,and Wu, Y. (2017). Openai baselines. https://github.com/openai/baselines.

Grathwohl, W., Choi, D., Wu, Y., Roeder, G., andDuvenaud, D. (2018). Backpropagation throughthe void: Optimizing control variates for black-boxgradient estimation. In International Conference onLearning Representations.

Gu, S., Levine, S., Sutskever, I., and Mnih, A. (2016).Muprop: Unbiased backpropagation for stochasticneural networks. In 4th International Conference onLearning Representations (ICLR).

Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E.,and Levine, S. (2017). Q-prop: Sample-efficient pol-icy gradient with an off-policy critic. In ProceedingsInternational Conference on Learning Representa-tions (ICLR). OpenReviews.net.

Jang, E., Gu, S., and Poole, B. (2017). Categoricalreparameterization with gumbel-softmax. In Inter-national Conference on Learning Representations.

Kingma, D. P. and Ba, J. (2014). Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Kingma, D. P. and Welling, M. (2014). Auto-encodingvariational bayes. In International Conference onLearning Representations.

Leordeanu, M. and Hebert, M. (2008). Smoothing-based optimization. In 2008 IEEE Conference onComputer Vision and Pattern Recognition, pages 1–8.IEEE.

Liu, H., Feng, Y., Mao, Y., Zhou, D., Peng, J., andLiu, Q. (2017). Action-depedent control variatesfor policy optimization via stein’s identity. arXivpreprint arXiv:1710.11198.

Maddison, C. J., Mnih, A., and Teh, Y. W. (2017). Theconcrete distribution: a continuous relaxation of dis-crete random variables. In International Conferenceon Learning Representations.

Mania, H., Guy, A., and Recht, B. (2018). Simplerandom search of static linear policies is competitivefor reinforcement learning. In Bengio, S., Wallach,H., Larochelle, H., Grauman, K., Cesa-Bianchi, N.,and Garnett, R., editors, Advances in Neural In-formation Processing Systems 31, pages 1800–1809.Curran Associates, Inc.

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lilli-crap, T., Harley, T., Silver, D., and Kavukcuoglu,K. (2016). Asynchronous methods for deep rein-forcement learning. In International Conference onMachine Learning, pages 1928–1937.

Paisley, J. W., Blei, D. M., and Jordan, M. I. (2012).Variational bayesian inference with stochastic search.In ICML. icml.cc / Omnipress.

Ranganath, R., Gerrish, S., and Blei, D. (2014). Blackbox variational inference. In Artificial Intelligenceand Statistics, pages 814–822.

Robbins, H. and Monro, S. (1951). A stochastic ap-proximation method. The Annals of MathematicalStatistics.

Ross, S. (2002). Simulation, 2002.

Rowland, M., Choromanski, K. M., Chalus, F., Pac-chiano, A., Sarlos, T., Turner, R. E., and Weller,A. (2018). Geometrically coupled monte carlo sam-pling. In Advances in Neural Information ProcessingSystems, pages 195–206.

Salimans, T., Ho, J., Chen, X., Sidor, S., and Sutskever,I. (2017). Evolution strategies as a scalable alter-native to reinforcement learning. arXiv preprintarXiv:1703.03864.

Schulman, J., Heess, N., Weber, T., and Abbeel, P.(2015a). Gradient estimation using stochastic com-putation graphs. In Advances in Neural InformationProcessing Systems, pages 3528–3536.

Schulman, J., Levine, S., Abbeel, P., Jordan, M., andMoritz, P. (2015b). Trust region policy optimization.In International Conference on Machine Learning,pages 1889–1897.

Schulman, J., Moritz, P., Levine, S., Jordan, M., andAbbeel, P. (2016). High-dimensional continuous con-trol using generalized advantage estimation. In Pro-ceedings of the International Conference on LearningRepresentations (ICLR).

Schulman, J., Wolski, F., Dhariwal, P., Radford, A.,and Klimov, O. (2017). Proximal policy optimizationalgorithms. arXiv preprint arXiv:1707.06347.

Silver, D., Lever, G., Heess, N., Degris, T., Wierstra,D., and Riedmiller, M. (2014). Deterministic policygradient algorithms. In ICML.

https://github.com/openai/baselines

https://github.com/openai/baselines


Sutton, R. S., McAllester, D. A., Singh, S. P., andMansour, Y. (2000). Policy gradient methods for re-inforcement learning with function approximation. InAdvances in neural information processing systems,pages 1057–1063.

Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y.,Casas, D. d. L., Budden, D., Abdolmaleki, A., Merel,J., Lefrancq, A., et al. (2018). Deepmind controlsuite. arXiv preprint arXiv:1801.00690.

Tokui, S., Oono, K., Hido, S., and Clayton, J. (2015).Chainer: a next-generation open source frameworkfor deep learning. In Proceedings of workshop on ma-chine learning systems (LearningSys) in the twenty-ninth annual conference on neural information pro-cessing systems (NIPS), volume 5, pages 1–6.

Tucker, G., Bhupatiraju, S., Gu, S., Turner, R.,Ghahramani, Z., and Levine, S. (2018). The mi-rage of action-dependent baselines in reinforcementlearning. In Dy, J. and Krause, A., editors, Proceed-ings of the 35th International Conference on MachineLearning, volume 80 of Proceedings of Machine Learn-ing Research, pages 5015–5024, Stockholmsmässan,Stockholm Sweden. PMLR.

Tucker, G., Mnih, A., Maddison, C. J., Lawson, J.,and Sohl-Dickstein, J. (2017). Rebar: Low-variance,unbiased gradient estimates for discrete latent vari-able models. In Advances in Neural InformationProcessing Systems, pages 2627–2636.

Wu, C., Rajeswaran, A., Duan, Y., Kumar, V., Bayen,A. M., Kakade, S., Mordatch, I., and Abbeel, P.(2018). Variance reduction for policy gradient withaction-dependent factorized baselines. In Interna-tional Conference on Learning Representations.


A How Much Variance Reduction is Possible?

Recall in the main paper we consider a one-step MDP with action a ∈ Rd. The reward function is αTa for someα ∈ Rd. Consider a Gaussian policy with mean parameter µ and fixed covariance matrix Σ = σ2

2I. The action issampled as a ∼ N (µ,Σ). ES convolves the reward objective with a Gaussian with covariance matrix σ2

1I. Letε(i)1 , ε

(i)2 ∼ N (0, I), 1 ≤ i ≤ N be N independent reparameterized noise, we can derive the vanilla ES estimator

gesµ =1

N

N∑i=1

gesµ,i =1

N

N∑i=1

αT (µ+ σ1ε(i)1 + σ2ε

(i)2 )

ε(i)2

σ2

The orthogonal estimator is constructed by N perturbations ε(i)2,ort such that 〈ε(i)2,ort, ε(j)2,ort〉 = 0 for i 6= j, and each

ε(i)2 is still marginally d-dimensional Gaussian. The orthogonal estimator is

gortµ =

1

N

N∑i=1

gortµ,i =

1

N

N∑i=1


(i)2,ort)

ε(i)2,ort

σ2

Finally, we consider the ES gradient estimator with control variate. In particular, we have the reparameterizedgradient as

greµ =1

N

N∑i=1

greµ,i =1

N

N∑i=1


(i)2 )

ε(i)1

σ1

The general gradient estimator with control variate is

gcvµ = gesµ + η � (greµ − gesµ )

where η ∈ Rd. Since η can be indepdently chosen across dimensions, the maximal variance reduction is achievedby setting ηi = − cov(Xi,Yi)

V[Yi]where here X = gesµ , Y = greµ − gesµ .

Recall that for a vector g of dimension d, its variance is defined as the sum of the variance of its componentsV[g] =

∑di=1 V[gi]. For simplicity, let ρ = σ2

σ1. We derive the variance for each estimator below.

Vanilla ES. For the vanilla ES gradient estimator, the variance is

V[gesµ ] =d+ 1

N‖α‖22

Orthogonal ES. For the orthogonal ES gradient estimator, the variance is

V[gortµ ] =

(1 + ρ2)d+ 2−NN

‖α‖22

ES with Control Variate. For the ES gradient estimator with control variate, recall the above notationX = gesµ , Y = greµ − gesµ . We first compute ρ(Xp, Yp)

2 =cov2(Xp,Yp)V[Xp]V[Yp] for each component p. Let Xp,i, Yp,i be the

pth component of gesµ,i and greµ,i − gesµ,i respectively. We will detail how to compute cov(Xp, Yp),V[Vp] in the nextsection. With these components in hand, we have the final variance upper bound

V[gcvµ ] ≤ V[gesµ ]{1− (1 + ρ2)[d((1 + ρ2)− 4]

[(1 + ρ2)d+ 1](2 + ρ2 + 1ρ2 )}.

B Derivation Details

Recall that for a vector g of dimensioon d, we define its variance as V[g] =∑di=1 V[gi] For simplicity, recall that

ρ = σ2

σ1.


B.1 Variance of Orthogonal ES

We derive the variance of orthogonal ES based on the formula in the Appendix of (Choromanski et al., 2018). Inparticular, we can easily compute the i sample estimate for the pth component of Xi,p = [gort

µ,i ]p

E[X2i,p] = (1 + ρ2)‖α‖22 + α2

p

Hence the variance can be calculated as

V[gortµ ] = V[X] =

(1 + ρ2)d+ 2−NN

‖α‖22

B.2 Variance of Vanilla ES

When we account for the cross product terms as in (Choromanski et al., 2018), we can easily derive

V[gesµ ] = V[X] =(1 + ρ2)d+ 1

N‖α‖22.

We can also easily derive the variance per component V[Xp] = 1N ((1 + ρ)2‖α‖22 + α2

p).

B.3 Variance of ES with Control Variate

Recall the definition Xp = XT ep, Yp = Y T ep where ep is a one-hot vector with [ep]i = δip. For simplicity, we fix pand denote xi = Xp,i, yi = Xp,i − Yp,i.

Step 1: Calculate cov(Xp, Yp). The notation produces the covariance

cov(Xp, Yp) = cov(1

N

N∑i=1

xi,1

N

N∑i=1

(xi − yi))

=1

N2E[∑i,j

xixj − xiyj ].

(10)

We identify some necessary components. Let i 6= j, then

E[x2i ] = E[(αT (σ1ε1 + σ2ε2)

ε1,pσ1

)2]

= E[(αT ε1)2ε21,p + (αT ε2)2ρ2]

= (1 + ρ2)‖α‖22 + 2α2p

E[xixj ] = E[xiyj ] = α2p

E[xiyi] = E[(αT (σ1ε1 + σ2ε2)2 ε1,pε2,pσ1σ2

]

= E[2αT ε1αT ε2ε1,pε2,p] = 2α2

p

(11)

We can hence derive

cov(Xp, Yp) =1

N2[

N∑i=1

E[x2i − xiyi] +

∑i 6=j

E[xixj − xiyj ]]

=1

N[(1 + ρ2)‖α‖22 − 2α2

p]

Step 2: Calculate V[Yp]. We only need to derive E[Y 2p,i] = E[(αT (σ1ε1 + σ2ε2)(

ε1,pσ1− ε2,p

σ2)2]. After expanding

all the terms, we can calculate

E[(αT (σ1ε1 + σ2ε2)(ε1,pσ1− ε2,p

σ2)2] = (2 + ρ2 +

1

ρ2)‖α‖22


Step 3: Combine all components. We finally combine all the previous elements into the main result onvariance reduction. Assuming that the scaling factor of the control variate η is optimally set, the maximumvariance reduction leads to the following resulting variance of component p. Using the above notations

V[[gcvµ ]p] = V[Xp]−cov2(Xp, Yp)

V[Yp]

=(1 + ρ)2‖α‖22 + α2

p

N− 1

N

[(1 + ρ)2‖α‖22 − 2α2p]

2

(2 + ρ2 + 1ρ2 )‖α‖22

.

We can lower bound the right hand side and sum over d dimensions,

V[gcvµ ] =

d∑p=1

V[[gcvµ ]p] ≤ V[X]−

d

N

(1 + ρ2)2

2 + ρ2 + 1ρ2

‖α‖22 +4

N

1 + ρ2

2 + ρ2 + 1ρ2

Finally, we plug in V[X] and calculate the variance ratio with respect to the vanilla ES

V[gcvµ ]

V[gesµ ]≤ 1− ρ2[d((1 + ρ2)− 4]

[(1 + ρ2)d+ 1](1 + ρ2).

As a comparison, we can calculate the variance ratio of the orthogonal ES

V[gortµ ]

V[gesµ ]=

(1 + ρ2)d+ 2−N(1 + ρ2)d+ 1

.

When does the control variate achieve lower variance? We set the inequality V[gortµ ] ≥ V[gcvµ ] and

calculate the following condition

ρ ≥ ρ0 =

√N + 3− d+

√(d−N − 3)2 + 4(N − 1)d

2d. (12)

The expression (12) looks formidable. To simplify the expression, consider the limit N →∞ while maintainingNd ∈ [0, 1]. Taking this limit allows us to drop certain constant terms on the right hand side, which produces

ρ0 =√

Nd .

C Additional Experiment Details

C.1 Updating Discount Factor γ

The discount factor γ is updated using es methods. Specifically, at each iteration t, we maintain a current γt.The aim is to update γt such that the empirical estimate of V[gcvθ ] is minimized. For each value of γ we cancalculate a gcvθ (γ), here we explicitly note its dependency on γ. For this iteration, we setup a blackbox functionas F (γ) =

∑di=1[gcvθ (γ)]2i where d is the dimension of parameter θ. Then the es update for γ is γt+1 = γt − αγ gγ ,

where

gγ =1

Nγ

Nγ∑i=1

F (γt + σγεi)

σγεi, εi ∼ N (0, 1). (13)

As mentioned in the paper, to ensure γ ∈ (0, 1] we parameterize γ = 1 − exp(φ) in our implementation. Weoptimize φ using the same es scheme but in practice we set αφ, σφ, Nφ. Here we have αφ ∈ {10−4, 10−5, 10−6},σφ = 0.02 and Nφ = 10. Sometimes we find it effective to just set γ to be the initial constant, because the controlvariate scalar η is also adjusted to minimize the variance.


C.2 Normalizing the gradients

Vanilla stochastic gradient updates are sensitive to the scaling of the objective function, which in our case are thereward functions. Recall the vanilla es estimator ges

θ = 1N

∑Ni=1

J(πθ+σ·εi )

σ εi where J(πθ) is the return estimateunder policy πθ. To ensure that the gradent is properly normalized, the common practice (Salimans et al.,2017; Mania et al., 2018; Choromanski et al., 2018) is to normalize the returns J ← J−J

σ(J) , where J , σ(J) are themean/std of the estimated returns of the batch of N samples. The normalized returns J are used in place of theoriginal returns in calculating the gradient estimators. In fact, we can interpret the normalization as subtractinga moving average baseline (for variance reduction) and dividing by a rough estimate of the local Lipchitz constantof the objective landscape. These techniques tend to make the algorithms more stable under reward functionswith very different scales.

The same technique can be applied to the control variate. We divide the original control variate by σ(J) toproperly scale the estimators.

Details on policy gradient methods. pg methods are implemented following the general algorithmicprocedure as follows: collect data using a previous policy πθt−1

, construct loss function and update the policyparameter using gradient descent θt ← to arrive at πθt , then iterate. We consider three baselines: vanilla pg,TRPO and PPO. In our implementation, these three algorithms differ in the construction of the loss function.For vanilla pg, the loss function is L = −Es,a[πθ(a|s)

πθold(a|s)A(s, a)] where A(s, a) is the advantage estimtion and

θold is the prior policy iterate. For PPO, the loss function is L = −Es,a[clip{πθ(a|s)πθold

(a|s), 1 − ε, 1 + ε}A(s, a)]

where clip{x, a, b} is to clip x between a and b and ε = 0.2. In practice, we go one step further and implementthe action dependent clipping strategy as in (Schulman et al., 2017). For TRPO, we implement the dualoptimization objective of the original formulation (Schulman et al., 2015b) and set the loss function L =

−Es,a[πθ(a|s)πθold

(a|s)A(s, a)] + ηEs[KL[πθ(·|s)||πθold(·|s)]

]where we select η ∈ {0.1, 1.0}.

When collecting data, we collect 2N full episodic trajectories as with the es baselines. This is different froma typical implementation (Dhariwal et al., 2017), where pg algorithms collect a fixed number of samples periteration. Also, we take only one gradient descent on the surrogate objective per iteration, as opposed to multipleupdates. This reduces the policy optimization procedure to a fully on-line fashion as with the es methods.

D Additional Experiments

D.1 Additional Baseline Comparison

Due to space limit, we omit the baseline comparison result on four control tasks from the main paper. We showtheir results in Figure 3. The setup is exactly the same as in the main paper.

(a) Balance (DM) (b) Pendulum (DM) (c) HopperHop (DM) (d) AntWalk(DM)

Figure 3: Training performance on Continuous Control Benchmarks: Swimmer, HalfCheetah, CartPole +{Swingup,TwoPoles,Balance}, Pendulum Swingup, Cheetah Run and Hopper. Tasks with DM stickers are from theDeepMind Control Suites. We compare five alternatives: baseline ES (orange), CV (blue, ours), ORTHO (marron), GCMC(green) and QMC (yellow). Each task is trained for 2 · 107 time steps (LQR is trained for 4 · 106 steps) and the trainingcurves show the mean ± std cumulative rewards across 5 random seeds.


D.2 Discussion on Policy Gradient Methods

Instability of pg. It has been observed that the vanilla pg estimators are not stable. Even when the discountfactor γ < 1 is introduced to reduce the variance, vanilla pg estimators can still have large variance due to thelong horizon. As a result in practice, the original form of the pg (5) is rarely used. Instead, prior works andpractical implementations tend to introduce bias into the estimators in exchange for lower variance: e.g. averageacross states intsead of trajectories (Dhariwal et al., 2017), clipping based trust region (Schulman et al., 2017)and biased advantage estimation (Schulman et al., 2016). These techniques stabilize the estimator and lead tostate-of-the-art performance, however, their theoretical property is less clear (due to their bias). We leave thestudy of combining such biased estimators with es as future work.

D.3 Additional Results with TRPO/PPO and Deterministic Policies

We provide additional comparison results against TRPO/PPO and deterministic policies in Table 3. We seethat generally TRPO/PPO achieve better performance, yet are still under-performed by CV. On the other hand,though deterministic policies tend to outperform stochastic policies when combined with the es baselines, theyare not as good as stochastic policies with CV.

Comparison of implementation variations of pg algorithms. Note that for fair comparison, we removecertain functionalities of a full fledged pg implementation as in (Dhariwal et al., 2017). We identify importantdifferences between our baseline and the full fledged implementation and evaluate their effects on the performancein Table 4. Our analysis focuses on PPO. These important implementation differences are

• S: Normalization of observations.

• A: Normalization of advantages.

• M: Multiple gradient updates (in particular, 10 gradient updates), instead of 1 for our baseline.

To understand results in Table 4, we start with the notations of variations of PPO implementations. The PPOdenotes our baseline for comparison; the PPO+S denotes our baseline with observation normalization. Similarly,we have PPO+A and PPO+M. Notations such as PPO+S+A denote the composition of these implementationtechniques. We evaluate these implementation variations on a subset of tasks and compare their performance inTable 4.

We make several observations: (1) Multiple updates bring the most significant for pg. While conventional esimplementations only consider one gradient update per iteration (Salimans et al., 2017), it is likely that theycoudl also benefit from this modification; (2) Even when considering these improvements, our proposed CVmethod still outperforms pg baselines on a number of tasks.

E Discussion on Scalability

es easily scale to a large distributed architecture for the trajectory collection (Salimans et al., 2017; Mania et al.,2018). Indeed, the bottleneck of most es applications seem to be the sample collection, which can be convenientlydistributed across multiple cores (machines). Fortunately, many open source implementations have providedhigh-quality packages via efficient inter-process communications and various techniques to reduce overheads(Salimans et al., 2017; Mania et al., 2018).

On the other hand, distributing pg requires more complicated software design. Indeed, distributed pg involveboth distributed sample collection and gradient computations. The challenge lies in how to construct gradientsvia multiple processes and combine them into a single update, without introducing costly overheads. Open sourceimplementations such as (Dhariwal et al., 2017) have achieved synchronized distributed gradient computationsvia MPI.

To scale CV, we need to combine both distributed sample collection from es and distributed gradient computationfrom pg. Since both of the above two components have been implemented with high quality via open sourcepackages, we expect it not to be a big issue to scale CV, in particular to combine scalable gradient computation


Table 3: Final performance on benchmark tasks. The policy is trained for a fixed number of steps on each task.The result is mean ± std across 5 random seeds. The best results are highlighted in bold font. We highlightmultiple methods if their results cannot be separated (mean± std overlap). CV (ours) achieves consistent gainsover the baseline and other variance reduction methods. We also include a pg baseline.

Tasks PPO TRPO Det CV (Ours) PG

Swimmer 11 ± 7 11 ± 7 191 ± 100 237± 33 −132 ± 5HalfCheetah −175 ± 20 −174 ± 16 1645± 1298 1897± 232 −180 ± 4

Walker 657 ± 291 657 ± 292 1588 ± 744 1476± 112 282 ± 25Pong(R) −17.1± 0.4 −15.0± 2.6 −10.4± 5.4 −3.0± 0.3 −17 ± 0.2

HalfCheetah(R) 14 ± 2 13 ± 3 502 ± 199 709± 16 12 ± 0BipedalWalker −66 ± 39 −66 ± 38 1 ± 2 105± 40 −82 ± 12Cheetah(DM) 20 ± 20 36 ± 32 241 ± 47 296± 15 25 ± 6Pendulum(DM) 0.6 ± 0.5 7.5 ± 4.8 40 ± 16 43± 1 3 ± 1TwoPoles(DM) 264 ± 46 42 ± 52 190 ± 58 245± 29 14 ± 1Balance(DM) 264 ± 46 515 ± 42 692 ± 250 847± 71 401 ± 12HopperHop(DM) 0.2 ± 0.2 0.0 ± 0.0 4.6 ± 6.2 6.5± 1.5 0.1 ± 0.0AntWalk(DM) 96 ± 36 180 ± 41 192 ± 20 239± 10 100 ± 11AntEscape(DM) 6 ± 3 10 ± 1 13 ± 8 51± 2 6 ± 1

Table 4: Evaluation of implementation variations of pg methods, with comparison to the CV. Below show thefinal performances on benchmark tasks. The policy is trained for a fixed number of steps on each task. The resultbelow shows the mean performance averaged across 3 random seeds. The best results are highlighted in bold font.Notice below the table is transposed due to space.

Tasks Swimmer HalfCheetah Pendulum Balance Hopper AntWalk

PPO 11 −175 0.6 264 0.2 96PPO+S 30 126 2 308 0.5 80PPO+A 33 67 26 278 0.4 117

PPO+S+A 31 192 1 386 0.6 106PPO+S+M 110 1579 36 652 0.3 180PPO+A+M 51 2414 223 771 46 173

PPO+S+A+M 106 1528 17 750 46.4 144CV(Ours) 237 1897 43 847 6.5 239

with es. However, this requires additional efforts - as these two parts have never been organically combinedbefore and there is so far little (open source) engineering effort into this domain. We leave this as future work.

Documents

Variance Reduction for Evolutionary Strategies via Structured … · 2020-03-16 · Yunhao Tang, Krzysztof Choromanski, Alp Kucukelbir of a vector g2Rd as the sum of the variance