Implicit Generative Modeling for Efficient Exploration · could be an unexpected extrinsic reward. However, uncer-tainty modeling from a deep network has proven to be dif-ﬁcult

Implicit Generative Modeling for Efficient Exploration

Neale Ratzlaff 1 Qinxun Bai 2 Li Fuxin 1 Wei Xu 2

Abstract

Efficient exploration remains a challenging prob-lem in reinforcement learning, especially forthose tasks where rewards from environments aresparse. In this work, we introduce an explo-ration approach based on a novel implicit gener-ative modeling algorithm to estimate a Bayesianuncertainty of the agent’s belief of the environ-ment dynamics. Each random draw from ourgenerative model is a neural network that in-stantiates the dynamic function, hence multipledraws would approximate the posterior, and thevariance in the predictions based on this poste-rior is used as an intrinsic reward for exploration.We design a training algorithm for our generativemodel based on the amortized Stein VariationalGradient Descent. In experiments, we demon-strate the effectiveness of this exploration algo-rithm in both pure exploration tasks and a down-stream task, comparing with state-of-the-art in-trinsic reward-based exploration approaches, in-cluding two recent approaches based on an en-semble of dynamic models. In challenging ex-ploration tasks, our implicit generative modelconsistently outperforms competing approachesregarding data efficiency in exploration.

1. IntroductionDeep Reinforcement Learning (RL) has enjoyed recentsuccess in a variety of applications, including super-human performance in Atari games (Mnih et al., 2013),robotic control (Lillicrap et al., 2015), image-based con-trol tasks (Hafner et al., 2019), and playing the game ofGo (Silver et al., 2016). Despite these achievements, manyrecent deep RL techniques still suffer from poor sampleefficiency. Agents are often trained for millions, or evenbillions of simulation steps before achieving a reasonable

1Department of Electrical Engineering and Computer Science,Oregon State University, Corvallis, OR, USA 2Horizon Robotics,Cupertino, CA, USA. Correspondence to: Neale Ratzlaff <[email protected]>.

performance (Burda et al., 2018a). This lack of statisticalefficiency makes it difficult to apply deep RL to real-worldtasks, as the cost of acting in the real world is far greaterthan in a simulator. It is then a problem of utmost impor-tance, to design agents that make efficient use of collecteddata. In this work, we focus on efficient exploration whichis widely considered to be one of the three key aspects inbuilding a data-efficient agent. (Sutton & Barto, 2018)

In particular, we focus on those challenging environmentswith sparse external rewards. In those environments, it isimportant for an effective agent to methodically explore asignificant portion of the state space, since there may notbe enough signals to indicate where the reward can be.Previous work usually utilize some sort of intrinsic rewarddriven by the uncertainty in agent’s belief of the environ-ment state (Osband et al., 2018). Intuitively, agents shouldexplore more on states where they are not certain whetherit could lead to a previously unknown consequence – whichcould be an unexpected extrinsic reward. However, uncer-tainty modeling from a deep network has proven to be dif-ficult with no approach (Snoek et al., 2019) that is provento be universally applicable.

In this work, we introduce a new framework of Bayesianuncertainty modeling for intrinsic reward-based explo-ration in deep RL. Our framework characterizes the uncer-tainty in the agent’s belief of the environment dynamicsin a non-parametric manner to enable flexibility and ex-pressiveness. The main component of our framework isa network generator, each draw of which is a neural net-work that serves as the dynamic function for RL. Multipledraws then approximate a posterior of the dynamic modeland the variance in future state prediction based on this pos-terior is used as an intrinsic reward for exploration. In do-ing so, we can avoid restrictive distributional assumptionson the likelihood or posterior, and explore a significantlylarger model space than previous approaches. Recently,it has been shown (Ratzlaff & Fuxin, 2019) that trainingsuch kind of generators can be done in classification prob-lems and the resulting draws of networks can represent arich distribution of diverse networks that perform approxi-mately equally well on the classification task.

For our goal of training this generator for the dynamic func-tion, we propose a new algorithm to optimize the KL di-

arX

iv:1

911.

0801

7v2

[cs

.LG

] 2

6 Fe

b 20

20


vergence between the implicit distribution (represented bydraws from the generator) and the true posterior of the dy-namic model (given the agent’s experience) via the amor-tized Stein Variational Gradient Descent (SVGD) (Liu &Wang, 2016; Feng et al., 2017). Amortized SVGD allowsdirect minimization of the KL divergence between the im-plicit posterior and true posterior without any parametricassumptions or Evidence Lower Bound (ELBO) approx-imations, and further projects to a finite-dimensional pa-rameter update.

Comparing with recent work (Pathak et al., 2019; Shyamet al., 2019) that maintain an ensemble of dynamic mod-els and use the divergence or disagreement among them asan intrinsic reward for exploration, our implicit modelingof the posterior has two major advantages: Firstly, it is amore flexible framework for approximating the model pos-terior comparing with ensemble-based approximation. Af-ter one training episode, it can provide an unlimited amountof draws whereas for an ensemble each draw would requireseparate training. Secondly, amortized SVGD (Feng et al.,2017) allows direct nonparametric minimization of the KLdivergence, in contrast with existing ensemble-based meth-ods that count on the random initialization and/or boot-strapped experience sampling, which does not necessarilyapproximate the posterior.

In experiments, we compare our approach with sev-eral state-of-the-art intrinsic reward-based exploration ap-proaches, including two recent approaches that also lever-age the uncertainty in dynamic models. Experiments showthat our implementation consistently outperforms compet-ing methods regarding data efficiency in exploration.

In summary, our contributions are:

• We propose a generative framework for implicitlyapproximating the posterior of network parameterswhere the uncertainty of the network function can beused as an intrinsic reward for efficient exploration indeep RL.

• We design an amortized SVGD-based training algo-rithm for the proposed framework and apply it toapproximate the implicit distribution of the dynamicmodel of the environment.

• We test our implementation on three challenging ex-ploration tasks and compare with three state-of-the-artintrinsic reward-based methods, two of which are alsobased on uncertainty in dynamic models. The consis-tent superior performance of our method demonstratesthe effectiveness of the proposed framework in esti-mating the Bayesian uncertainty in the dynamic modelfor efficient exploration. We also apply our method ona dense reward task to further demonstrate its potentialfor improving downstream tasks.

2. Problem Setup and BackgroundConsider a Markov Decision Process (MDP) represented as(S,A, P, r, ρ0), where S is the state space, A is the actionspace. P : S × A × S → [0, 1] is the unknown dynam-ics model, specifying the probability of transitioning to thenext state s′ from the current state s by taking the actiona, as P (s′|s, a). r : S × A → R is the reward function,ρ0 : S → [0, 1] is the distribution of initial states. A policyis a function π : S × A → [0, 1], which outputs a distribu-tion over the action space for a given state s.

2.1. Exploration in Reinforcement Learning

In online decision-making problems, such as multi-arm bandits and reinforcement learning, a fundamentaldilemma in an agent’s choice is exploitation versus ex-ploration. Exploitation refers to making the best decisiongiven current information, while exploration refers to gath-ering more information about the environment. In the stan-dard reinforcement learning setting where the agent re-ceives an external reward for each transition step, com-mon recipes for exploration/exploitation trade-off includenaive methods such as ε-greedy (Sutton & Barto, 2018) andoptimistic initialization (Lai & Robbins, 1985), posteriorguided methods such as upper confidence bounds (Auer,2002; Dani et al., 2008) and Thompson sampling (Thomp-son, 1933). In the situation we focus on, where exter-nal rewards are sparse or disregarded, the above trade-off narrows down to the pure exploration problem of ef-ficiently accumulating information about the environment.The common approach is to explore in a task-agnostic wayunder some “intrinsic” reward. An exploration policy canthen be trained with standard RL. Existing methods con-struct intrinsic rewards from visitation frequency of thestate (Bellemare et al., 2016), prediction error of the dy-namic model as “curiosity” (Pathak et al., 2017), diversityof visited states (Eysenbach et al., 2018), etc.

2.2. Dynamic Model Uncertainty as Intrinsic Reward

Following the guiding principle of modeling Bayesian un-certainty in online decision making, two recent meth-ods (Pathak et al., 2019; Shyam et al., 2019) train an ensem-ble of dynamic models and use the variation/informationgain as an intrinsic reward for exploration. In this work, wefollow the similar idea of exploiting the uncertainty in thethe dynamic model, but emphasize on the implicit posteriormodeling in contrast with directly training an ensemble ofdynamic models.

Let f : S × A → S denote a model of the environ-ment dynamics (usually represented by a neural network)we want to learn based on the agent experience D. We de-sign a generator module G which takes a random draw from


Figure 1. Architecture of the layer-wise generator of the dynamicmodel. Independent noise samples {z1, · · · , zN} are drawn froma standard Gaussian with diagonal covariance, and input to layer-wise generators {G1, · · · , GN}. Each generator Gj outputs pa-rameters θj for the corresponding j-th layer of the neural networkrepresenting the dynamic model.

the normal distribution and outputs a sample vector of pa-rameters θ that determines f (denoted as fθ). If samplesfrom G represent the posterior distribution p(fθ|D), thengiven (st, at), the uncertainty in the output of the dynamicsmodel can be computed by the following variance amonga set of samples {θi}mi=1 from G, and used as an intrinsicreward rin for learning an exploration policy,

rint =1

m

∑m

i=1

∥∥∥∥fθi(st, at)−

1

m

∑m

`=1fθ`

(st, at)

∥∥∥∥2 .(1)

In learning the exploration policy, this intrinsic reward canbe computed with either actual rollouts in the environmentor simulated rollouts generated by the estimated dynamicmodel.

3. Posterior Approximation via AmortizedSVGD

In this section, we introduce the core component of ourexploration agent, the dynamic model generator G. In thefollowing subsections, we first introduce the design of thisgenerator and then describe its training algorithm in detail.A summary of our algorithm is given in the last subsection.

3.1. Implicit Posterior Generator

As shown in Fig. 1, the dynamic model is defined asan N -layer neural network function fθ(s, a), with in-put (state, action) pair (s, a) and model parameters θ =(θ1, · · · , θN ), where θj represents network parameters ofthe j-th layer. The generator module G consists of exactlyN layer-wise generators, {G1, · · · , GN}, where each Gjtakes a random noise vector zj ∈ Rd and outputs the corre-sponding parameter vector θj = Gj(z

j ; ηj), where ηj arethe parameters of Gj . Note that zjs are generated indepen-dently from a d-dimensional standard normal distribution,

rather than jointly.

As mentioned in Section. 1, this framework has advantagesin flexibility and efficiency, comparing with ensemble-based methods (Shyam et al., 2019; Pathak et al., 2019),since it maintains only parameters of theN generators, i.e.,η = (η1, · · · , ηN ), and enables drawing an arbitrary num-ber of sample networks to approximate the posterior of thedynamic model.

3.2. Training with Amortized SVGD

We now introduce the training algorithm of the generatormodule G. Assuming that the true posterior of the dy-namic model given agent’s experience D is p(f |D), andthe implicit distribution of fθ captured by G is q(fθ). Wewant q(fθ) to be as close as possible to p(f |D), suchcloseness is commonly measured by the KL divergenceDKL [q(fθ)‖p(f |D)]. The traditional approach for findingq that minimizes DKL [q(fθ)‖p(f |D)] is variational infer-ence, where an ELBO is maximized (Blei et al., 2017). Re-cently, a nonparametric variational inference framework,Stein Variational Gradient Descent (SVGD) (Liu & Wang,2016), was proposed, which represents q with a set ofparticles rather than making any parametric assumptions,and approximates the functional gradient descent w.r.t.DKL [q(fθ)‖p(f |D)] by iterative particle evolvement. Weapply SVGD to our sampled network functions, and fol-low the idea of amortized SVGD (Feng et al., 2017) toproject the functional gradients to the parameter space ofη by back-propagation through the generators.

Given a set of dynamic functions {fθi}mi=1 sampled fromG, SVGD updates each function by

fθi← fθi

+ εφ∗(fθi), i = 1, · · · ,m,

where ε is a step size, and φ∗ is the function in the unitball of a reproducing kernel Hilbert space (RKHS) H thatmaximally decreases the KL divergence between the distri-bution q represented by {fθi}mi=1 and the target posteriorp,

φ∗ = maxφ∈H

{− d

dεDKL(q||p), s.t.||φ||H ≤ 1

}.

This optimization problem has a closed form solution,

φ∗(fθi) = Ef∼q [∇θ log p(f)k(f, fθi

) +∇θk(f, fθi)] ,(2)

where k(·, ·) is the positive definite kernel associatedwith the RKHS. The log-likelihood term for fθ corre-sponds to the negation of the regression loss of futurestate prediction for all transitions in D, i.e., log p(fθ) =−∑

(s,a,s′)∈D L(fθ(s, a), s′). Given that each θi is gener-ated by G(z;η), the update rule for η can be obtained by


by the chain rule,

η ← η + ε

m∑i=1

∇ηG(zi;η)φ∗(θi), (3)

where φ∗(θi) can be computed by (2) using empirical ex-pectation from sampled batch {θi}mi=1,

φ∗(θi) =1

m

m∑`=1

{−[∑

(s,a,s′)∈D∇θ`

L(fθ`(s, a), s′)

]

· k(fθ`(s,a), fθi(s,a)) +∇θ`k(fθ`(s,a), fθi(s,a))

},

(4)

where k(·, ·) is the Gaussian kernel evaluated at functionoutputs, which is in the state space.

3.3. Summary of the Exploration Algorithm

To condense what we have proposed so far, we summarizein Algorithm 1 the procedure used to train the generator ofdynamic models and the exploration policies.

Our algorithm starts with a buffer D of random transitionsand explores for some fixed number of episodes. For eachepisode, our algorithm samples a set of dynamic modelsfΘ = {fθi

} from the generator G, and updates the gener-ator parameters η using amortized SVGD (3) and (4). Forthe policy update, the intrinsic reward (1) is evaluated onthe actual experience D and the simulated experience Dgenerated by fθi

. The exploration policy is then updatedusing a model-free RL algorithm on the collected experi-enceDπ and intrinsic rewardsRπ . The updated explorationpolicy is then used to rollout in the environment for T stepsso that new transitions are collected and added to the bufferD for subsequent iterations. We repeat the process, untilthe episode is done.

Algorithm 1 Exploration with an Implicit DistributionInitialize Generator Gη , parameters T,mInitialize Policy π, Experience buffer Dwhile True do

while episode not done: dofΘ ← G(z;η), z ∼ N (0, Id)η ← evaluate (3), (4) on DDπ ← D ∪ D,D ∼ MDP(fΘ)Rπ ← rin(fθ, s, a|(s, a) ∼ Dπ) by (1)π ← update policy on (Dπ, Rπ)DT ← rollout π for T stepsD ← D ∪DT

endend

4. Experimental ResultsIn this section we conduct experiments to compare our ap-proach to existing state-of-the-art in efficient explorationwith intrinsic reward to show the following:

• An agent with an implicit posterior over dynamicmodels explores more effectively and efficiently thanagents using a single model or a static ensemble.

• Agents seeking external reward find better policieswhen initialized from powerful exploration policies.Our ablation study shows that the better the explo-ration policy as an initialization, the better the down-stream task policy can learn.

To isolate our main claim of the superior exploration ef-ficiency of the proposed method, we first consider explo-ration tasks agnostic of any external reward. In this set-ting, the agent explores the environment irrespective of anydownstream task. Then, to further investigate the poten-tial of our exploration policies, we consider transferringthe learned exploration policy to downstream task policieswhere a dense external reward is provided. Note that bothcases are important for understanding and applying explo-ration policies. In sparse reward settings, such as a maze,the reward could occur at any location, without informa-tive hints accessible at other locations. Therefore an ef-fective agent must be able to efficiently explore the entirestate space in order to consistently find rewards under dif-ferent task settings. In dense reward settings, the trade-off between exploration and exploitation plays a centralrole in efficient policy learning. Our experiments showthat even for a state-of-the-art model-free algorithm likethe Soft Actor-Critic (SAC) (Haarnoja et al., 2018), whichalready incorporates a strong exploration mechanism (themaximum entropy framework), spending some initial roll-outs to learn a powerful exploration policy as an initial-ization of the task policy still considerably improves thelearning efficiency.

4.1. Toy Task: NChain

Figure 2. NChain task.

As a sanity check, we first follow MAX (Shyam et al., 2019)to evaluate our method on a stochastic version of the toyenvironment NChain. As shown in Fig. 2, the chain is afinite sequence of N states. Each episode starts from state1 and lasts for N + 9 steps. For each step, the agent canmove forward to the next state in the chain or backward


to the previous state. Attempting to move off the edge ofthe chain results in the agent staying still. Reward is onlyafforded to the agent at the edge states: 0.01 for reachingstate 0, and 1.0 for reaching stateN−1. In addition, there isuncertainty built into the environment: each state is desig-nated as a flip-state with probability 0.5. When acting froma flip-state, the agent’s actions are reversed, i.e., movingforward will result in movement backward, and vice-versa.Given the (initially) random dynamics and a sufficientlylong chain, we expect an agent using an ε-greedy explo-ration strategy to exploit only the small reward of state 0.In contrast, agents with exploration policies which activelyreduce uncertainty can efficiently discover all states in thechain. Fig. 8 shows that our agent navigates the chain inless than 15 episodes, while the ε-greedy agent (doubleDQN) does not make meaningful progress. We show a fullevaluation on the chain environment including all methodsreported in Sec. 4.2 in the supplementary material.

Figure 3. Results on the 40-link chain environment. Each line isthe mean of three runs, with the shaded regions corresponding to±1 standard deviation. Both our method and MAX actively reduceuncertainty in the chain, and therefore are able to quickly exploreto the end of the chain. ε-greedy DQN fails to explore more than40% of the chain.

4.2. Pure Exploration Results

For pure exploration experiments, we consider three chal-lenging continuous control tasks in which efficient explo-ration is known to be difficult. In each environment, thedynamics are nonlinear and cannot be solved with simpler(efficient) tabular approaches. As explained in the begin-ning of Section. 4, the external reward is completely re-moved; the agent is motivated purely by the uncertainty inits belief of the environment.

Experimental setup

To validate the effectiveness of our method, we comparewith several state-of-the-art formulations of intrinsic re-ward. Specifically, we conduct experiments comparing thefollowing methods:

• (Ours) The proposed intrinsic reward, using the esti-mated variance of an implicit distribution of the dy-namic model.

• (Random) Random exploration as a naive baseline.

• (ICM) Error between predicted next state and ob-served next state (Pathak et al., 2017).

• (Disagreement) Variance of predictions from an en-semble of dynamic models (Pathak et al., 2019).

• (MAX) Jensen-Renyi information gain of the dynamicfunction (Shyam et al., 2019).

Implementation details

Given our goal is to compare the performance across differ-ent intrinsic rewards, we fix the model architecture, trainingpipeline, and hyper-parameters across all methods,1 sharedhyper-parameters follow the MAX default settings. Forthe purpose of computing the information gain, dynamicmodels for MAX predict both mean and variance of thenext state, while for other methods, dynamic models pre-dict only the mean. Since our method trains a generatorof dynamic models instead of a fixed-size ensemble, wefix the number of models we sample from the generator atm = 32, which equals the ensemble size for MAX, ICM,and Disagreement. For all experiments, we use SAC as themodel-free RL algorithm used to train the exploration poli-cies.

4.2.1. ACROBOT CONTROL

Our first environment is a modified continuous control ver-sion of the Acrobot. As shown in Figure 4, the Acrobotenvironment begins with a hanging down pendulum whichconsists of two links connected by an actuated joint. Nor-mally, a discrete action a ∈ {−1, 0, 1} either applies a unitforce on the joint in the left or right direction (a = ±1), ornot (a = 0). We modify the environment such that a con-tinuous action a ∈ [−1, 1] applies a force F = |a| in thecorresponding direction.

Figure 4. Performance of each method on the Acrobot environ-ment (average of five seeds), with error bars representing ±1 stan-dard deviation. The length of each horizontal bar indicates thenumber of environment steps each agent/method takes to swingthe acrobot to fully horizontal on both (left and right) directions.

1We use the codebase of MAX as a basis and implement Ours,ICM, and Disagreement intrinsic rewards under the same frame-work. The full Disagreement method includes a differentiable re-ward function which compare with separately in the supplemen-tary material.


To focus on efficient exploration, we test the ability of eachexploration method to sweep the entire lower hemisphere:positioning the acrobot completely horizontal towards both(left and right) directions. Given this is a relatively simpletask and can be solved by random exploration, as shown inFigure 4, all four intrinsic reward methods solve it withinjust hundreds of steps and our method is the most efficientone. The takeaway here is that in relatively simple environ-ments where there might be little room for improvementover state-of-the-art, our method still achieves a better per-formance due to its flexibility and efficiency in approximat-ing the model posterior. As we will see in subsequent ex-periments, this observation scales well with the increasingdifficulty of the environments.

4.2.2. ANT MAZE NAVIGATION

Next, we evaluate on the Ant Maze environment. In theAnt control task, the agent provides torques to each of the8 joints of the ant. The provided observation contains thepose of the torso as well as the angles and velocities of eachjoint. For the purpose of exploration, we place the Ant ina U-shaped maze, where the goal is to reach the end ofthe maze, discovering all the states. We show an image ofthe maze environment in the supplementary material. Theagent’s performance is measured by the percentage of themaze explored during evaluation. Figure 5(a) shows the re-sult of each method over 5 seeds. Our agent consistentlynavigates to the end of the maze faster than the other com-peting methods. To see that our agent fully explores themaze environment, we include state visitation diagrams inthe supplementary material.

While MAX (Shyam et al., 2019) also navigates the maze,the implicit uncertainty modeling scheme in our method al-lows our agent to better estimate the state novelty, whichleads to a considerably faster exploration. To provide amore intuitive understanding of the effect of an intrinsicreward and how it might correlate to the performance, wealso plot in Figure 5(b) the intrinsic reward observed byour agent at each exploration step, compared with that ob-served by the MAX agent. For fair comparison we plot theintrinsic reward from eq.(1) for both methods. We can seethat after step 2,000, predictions from the MAX ensemblestart to become increasingly similar, leading to a decline inintrinsic reward (Fig. 5(b)) as well as a slow-down in ex-ploration speed (Fig. 5(a)). We hypothesize this is becausein a regular ensemble, all members are updating their gra-dients on the same experiences without an explicit term tomatch the real posterior, leading to eventually all agentsconverging to the same one. By contrast, our intrinsic re-ward keeps increasing around step 2,000 and remains highas we continue to quickly explore new states in the maze,only starting to decline once we have solved the maze atapproximately step 5,000.

(a) Ant Navigation Task Results

(b) Ant Intrinsic Rewards

Figure 5. Figure (a) shows the performance of each method withmean and ±1 standard deviation (shaded region) over five seeds.x-axis is the number of steps the ant has moved, y-axis is thepercentage of the U-shaped maze that has been explored. Figure(b) shows the proposed intrinsic reward magnitude for each stepin the environment, calculated for both our method and MAX.

4.2.3. ROBOTIC MANIPULATION

The final task is an exploration task in a robotic manipu-lation environment, HandManipulateBlock. As shown inFigure 6(a), a robotic hand is given a palm-sized block formanipulation. The agent has actuation control of the 20joints that make up the hand, and its exploration perfor-mance is measured by the percentage of possible rotationsof the cube that the agent performs. This is different fromthe original goal of this environment since we want to eval-uate task-agnostic exploration rather than goal-based poli-cies. In particular, the state of the cube is represented byCartesian coordinates along with a quaternion to representthe rotation. We transform the quaternion to Euler anglesand discretize the resulting state space by 45 degree inter-vals. The agent is evaluated based on how many of the 512total states are visited.

This task is far more challenging than previous tasks, hav-ing a larger state space and action space. Additionally,states are more difficult to reach than the Ant Maze en-vironment: requiring manipulation of 20 joints instead of8. In order to explore in this environment, an agent mustalso learn how to rotate the block without dropping it. Fig-ure 6(b) shows the performance of each method over 5


(a) Robotic Hand (b) Manipulation Task Results

Figure 6. (a) The Robotic Hand task in motion. (b) Performance of each method with mean and ±1 standard deviation (shaded region)over five seeds. x-axis is the number of manipulation steps, y-axis is the number of rotation states of the block that has been explored.Our method (red) explores clearly faster than all other methods.

seeds. This environment proved very challenging for allmethods, none succeeded in exploring more than half ofthe state space. Still, our method performs the best by aclear margin.

4.3. Policy Transfer Experiments

So far, we have demonstrated that the proposed implicitgenerative modeling of the posterior over dynamic mod-els leads to more effective and efficient pure explorationpolicies. While the efficiency of pure exploration is im-portant under sparse reward settings, a natural follow-upquestion is whether a strong pure exploration policy wouldalso be beneficial for downstream tasks where dense re-wards are available. We give an answer to this question byperforming the following experiments in the widely-usedHalfCheetah environment.

We first train a task-agnostic exploration policy followingAlgorithm 1 for 10,000 environment steps. The trained pol-icy is then used to initialize the (downstream) task policy,followed by standard training of the task policy using exter-nal rewards. In particular, we use SAC for both explorationpolicy and task policy. In our comparison experiments, weinitialize the task policy using exploration policies trainedby MAX, ICM, Disagreement, and Ours respectively, wealso include the standard SAC (without exploration policyinitialization) as a baseline. For the training of the task pol-icy, we follow the recommended settings for HalfCheetahgiven in the original SAC method. We detail the settingsused, as well as additional results, in the supplementarymaterial.

Figure 7(a) shows the performance of all compared meth-ods on HalfCheetah-v2. We can see that the compara-tively small number of initial steps spent on pure explo-ration pays off when the agent switches to the downstream

task. Even though SAC is widely regarded as a strong base-line with maximum entropy-based exploration mechanism,all intrinsic reward methods are able to improve the base-line more or less, by introducing a pure exploration stagebefore standard training of SAC. We also observe that thestronger the pure exploration policy is, the more it can im-prove the training efficiency of the downstream task. Taskpolicies initialized with our exploration policy (Ours) stillperform the best with a clear margin.

We also conduct an ablation study to better understand therelation between the number of initial exploration steps andthe improvement it brings to the downstream task train-ing. We compare multiple variants of Ours in Figure 7(a),with different numbers of initial exploration steps: 2,000,4,000, 6,000, 8,000, and 10,000 steps. As shown in Fig-ure 7(b), with more than 4,000 steps of initial exploration,our approach is able to improve upon the SAC baseline inthe downstream task. The longer our exploration policy istrained, the more beneficial it is to the training of the down-stream task. 2,000 steps of initial exploration, on the otherhand, harms the downstream task training, since it could betoo short to obtain any reasonable exploration policy.

5. Related WorkEfficient exploration remains a major challenge in deep re-inforcement learning (Fortunato et al., 2017; Burda et al.,2018b; Eysenbach et al., 2018; Burda et al., 2018a), andthere is no consensus on the correct way to explore an en-vironment. One practical guiding principle for efficient ex-ploration is the reduction of the agent’s epistemic uncer-tainty of the environment (Chaloner & Verdinelli, 1995;Osband et al., 2017). Osband et al. (2016) uses a boot-strap ensemble of DQNs, where the predictions of the en-semble are used as an estimate of the agent’s uncertainty


(a) Policy Transfer Performance (b) Policy Transfer with Varying Exploration Time

Figure 7. Policy transfer results. Figure (a) shows the performance of SAC baseline (SAC) and four SAC variants initialized from a10,000-step exploration policy trained with different intrinsic reward methods, ICM, Disgreement, MAX, and our proposed method(Ours) respectively. Figure (b) shows the performance of downstream SAC policy training initialized with our proposed explorationpolicy under different settings of exploration length. Init-Nk refers to the SAC agent initialized from our exploration policy which hasbeen trained for N -thousand exploration steps.

over the value function. Osband et al. (2018) proposed toaugment the predictions of a DQN agent by adding the con-tribution from a prior to the value estimate. In contrast toour method, these approaches seek to estimate the uncer-tainty in the value function, while we focus on explorationwith intrinsic reward by estimating the uncertainty of thedynamic model. Fortunato et al. (2017) add parameterizednoise to the agent’s weights, to induce state-dependant ex-ploration beyond ε-greedy or entropy bonus.

Methods for constructing intrinsic rewards for explorationhave become the subject of increased study. One well-known approach is to use the prediction error of an inversedynamics model as an intrinsic reward (Pathak et al., 2017;Schmidhuber, 1991). Schmidhuber (1991) and Sun et al.(2011) proposed using the learning progress of the agent asan intrinsic reward. Count based methods (Bellemare et al.,2016; Ostrovski et al., 2017) give a reward proportional tothe visitation count of a state. Houthooft et al. (2016)formulate exploration as a variational inference problem,and use Bayesian neural networks (BNN) to maintain theagent’s belief over the transition dynamics. The BNN pre-dictions are used to estimate a form of Bayesian informa-tion gain called compression improvement. The variationalapproach is also explored in Mohamed & Rezende (2015);Gregor et al. (2016); Salge et al. (2014), who proposed us-ing intrinsic rewards based on a variational lower bound onempowerment: the mutual information between an actionand the induced next state. This reward is used to learn a setof discriminative low-level skills. The most closely-relatedwork to ours are two recent methods (Pathak et al., 2019;Shyam et al., 2019) that compute intrinsic rewards froman ensemble of dynamic models. Disagreement among the

ensemble members in next-state predictions is computedas an intrinsic reward. Shyam et al. (2019) also uses ac-tive exploration (Schmidhuber, 2003; Chua et al., 2018),in which the agent is trained in a surrogate MDP, to maxi-mize intrinsic reward before acting in the real environment.Our method follows the similar idea of exploiting the un-certainty in the dynamic model, but instead suggests an im-plicit generative modeling of the posterior of the dynamicfunction, which enables a more flexible approximation ofthe posterior uncertainty with better sample efficiency.

There has been a wealth of research on nonparametricparticle-based variational inference methods (Liu & Wang,2016; Dai et al., 2016; Ambrogioni et al., 2018), whereparticles are maintained to represent the variational distri-bution, and updated by solving an optimization problemwithin an RKHS. Notably, we use amortized SVGD (Fenget al., 2017) to optimize our generator for approximatelysampling from the posterior of the dynamic model.

6. ConclusionIn this work, we introduced a new method for representingthe agent’s uncertainty of the environment dynamics. Uti-lizing amortized SVGD, we learned an approximate poste-rior over dynamics models. We use this approximate pos-terior to formulate an intrinsic reward based on the uncer-tainty estimated from samples of this distribution, enablingefficient exploration in difficult environments, Future workincludes investigating the efficacy of learning an approxi-mate posterior of the agent’s value or policy model, as wellas more efficient sampling techniques to reduce the compu-tational cost inherent to many model-based algorithms. We


would also investigate principled methods of combining in-trinsic and external rewards, and how different explorationpolicies influence the downstream task performance.

ReferencesLuca Ambrogioni, Umut Guclu, Yagmur Gucluturk, and

Marcel van Gerven. Wasserstein variational gradient de-scent: From semi-discrete optimal transport to ensemblevariational inference. arXiv preprint arXiv:1811.02827,2018.

Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Re-search, 3(Nov):397–422, 2002.

Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, TomSchaul, David Saxton, and Remi Munos. Unifyingcount-based exploration and intrinsic motivation. In Ad-vances in Neural Information Processing Systems, pp.1471–1479, 2016.

David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Vari-ational inference: A review for statisticians. Journal ofthe American statistical Association, 112(518):859–877,2017.

Yuri Burda, Harri Edwards, Deepak Pathak, Amos Storkey,Trevor Darrell, and Alexei A Efros. Large-scalestudy of curiosity-driven learning. arXiv preprintarXiv:1808.04355, 2018a.

Yuri Burda, Harrison Edwards, Amos Storkey, and OlegKlimov. Exploration by random network distillation.arXiv preprint arXiv:1810.12894, 2018b.

Kathryn Chaloner and Isabella Verdinelli. Bayesian exper-imental design: A review. Statistical Science, pp. 273–304, 1995.

Kurtland Chua, Roberto Calandra, Rowan McAllister, andSergey Levine. Deep reinforcement learning in a hand-ful of trials using probabilistic dynamics models. In Ad-vances in Neural Information Processing Systems, pp.4754–4765, 2018.

Bo Dai, Niao He, Hanjun Dai, and Le Song. Provablebayesian inference via particle mirror descent. In Ar-tificial Intelligence and Statistics, pp. 985–994, 2016.

Varsha Dani, Thomas P Hayes, and Sham M Kakade.Stochastic linear optimization under bandit feedback.2008.

Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, andSergey Levine. Diversity is all you need: Learn-ing skills without a reward function. arXiv preprintarXiv:1802.06070, 2018.

Yihao Feng, Dilin Wang, and Qiang Liu. Learning todraw samples with amortized stein variational gradientdescent. arXiv preprint arXiv:1707.06626, 2017.

Meire Fortunato, Mohammad Gheshlaghi Azar, BilalPiot, Jacob Menick, Ian Osband, Alex Graves, VladMnih, Remi Munos, Demis Hassabis, Olivier Pietquin,et al. Noisy networks for exploration. arXiv preprintarXiv:1706.10295, 2017.

Karol Gregor, Danilo Jimenez Rezende, and Daan Wier-stra. Variational intrinsic control. arXiv preprintarXiv:1611.07507, 2016.

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and SergeyLevine. Soft actor-critic: Off-policy maximum en-tropy deep reinforcement learning with a stochastic ac-tor. arXiv preprint arXiv:1801.01290, 2018.

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Vil-legas, David Ha, Honglak Lee, and James Davidson.Learning latent dynamics for planning from pixels. In In-ternational Conference on Machine Learning, pp. 2555–2565, 2019.

Rein Houthooft, Xi Chen, Xi Chen, Yan Duan, John Schul-man, Filip De Turck, and Pieter Abbeel. Vime: Varia-tional information maximizing exploration. In D. D. Lee,M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett(eds.), Advances in Neural Information Processing Sys-tems 29, pp. 1109–1117. Curran Associates, Inc., 2016.

Tze Leung Lai and Herbert Robbins. Asymptotically ef-ficient adaptive allocation rules. Advances in appliedmathematics, 6(1):4–22, 1985.

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel,Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, andDaan Wierstra. Continuous control with deep reinforce-ment learning. arXiv preprint arXiv:1509.02971, 2015.

Qiang Liu and Dilin Wang. Stein variational gradient de-scent: A general purpose bayesian inference algorithm.In Advances in neural information processing systems,pp. 2378–2386, 2016.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, AlexGraves, Ioannis Antonoglou, Daan Wierstra, and MartinRiedmiller. Playing atari with deep reinforcement learn-ing. arXiv preprint arXiv:1312.5602, 2013.

Shakir Mohamed and Danilo Jimenez Rezende. Variationalinformation maximisation for intrinsically motivated re-inforcement learning. In Advances in neural informationprocessing systems, pp. 2125–2133, 2015.

Ian Osband, Charles Blundell, Alexander Pritzel, and Ben-jamin Van Roy. Deep exploration via bootstrapped dqn.


In Advances in neural information processing systems,pp. 4026–4034, 2016.

Ian Osband, Benjamin Van Roy, Daniel Russo, and ZhengWen. Deep exploration via randomized value functions.arXiv preprint arXiv:1703.07608, 2017.

Ian Osband, John Aslanides, and Albin Cassirer. Random-ized prior functions for deep reinforcement learning. InAdvances in Neural Information Processing Systems, pp.8617–8629, 2018.

Georg Ostrovski, Marc G Bellemare, Aaron van den Oord,and Remi Munos. Count-based exploration with neuraldensity models. In Proceedings of the 34th InternationalConference on Machine Learning-Volume 70, pp. 2721–2730. JMLR. org, 2017.

Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and TrevorDarrell. Curiosity-driven exploration by self-supervisedprediction. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition Workshops,pp. 16–17, 2017.

Deepak Pathak, Dhiraj Gandhi, and Abhinav Gupta. Self-supervised exploration via disagreement. In Interna-tional Conference on Machine Learning, pp. 5062–5071,2019.

Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szy-mon Sidor, Richard Y Chen, Xi Chen, Tamim As-four, Pieter Abbeel, and Marcin Andrychowicz. Pa-rameter space noise for exploration. arXiv preprintarXiv:1706.01905, 2017.

Prajit Ramachandran, Barret Zoph, and Quoc V Le.Searching for activation functions. arXiv preprintarXiv:1710.05941, 2017.

Neale Ratzlaff and Li Fuxin. Hypergan: A generativemodel for diverse, performant neural networks. arXivpreprint arXiv:1901.11058, 2019.

Christoph Salge, Cornelius Glackin, and Daniel Polani.Empowerment–an introduction. In Guided Self-Organization: Inception, pp. 67–114. Springer, 2014.

Juergen Schmidhuber. Exploring the predictable. InAdvances in evolutionary computing, pp. 579–612.Springer, 2003.

Jurgen Schmidhuber. Curious model-building control sys-tems. In Proc. international joint conference on neuralnetworks, pp. 1458–1463, 1991.

Pranav Shyam, Wojciech Jaskowski, and Faustino Gomez.Model-based active exploration. In International Con-ference on Machine Learning, pp. 5779–5788, 2019.

David Silver, Aja Huang, Chris J Maddison, Arthur Guez,Laurent Sifre, George Van Den Driessche, Julian Schrit-twieser, Ioannis Antonoglou, Veda Panneershelvam,Marc Lanctot, et al. Mastering the game of go with deepneural networks and tree search. nature, 529(7587):484,2016.

Jasper Snoek, Yaniv Ovadia, Emily Fertig, Balaji Lakshmi-narayanan, Sebastian Nowozin, D Sculley, Joshua Dil-lon, Jie Ren, and Zachary Nado. Can you trust yourmodel’s uncertainty? evaluating predictive uncertaintyunder dataset shift. In Advances in Neural InformationProcessing Systems, pp. 13969–13980, 2019.

Yi Sun, Faustino Gomez, and Jurgen Schmidhuber. Plan-ning to be surprised: Optimal bayesian exploration in dy-namic environments. In International Conference on Ar-tificial General Intelligence, pp. 41–51. Springer, 2011.

Richard S Sutton and Andrew G Barto. Reinforcementlearning: An introduction. MIT press, 2018.

William R Thompson. On the likelihood that one unknownprobability exceeds another in view of the evidence oftwo samples. Biometrika, 25(3/4):285–294, 1933.


A. Supplementary MaterialA.1. Exploration Environment Implementation Details

Here we describe in more detail the various implementation choices we used for our method as well as for the baselines.

Toy Chain Environment

The chain environment is implemented based on the NChain-v0 gym environment. We alter NChain-v0 to contain 40 statesinstead of 10 to reduce the possibility of solving the environment with random actions. We also modify the stochastic’slipping’ state behavior by fixing the behavior of the states respect to reversing an action. For both our method and MAX,we use ensembles of 5 deterministic neural networks with 4 layers, each is 256 units wide with tanh nonlinearities. Asusual, our ensembles are sampled from the generator at each timestep, while MAX uses a static ensemble. We generateeach layer in the target network with generators composed of two hidden layers, 64 units each with ReLU nonlinearities.Both models are trained by minimizing the regression loss on the observed data. We optimize using Adam with a learningrate of 10−4, and weight decay of 10−6. We use Monte Carlo Tree Search (MCTS) to find exploration policies for usein the environment. We build the tree with 25 iterations of 10 random trajectories, and UCB-1 as the selection criteria.Crucially, when building the tree, we query the dynamic models instead of the simulator, and we compute the correspondingintrinsic reward. For intrinsic rewards, MAX uses the Jensen Shannon divergence while our method uses the variance inthe predictions within the ensemble. After building the tree we take an action in the real environment according to ourselection criteria. There is a small discrepancy between the numbers reported in the MAX paper for the chain environment.This is due to using UCB-1 as the selection criteria instead of Thompson sampling as used in the MAX. We take actionsin the environment based on the children with the highest value. The tree is then discarded after one step, after which, thedynamic models are fit for 10 additional epochs.

Continuous Control Environments

For each method where applicable, we use the method-specific hyperparameters given by the authors. Due to experimentingon potentially different environments, we search for a suitable learning rate which works the best for each method acrossall tasks. The common details of each exploration method are as follows. Each method uses (or samples) an ensembleof dynamic models to approximate environment dynamics. An ensemble consists of 32 networks with 4 hidden layers,512 units wide with ReLU nonlinearities, except for MAX which uses swish2. ICM, Disagreement, and our method useensembles of deterministic models, while MAX uses probabilistic networks which output a Gaussian distribution over nextstates. The approximate dynamic models (ensembles/generators) are optimized with Adam, using a minibatch size of 256,a learning rate of 1−4, and weight decay of 1−5.

For our dynamic model, each layer generator is composed of two hidden layers, 64 units wide and ReLU nonlinearity. Theoutput dimensionality of each generator is equal to the product of the input and output dimensionality of the correspondinglayer in the dynamic model. To sample one dynamic model, each generator takes as input an independent draw from z ∼ Zwhere Z = N (032,132). We sample ensembles of a given size m by instead providing a batch {z}mi=1 as input. To trainthe generator such that we can sample accurate transition models, we update according to equation (4) in the main text; wecompute the regression error on the data, as well as the repulsive term using an appropriate kernel. For all experiments weuse a standard Gaussian kernel k(θ, θi) = exp

(− 1h ||θ − θi||

22

)where h is the median of the pairwise distances between

sampled particles {θ}mi=1. Because we sample functions fθ instead of data points, the pairwise distance is computed byusing the likelihood of the data x under the model: log fθ(x).

For MAX, we use the code provided from (Shyam et al., 2019)3. Each member in the ensemble of dynamic models isa probabilistic neural network that predicts a Gaussian distribution (with diagonal covariance) over the next state. Theexploration policy is trained with SAC, given an experience buffer of rollouts D = {s, a, s′} ∪ Rπ performed by thedynamic models, where Rπ is the intrinsic reward: the Jensen-Renyi divergence between next state predictions of thedynamic models. The policy trained with SAC acts in the environment to maximize the intrinsic reward, and in doing socollects additional transitions that serve as training data for the dynamic models for the subsequent training phase.

For Disagreement (Pathak et al., 2019), we implement this method under the MAX codebase, following the implementation

2Swish refers to the nonlinearity proposed by (Ramachandran et al., 2017) which is expressed as a scaled sigmoid function: y =x+ sigmoid(βx)

3https://github.com/nnaisense/max


given by the authors4. The intrinsic reward is formulated as the predictive variance of the dynamic models, where themodels are represented by a bootstrap ensemble. In this work, we report results using two versions of this method. Theproposed intrinsic reward specifically is formulated in a manner quite similar to our own, however, a fixed ensemble is usedinstead of a distribution for the approximate posterior. In section §4 of the main text we report results of Disagreement onlyusing its intrinsic reward, instead of the full method, which makes use of a differentiable reward function and treats thereward as a supervised learning signal. We examine these methods separately because we are testing the effects of intrinsicrewards, as well as the form of the approximate dynamic model e.g. sampling vs fixed ensembles. The differentiablereward function is orthogonal to this effort. Nonetheless, in the next section §A.2 we report results using the full methodof Disagreement, on each continuous control experiment.

A.2. Additional Exploration Results

Here we report additional results on NChain, Ant Maze, and also additional comparisons with Disagreement – includingthe original policy optimization method with a differentiable reward function (Pathak et al., 2019).

A.2.1. EXTENDED NCHAIN RESULTS

Here we show the results of each method on the NChain task. We used the same experimental setup as in section 4.1 forour experiments. For both ICM and Disagreement we use MCTS with UCB-1 to find exploration policies, with the samehyperparameters used for our method and MAX. We find that reducing uncertainty is a crucial component to be able toexplore the chain. ICM only reduces prediction error, so it is not interested in states it can predict correctly. The onlysource of reward for ICM is in the flipped directions of some states, so its possible to initialize chains where ICM does notfind novel states, causing high variance between runs. We found that Disagreement performs better than ICM but not aswell as MAX, or our method. We suspect that this is due to a weakness in the ensemble diversity, lessening the effect ofthe intrinsic reward for this task. The ensemble can easily overfit to a toy environment like NChain, reducing its abilityto explore. MAX may avoid overfitting to the chain due to using ensembles of stochastic neural networks. Our methoduses amortized SVGD to promote model diversity, and we use an implicit generative modeling of the uncertainty in ourdynamic model to explore new states.

Figure 8. Results on the 40-link chain environment. Each line is the mean of three runs, with the shaded regions corresponding to ±1standard deviation. Both our method and MAX actively reduce uncertainty in the chain, and therefore are able to quickly explore to theend of the chain. ε-greedy DDQN fails to explore more than 40% of the chain. Both ICM and Disagreement perform better than DDQNbut explore less efficiently comparing with MAX and our method.

A.2.2. EXTENDED DISAGREEMENT RESULTS

Here we repeat our pure exploration experiments, comparing our method to both disagreement purely as an intrinsicreward, as well as the full method using the differentiable reward function for policy optimization. Figures 9(a), 9(b),and 9(c) show results on the Acrobot, Ant Maze, and Block Manipulation environments, respectively. In each figure, linescorrespond to the mean of three seeds, and shaded regions denote± one standard deviation. In each experiment, we can see

4https://github.com/pathak22/exploration-by-disagreement


that treating the intrinsic reward as a supervised loss (gray) improves on the baseline scalar-valued disagreement intrinsicreward (green). However, our method (red) remains the most sample efficient in these experiments.

(a) Acrobot (b) Ant Maze (c) HandManipulateBlock

Figure 9. Results for the full Disagreement method including the differentiable reward function on the Acrobot (a), Ant Maze (b), andHandManipulateBlock (c) environments.

A.2.3. ANT STATE VISITATION DIAGRAMS

Figure 10. Ant Maze environment.

When developing exploration algorithms based on an intrinsic reward, there is aquestion of whether the intrinsic reward is motivating the agent to explore in a sen-sible way. Taking the Ant Maze environment as an example, the agent should carveout many trajectories through the maze, instead of merely retracing its steps in eachepisode. Retracing exact trajectories would show some limitation in the explorationpolicy, the intrinsic reward, or both.

We desire that the agent explores many trajectories through the maze, and does notretrace its steps often. To show that our intrinsic reward induces sensible explo-ration, we plot state visitation diagrams of our agent in the Ant Maze in figures11(a)-11(d). We can see that the agent explores many paths through the maze, andhas not left any large portion of the maze unexplored by the end of the trial. Forobservation, we show a birds-eye-view of the environment in figure 10.

(a) 2500 Steps (b) 5000 Steps (c) 7500 Steps (d) 10000 Steps

Figure 11. Figures (a-d) show the behavior of the agent at different stages of training. Points are color-coded with blue points occurringat the beginning of the episode, and red points at the end.

A.3. Comparison to other exploration methods

Here we compare our method to two other representative exploration methods. Parameter Space Noise for Exploration(PSNE) (Plappert et al., 2017) adds parametric noise to the weights of an agent, similar to (Fortunato et al., 2017). Thenoise parameters are learned by gradient descent, and the additional stochasticity in the induced policy is responsible forincreased exploration ability. Random Network Distillation (RND) is another well-known method (Burda et al., 2018b)that introduces a randomly initialized function f : S → Rk which maps states s to a k-dimensional vector, similar to(Osband et al., 2018). A second function f : S → Rk is trained to match the predictions given by f . The prediction error

ˆf(s) − f(s) is used as an exploration bonus to the reward during training, similar to the psuedo-count based exploration


bonus in (Bellemare et al., 2016). RND has been shown to be a strong baseline for both task-specific environments andpure exploration.

In Figure 12(a), we first compare our method with PSNE and RND on the HalfCheetah environment, as both can used tolearn task-specific policies. For both methods, we use the author provided codes to run our experiments. Because RND isinitially designed for discrete actions, we modify the policy to handle continuous action spaces. However, we were unableto recover the reported results from PSNE using the provided code5. In Figure 12(b), we further compare with RND inthe pure exploration setting. We omit PSNE from this experiment, as PSNE does not have intrinsic reward, or anothermechanism that can be directly used for pure exploration in the Ant Maze environment. For Ant Maze, each method runsfor 10k steps for pure exploration, without external reward.

(a) Baseline comparison on HalfCheetah (b) Baseline comparison on Ant Maze

Figure 12. Comparison with RND and PSNE on HalfCheetah with external reward (a), and a pure exploration comparison with RND onAnt Maze (b).

These methods lack an explicit model of the environment dynamics. It has been shown many times that model-basedmethods have a considerable advantage in sample efficiency. RND in particular, takes hundreds of millions of environmentsteps to achieve its performance. We show in Figure 12(a) and 12(b) that our method enables superior downstream taskperformance, and better sample efficiency in exploration, respectively.

A.4. Policy Transfer Implementation Details

Here we describe in detail the specific settings and design choices used for the policy transfer methods and environments.

The exploration policies for our method, MAX, ICM, and Disagreement were trained exactly as in the pure explorationexperiments. We trained each exploration policy for 10k steps, storing the transition buffer for subsequent use. We theninitialize a new SAC agent, and pretrain on the stored exploration buffer for 50 episodes. After this initial stage of training,the transition buffer is cleared, and the agent is trained in the standard setting with external reward for one million steps(including the steps already taken during exploration). We choose to keep the hyperparameters (originally suggested byMAX) for SAC fixed during both exploration and task-learning. The SAC baseline, on the other hand, is initialized fromscratch and simply trained as normal for the full one million steps. For the SAC baseline we use the hyperparametersgiven by the authors for HalfCheetah. We have also run SAC with the hyperparameters used in our method, but they didnot perform better than the hyparameters given by the authors (Haarnoja et al., 2018). The relevant SAC hyperparameterspecifications used for each method can be found in table 1.

A.5. Extended Policy Transfer Results

In section 4.3 of the main paper, we show that we could significantly improve the performance of a model-free SAC agentby pretraining on data gathered from our exploration policy: where after training an exploration policy with algorithm 1,we use the resulting transition buffer to pretrain a new task-specific policy. This pretraining period results in a significantimprovement in sample-efficiency with regard to learning an effective task-specific policy. In this section we consider thesame comparing experiments, yet based on a slightly different SAC baseline training strategy which further improves the

5For PSNE we used the code at https://github.com/openai/baselines. There is an open issue regarding the performance of this code.We will update our plot when this is fixed


Ours MAX Disagreement ICM SAC BaselineLearning Rate 1e-3 1e-3 1e-3 1e-3 3e-4

Batch Size 4096 4096 4096 4096 256Alpha 0.02 0.02 0.02 0.02 1

Hidden Size 256 256 256 256 256Gamma 0.99 0.99 0.99 0.99 0.99

Tau 5e-3 5e-3 5e-3 5e-3 5e-3Reward Scale 1 1 1 1 5

Table 1. SAC Hyperparameters used with each method for pure exploration and policy transfer experiments

performance of all methods. It is common practice to use a warm-up period when training off-policy agents. The warm-upperiod takes place before any parameter updates, and consists of a fixed number of steps where the agent takes randomactions, collecting experience to add to the transition buffer. Collecting this initial experience buffer prevents the agentfrom performing many updates with respect to the same transitions in the early stage of training.

For the SAC baseline, we perform this warm-up period for ten thousand steps with an SAC agent on the HalfCheetahenvironment, and we observe a significant improvement over SAC baseline without warm-up. For other methods includingours, we also add this warm-up stage between pure exploration stage and task-specific policy training stage, but insteadutilizing the pretrained policy to collect the initial experience buffer for task policy training. As before, we train anexploration policy for ten thousand steps on the HalfCheetah environment, followed by a period of pretraining a new SACagent on the transitions collected from the exploration policy. For the warm-up stage, instead of sampling random actions,we sample actions from our pretrained policy. Specifically, we compare the following warm-up strategies:

• SAC baseline: warm up according to uniform distributed random actions

• Exploration policy initialized SAC: warm-up according to the initialized policy

Figure 13. Policy transfer results with policy warm-up, on the HalfCheetah environment. We show our results for the SAC Baseline(uniform warm-up), and SAC with exploration initialization and warm-up using various initialized policies

As shown in Figure 13, all intrinsic exploration policy initialized methods are still able to improve the baseline more orless, especially in the early stage of learning, this is consistent with our observations in the main paper (Figure 7 (a))where the warm-up stage is not introduced. While other competing methods are not able to improve much over the newbaseline given that the SAC with warm-up strategy is already very strong, our method still improves the baseline with aclear margin, further demonstrating the effectiveness and robustness of the proposed method.

Documents

Implicit Generative Modeling for Efficient Exploration · could be an unexpected extrinsic reward. However, uncer-tainty modeling from a deep network has proven to be dif-ﬁcult