On Inductive Biases in Deep Reinforcement Learning · On Inductive Biases in Deep Reinforcement Learning normalized space, with nw(s) denoting the normal-ized value, while the unnormalized

July 8, 2019

On Inductive Biases in Deep Reinforcement LearningMatteo Hessel*, Hado van Hasselt*, Joseph Modayil, David Silver,DeepMind, London, UK

Many deep reinforcement learning algorithms contain inductive biases that sculpt the agent’s objective andits interface to the environment. These inductive biases can take many forms, including domain knowledgeand pretuned hyper-parameters. In general, there is a trade-off between generality and performance whenalgorithms use such biases. Stronger biases can lead to faster learning, but weaker biases can potentiallylead to more general algorithms. This trade-off is important because inductive biases are not free; substan-tial effort may be required to obtain relevant domain knowledge or to tune hyper-parameters effectively. Inthis paper, we re-examine several domain-specific components that bias the objective and the environmentalinterface of common deep reinforcement learning agents. We investigated whether the performance deteri-orates when these components are replaced with adaptive solutions from the literature. In our experiments,performance sometimes decreased with the adaptive components, as one might expect when comparing tocomponents crafted for the domain, but sometimes the adaptive components performed better. We investi-gated the main benefit of having fewer domain-specific components, by comparing the learning performanceof the two systems on a different set of continuous control problems, without additional tuning of either sys-tem. As hypothesized, the system with adaptive components performed better on many of the new tasks.

The deep reinforcement learning (RL) community hasdemonstrated that well-tuned deep RL algorithms canmaster a wide range of challenging tasks. Human-level performance has been approached or surpassedon board-games such as Go and Chess (Silver et al.,2017), video-games such as Atari (Mnih et al., 2015;Hessel et al., 2018a; Horgan et al., 2018), and custom3D navigation tasks (Johnson et al., 2016; Kempka et al.,2016; Beattie et al., 2016; Mnih et al., 2016; Jaderberget al., 2016; Espeholt et al., 2018). These results are atestament to the generality of the overall approach. Attimes, however, the excitement for the constant streamof new domains being mastered by suitable RL algo-rithms may have over-shadowed the dependency oninductive biases of these agents, and the amount oftuning that is often required for these to perform effec-tively in new domains. A clear example of the benefitsof generality is the AlphaZero algorithm (Silver et al.,2017). AlphaZero differs from the earlier AlphaGo algo-rithm (Silver et al., 2016) by removing all dependencieson Go-specific inductive biases and human data. Afterremoving these biases, AlphaZero did not just achievehigher performance in the game of Go, but could alsolearn effectively to play Chess and Shogi.

In general, there is a trade off between generality andperformance when we inject inductive biases into ouralgorithms. Inductive biases take many forms, includ-ing domain knowledge and pretuned learning parame-ters. If applied carefully, such biases can lead to fasterand better learning. On the other hand, fewer biasescan potentially lead to more general algorithms thatwork out of the box on a wider class of problems. Cru-cially, most inductive biases are not free: for instance,substantial effort can be required to attain the relevant

domain knowledge or pretune parameters. This costis often hidden—for instance, one might use hyperpa-rameters established as good in prior work on the samedomain, without knowing how much data or time wasspent optimising these, nor how specific the settingsare to the given domain. Systematic studies about theimpact of different inductive biases are rare and thegenerality of these different biases is often unclear. An-other consideration is that inductive biases may maskthe generality of other parts of the system as a whole; ifa learning algorithm tuned for a specific domain doesnot generalize out of the box to a new domain, it canbe unclear whether it is due to the having the wronginductive biases or whether the underpinning learningalgorithm is lacking something important.

In this paper, we consider several commonly used do-main heuristics, and investigate if (and by how much)performance deteriorates when we replace these withmore general adaptive components, and we assess theimpact of such replacements on the generality of theagent. We consider two broad ways of injecting in-ductive biases in RL agents: 1) sculpting the agent’sobjective (e.g., clipping and discounting rewards), 2)sculpting the agent-environment interface (e.g., repeat-ing each selected action a hard-coded fixed number oftimes, or crafting of the learning agent’s observation).We first highlight, in a set of carefully constructed toydomains, the limitations of common instantiations ofthe these classes of inductive biases. Next, we show, inthe context of the Atari Learning Environment (Belle-mare et al., 2013), that the carefully crafted heuristicscommonly used in this domain (and often consideredessential for good performance) can be safely replacedwith adaptive components, while preserving compet-

arX

iv:1

907.

0290

8v1

[cs

.LG

] 5

Jul

201

9

On Inductive Biases in Deep Reinforcement Learning

itive performance across the benchmark. Finally, weshow that this results in increased generality for anactor critic agent; the resulting fully adaptive systemcan be applied with no additional tuning on a separatesuite of continuous control tasks. Here, the adaptivesystem achieved much higher performance than a com-parable system using the Atari tuned heuristics, andalso higher performance than an actor-critic agent thatwas tuned for this benchmark (Tassa et al., 2018).

1. Background

Problem setting: Reinforcement learning is a frame-work for learning and decision making under uncer-tainty, where an agent interacts with its environmentin a sequence of discrete steps, executing actions Atand getting observations Ot+1 and rewards Rt+1 in re-turn. The behaviour of an agent is specified by a policyπ(At|Ht): a probability distribution over actions condi-tional on previous observations (the history Ht = O1:t).The agent’s objective is to maximize the rewards col-lected in each episode of experience under policy π,and it must learn such policy without direct supervi-sion, by trial and error. The amount of reward collectedfrom time t onwards, the return, is a random variable

Gt =Tend

∑k=0

γkRt+k+1, (1)

where Tend is the number of steps until episode ter-mination and γ ∈ [0, 1] is a constant discount factor.An optimal policy is one that maximizes the expectedreturns or values: v(Ht) = Eπ [Gt|Ht]. In fully observ-able environments the optimal policy depends on thelast observation alone: π∗(At|Ht) = π∗(At|Ot). Other-wise, the history may be summarized in an agent stateSt = f (Ht). The agent’s objective is then to jointly learnthe state representation f and policy π(At|St) to maxi-mize values. The fully observable case is formalized asa Markov Decision Process Bellman (1957).

Actor-critic algorithms: Value-based algorithms effi-ciently learn to approximate values vw(s) ≈ vπ(s) ≡Eπ [Gt|St = s], under a policy π, by exploiting the recur-sive decomposition vπ(s) = E[Rt+1 + γvπ(St+1)|St =s] known as the Bellman equation, which is used in tem-poral difference learning Sutton (1988) through sam-pling and incremental updates:

∆wt = (Rt+1 + γvw(St+1)− vw(St))∇wvw(St). (2)

Policy-based algorithms update a parameterized pol-icy πθ(At|St) directly through a stochastic gradientestimate of the direction of steepest ascent in the valueWilliams (1992); Sutton et al. (2000), for instance:

∆θt = Gt∇ log πθ(At|St). (3)

Value-based and policy-based methods are combinedin actor-critic algorithms. If a state value estimate isavailable, the policy updates can be computed fromincomplete episodes by using the truncated returnsG(n)

t = ∑n−1k=0 γkRt+k+1 + γnvw(St) that bootstrap on

the value estimate at state St+n according to vw. Thiscan reduce the variance of the updates. The variancecan be further reduced using state values as a baselinein policy updates, as in advantage actor-critic updates

∆θt = (G(n)t − vw(St))∇θ log πθ(At|St). (4)

2. Common inductive biases andcorresponding adaptive solutions

We now describe a few commonly used heuristicswithin the Atari domain, together with the adaptivereplacements that we investigated in our experiments.

2.1. Sculpting the agent’s objective

Many current deep RL agents do not directly opti-mize the true objective that they are evaluated against.Instead, they are tasked with optimizing a differenthandcrafted objective that incorporates biases to makelearning simpler. We consider two popular ways ofsculpting the agent’s objective: reward clipping, andthe use of fixed discounting of future rewards by afactor different from the one used for evaluation.

In many deep RL algorithms, the magnitude of theupdates scales linearly with the returns. This makesit difficult to train the same RL agent, with the samehyper-parameters, on multiple domains, because goodsettings for hyper-parameters such as the learning ratevary across tasks. One common solution is to clip therewards to a fixed range (Mnih et al., 2015), for instance[−1, 1]. This clipping makes the magnitude of returnsand updates more comparable across domains. How-ever, this also radically changes the agent objective,e.g., if all non-zero rewards are larger than one, thenthis amounts to maximizing the frequency of positiverewards rather than their cumulative sums. This cansimplify the learning problem, and, when it is a goodproxy for the true objective, can result in good perfor-mance. In other tasks, however, clipping can resultin sub-optimal policies because the objective that isoptimized is ill-aligned with the true objective.

PopArt (van Hasselt et al., 2016; Hessel et al., 2018b)was introduced as a principled solution to learn effec-tively irrespective of the magnitude of returns. PopArtworks by tracking the mean µ and standard deviationσ of bootstrapped returns G(n)

t . Temporal differenceerrors on value estimates can then be computed in a

2


normalized space, with nw(s) denoting the normal-ized value, while the unnormalized values (needed,for instance, for bootstrapping) are recovered by a lin-ear transformation vw(s) = µ + σ ∗ nw(s). Doing thisnaively increases the non-stationarity of learning sincethe unnormalized predictions for all states change ev-ery time we change the statistics. PopArt thereforecombines the adaptive rescaling with an inverse trans-formation of the weights at the last layer of nw(s),thereby preserving outputs precisely under any changein statistics µ → µ′ and σ → σ′. This is done exactlyby updating weights and biases as w′ = wσ/σ′ andb′ = (σb + µ− µ′)/σ′.

Discounting is part of the traditional MDP formula-tion of RL. As such, it is often considered a propertyof the problem rather than a tunable parameter onthe agent side. Indeed, sometimes, the environmentdoes define a natural discounting of future rewards(e.g., inflation in a financial setting). However, evenin episodic settings where the agent should maximizethe undiscounted return, a constant discount factor isoften used to simplify learning (by having the agentfocus on a relatively short time horizon). Optimizingthis proxy of the true return often results in the agentachieving superior performance even in terms of theundiscounted return (Machado et al., 2017). This ben-efit comes with the cost of adding a hyperparameter,and a rather sensitive one: learning might be fast if thediscount is small, but the solution may be too myopic.

Instead of tuning the discount manually, we use meta-learning (cf. Sutton, 1992; Bengio, 2000; Finn et al., 2017;Xu et al., 2018) to adapt the discount factor. The meta-gradient algorithm Xu et al. (2018) uses the insight thatthe updates in Equations (2) and (4) are differentiablefunctions of hyper-parameters such as the discount.On the next sample or rollout of experience, usingupdated parameters w + ∆w(γ), written here as anexplicit function of the discount, the agent then appliesa gradient based actor-critic update, not to parametersw, but to the parameter θ that defines the discountγ which is used in a standard learning update. Thisapproach was shown to improve performance on AtariXu et al. (2018), when using a separate hand-tuneddiscount factor for the meta-update. We instead usethe undiscounted returns (γm = 1) to define the meta-gradient updates, to test whether this technique canfully replace the need for manual tuning discounts.

A related heuristic, quite specific to Atari, is to track thenumber of lives that the agent has available (in severalAtari games the agent is allowed to die a fixed numberof times before the game is over), and hard code anepisode termination (γ = 0) when this happens. Weignore the number of lives channel exposed by theArcade Learning Environment in all our experiments.

2.2. Sculpting the agent-environment interface

A common assumption in reinforcement learning isthat time progresses in discrete steps with a fixed dura-tion. Although algorithms are typically defined in thisnative space, learning at the fastest timescale providedby the environment may not be practical or efficient,at least with the current generation of learning algo-rithms. It is often convenient to have the agent operateat a slower timescale, for instance by repeating eachselected action a fixed number of times. The use offixed action repetitions is a widely used heuristic (e.g.,Mnih et al., 2015; van Hasselt et al., 2016; Wang et al.,2016; Mnih et al., 2016) with several advantages. 1)Operating at a slower timescale increases the actiongap (Farahmand, 2011), which can lead to more sta-ble learning (Bellemare et al., 2015) because it becomeseasier to appropriately rank actions reliably when thevalue estimates are uncertain or noisy. 2) Selecting anaction every few steps can save a significant amount ofcomputation. 3) Committing to each action for a longerduration may help exploration, because the diameterof the solution space has effectively been reduced, forinstance removing some often-irrelevant sequences ofactions that jitter back and forth at a fast time scale.

A more general solution approach is for the agent tolearn the most appropriate time scale at which to op-erate. Solving this problem in full generality is oneaim of hierarchical reinforcement learning (Dayan andHinton, 1993; Wiering and Schmidhuber, 1997; Sutton.et al., 1998; Bacon et al., 2017). This general problemremains largely unsolved. A simpler, though morelimited, approach is for the agent to learn how longto commit to actions (Lakshminarayanan et al., 2017).For instance, at each step t, the agent may be allowedto select both an action At and a commitment Ct, bysampling from two separate policies, both trained withpolicy gradient. Committing to an action for multiplesteps raises the issue of how to handle intermediateobservations without missing out on the potential com-putational savings. Conventional deep RL agents forAtari max-pool multiple image frames into a singleobservation. In our setting, the agent gets one imageframe as an observation after each new action selection.The agent needs to learn to trade-off the benefits ofaction repetition (e.g., lower variance, more directedbehaviour) with its disadvantages (e.g., not being ableto revise its choices during as often, and missing poten-tially useful intermediate observations).

Many state-of-the-art RL agents use non-linear func-tion approximators to represent values, policies, andstates. The ability to learn flexible state representa-tions was essential to capitalize on the successes ofdeep learning, and to scale reinforcement learning al-gorithms to visually complex domains (Mnih et al.,

3


2015). While the use of deep neural network to ap-proximate value functions and policies is widespread,their input is often not the raw observations but theresult of domain-specific heuristic transformations. InAtari, for instance, most agents rely on down-samplingthe observations to an 84× 84 grid (down from theoriginal 210× 160 resolution), grey scaling them, andfinally concatenating them into a K-Markov represen-tation, with K = 4. We replace this specific prepro-cessing pipeline with a state representation learnedend-to-end. We feed the RGB pixel observations at thenative resolution of the Arcade Learning Environmentinto a convolutional network with 32 and 64 channels(in the first and second layer, respectively), both us-ing 5× 5 kernels with a stride of 5. The output is fedto a fully connected layer with 256 hidden units, andthen to an LSTM recurrent network (Hochreiter andSchmidhuber, 1997) of the same size. The policy forselecting the action and its commitment is computed aslogits coming from two separate linear outputs of theLSTM. The network must then integrate informationover time to cope with any issues like the flickering ofthe screen that had motivated the standard heuristicpipeline used by deep RL agents on Atari.

3. Experiments

When designing algorithms it is useful to keep in mindwhat properties we would like the algorithm to satisfy.If the aim is to design an algorithm, or inductive bias,that is general, in addition to metrics such as asymptoticperformance and data efficiency, there are additionaldimensions that are useful to consider. 1) Does the algo-rithm require careful reasoning to select an appropriatetime horizon for decision making? This is tricky with-out domain knowledge or tuning. 2) How robust is thealgorithm to reward scaling? Rewards can have arbi-trary scales, that may change by orders of magnitudesduring training. 3) Can the agent use commitment (e.g.action repetitions, or options) to alleviate the difficultyof learning at the fastest time scale? 4) Does the algo-rithm scale effectively to large complex problems? 5)Does the algorithm generalize well to problems it wasnot specifically designed and tuned for?

None of these dimensions is binary, and different algo-rithms may satisfy each of them to a different degree,but keeping them in mind can be helpful to drive re-search towards more general reinforcement learningsolutions. We first discuss the first three in isolation, inthe context of simple toy environments, to increase theunderstanding about how adaptive solutions compareto the corresponding heuristics they are intended toreplace. We then use the 57 Atari games in the ArcadeLearning Environment (Bellemare et al., 2013) to evalu-ate the performance of the different methods at scale.

Finally, we investigate how well the methods general-ize to new domains, using 28 continuous control tasksin the DeepMind Control Suite (Tassa et al., 2018).

3.1. Motivating Examples

We used a simple tabular actor-critic agent (A2C) toinvestigate in a minimal setup how domain heuristicsand adaptive solutions compare with respect to someof these dimensions. We report average reward perstep, after 5000 environment steps, for each of 20 repli-cas of each agent.

First, we investigate the role of discounting for effec-tive learning. Consider a small chain environmentwith T = 9 states and 2 actions. The agent starts everyepisode in the middle of the chain. Moving left pro-vides a -1 penalty. Moving right provides a reward of2d/T, where d is the distance from the left end. Wheneither end of the chain is reached, the episode ends,with an additional reward T on the far right end. Fig-ure 1a shows a parameter study over a range of valuesfor the discount factor. We found the best performancewas between 0.5 and 0.9, where learning is quite effec-tive, but observed decreased performance for loweror higher discounts. This shows that it can be diffi-cult to set a suitable discount factor, and that the naivesolution of just optimizing the undiscounted returnmay also perform poorly. Compare this to the sameagent, but equipped with the adaptive meta-gradientalgorithm discussed in Section 2.1 (in orange in Figure1.a). Even initializing the discount to the value of 0.95(which performed poorly in the parameter study), theagent learned to reduce the discount and performed inpar with the best tuned fixed discount.

Next, we investigate the impact of reward scaling. Weused the same domain, but keep the discount fixed to avalue of 0.8 (as it was previously found to work well).We examine instead the performance of the agent whenall rewards are scaled by a constant factor. Note thatin the plots we report the unscaled rewards to makethe results interpretable. Figure 1.b shows that the per-formance of the vanilla A2C agent (in blue) degradedrapidly when the scale of the rewards was significantlysmaller or larger than 1. Compare this to the sameagent equipped with PopArt, we observe better per-formance across multiple orders of magnitude for thereward scales. In this tiny problem, learning couldalso be achieved by tuning the learning rate for eachreward scale, but that does not suffice for larger prob-lems. Adaptive optimization algorithm such as Adam(Kingma and Ba, 2014) or RMSProp (Tieleman and Hin-ton, 2012) can also provide some degree of invariance,but, as we will see in Section 3.2, they are not as effec-tive as PopArt normalization.

4


Figure 1 | Investigations on the robustness of an A2C agent with respect to discounting, reward scaling andaction repetitions. We report the average reward per environment step, after 5000 steps of training, for each of 20distinct seeds. Each parameter study compares different fixed configurations of a specific hyper-parameter to thecorresponding adaptive solution. In all cases the performance of the adaptive solutions is competitive with thatof the best tuned solution

Finally, we investigate the role of action repeats. Weconsider states arranged into a simple cycle of 11 states.The agent starts in state 0, and only moves in one direc-tion using one action, the other action does not movethe agent. The reward is 0 everywhere, except if theagent selects the non-moving action in the 11− th state:in this case the agent receives a reward of 100 and theepisode ends. We compare an A2C agent that learnsto choose the number of action repeats (up to 10), toan agent that used a fixed number of repetitions C. Fig-ure 1.c shows how the number of fixed action repeatsused by the agent is a sensitive hyper-parameter inthis domain. Compare this to the adaptive agent thatlearns how often to repeat actions via policy gradient(in orange in Figure 1.c). This agent quickly learneda suitable number of action repeats and thereby per-formed very well. This is a general problem, in manydomains of interest it can be useful to combine fine-grained control in certain states, with more coarse anddirected behaviour in other parts of the state space.

3.2. Performance on large domains

To evaluate the performance of the different methodson larger problems, we use A2C agents on many Atarigames. However, differently from the previous ex-periments, the agent learns in parallel from multiplecopies of the environment, similarly to many state-of-the-art algorithms for reinforcement learning. Thisconfiguration increases the throughput of acting andlearning, and speeds up the experiments. In parallellearning training setups, the learning updates may beapplied synchronously (Espeholt et al., 2018) or asyn-chronously (Mnih et al., 2016). Our learning updatesare synchronous: the agent’s policy takes steps in paral-lel across 16 copies of the environment to create multi-step learning updates, batched together to compute a

single update to the parameters. We train individualagents on each game. Per-game scores are averagedover 8 seeds, and we then track the median human nor-malized score across all games. All hyper-parametersfor our A2C agents were selected for a generic A2Cagent on Atari before the following experiments wereperformed, with details given in the appendix.

Our experiments measure the performance of a fulladaptive A2C agent with learned action repeats,PopArt normalization, learned discount factors, andan LSTM-based state representation. We compare theperformance of this agent to agents with exactly oneadaptive component disabled and replaced with oneof two fixed components. This fixed component is ei-ther falling back to the environment specified task (e.g.learning directly from undiscounted returns), or usingthe corresponding fixed heuristic from DQN. Thesecomparisons enable us to investigate how importantthe original heuristic is for current RL algorithms, aswell as how fully an adaptive solution can replace it.

In Figure 2a, we investigate action repeats and their im-pact on learning. We compare the fully general agentto an agent that used exactly 4 action repetitions (astuned for Atari (Mnih et al., 2015)), and to an agent thatacted and learned at the native frame rate of the envi-ronment. The adaptively learned solution performedalmost as well as the tuned domain heuristic of al-ways repeating each action 4 times. Interestingly, inthe first 100M frames, also acting at the fastest ratewas competitive with the agents equipped with actionrepetition (whether fixed or learned), at least in termsof data efficiency. However, while the agents with ac-tion repeats were still improving performance until thevery end of the training budget, the agent acting at thefastest timescale appeared to plateau earlier. This per-formance plateau is observed in multiple games (see

5


Figure 2 | Comparison of inductive biases to RL solutions. All curves show mean episode return as a function ofthe number of environment steps. Each plot compares the same fully general agent to 2 alternative. a) tunedaction repeats, and no action repeats. b) tuned discount factor, and no discounting. c) reward clipping, andlearning from raw rewards with no rescaling of the updates. d) learning from the raw observation stream ofAtari, and the standard preprocessing.

appendix), and we speculate that the use of multipleaction repetitions may be helping achieve better explo-ration. We note that, in wall-clock time, the gap in theperformance of the agents with action repetitions waseven larger due to the additional compute.

In Figure 2b, we investigate discounting. The agentthat used undiscounted returns directly in the updatesto policy and values performed very poorly, demon-strating that in complex environments the naive solu-tion of directly optimizing the real objective is prob-lematic with modern deep RL agents. Interestingly,while performance was very poor overall, the agentdid demonstrate good performance on a few specificgames. For instance, in bowling it achieved a betterscore than state of the art agents such as Rainbow (Hes-sel et al., 2018a) and ApeX (Horgan et al., 2018). Theagent with tuned discount and the agent with a dis-count factor learned through meta-gradient RL per-formed much better overall. The adaptive solution didslightly better than the heuristic.

In Figure 2c, we investigate the effect of reward scales.We compare the fully adaptive agent to an agentwhere clipping was used in place of PopArt, and toa naive agent that used the environment reward di-rectly. Again, the naive solution performed very poorly,compared to using either the domain heuristic or thelearned solution. Note that the naive solution is usingRMSProp as an optimizer, in combination with gradi-ent clipping by norm (Pascanu et al., 2012); togetherthese techniques should provide at least some robust-ness to scaling issues, but in our experiments PopArtprovided an additional large increase in performance.In this case, the domain heuristic (reward clipping) re-tained a significant edge over the adaptive solution.

This suggests that reward clipping might not be help-ing exclusively with reward scales; the inductive biasof optimizing for a weighted frequency of rewards is avery good heuristic in many Atari games, and the qual-itative behaviour resulting from optimizing the proxyobjective might result in a better learning dynamics.We note, in conclusion, that while clipping was betterin aggregate, PopArt yielded significantly improvedscores on several games (e.g., centipede) where theclipped agent was stuck in sub-optimal policies.

Finally, in Figure 2d, we compare the fully end to endpipeline with a recurrent network, to a feedforwardneural network with the standard Atari pipeline. Therecurrent end to end solution performed best, show-ing that a recurrent network is sufficiently flexible tolearn on its own to integrate relevant information overtime, despite the Atari-specific features of the observa-tion stream (such as the flickering of the screen) thatmotivated the more common heuristic approach.

3.3. Generalization to new domains

Our previous analysis shows that learned solutions aremostly quite competitive with the domain heuristicson Atari, but do not uniformly provide additional ben-efits compared to the well tuned inductive biases thatare commonly used in Atari. To investigate the gener-ality of these different RL solutions, in this section wecompare the fully general agent to an agent with all theusual inductive biases, but this time we evaluate themon a completely different benchmark: a collection of28 continuous control tasks in the DeepMind ControlSuite. The tasks represent a wide variety of physicalcontrol problems, and the dimension of the real-valued

6


Figure 3 | In a separate experiment on the 28 tasks in the DeepMind Control Suite, we compared the generalsolution agent with an A2C agent using all the domain heuristics previously discussed. Both agents were trainedand evaluated on the new domain with no changes to the algorithm nor any additional tuning for this verydifferent set of environments. On average, the general adaptive solutions transfer better to the new domain thatthe heuristic solution. On the left we plot the average performance across all 28 tasks. On the right we show thelearning curves on a selection of 10 tasks.

observation and action vectors differs across the tasks.The environment state can be recovered from the ob-servation in all but one task. The rewards are boundedbetween 0 and 1, and tasks are undiscounted.

Again, we use a parallel A2C implementation, with16 copies of the environment, and we aggregate re-sults by first averaging scores across 8 seeds, and thentaking the mean across all 28 tasks. Because all tasksin this benchmark are designed to have episode re-turns of comparable magnitude, there is no need tonormalize the results to meaningfully aggregate them.For both agents we use the exact same solutions thatwere used in Atari, with no additional tuning. Theagents naturally transfer to this new domain with twomodifications: 1) we do not use convolutions since theobservations do not have spatial structure. 2) the out-puts of the policy head are interpreted as encoding themean and variance of a Gaussian distribution insteadof as the logits of a categorical one.

Figure 3 shows the fully general agent performed muchbetter than the heuristic solution, which suggests thatthe set of inductive biases typically used by Deep RLagents on Atari do not generalize as well to this new do-main as the set of adaptive solutions considered in thispaper. This highlights the importance of being awareof the priors that we incorporate into our learning algo-rithms, and their impact on the generality of our agents.On the right side of Figure 3, we report the learningcurves on the 10 tasks for which the absolute differencebetween the performance of the two agents was great-est (details on the full set of 28 tasks can be found inAppendix). The adaptive solutions performed equal

or better than the heuristics on each of these 10 tasks,and the results in the appendix show performance wasrarely worse. The reference horizontal black lines markthe performance of an A3C agent, tuned specifically forthis suite of tasks, as reported by Tassa et al. (2018). Theadaptive solution was also better, in aggregate, thanthis well tuned baseline; note however the tuned A3Cagent achieved higher performance on a few games.

4. Related Work and Discussion

The present work was partially inspired by the workof Silver et al. (2017) in the context of Go. They demon-strated that specific domain specific heuristics (e.g.pretraining on human data, the use of handcraftedGo-specific features, and exploitation of certain sym-metries in state space), while originally introduced tosimplify learning (Silver et al., 2016), had actually out-lived their usefulness: taking a purer approach, evenstronger Go agents could be trained. Importantly, theyshowed removing these domain heuristics, the samealgorithm could master other games, such as Shogi andChess. In our paper, we adopted a similar philosophybut investigated the very different set of domain spe-cific heuristics, that are used in more traditional deepreinforcement learning agents.

Our work relates to a broader debate (Marcus, 2018)about priors and innateness. There is evidence that we,as humans, posses specific types of biases, and thatthese have a role in enabling efficient learning (Spelkeand Kinzler, 2007; Dubey et al., 2018); however, it is farfrom clear the extent to which these are essential for

7


intelligent behaviour to arise, what form these priorstake, and their impact on the generality of the resultingsolutions. In this paper, we demonstrate that severalheuristics we commonly use in our algorithms alreadyharm the generality of our methods. This does notmean that other different inductive biases could not beuseful as we progress towards more flexible, intelligentagents; it is however a reminder to be careful withthe domain knowledge and priors we bake into oursolutions, and to be prepared to revise them over time.

We found that existing learned solutions are competi-tive with well tuned domain heuristics, even on the do-main these heuristics were designed for, and they seemto generalize better to unseen domain. This makes acase for removing these biases in future research onAtari, since they are not essential for competitive per-formance, and they might hide issues in the core learn-ing algorithm. The only case where we still found asignificant gap in favour of the domain heuristic wasin the case of clipping. While, to the best of our knowl-edge, PopArt does address scaling issues effectively,clipping still seems to help on several games. Changingthe reward distribution has many subtle implicationsfor the learning dynamics, beside affecting the magni-tude of updates (e.g. exploration, risk-propensity, ...).We leave to future research to investigate what othergeneral solutions could be deployed in our agents tofully recover the observed benefits of clipping.

Several of the biases that we considered have knobsthat could be tuned rather than learned (e.g. the dis-count, the number of repeats, etc); however, this is nota satisfying solution for several reasons. Tuning is ex-pensive, and these heuristics interact subtly with eachother, thus requiring an exploration of a combinato-rial space to find suitable settings. Consider the useof fixed action repeats: when changing the number ofrepetitions you also need to change the discount fac-tor, otherwise this will change the effective horizon ofthe agent, which in turn affects the magnitude of thereturns and therefore the learning rate. Also, a fixedtuned value might still not give you the full benefitsof an adaptive learned approach that can adapt to thevarious phases of the training process.

There are other two features of our algorithm that, de-spite not incorporating quite as much domain knowl-edge as the heuristics discussed in this paper, also con-stitute a potential impediment to its generality andscalability. 1) The use of parallel environments is notalways feasible in practice, especially in real world ap-plications (although recent work on robot farms Levineet al. (2016) shows that it might still be a valid approachwhen sufficient resources are available). 2) The use ofback-propagation through time for training recurrentstate representations constrains the length of the tem-

poral relationship that we can learn, since the memoryconsumption is linear in the length of the rollouts. Fur-ther work in overcoming these limitations, successfullylearning online from a single stream of experience, is afruitful direction for future research.

References

P. Bacon, J. Harb, and D. Precup. The option-critic architec-ture. AAAI Conference on Artificial Intelligence, 2017.

C. Beattie, J. Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright,H. Küttler, A. Lefrancq, S. Green, V. Valdés, A. Sadik,J. Schrittwieser, K. Anderson, S. York, M. Cant, A. Cain,A. Bolton, S. Gaffney, H. King, D. Hassabis, S. Legg, andS. Petersen. Deepmind lab. CoRR, abs/1612.03801, 2016.

M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. TheArcade Learning Environment: An evaluation platformfor general agents. JAIR, 2013.

M. G. Bellemare, G. Ostrovski, A. Guez, P. S. Thomas, andR. Munos. Increasing the action gap: New operators forreinforcement learning. CoRR, abs/1512.04860, 2015.

R. Bellman. Dynamic programming. Princeton UniversityPress, 1957.

Y. Bengio. Gradient-based optimization of hyperparameters.Neural computation, 12(8):1889–1900, 2000.

P. Dayan and G. E. Hinton. Feudal reinforcement learning.In Neural Information Processing Systems, 1993.

Dubey, Agrawal, Pathak, Griffiths, and Efros. Investi-gating human priors for playing video games. CoRR,abs/1802.10217, 2018. URL http://arxiv.org/abs/1802.10217.

L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih,T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg,and K. Kavukcuoglu. Impala: Scalable distributed deep-rlwith importance weighted actor-learner architectures. InInternational Conference on Machine Learning, 2018.

A.-m. Farahmand. Action-gap phenomenon in reinforcementlearning. In Neural Information Processing Systems, 2011.

C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Interna-tional Conference on Machine Learning, 2017.

Hessel, Modayil, van Hasselt, Schaul, Ostrovski, Dabney,Horgan, Piot, Azar, and Silver. Rainbow. AAAI Conferenceon Artificial Intelligence, 2018a.

Hessel, Soyer, Espeholt, Czarnecki, Schmitt, and van Hasselt.Multi-task deep reinforcement learning with popart. AAAIConference on Artificial Intelligence, 2018b.

S. Hochreiter and J. Schmidhuber. Long short-term memory.Neural computation, 9(8):1735–1780, 1997.

8


D. Horgan, J. Quan, D. Budden, G. Barth-Maron, M. Hes-sel, H. van Hasselt, and D. Silver. Distributed prioritizedexperience replay. In International Conference on LearningRepresentations, 2018.

M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo,D. Silver, and K. Kavukcuoglu. Reinforcement learningwith unsupervised auxiliary tasks. In International Confer-ence on Learning Representations, 2016.

M. Johnson, K. Hofmann, T. Hutton, and D. Bignell. TheMalmo platform for artificial intelligence experimentation.In IJCAI, pages 4246–4247, 2016.

Kempka, Wydmuch, Runc, Toczek, and Jaskowski. Vizdoom:A doom-based ai research platform for visual rl. In Confer-ence on Computational Intelligence and Games, 2016.

D. P. Kingma and J. Ba. Adam: A method for stochasticoptimization. In International Conference on Learning Repre-sentations, 2014.

Lakshminarayanan, Sharma, and Ravindran. Dynamic ac-tion repetition for deep reinforcement learning. In AAAIConference on Artificial Intelligence, 2017.

S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen. Learninghand-eye coordination for robotic grasping with large-scale data collection. In ISER, 2016.

M. C. Machado, M. G. Bellemare, E. Talvitie, J. Veness,M. Hausknecht, and M. Bowling. Revisiting the ArcadeLearning Environment: Evaluation Protocols and OpenProblems for General Agents. ArXiv e-prints, Sept. 2017.

Marcus. Innateness, alphazero, and artificial intelligence.CoRR, abs/1801.05667, 2018.

Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, andKavukcuoglu. Asynchronous methods for deep rl. InInternational Conference on Machine Learning, 2016.

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness,M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fid-jeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik,I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg,and D. Hassabis. Human-level control through deep rein-forcement learning. Nature, 2015.

R. Pascanu, T. Mikolov, and Y. Bengio. Understanding theexploding gradient problem. CoRR, abs/1211.5063, 2012.URL http://arxiv.org/abs/1211.5063.

Silver, Hubert, Schrittwieser, Antonoglou, Lai, Guez, Lanctot,Sifre, Kumaran, Graepel, Lillicrap, Simonyan, and Hass-abis. Mastering chess and shogi by self-play with a generalreinforcement learning algorithm. CoRR, abs/1712.01815,2017.

D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre,G. van den Driessche, J. Schrittwieser, I. Antonoglou,V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe,J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap,M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis.Mastering the game of Go with deep neural networks andtree search. Nature, 2016.

E. S. Spelke and K. D. Kinzler. Core knowledge. DevelopmentalScience, 10:89–96, 2007.

Sutton. Learning to predict by the methods of temporaldifferences. Machine Learning, 1988.

Sutton. Adapting bias by gradient descent: An incrementalversion of delta-bar-delta. In AAAI Conference on ArtificialIntelligence, 1992.

Sutton., Precup, and Singh. Intra-option learning about tem-porally abstract actions. In International Conference on Ma-chine Learning, 1998.

Sutton, McAllester, Singh, and Mansour. Policy gradientmethods for rl with function approximation. In NeuralInformation Processing Systems, 2000.

Tassa, Doron, Muldal, Erez, Li, L. Casas, Budden, Abdol-maleki, Merel, Lefrancq, Lillicrap, and Riedmiller. Controlsuite. 2018.

T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Dividethe gradient by a running average of its recent magnitude.COURSERA: Neural networks for machine learning, 2012.

van Hasselt, Guez, Hessel, Mnih, and Silver. Learning valuesacross many orders of magnitude. In Neural InformationProcessing Systems, 2016.

H. van Hasselt, A. Guez, and D. Silver. Deep reinforcementlearning with double Q-learning. In AAAI Conference onArtificial Intelligence, pages 2094–2100, 2016.

Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot,and N. de Freitas. Dueling network architectures for deepreinforcement learning. In International Conference on Ma-chine Learning, 2016.

M. Wiering and J. Schmidhuber. HQ-learning. AdaptiveBehavior, 6(2):219–246, 1997.

Williams. Simple statistical gradient-following algorithmsfor connectionist reinforcement learning. Machine Learning,1992.

Xu, van Hasselt, and Silver. Meta-gradient reinforcementlearning. CoRR, abs/1805.09801, 2018.

9


Appendix

A. Training Details

We performed very limited tuning on Atari, both due to the cost of running so many comparison with 8 seeds at scale across57 games, and because we were interested in generalization to a different domain. We used a learning rate of 1e− 3, anentropy cost of 0.01 and a baseline cost of 0.5. The learning rate was selected among 1e− 3, 1e− 4, 1e− 5 in an early version ofthe agent without any of the adaptive solutions, and verified to be reasonable as we were adding more components. Similarlythe entropy cost was selected between 0.1 and 0.01. The baseline cost was not tuned, but we used the value of 0.5 that iscommon in the literature. We used the TensorFlow default settings for the additional parameters of the RMSProp optimizer.

The learning updates were batched across rollouts of 150 agent steps for 16 parallel copies of the environment. All experimentsused gradient clipping by norm, with a maximum norm of 5., as in Wang et al. (2016). The adaptive agent could choose torepeat each action up to 6 times. PopArt used a step-size of 3e− 4, with bounds on the scale of 1e− 4 and 1e6 for numericalstability, as reported by Hessel et al. (2018b). We always used the undiscounted return to compute the meta-gradient updates,and when using the adaptive solutions these would update both the discount γ and the trace λ, as in the experiments by Xuet al. (2018). The meta-updates were computed on smaller rollouts of 15 agent steps, with a meta-learning rate of 1e− 3. Noadditional tuning was performed for any of the experiments on the Control Suite.

B. Experiment Details

In Figure 4 we report the detailed learning curves for all Atari games for three distinct agents: the fully adaptive agent (inred), the agent with fixed action repeats (in green), and the agent acting at the fastest timescale (in blue). It’s interestingto observe how on a large number of games (ice_hockey, jamesbond, robotank, fishing, double_dunk, etc) the agent withno action repeats seems to learn stably and effectively up to a point, but then plateaus quite abruptly. This suggests thatexploration might be a major reason for the overall reduced aggregate performance of this agent in the later stages of training.

In Figure 5 and 6 we show the discount factors and the eligibility traces λ as they were meta-learned over the course oftraining by the fully adaptive agent. We plot the soft time horizons T = (1− γ)−1 and T = (1− λ)−1. It’s interesting toobserve how diverse these are for the various games. Consider for instance the discount factor γ: in games such as robotank,bank_heist, and tutankham the horizon increases up to almost 1000, or even above, while in other games, such as surroundand double_dunk the discount reduces over time. Also for the eligibility trace parameter λ we observe very diverse values indifferent games. In Figure 7 we report the learning curves for all 28 tasks in the Control Suite (the 10 selected for the main textare those for which the difference in performance between the adaptive agent and the heuristic agent is greater).

10


Figure 4 | Performance on all Atari games of the fully adaptive agent (in red), the agent with fixed action repeats(in green), and the agent acting at the fastest timescale (in blue).

11


Figure 5 | For each Atari game, the associated plot shows the time horizon T = (1− γ)−1 for the discount factorγ that was adapted across the 200 million training frames.

12


Figure 6 | For each Atari game, the associated plot shows the time horizon T = (1− λ)−1 for the eligibility traceλ that was adapted across the 200 million training frames.

13


Figure 7 | Performance on all control suite tasks for the fully adaptive agent (in red), the agent trained with theAtari heuristics (in green), and an A3C agent (black horizontal line) as tuned specifically for the Control Suite byTassa et al. (2018).

14

Documents

On Inductive Biases in Deep Reinforcement Learning · On Inductive Biases in Deep Reinforcement Learning normalized space, with nw(s) denoting the normal-ized value, while the unnormalized