Solving Continuous Control with Episodic Memory

Solving Continuous Control with Episodic Memory

Igor Kuznetsov , Andrey FilchenkovITMO University

[email protected], [email protected],

AbstractEpisodic memory lets reinforcement learning algo-rithms remember and exploit promising experiencefrom the past to improve agent performance. Pre-vious works on memory mechanisms show bene-fits of using episodic-based data structures for dis-crete action problems in terms of sample-efficiency.The application of episodic memory for continuouscontrol with a large action space is not trivial. Ourstudy aims to answer the question: can episodicmemory be used to improve agent’s performance incontinuous control? Our proposed algorithm com-bines episodic memory with Actor-Critic architec-ture by modifying critic’s objective. We further im-prove performance by introducing episodic-basedreplay buffer prioritization. We evaluate our algo-rithm on OpenAI gym domains and show greatersample-efficiency compared with the state-of-theart model-free off-policy algorithms.

1 IntroductionThe general idea of episodic memory in reinforcement learn-ing setting is to leverage long-term memory reflecting datastructure containing information of the past episodes to im-prove agent performance. The existing works [Blundell et al.,2016; Pritzel et al., 2017], show that episodic control (EC)can be beneficial for decision making process. In the contextof discrete action problems, episodic memory stores infor-mation about states and the corresponding returns in a table-like data structure. With only several actions the proposedmethods store past experiences in multiple memory buffersper each action. During the action selection process the es-timate of taking each action is reconsidered with respect tothe stored memories and may be corrected taking into the ac-count past experience. The motivation of using episodic-likestructures is to latch quickly to the rare but promising experi-ences that slow gradient-based models cannot reflect from asmall number of samples.

The notion of using episodic memory in continuous con-trol is not trivial. Since the action space may be high-dimensional, the methods that operate discrete action spacebecome not suitable. Another challenge is the complexity ofthe state space. The study of [Blundell et al., 2016] shows

that some discrete action environments (e.g. Atari) may havehigh ratio of repeating states, which is not the case for com-plex continuous control environments.

Our work proposes an algorithm that leverages episodicmemory experience through modification of the Actor-Criticobjective. Our algorithm is based on DDPG, the model-freeactor-critic algorithm that operates over continuous controlspace. We offer to store the representation of action-statepairs in memory module to perform memory association notonly from environment state, but also from the performedaction. Our experiments show that modification of objec-tive provides greater sample efficiency compared with off-policy algorithms. We further improve agent performanceby introducing novel way of prioritized experience replaybased on episodic experiences. The proposed algorithm,Episodic Memory Actor-Critic (EMAC), exploits episodicmemory during the training, distilling the Monte-Carlo (MC)discounted return signal through the critic to actor and re-sulting in the strong policy that provides greater sample effi-ciency than other models.

We draw the connection between the proposed objectiveand alleviation of Q-value overestimation, a common prob-lem in Actor-Critic methods [Thrun and Schwartz, 1993].We show that the use of the retrieved episodic data resultsin more realistic critic estimates that in turn provides fastertraining convergence. In contrast, the proposed prioritized ex-perience replay, aims to focus more on optimistic promisingexperiences from the past. The process of frequently exploit-ing only the high-reward transitions from the whole replaybuffer may result in unstable behaviour. However, the stabi-lizing effect of the new objective allows to decrease the pri-ority of non-informative transitions with low return withoutloss in stability.

We evaluate our model on a set of OpenAI gym en-vironments [Dhariwal et al., 2017] (Figure 1) and showthat it achieves greater sample efficiency compared with thestate-of-the-art off-policy algorithms (TD3, SAC). We opensourced our algorithm to achieve reproducibility. All thecodes and learning curves can be accessed at: http://github.com/schatty/EMAC.

2 Related WorkEpisodic Memory. First introduced in [Lengyel andDayan, 2007], episodic control is studied within the regular

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)

2651

http://github.com/schatty/EMAC

http://github.com/schatty/EMAC

tree-structured MDP setup. Episodic control aims to memo-rize highly-rewarding experiences and replays sequences ofactions that achieved high returns in the past. The successfulapplication of episodic memory to the discrete action prob-lems has been studied in [Blundell et al., 2016] . Reflectingthe model of hippocampal episodic control the work showsthat the use of table-based structures containing state repre-sentation with the corresponding returns can improve model-free algorithms on environments with the high ratio of repeat-ing states. Semi-tabular differential version of memory is pro-posed in [Pritzel et al., 2017] to store past experiences. Thework of [Lin et al., 2018] is reminiscent of ideas presentedin our study. They modify the DQN objective to mimic rela-tionships between two learning systems of the human brain.The experiments show that this modifications improves sam-ple efficiency for the arcade learning environment [Bellemareet al., 2013]. The framework to associate the related experi-ence trajectories in the memory to achieve reasoning of effec-tive strategies is proposed in [Vikbladh et al., 2017]. In thecontext of the continuous control, the work of [Zhang et al.,2019] exploits episodic memory to redesign experience re-play for the asynchronous DDPG. To the best of our knowl-edge, this is the only work that leverages episodic memorywithin the continuous control.

Actor-Critic algorithms. Actor-Critic methods representsa set of algorithms that compute value function for the policy(critic) and improve the policy (actor) from this value func-tion. Using deep approximators for the value function andthe actor, [Lillicrap et al., 2015] presents Deep DeterministicPolicy Gradient. The proposed model-free off-policy algo-rithm aims to learn policy in high-dimensional, continuousaction space. Recent works propose several modificationsthat stabilize and improve performance of the original DDPG.The work of [Fujimoto et al., 2018] addresses Q-value over-estimation and proposes the Twin Delayed Deep Determin-istic (TD3) algorithm which greatly improves DDPG learn-ing speed. The proposed modifications include existenceof multiple critic to reduce critic’s over optimism, addi-tional noise applied within calculation of target Q-estimateand delayed policy update. In [Haarnoja et al., 2018a;Haarnoja et al., 2018b] the authors study maximum-entropyobjectives based on which provide state-of-the-art perfor-mance for OpenAI gym benchmark.

3 BackgroundReinforcement learning setup consists of an agent that in-teracts with an environment E. At each discrete time stept = 0...T the agent receives an environment state st, per-forms an action at and receives a reward rt. An agent’s ac-tion is defined by a policy, which maps the state to a proba-bility distribution over the possible actions π : S → P(A).The return is defined as a discounted sum of rewards Rt =∑Ti=t γ

i−tr(si, ai) with a discount factor γ ∈ [0, 1].Reinforcement learning aims to find the optimal policy

πθ, with parameters θ, which maximizes the expected re-turn from the initial distribution J(θ) = Esi∼pπ,ai∼π[R0].The Q-function denotes the expected return when perform-ing action a from the state s following the current policy π

Qπ(s, a) = Esi∼pπ,ai∼π [Rt|s, a]For continuous control problems policy πθ can be updated

taking the gradient of the expected return∇θJ(θ) with deter-ministic policy gradient algorithm [Silver et al., 2014]:

∇θJ(θ) = Es∼pπ[∇aQπ(s, a)|a=π(s)∇θπθ(s)

]. (1)

In actor-critic methods, we operate with two parametrizedfunctions. An actor represents the policy π and a critic isthe Q-function. The critic is updated with temporal differ-ence learning by iteratively minimizing the Bellman equation[Watkins and Dayan, 1992]:

JQ = E[(Q(st, at)− (r(st, at) + γQ(st+1, at+1)))

2].(2)

In deep Q-learning , the parameters of Q-function are up-dated with additional frozen target network Qθ′ which isupdated by τ proportion to match the current Q-functionθ′ ← τθ + (1− τ)θ′

JQ = E[(Q(st, at)−Q′)2

], (3)

where

Q′ = (r(st, at) + γQθ′(st+1, a′)), a′ ∼ πθ′ (st+1). (4)

The actor is learned to maximize the current Q function.

Jπ = E [Q(s, π(s))] . (5)

The advantage of the actor-critic methods is that they canbe applied off-policy, methods that proven to have better sam-ple complexity [Lillicrap et al., 2015]. During the trainingthe actor and critic are updated with sampled mini-batchesfrom the experience replay buffer [Lin, 1992].

4 MethodWe present Episodic Memory Actor-Critic, the algorithm thatleverages episodic memory during the training. Our algo-rithm builds on the Deep Deterministic Policy Gradient [Lil-licrap et al., 2015] by modifying critic’s objective and intro-ducing episodic-based prioritized experience replay.

(a) (b) (c) (d)

Figure 1: Example of OpenAI gym environments. (a) Walker2d-v3(b) Swimmer-v3 (c) Hopper-v3 (d) InverteDoublePendulum-v2


2652

4.1 Memory ModuleThe core of our algorithm features a table-based structure thatstores the experience of the past episodes. Following the ap-proach of [Blundell et al., 2016] we treat this memory ta-ble as a set of key-value pairs. The key is the representationof a concatenated state-action pair and the value is the truediscounted Monte-Carlo return from an environment. We en-code this state-action pair to a vector of smaller dimension bythe projection operation φ : x → Mx, x ∈ Rv,M ∈ Ru×v ,where v is the dimension of the concatenated state-action vec-tor, and u is a smaller projected dimension. As the Johnson-Lindenstrauss lemma [Johnson and Lindenstrauss, 1984]states, this transformation preserves relative distances in theoriginal space given that M is a standard Gaussian. The pro-jection matrix is initialized once at the beginning of the train-ing.

Memory module implies two operations: add and lookup.As we can calculate the discounted return only at the end of anepisode we perform add operation adding pairs of φ([s, a])i,Ri when the whole episode is generated by the agent. Wenote that the true discounted return may be considered faironly when the episode ends naturally. For time step limitcase we should perform additional evaluation steps withoutusing these transitions for training. This way we allow dis-counted return from later transitions be obtained from thesame amount of consequent steps as transitions in the begin-ning of the episode. Given the complexity and continuousnature of environments we make an assumption that the re-ceived state-action pair has no duplicates in the stored rep-resentations. As a result, we do not perform the search forthe stored duplicates to make a replacement and we add theincoming key-value pair to the end of the module. Thus, theadd operation takes O(1).

The lookup operation takes as input state-action pair, per-forms the projection to the smaller dimension and accom-plishes the search for the k most similar keys returning thecorresponding MC returns. We use l2 distance as a metric ofsimilarity between the projected states

d(z, zi) = ‖z − zi‖22 + ε, (6)

where ε is a small constant. The episodic MC return for aprojected query vector zq is formulated as a weighted sum ofthe closest K elements

QM (zq) =∑

k=1..K

qkwk, (7)

where q is a value stored in the memory-module and wiis a weight proportional to the inverse distance to the queryvector.

wk =e−d(k,q)∑

t=1..K e−d(k,t) . (8)

We propose to leverage stored MC returns as the secondpessimistic estimate of the Q-function. In contrast of ap-proach of [Pritzel et al., 2017; Lin et al., 2018] we use

Figure 2: EMAC architecture on calculating Q-estimates

weighted sum of all near returns rather then taking maximumof them. Given that we are able calculate MC return for eachtransition from the off-policy replay buffer we propose thefollowing modification of the original critic objective (3):

JQ =(1− α)(Q(st, at)−Q′)2++α(Q(stat)−QM )2, (9)

where QM is the value returned by lookup operation and αis a hyperparameter controlling the contribution of episodicmemory. In the case of α = 0 the objective becomes a com-mon Q-learning loss function. In our experiments we foundbeneficial small values of α. In the evaluation results α is setto 0.1 for the Walker, Hopper, InvertedPendulum, Inverted-DoublePendulum, and to 0.15 for the Swimmer.

The memory module is similar to the replay buffer. Theyboth needed to sample off-policy data during the Q-learningupdate. The difference is the memory module stores small-dimensional representations of the state-action pair for ef-fective lookup and discounted returns instead of rewards.The outline of the proposed architecture on calculating Q-estimates is presented in the Figure 2.

4.2 Alleviating Q-value OverestimationThe issue of Q-value overestimation is a common case in Q-learning based methods [Thrun and Schwartz, 1993; Lilli-crap et al., 2015]. Considering the discrete action setting, thevalue estimate is updated in a greedy fashion from suboptimalQ function containing some error y = r+ γmaxa′ Q(s′, a′).As a result, the maximum over the actions along with its er-ror will generally be greater than the true maximum [Thrunand Schwartz, 1993]. This bias is then propagated throughthe Bellman residual resulting in a consistent overestimationof the Q-function. The work of [Fujimoto et al., 2018] stud-ies the presence of the same overestimation in an actor-criticsetting. The authors propose a clipped variant of Double Q-learning that reduces overestimation. In our work we showthat additional episodic Q-estimate in critic loss can be usedas a tool to reduce overestimation bias.

Following the procedure described in [Fujimoto et al.,2018] we perform the experiment that measures the predicted


2653

(a) (b)

Figure 3: Q-value overestimation. (a) Comparison between pre-dicted Q-value and its true estimate. (b) Comparison betweenepisodic stored MC estimate and true estimate

Q-value from a critic compared with a true Q-estimate. Wecompare the value estimate of the critic, true value estimateand episodic MC return stored in the memory module. Asa true value estimate we use a discounted return of the cur-rent policy starting from the state sampled randomly fromthe replay buffer. The discounted return is calculated fromtrue rewards for maximum of 1000 steps or less in the caseof the episode’s end. We measure true value estimate each5000 steps during the training of 100000 steps. We use theaverage of the batch for the critic’s Q-value estimate andepisodic return. The learning behaviour for the Walker2d-v3 and Hopper-v3 domains is shown in Figure 3. In (a) weshow that the problem of Q-value overestimation exists forboth TD3 and EMAC algorithms, as both of Q-value predic-tions are higher than their true estimates. The Q-value predic-tion for EMAC shows less tendency for overestimaton. In (b)we compare true value estimate with episodic MC-return andshow that latter has more realistic behaviour than the critic’sestimate. Here, the difference between the true estimate andthe MC-return is that episodic returns are obtained from thepast suboptimal policy, whereas true value estimate is calcu-lated using the current policy. The training curves from (b)show that MC-return has less value than the true estimate.

The experiment shows that episodic Monte Carlo returnsfrom the past are always more pessimistic than the corre-sponding critic value. This fact makes incorporating MC-return into the objective beneficial for the training. As a re-sult, the proposed objective with episodic MC-return showsless tendency to overestimation compared to the TD3 model.Given the state-action pair we here state that the MC returnproduced by the suboptimal policy for the same state may beused as a second stabilized estimate of the critic. Therefore,

Algorithm 1 EMAC

1: Initialize actor network πφ, and critic network Qθ2: Initialize target critic network Qθ′3: Initialize replay buffer B, and episodic buffer BE4: Initialize memory moduleM5: for t=1 to T do6: Select action a ∼ πφ(s)7: Receive next state s

′, reward r, done d

8: Store (s, a, r, s′, d) in BE

9: if d=True then10: for i=1 to |BE | do11: Calculate discounted episode return Ri.12: Store (si, ai, ri, s

′

i, di) in B13: Add( [si, ai], Ri ) inM14: end for15: Free BE16: end if17: Sample mini-batch from B: [sb, ab, rb, sb, rb]18: QM ← Lookup(sb, ab) fromM19: Update critic Qθ by minimizing the loss:20: JQθ = (1− α)(Q−Q′)2 + α(Q−QM )2

21: Update policy πφ22: Update target critic θ

′ ← τθ + (1− τ)θ′

23: end for24: return φ

additional loss component of MSE between episodic returnand critic estimate may be interpreted as a penalty for criticoverestimation.

4.3 Episodic-based Experience ReplayPrioritization

The prioritization of the sampled experiences for off-policyupdate of Q-function approximation is a studied tool for im-proving sample-efficiency and stability of an agent [Schaul etal., 2015; Wang et al., 2016; Kapturowski et al., 2018]. Thecommon approach of using prioritization is formulated in thework of [Schaul et al., 2015], where normalized TD-error isused as a criterion of transition importance during the sam-pling. We propose a slightly different prioritization schemethat is based on episodic return as a criterion for samplingpreference. We formulate the probability of sampling a tran-sition as

P (i) =pβi∑k p

βk

, (10)

where priority of transition pi is a MC return stored in thememory module. The exponent β controls the measure of pri-oritization with β = 0 as a uniform non-prioritized case. Theinterpretation of such a prioritization is an increasing reuse oftransitions which gave high returns in the past. The shift oftrue reward distribution used for off-policy update may resultin divergent behaviour, therefore large values of β may desta-bilize the training. In our experiment the coefficient value of0.5 gave promising results improving non-prioritized versionof the algorithm.


2654

Figure 4: Evaluation results on OpenAI Gym Benchmark.

Environment EMAC EMAC-NoPr DDPG TD3 SAC

Walker2d-v3 2236.88 ±808 2073 ±768.88 129.73 1008.32. 1787.28Hopper-v3 1969.16 ±1142.58 1837.3 ±794.17 1043.78 989.96 2411.02Swimmer-v3 80.53 ±33.07 62.56 ±25.78 27.45 38.91 56.7InvertedPendulum-v2 1000 1000 909.98. 1000 1000InvertedDoublePendulum-v2 9332.56 ±11.78 9327.8 ±11.2 9205.81 9309.73 9357.17

Table 1: Average return over 10 trials of 200000 time steps. ± corresponds to a standard deviation over 10 trials.

The stored probabilities of sampling are recalculated aftereach episode comes to the memory-module and used conse-quently during sampling from off-policy replay buffer.

5 ExperimentsThe EMAC algorithm is shown in Algorithm 1. We compareour algorithm with model-free off-policy algorithms DDPG[Lillicrap et al., 2015], TD3 [Fujimoto et al., 2018] and SAC[Haarnoja et al., 2018a]. For DDPG and TD3 we use imple-mentations of [Fujimoto et al., 2018]. For SAC we imple-mented the model following the [Haarnoja et al., 2018a]. Allthe networks have the same architecture in terms of the num-ber of hidden layers, non-linearities and size of the hidden di-mensions. In our experiments we focus on small-data regimegiven all algorithms 200000 time steps from an environment.

We evaluate our algorithm on a set of OpenAI gym do-mains [Dhariwal et al., 2017]. Each environment is run for200000 time steps with the corresponding number of net-works update steps. Evaluation is performed every 1000steps with the reported value as an average from 10 evalu-ation episodes from different seeds without any exploration.We report results from both prioritized (EMAC) and non-prioritized (EMAC-NoPr) versions of the algorithm. The re-sults are reported from 10 random seeds. The results in Table

1 is the average return of the last training episode over allseeds. The training curves of the compared algorithms pre-sented in Figure 4. Our algorithm outperforms DDPG andTD3 on all tested environments and SAC on three out of fiveenvironments.

Networks’ parameters are updated with Adam optimizer[Kingma and Ba, 2014] with a learning rate of 0.001. Allmodels consists of two hidden layers, size 256, for an actorand a critic and a rectified linear unit (ReLU) as a nonlinear-ity. For the first 1000 time steps we do not exploit an actorfor action selection and choose the actions randomly for theexploration purpose.

During the architecture design we study the dependencybetween the size of the projected state-action pairs and fi-nal metrics. We witness small increase in efficiency withgreater projected dimensions. Training curves showing dif-ferent variants of projections of 4, 16, 32 sizes on domainsof Walker-v3 and Hopper-v3 are depicted in Figure 5. Un-fortunately, increased projection size tends to slow down thelookup operation. In all our experiments we use the mini-mal option of projection size of 4 for the faster computations.The work of [Pritzel et al., 2017] leverages the KD-tree datastructure to store and search experiences. On the contrary,we decided to keep table-based structure of the stored expe-


2655

Figure 5: Performance of the EMAC with different size of projectedstate-action pairs

Figure 6: Comparison of prioritized (EMAC) and non-prioritized(EMAC-NoPr) versions

riences, but placed the module on a CUDA device to performvectorized l2-distance calculations. All our experiments areperformed on single 1080ti NVIDIA card.

Due to low-data regime we are able to store all incom-ing transitions without replacement for the replay buffer aswell as for episodic memory module. We set episodic mem-ory size to 200000 transitions, although our experiments donot indicate loss in performance for cases of smaller sizesof 100 and 50 thousands of records. The k parameter thatdetermines the number of top-k nearest neighbours for theweighted episodic return calculation is set to 1 or 2 dependenton the best achieved results from both options. We notice thatk bigger than 3 leads to the worse performance.

To evaluate episodic-based prioritization we compare theprioritized and non-prioritized versions of EMAC on multipleenvironments. Prioritization parameter β is chosen to be 0.5.The learning behaviour of the prioritized and non-prioritizedversions of the algorithm is showed in Figure 6. The averagereturn for both versions is provided in Table 1.

6 DiscussionWe present Episodic Memory Actor-Critic (EMAC), a deepreinforcement learning algorithm that exploits episodic mem-ory in continuous control problems. EMAC uses non-parametric data structure to store and retrieve experiencesfrom the past. The episodic memory module is involvedin the training process via the additional term in the critic’sloss. The motivation behind such an approach is that theactor is directly dependent on the critic, therefore improv-ing critic’s quality ensures the stronger and more efficientpolicy. We do not exploit episodic memory during the pol-icy evaluation, which means that memory module is usedonly within the network update step. The loss modificationdemonstrates sample-efficiency improvement over the DDPGbaseline. We further show that introducing prioritizationbased on the episodic memories improves our results. Exper-imental study of Q-value overestimation shows that proposedapproach has less tendency in critic overestimation thus pro-viding faster and more stable training.

Our experiments show that leveraging episodic memorygave superior results in comparison to the baseline algorithmDDPG and TD3 on all tested environments and also outper-formed SAC on 3 out of 5 environments. We hypothesize thatthe applicability of the proposed method is dependent fromenvironment complexity. As a result, we struggle to outper-form SAC for such a complicated environment as Humanoidwhich has bigger action space than the other OpenAI gymdomains.

We now outline the directions of future work. As notedin [Daw et al., 2005], both animal and human brain ex-ploits multiple memory systems while solving various tasks.Episodic memory is often associated with hippocampal in-volvement [Blundell et al., 2016; Lin et al., 2018] as a long-term memory. Although the gradient-update-based policymay be seen as a working memory, it may be beneficial tostudy the role of separate short-term memory mechanismsalongside episodic memory for better decision making. An-other interesting direction is to use different state-action rep-resentation for storing experiences. Although the randomprojection provides a way to transfer the distance relationfrom the original space to the smaller one, it does not showtopological similarity between the state-action records. Onepossible way to overcome this issue is to use differential em-bedding for state-action representation. Unfortunately, thechanging nature of embeddings entails the need of the con-stant or periodical memory update, which is alone an engi-neering challenge. We believe that our work provides the ben-efit of using episodic-memory for continuous control tasksand opens further research directions in this area.

References[Bellemare et al., 2013] Marc G Bellemare, Yavar Naddaf,

Joel Veness, and Michael Bowling. The arcade learningenvironment: An evaluation platform for general agents.Journal of Artificial Intelligence Research, 47:253–279,2013.

[Blundell et al., 2016] Charles Blundell, Benigno Uria,Alexander Pritzel, Yazhe Li, Avraham Ruderman, Joel Z


2656

Leibo, Jack Rae, Daan Wierstra, and Demis Hass-abis. Model-free episodic control. arXiv preprintarXiv:1606.04460, 2016.

[Daw et al., 2005] Nathaniel D Daw, Yael Niv, and PeterDayan. Uncertainty-based competition between prefrontaland dorsolateral striatal systems for behavioral control.Nature neuroscience, 8(12):1704–1711, 2005.

[Dhariwal et al., 2017] Prafulla Dhariwal, ChristopherHesse, Oleg Klimov, Alex Nichol, Matthias Plap-pert, Alec Radford, John Schulman, Szymon Sidor,Yuhuai Wu, and Peter Zhokhov. Openai baselines.https://github.com/openai/baselines, 2017.

[Fujimoto et al., 2018] Scott Fujimoto, Herke Van Hoof, andDavid Meger. Addressing function approximation errorin actor-critic methods. arXiv preprint arXiv:1802.09477,2018.

[Haarnoja et al., 2018a] Tuomas Haarnoja, Aurick Zhou,Pieter Abbeel, and Sergey Levine. Soft actor-critic:Off-policy maximum entropy deep reinforcement learningwith a stochastic actor. arXiv preprint arXiv:1801.01290,2018.

[Haarnoja et al., 2018b] Tuomas Haarnoja, Aurick Zhou,Kristian Hartikainen, George Tucker, Sehoon Ha, JieTan, Vikash Kumar, Henry Zhu, Abhishek Gupta, PieterAbbeel, et al. Soft actor-critic algorithms and applications.arXiv preprint arXiv:1812.05905, 2018.

[Johnson and Lindenstrauss, 1984] William B Johnson andJoram Lindenstrauss. Extensions of lipschitz mappingsinto a hilbert space. Contemporary mathematics, 26(189-206):1, 1984.

[Kapturowski et al., 2018] Steven Kapturowski, Georg Os-trovski, John Quan, Remi Munos, and Will Dabney.Recurrent experience replay in distributed reinforcementlearning. In International conference on learning repre-sentations, 2018.

[Kingma and Ba, 2014] Diederik P Kingma and Jimmy Ba.Adam: A method for stochastic optimization. arXivpreprint arXiv:1412.6980, 2014.

[Lengyel and Dayan, 2007] Mate Lengyel and Peter Dayan.Hippocampal contributions to control: the thireep way.Advances in neural information processing systems,20:889–896, 2007.

[Lillicrap et al., 2015] Timothy P Lillicrap, Jonathan J Hunt,Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,David Silver, and Daan Wierstra. Continuous con-trol with deep reinforcement learning. arXiv preprintarXiv:1509.02971, 2015.

[Lin et al., 2018] Zichuan Lin, Tianqi Zhao, GuangwenYang, and Lintao Zhang. Episodic memory deep q-networks. arXiv preprint arXiv:1805.07603, 2018.

[Lin, 1992] Long-Ji Lin. Self-improving reactive agentsbased on reinforcement learning, planning and teaching.Machine learning, 8(3-4):293–321, 1992.

[Pritzel et al., 2017] Alexander Pritzel, Benigno Uria, Sri-ram Srinivasan, Adria Puigdomenech, Oriol Vinyals,

Demis Hassabis, Daan Wierstra, and Charles Blundell.Neural episodic control. arXiv preprint arXiv:1703.01988,2017.

[Schaul et al., 2015] Tom Schaul, John Quan, IoannisAntonoglou, and David Silver. Prioritized experience re-play. arXiv preprint arXiv:1511.05952, 2015.

[Silver et al., 2014] David Silver, Guy Lever, Nicolas Heess,Thomas Degris, Daan Wierstra, and Martin Riedmiller.Deterministic policy gradient algorithms. 2014.

[Thrun and Schwartz, 1993] Sebastian Thrun and AntonSchwartz. Issues in using function approximation for re-inforcement learning. In Proceedings of the 1993 Con-nectionist Models Summer School Hillsdale, NJ. LawrenceErlbaum, 1993.

[Vikbladh et al., 2017] Oliver Vikbladh, Daphna Shohamy,and Nathaniel Daw. Episodic contributions to model-basedreinforcement learning. In Annual conference on cognitivecomputational neuroscience, CCN, 2017.

[Wang et al., 2016] Ziyu Wang, Victor Bapst, Nicolas Heess,Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, andNando de Freitas. Sample efficient actor-critic with expe-rience replay. arXiv preprint arXiv:1611.01224, 2016.

[Watkins and Dayan, 1992] Christopher JCH Watkins andPeter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.

[Zhang et al., 2019] Zhizheng Zhang, Jiale Chen, ZhiboChen, and Weiping Li. Asynchronous episodic deep de-terministic policy gradient: Toward continuous control incomputationally complex environments. IEEE Transac-tions on Cybernetics, 2019.


2657

https://github.com/openai/baselines

Documents

Solving Continuous Control with Episodic Memory