6
Visual Goal-Directed Meta-Learning with Contextual Planning Networks Corban G. Rivera, David A Handelman Intelligent Systems Center Johns Hopkins Applied Physics Lab Laurel, MD 20723 [email protected] Abstract— The goal of meta-learning is to generalize to new tasks and goals as quickly as possible. Ideally, we would like approaches that generalize to new goals and tasks on the first attempt. Toward that end, we introduce contextual planning networks (CPN). Tasks are represented as goal images and used to condition the approach. We evaluate CPN along with several other approaches adapted for zero-shot goal-directed meta-learning. We evaluate these approaches across 24 distinct manipulation tasks using Metaworld benchmark tasks. We found that CPN outperformed several approaches and baselines on one task and was competitive with existing approaches on others. We demonstrate the approach on a physical platform on Jenga tasks using a Kinova Jaco robotic arm. I. INTRODUCTION When confronted with a new task, humans are able to generalize from previously learned skills and experience to perform new tasks on the first attempt. Equipping artificial in- telligence (AI) agents with similar capabilities would increase their value as robotic teammates by enabling "improvisation" of problem solutions. If an AI could learn to transfer learned skills to new tasks in a way similar to humans, it would make the robotic teammate much more useful and trusted as a partner. It has been thoroughly studied that state-of-the-art AI struggles when presented with a limited number of training examples of new tasks, let alone succeeding in a "zero-shot" task with no previous experience attempting it [1]. One key challenge is learning abstractions and modularity that will allow novel tasks to be completed on the first attempt. There are several related problems that have received a lot of attention in literature including meta-learning and meta- imitation learning which we will describe briefly. Meta-learning Meta-learning, or "learning to learn," has a rich history in machine learning literature [2], [3], [4], [5]. Given a family of tasks, training experience for each meta-training task, and performance measures, an algorithm is capable of learning to learn if its performance improves with both additional experience and tasks. Meta-learning generally refers to the problem of quickly adapting to new tasks given prior experience on dissimilar tasks. These methods frequently include an outer loop which is used to select parameters for an inner loop which is then used to solve the task [2], [4]. Some approaches rely on conditioning networks on a task embedding [6]. Other approaches optimize directly for their ability to fine tune the model quickly to a library of known tasks [7], [8]. Fig. 1. Metaworld is a manipulation benchmark that includes 50 distinct tasks for multi-task learning and meta-RL experiments. The tasks include distinct objects and affordances. We adapted the metaworld benchmark tasks to create 24 zero-shot meta-learning from visual demonstration tasks for evaluation. The use of visual observations and goals is slightly more challenging than the existing metaworld benchmark formulations which are based on coordinate-based observations and goals Meta-imitation learning Imitation learning allows au- tonomous agents to learn a skill from demonstrations. Neural network based policies can provide a basic level of gener- alization to new scenarios, however training these policies may require many demonstrations. Research into few-shot learning and rapid adaptation has focused on decreasing the number of demonstrations or trials of the test-task needed to successfully perform the new task [9], [7]. In the limit, zero demonstrations of the new task are provided. This is a special case of zero-shot learning in policy domain, which has become an emerging sub-field within meta-learning [10]. Meta-imitation learning refers to meta-learning approaches that use imitation learning during the meta-training phase. Contextual Planning Networks The problem that we are solving here is distinct from the standard formulations of meta-learning and meta-imitation learning in that we aim to complete a meta-test task on the first attempt assuming we are given demonstrations of the meta-training tasks and a single image of the completion of the meta-test task. To solve this problem, We introduce an approach called contextual planning networks (CPN) that learns a combined representation of the policy and dynamics using an objective that directly optimizes for the ability to plan towards a goal state represented as an image. Our approach draws inspiration from several previous approaches [11], [6], [8] and builds upon them. The approach combines planning by backpropagation, with task embedding, and neuromodulation. arXiv:2111.09908v1 [cs.RO] 18 Nov 2021

arXiv:2111.09908v1 [cs.RO] 18 Nov 2021

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Visual Goal-Directed Meta-Learning with Contextual Planning Networks

Corban G. Rivera, David A HandelmanIntelligent Systems Center

Johns Hopkins Applied Physics LabLaurel, MD 20723

[email protected]

Abstract— The goal of meta-learning is to generalize to newtasks and goals as quickly as possible. Ideally, we would likeapproaches that generalize to new goals and tasks on the firstattempt. Toward that end, we introduce contextual planningnetworks (CPN). Tasks are represented as goal images andused to condition the approach. We evaluate CPN along withseveral other approaches adapted for zero-shot goal-directedmeta-learning. We evaluate these approaches across 24 distinctmanipulation tasks using Metaworld benchmark tasks. Wefound that CPN outperformed several approaches and baselineson one task and was competitive with existing approaches onothers. We demonstrate the approach on a physical platformon Jenga tasks using a Kinova Jaco robotic arm.

I. INTRODUCTION

When confronted with a new task, humans are able togeneralize from previously learned skills and experience toperform new tasks on the first attempt. Equipping artificial in-telligence (AI) agents with similar capabilities would increasetheir value as robotic teammates by enabling "improvisation"of problem solutions. If an AI could learn to transfer learnedskills to new tasks in a way similar to humans, it wouldmake the robotic teammate much more useful and trusted asa partner. It has been thoroughly studied that state-of-the-artAI struggles when presented with a limited number of trainingexamples of new tasks, let alone succeeding in a "zero-shot"task with no previous experience attempting it [1]. One keychallenge is learning abstractions and modularity that willallow novel tasks to be completed on the first attempt.

There are several related problems that have received a lotof attention in literature including meta-learning and meta-imitation learning which we will describe briefly.

Meta-learning Meta-learning, or "learning to learn," hasa rich history in machine learning literature [2], [3], [4],[5]. Given a family of tasks, training experience for eachmeta-training task, and performance measures, an algorithmis capable of learning to learn if its performance improveswith both additional experience and tasks.

Meta-learning generally refers to the problem of quicklyadapting to new tasks given prior experience on dissimilartasks. These methods frequently include an outer loop whichis used to select parameters for an inner loop which isthen used to solve the task [2], [4]. Some approaches relyon conditioning networks on a task embedding [6]. Otherapproaches optimize directly for their ability to fine tune themodel quickly to a library of known tasks [7], [8].

Fig. 1. Metaworld is a manipulation benchmark that includes 50 distincttasks for multi-task learning and meta-RL experiments. The tasks includedistinct objects and affordances. We adapted the metaworld benchmarktasks to create 24 zero-shot meta-learning from visual demonstration tasksfor evaluation. The use of visual observations and goals is slightly morechallenging than the existing metaworld benchmark formulations which arebased on coordinate-based observations and goals

Meta-imitation learning Imitation learning allows au-tonomous agents to learn a skill from demonstrations. Neuralnetwork based policies can provide a basic level of gener-alization to new scenarios, however training these policiesmay require many demonstrations. Research into few-shotlearning and rapid adaptation has focused on decreasing thenumber of demonstrations or trials of the test-task neededto successfully perform the new task [9], [7]. In the limit,zero demonstrations of the new task are provided. This is aspecial case of zero-shot learning in policy domain, whichhas become an emerging sub-field within meta-learning [10].Meta-imitation learning refers to meta-learning approachesthat use imitation learning during the meta-training phase.

Contextual Planning Networks The problem that we aresolving here is distinct from the standard formulations ofmeta-learning and meta-imitation learning in that we aim tocomplete a meta-test task on the first attempt assuming we aregiven demonstrations of the meta-training tasks and a singleimage of the completion of the meta-test task. To solve thisproblem, We introduce an approach called contextual planningnetworks (CPN) that learns a combined representation of thepolicy and dynamics using an objective that directly optimizesfor the ability to plan towards a goal state represented as animage. Our approach draws inspiration from several previousapproaches [11], [6], [8] and builds upon them. The approachcombines planning by backpropagation, with task embedding,and neuromodulation.

arX

iv:2

111.

0990

8v1

[cs

.RO

] 1

8 N

ov 2

021

Task Objectives given as an Image We focus on ap-proaches that represent their goal as an image. The ideaof using an image as a goal has been studied previously[12], [13]. An image provides a flexible format that can beeasily accommodate diverse tasks. The challenge comes fromfinding representations of the goal image that allow the policyto accomplish the test task. Using an image as a task goal hasunique challenges as many pixel level details of the imagemay be irrelevant or misleading for the task. Instead, manyapproaches have explored latent representations of the imagesthat are intended to capture the salient aspects of the task fromthe image [13], [11]. To discover these latent representations,prior work has mostly focused on unsupervised objectivesor objectives that are disconnected from learning the policy[13], [14].

Planning by BackpropagationThe strength of planning for the purpose of meta-learning

is the ability to transfer an existing policy towards a new taskand potentially complete a previously unseen goal.

Planning through latent space with gradient descent is acore concept shared by several previous works, in which thesame embedding is used for both the goal and the currentstate. The basic idea is that the distance between the latentrepresentation of the current state and the latent representationof the goal could be exploited for planning purposes [10]. Theidea can be extended by using a dynamics model to predictthe latent state resulting from an action. Prior work [11],[15], [16], [17], [18] has demonstrated the value of using thedynamics model to unroll the policy over a planning horizonwith the goal of minimizing the distance between the latentrepresentations of the final predicted state and the goal.

The primary contribution of this paper is a visual goal-directed meta-learning approach that allows existing policiesto be repurposed for new tasks with a chance of succeedingon the first attempt. Our experiments show that the approachimproves the task success rate on the first attempt for one taskin the metaworld [19] task library and is competitive withrecent approaches and baseline algorithms for many others.In a later experiment, we show that the approach works ona physical robot using a Kinova Jaco robotic arm and tasksbased on the game Jenga.

II. RELATED WORK

Describing tasks by images has been studied by severalauthors [14], [13]. These approaches do not focus on zero-shot goal-directed task transfer. Other work used an imitationlearning objective to optimization a representation for a goalimage [20], [21]. In contrast, our work focuses on bothimitation learning and planning. Single-shot imitation learningvia images has also been explored [22], [23], [1], [24]. Single-shot visual imitation learning approaches are related in thesense that the final image of the single demonstration couldbe thought of as the goal image.

Meta-learning in the context of reinforcement learning isreferred to as Meta-RL [25], [26], [7], [8], [6]. These methodsare related in that they evaluate the approaches on their abilityto learn new tasks with as few steps from the new task as

possible. They are different in that they require environmentswith extrinsic rewards. In contrast, the work here is basedon meta-imitation learning. Although, inverse reinforcementlearning [27], [28] (IRL) could be used to approximate anobjective function which would allow Meta-RL approaches tobe applied for learning from demonstrations. Instead of takingthe IRL coupled with Meta-RL approach, In our experiments,we adapted concepts from Meta-RL approaches into the visuallearning from demonstration context explicitly for the purposeof comparison with a baseline approach that conditions thepolicy on both a state and task-embedding [6].

The earliest work on planning by gradient descent wasintroduced decades ago [29] and was based on known modeldynamics. Later work introduced planning with learneddynamics models [30]. Model-based planning was alsoexplored for environments with discrete actions [31]. Thisapproach relies on unsupervised pretraining. Others haveexplored using planning through latent space with the goal ofpredicting a value function [15], [16], [17]. These approachesfocus on environments with extrinsic rewards. Other workhas looked at approaches for goal-directed imitation learning[18] conditioned on a sequence of images of the test task.

III. GOAL-DIRECTED ZERO-SHOTMETA-LEARNING

We first introduce some meta-learning preliminaries, thenwe present the problem statement before describing theapproach and implementation.

A. PRELIMINARIES

Meta-learning, or "learning to learn," aims to complete newtasks with very little (if any) training data. To achieve this,a meta-learning algorithm is first pretrained on a repertoireof tasks Ti in a meta-train phase. The algorithm is thenevaluated on a disjoint task Tj . As part of meta-training,the meta-learning algorithm can learn generalizable structurebetween tasks that allow it to successfully complete the meta-test task.

A task Ti is a finite-horizon Markov decision process(MDP), {S,A, ri, Pi, H} with state space S given as images,action space A, reward function ri : S ×A→ R, dynamicsPi(st+1|st, at), and horizon H . It is assumed that theenvironment dynamics and goals may vary across tasks.

We explore a subset of reward specification R correspond-ing to the mean squared error ||fφ(st)− fφ(sg)||2 betweenstate st ∈ S at time t, a goal sg ∈ S, and image embeddingfunction fφ with parameters φ.

B. PROBLEM STATEMENT

The goal is to meta-train an agent such that it is moresuccessful at performing a new test task Tj on the first attemptgiven a goal gj for test task j presented as an image.

Meta-train: The agent observes and learns from m =1, . . . ,M task demonstrations.

Meta-test: The pretrained agent is given a new goal gjthat corresponds to the goal for the test task Tj .

C. UNIVERSAL PLANNING NETWORKS

The goal of this meta-learning approach is to directlyoptimize for plannable representations [11]. The approach isinspired by other well known approaches like MAML [7]that optimize a model for an ability to quickly fine tune fora new task. The UPN approach is illustrated in Figure 2.The model produces an action plan at:t+H conditioned oninitial and goal observations ot and og , where at denotes thepredicted action at time t over horizon H . Both the encoderfφ and the combined policy and dynamics model gθ areapproximated with neural networks and are fully differentiable.The inner-loop unrolls predictions of intermediate statesxt : xt+H and actions at:t+H over the horizon H . Theobjective of the inner loop is to minimize the distancebetween the latent representation of the goal xg and thefinal predicted state xt+H using the Huber loss. Updates tothe model using this loss result in updated actions at:t+H .In the learning from demonstrations scenario, the outer loopuses a behavior cloning objective that minimizes the distancebetween the predicted action at for time t and the givenaction at over all timepoints using a mean squared errorloss. The planning horizon H and the number of inner-loop updates U are hyperparameters for the approach. In ourexperiments, hyperparameters are set such that the predictedstate embedding at the end of the horizon is as close aspossible to the embedding of the goal.

Fig. 2. Universal planning networks architecture. The architecture can bedescribed as an inner loop and an outer loop. The objective of the innerloop is to minimize the distance in latent space between the final rolledout state prediction and the latent representation of the goal. For learningfrom demonstrations, the outer loop is the behavior cloning objective whichminimizes the distance between the predicted and observed actions.

D. NEUROMODULATED META-LEARNING

Neuromodulation in deep networks is inspired by the plas-ticity response observed for human neurons. In our approach,this is achieved by learning and applying attenuation tonetwork weights directly conditioned on the input [8]. This isin contrast to commonly used notions of attention that learnand apply attenuation directly to the input conditioned onthe input [32]. Figure 3 illustrates the network architectureof the of the combined dynamics and policy network.

Fig. 3. (a) Network architecture of the observation and goal embedding,(b) Structure of the combined policy and dynamics model. A key structuraldifference between this model and an MLP architecture is the introductionof attention conditioned on the input that attenuates the weights of the modeldirectly.

Given fully connected layers of the form y = ax+ b, weintroduce weight attenuation functions uβ(x) and vγ(x) con-ditioned on the input x . The function uβ(x) is parameterizedby β and yields a tensor that matches the shape of a withvalues in the range [0 − 1]. The bias attenuation functionvγ(x) is parameterized by γ and yields a matrix that matchesthe of size b in the range [0− 1]. The neuromodulated fullyconnected layer computes y = (uβ(x)� a)x+ (vγ(x)� b)where � denotes element-wise multiplication.

In our experiments, Both uβ(x) and vγ(x) are approxi-mated using a 2-layer fully-connected neural network withintermediate rectified linear activation and terminal sigmoidactivation.

E. TASK-EMBEDDING META-LEARNING

Task embedding in reactive networks has been explored formeta-learning [6], [18]. A recent approach called PEARL [6]extends the typical policy formation p(a|s) by additionallyconditioning on a latent representation of the task p(a|s, xg).An additional KL-divergence objective is used to organizedthe task embedding into a desired distribution. This approachwas demonstrated in the context of reinforcement learning.We adapted this approach to learn from demonstration byadopting a behavior-cloning objective. In experiments, werefer to this baseline approach as (TE-BC) for task embeddingbehavior cloning.

F. CONTEXT-AWARE PLANNING NETWORKS

In this paper, we introduce context-aware planning net-works with extends the ideas from universal planning net-works illustrated in Figure 2 with neuromodulation illustratedin Figure 3(b) and visual task-embedding illustrated inFigure 3(a). Task-embedding is integrated by modifyingthe combined policy and dynamics model by additionallyconditioning on xg as shown in Figure 3(b). Neuromodulationis integrated into the full model by replacing fully connectedlayers used to approximate gθ(xt, xg) with neuromodulatedfully-connected layers as described earlier. The structure ofcombined dynamics and policy network has a shared trunkthat passes the image embedding to a linear layer . After theshared trunk, the network splits into dynamics and policybranches. The policy branch is composed of linear layers andthe number of continuous actions. The dynamics branch iscomposed of two linear layers and is also conditioned on theaction prediction. In experiments with metaworld, the numberof continuous actions is 4 and the internal linear layers are32 units wide.

IV. RESULTS

We designed our experiments to answer the followingquestions: (i) Conditioned on a goal image, how effectiveis the CPN approach at completing meta-test tasks on thefirst attempt compared to UPN and imitation learning andtask embedding approaches? (ii) Can the approach work ona physical robot?

A. Methods for comparison

We compared to two reactive imitation learning approachesand two planning approaches along with a random control.We included a behavior cloning based approach (BC), whichuses a sequence of convolutional layers to encode visualobservations and output action. We compare to a TaskEmbedding Behavior Cloning (TE-BC) approach that extendsthe behavior cloning model by additionally conditioning onan embedding of the goal image. In this approach, both theobservation and the goal image are processed by the samelearned convolutional encoding network.

Additionally, we compare to two approaches that plan overa finite horizon by gradient decent. The first approach isbased on the Universal Planning Network with a horizon of5 (UPN) [11]. We also compare to an approach that extendsUPN with neuromodulation for the integrated policy anddynamics model along with task embedding (TE-CPN). Asingle backpropagation update step was used by each planningalgorithm to plan towards the goal. For these experimentsthe planning horizon was fixed at 5. Both planning basedapproaches were evaluated based on the strategy of modelpredictive control described by Srinivas et al. [11], whichamounts to replanning actions over a fixed horizon at eachtime step but only taking the first action in the plan.

B. Metaworld tasks

We structured our experiements around 24 distinct visiomo-tor tasks derived from the Metaworld benchmark [19]. Two

example tasks are shown in Figure 1. The tasks containdifferent objects with different affordances. Additionally,each task contains selectable sub-task variation. The originalmetaworld benchmarks were designed for multitask andmeta-RL with vector representations of state and goals.We adapted the metaworld benchmark tasks to make themsuitable for approaches that learn from demonstration withvisual observations and goals. The introduction of visualobservations add additional complexity that are not part ofthe existing Metaworld experiments.

C. Experiment designWe adopted a cross-validation structure for the experiment.

Specifically, to evaluate the first attempt success rate on ameta-test task, the meta-training set consisted of all othertasks.

Training data For each of the 24 tasks, we generated100 successful demonstrations using heuristic control. Foreach trial, the position of key objects was randomized withinpreconfigured bounds. Trajectories in the dataset consisted of84x84x3 images of each state including the final state whichwas treated as the goal image. Each approach was trainedfor 50 epochs.

Evaluation For the task being evaluated, we measuredaverage task success rate over 20 trials. We controled therandom seed to ensure that each approach was evaluatedbased on the same distribution of task variants.

The results of the experiments are summarized in Figure4. We found that 7 of the 24 evaluation tasks had at leastone approach that was able to complete the task on the firstattempt. For the door-lock task, CPN and UPN achieved over50% success on the first attempt, while the other reactiveapproaches performed close to chance. For the drawer-close,dial-turn, coffee-button tasks all of the approaches performedwell above chance.

D. JENGA domain on a physical robotWe wanted to test the ability of CPN to extrapolate to

novel goal targets on physical hardware. Kinova Jaco armsand an oversized Jenga tower were used for the experiment.The task was to demonstrate the ability to poke target blocksin the tower in positions that were were offset along an axisnot seen during training.

Heuristic behaviors were used to create a dataset of 21trajectories of the JACO arm poking a block on the lowerrow from different starting positions. Figure 5 illustrates theexperimental setup. In this experiment, CPN was reconfiguredto be conditioned on a block position goal represented asa three dimensional vector. The horizon length and numberof planning updates were selected to minimize the deviationbetween the predicted endpoint of the trajectory and the targetblock position. In Figure 5 and the supplementary video, weshow that the approach was able to plan towards multipletargets that offset along an axis not seen during training.

V. DISCUSSIONOf the tasks that were solvable on the first attempt, we

were surprised to find that there was no clear superior

Fig. 4. First attempt success rate for several meta-learning and baselineapproaches based on reactive policies and planning. We organized an experi-ment around the metaworld series of manipulation tasks. For each evaluationtask, the approaches were trained using 100 successful demonstrations fromall of the other tasks. Each approach was trained for the same numberof epochs. Average % task success is on the x-axis. Evaluated tasks areshown on the y-axis. We compare reactive imitation learning based onbehavior cloning (BC), an approach that extends BC by conditioning thepolicy on both the observation and goal embedding (TE-BC), universalplanning network (UPN), and contextual planning networks (TE-CPN) thatextends UPN with neuromodulation and task embedding. A random policyis included to illustrate the likelihood of task completion by chance.

approach among the approaches that we tested. The bestapproach frequently varied by task. Occasionally, basicreactive approaches fared as well as more sophisticatedplanning-based approaches. We found that the CPN approachoutperformed the other approaches in the door-lock task,and we found that the CPN was competitive with the otherapproaches for the other tasks in the metaworld experiment.

In the Jenga experiment, we found that CPN was successfulat planning towards goals that were not seen during training.This experiment highlights the ability of the CPN approach toextrapolate to new goal targets not seen during training. Theexperiment represents extrapolation because the goal targetswere offset along an axis that was not in the training data.

Fig. 5. Physical demonstration of planning towards novel goals in Jenga.CPN was trained using trajectories of Jaco arm reaching towards blocks onthe lower level. CPN was used to create a trajectory for a target block thatwere offset along an axis that was not varied during training.

While we fully appreciate that the task could be accomplishedvia inverse kinematics, we hope the experiment illustrates ourcontribution to the more fundamental challenge of zero-shotgoal-conditioned task planning.

VI. ACKNOWLEDGEMENTS

We would like to thank Chace Ashcraft, I-Jeng Wang, andMarie Chau for technical and manuscript feedback.

REFERENCES

[1] D. Xu, S. Nair, Y. Zhu, J. Gao, A. Garg, L. Fei-Fei, and S. Savarese,“Neural task programming: Learning to generalize across hierarchicaltasks,” in 2018 IEEE International Conference on Robotics andAutomation (ICRA). IEEE, 2018, pp. 1–8.

[2] S. Thrun and L. Pratt, Learning to Learn: Introduction and Overview.Boston, MA: Springer US, 1998, pp. 3–17. [Online]. Available:https://doi.org/10.1007/978-1-4615-5529-2_1

[3] J. Schmidhuber, “Gödel machines: Fully self-referential optimaluniversal self-improvers,” in Artificial general intelligence. Springer,2007, pp. 199–226.

[4] S. Bengio, Y. Bengio, J. Cloutier, and J. Gecsei, “On the optimizationof a synaptic learning rule,” in Preprints Conf. Optimality in Artificialand Biological Neural Networks, vol. 2. Univ. of Texas, 1992.

[5] S. Hochreiter, A. S. Younger, and P. R. Conwell, “Learning to learnusing gradient descent,” in International Conference on Artificial NeuralNetworks. Springer, 2001, pp. 87–94.

[6] K. Rakelly, A. Zhou, C. Finn, S. Levine, and D. Quillen, “Efficient off-policy meta-reinforcement learning via probabilistic context variables,”in International conference on machine learning, 2019, pp. 5331–5340.

[7] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning forfast adaptation of deep networks,” arXiv preprint arXiv:1703.03400,2017.

[8] S. Beaulieu, L. Frati, T. Miconi, J. Lehman, K. O. Stanley, J. Clune,and N. Cheney, “Learning to continually learn,” arXiv preprintarXiv:2002.09571, 2020.

[9] Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Generalizing from a fewexamples: A survey on few-shot learning,” ACM Computing Surveys(CSUR), vol. 53, no. 3, pp. 1–34, 2020.

[10] K. Liu, T. Kurutach, C. Tung, P. Abbeel, and A. Tamar, “Hallucinativetopological memory for zero-shot visual planning,” arXiv preprintarXiv:2002.12336, 2020.

[11] A. Srinivas, A. Jabri, P. Abbeel, S. Levine, and C. Finn, “Universalplanning networks,” arXiv preprint arXiv:1804.00645, 2018.

[12] K. Deguchi and I. Takahashi, “Image-based simultaneous control ofrobot and target object motions by direct-image-interpretation method,”in Proceedings 1999 IEEE/RSJ International Conference on IntelligentRobots and Systems. Human and Environment Friendly Robots withHigh Intelligence and Emotional Quotients (Cat. No. 99CH36289),vol. 1. IEEE, 1999, pp. 375–380.

[13] M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller, “Embedto control: A locally linear latent dynamics model for control fromraw images,” in Advances in neural information processing systems,2015, pp. 2746–2754.

[14] C. Finn, X. Y. Tan, Y. Duan, T. Darrell, S. Levine, and P. Abbeel,“Deep spatial autoencoders for visuomotor learning,” in 2016 IEEEInternational Conference on Robotics and Automation (ICRA). IEEE,2016, pp. 512–519.

[15] J. Oh, S. Singh, and H. Lee, “Value prediction network,” in Advancesin Neural Information Processing Systems, 2017, pp. 6118–6128.

[16] D. Silver, H. Hasselt, M. Hessel, T. Schaul, A. Guez, T. Harley,G. Dulac-Arnold, D. Reichert, N. Rabinowitz, A. Barreto, et al.,“The predictron: End-to-end learning and planning,” in InternationalConference on Machine Learning. PMLR, 2017, pp. 3191–3199.

[17] J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre,S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, et al.,“Mastering atari, go, chess and shogi by planning with a learned model,”arXiv preprint arXiv:1911.08265, 2019.

[18] D. Pathak, P. Mahmoudieh, G. Luo, P. Agrawal, D. Chen, Y. Shentu,E. Shelhamer, J. Malik, A. A. Efros, and T. Darrell, “Zero-shot visualimitation,” in Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition Workshops, 2018, pp. 2050–2053.

[19] T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine,“Meta-world: A benchmark and evaluation for multi-task and metareinforcement learning,” in Conference on Robot Learning, 2020, pp.1094–1100.

[20] P. Sermanet, K. Xu, and S. Levine, “Unsupervised perceptual rewardsfor imitation learning,” arXiv preprint arXiv:1612.06699, 2016.

[21] P. Sermanet, C. Lynch, J. Hsu, and S. Levine, “Time-contrastivenetworks: Self-supervised learning from multi-view observation,” in2017 IEEE Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW). IEEE, 2017, pp. 486–487.

[22] A. Nair, D. Chen, P. Agrawal, P. Isola, P. Abbeel, J. Malik, andS. Levine, “Combining self-supervised learning and imitation for vision-based rope manipulation,” in 2017 IEEE International Conference onRobotics and Automation (ICRA). IEEE, 2017, pp. 2146–2153.

[23] A. Zhou, E. Jang, D. Kappler, A. Herzog, M. Khansari, P. Wohlhart,Y. Bai, M. Kalakrishnan, S. Levine, and C. Finn, “Watch, try,learn: Meta-learning from demonstrations and reward,” arXiv preprintarXiv:1906.03352, 2019.

[24] K. Shiarlis, M. Wulfmeier, S. Salter, S. Whiteson, and I. Posner, “Taco:Learning task decomposition via temporal alignment for control,” arXivpreprint arXiv:1803.01840, 2018.

[25] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel,“Benchmarking deep reinforcement learning for continuous control,” inInternational Conference on Machine Learning, 2016, pp. 1329–1338.

[26] Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, andP. Abbeel, “Rl2: Fast reinforcement learning via slow reinforcementlearning,” 2016.

[27] A. Y. Ng, S. J. Russell, et al., “Algorithms for inverse reinforcementlearning.” in Icml, vol. 1, 2000, p. 2.

[28] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse rein-forcement learning,” in Proceedings of the twenty-first internationalconference on Machine learning, 2004, p. 1.

[29] H. J. Kelley, “Gradient theory of optimal flight paths,” Ars Journal,vol. 30, no. 10, pp. 947–954, 1960.

[30] J. Schmidhuber, “An on-line algorithm for dynamic reinforcementlearning and planning in reactive environments,” in 1990 IJCNNinternational joint conference on neural networks. IEEE, 1990, pp.253–258.

[31] M. Henaff, W. F. Whitney, and Y. LeCun, “Model-based planning indiscrete action spaces,” arXiv preprint arXiv:1705.07177, 2017.

[32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” arXivpreprint arXiv:1706.03762, 2017.