Learning Robot Task Planning primitives by means of Long ...toselloe/2018-Simpar-Workshop/TMPC-2018... · Learning Robot Task Planning primitives by means of Long Short-Term Memory

Learning Robot Task Planning primitives by means of LongShort-Term Memory networks

Federico Vendramin1, Elisa Tosello1, Nicola Castaman2, and Stefano Ghidoni1

Abstract— This work proposes a Long Short-Term Memorynetwork able to learn robot task planning primitives whensolving TAMP problems. Given a task to a robot, the networkis able to select the appropriate sequence of actions to be per-formed, along with their arguments, while facing environmentuncertainty and action/motion failures. Proposed tests ask arobot to swap two cubes on a table. Focusing on this taskassignment, an expert policy is proposed able to generate anartificial dataset useful for training. Accuracy of action selectionup to 97% was obtained on sequences of the length of thetraining ones; an accuracy up to 91%, instead, was obtainedon longer task sequences.

I. INTRODUCTION

This paper aims to solve Task and Motion Planning(TAMP) problems by using Deep Neural Networks (DNNs).Specifically, it focuses on Robot Learning of task planningprimitives by means of Long Short-Term Memory (LSTM)networks [1].

Traditional TAMP algorithms lack of generality, scalabil-ity, and completeness. They are not able to handle environ-ment uncertainty or motion failures when executing primitiveactions (e.g., a person moving in the robot workspace oran object slipping from the robot end-effector). The attemptof this work is facing these limitations by introducingArtificial Intelligence (AI). Questions authors are trying toanswer follow. Can NNs be exploited to learn a task andproduce an online planner by selecting the correct actionto be performed? Will this planner be able to react tochanges in the environment or unexpected failures withoutrecomputing, each time, the entire plan? Will it be able tolearn how to change the task execution according to theencountered motion failures? Based on these open questions,faced challenges and contributions of this work follow.

A. Challenges

• Online Planning and Execution: the main goal is touse neural networks to explore the feasibility of anonline task planner. The network should be able toreason about the environment and react to unexpectedchanges;

• Learning: since neural networks are trained end to end,authors aim to implement a network able to learn with

1Federico Vendramin, Elisa Tosello, and Stefano Ghidoni are withIAS-Lab, Dept. of Information Engineering, University of Padova,Padova, Italy. [email protected],[email protected], [email protected]

2Nicola Castaman is with IAS-Lab, Dept. of Information Engineering,University of Padova, Padova, Italy and IT+Robotics srl, Vicenza, [email protected]

supervision how to predict the next action necessary toachieve a specific task;

• Dataset: a public dataset to use for training the networkis lacking. Authors aim to face this lack knowing thatdata collection could be resource expensive if done inan inefficient way.

B. Contributions

An attempt of agent able to learn how to reason onassigned tasks is proposed. This agent is able to correctlyhandle environment observations by producing a sequenceof actions that fulfill the assigned task. This sequence can beseen as a program that invokes robot motion primitives. Testsfocus on a toy example: the swap of two cubes on a table.Experiments were executed using an artificial environmentgenerated by an expert policy. This environment depicts theposition of the objects and the gripper status (full or empty).• Online Planning and Execution: the proposed model

can detect robotic manipulation failures such as unsuc-cessful grasps or wrong object placings. It reacts to theseaction failures by selecting those actions that restore theright action preconditions and complete the task;

• Learning: accuracy of up to 97% has been reachedwhen evaluating the network on sequences of the lengthof the training ones (with maximum sequence lengthequals to maxlen= 30). Accuracy of up to 91% has beenreached while evaluating the network on sequences ofhigher length (maxlen = 100), even if these sequenceshave been encountered for the first time;

• Dataset: an artificial dataset generator is proposed ableto produce the synthetic examples necessary for train-ing.

II. RELATED WORK

Various researchers focus on the resolution of TAMPproblems. Initial works investigated on a Hierarchical ar-chitecture that solves task and motion planning problemsin isolation: the task planner decides what actions to doand the motion planner decides how to realize these ac-tions. An example is the Hierarchical Planning in the Now(HPN) [2] approach: it interleaves planning and executionreducing search depth but requiring reversible actions whenbacktracking.

The main issue is that the environment is not deterministic:task and motion planners should cooperate so that onlygeometrically valid plans are sent to the executor. Taskand motion planning integration has then been exploredby implementing approaches able to search in the hybrid

task-geometry space and distribute search across the twospaces. [3], for example, extends a hierarchical task plannerwith geometric primitives, using shared literals that relatetask-level symbols with motion-level geometric entities. [4]interfaces off-the-shelf task and motion planners using aheuristic to remove objects that potentially block the robotspath. The Robosynth framework [5] uses a SatisfiabilityModulo Theories (SMT) solver ([6], [7]) to generate taskand motion plans from a static roadmap, employing planoutlines to guide the planning process. FFRob [8] developsan FF-like [9] task layer heuristic based on a lazily-expandedroadmap. Finally, the Task-Motion Kit [10] proposes a Itera-tively Deepened Task and Motion Planning (IDTMP) methodthat uses a constraint-based task planner to compute differentcandidate plans. In detail, the Z3 [11] SMT solver is used astask planner: it generates alternative task plans maintaininga stack of constraints and performing repeated satisfiabilitychecks as constraints are pushed onto and popped from thestack. The Open Motion Planning Library (OMPL) [12] isused for motion planning and the Flexible Collision Library(FCL) [13] is used for collision checking.

Combining task and motion planning in this way stilldo not guarantee generality, completeness, and scalability.The aforementioned algorithms, for example, are not ableto face the environment uncertainty. One way to solve theselimitations involves the introduction of AI. An example isthe Neural Programmer-Interpreter (NPI) [14] network: arecurrent and compositional neural network that learns howto represent and execute programs. It has three learnablecomponents: a task-agnostic recurrent core, a persistent key-value program memory, and a domain specific encoder thatenables a single NPI to operate in multiple perceptuallydifferent environments with distinct affordances. The coreis a LSTM network that at each time step selects the nextprogram to run conditioned on the current observation of theenvironment. Another example is [15], which selects sub-goals using Deep Learning in Minecraft. It interleaves per-ception, goal, reasoning, and acting with the visual observa-tions of the environment captured by the in-game characterscamera. Basic motion primitives compose the action planwith respect to four sub-goals: walking forward, creatingstairs, removing obstacles, and bridging obstacles. The sub-goals selection is done by a CNN trained on the environmentobservations. Finally, [16] investigates the ability of neuralnetworks to learn both linear temporal logic constraints andcontrol policies in order to generate task plans in complexenvironments.

To guarantee the greatest possible generalization, Meta-Learning can be introduced. It allows to train the model byexploiting very few significant samples of a multitude ofdifferent tasks. An example is Neural Task Programming(NTP) [17]: a robot learning framework based on NPI.This tool has been tested on robotic manipulators for taskslike blocks stacking and sorting. NTP interprets a taskspecification and instantiates a hierarchical policy as a neuralprogram, where the lowest level of the hierarchy is made ofprimitive actions executed by the robot API.

III. LONG SHORT-TERM MEMORY NETWORKS

Authors decided to implement the Neural Network bymeans of Long Short-Term Memory (LSTM) networks [1].LSTMs, in fact, solve the vanishing and exploding gradientproblem of other NNs: the output of a LSTM depends onlyon the memory status and on the current input. Moreover,LSTMs can remember long-term interactions by maintainingan internal state which depends on previous inputs. Focusingon task planning, this means that LSTMs let users work ontemporal sequences, letting a cognitive choice of the actionsto be performed.

A LSTM unit is composed of a cell, an input gate, anoutput gate and a forget gate. The cell is responsible for“remembering” values over arbitrary time intervals; hencethe word “memory” in LSTM. Each of the three gates can bethough of as a “conventional” artificial neuron that computesan activation (using an activation function) of a weightedsum. Hence the notation of “gate”: every gate controls theflow of values that goes through the connections of theLSTM. In this way, LSTMs learn how to control the flowof information inside the cell and choose what to rememberand what to forget.

Formally, the state sti of the ith cell at time t is equals to

the sum of the input gate and the forget net. The input gategt

i can be defined as:

gti = σ(bg

i +∑j

Ugi, jx

tj +∑

jW g

i, jht−1j ) (1)

where xtj is the input vector and ht−1

j is the hidden statevector at the previous step. In order to compute gt

i , Gatingis performed on both xt

j and ht−1j : xt

j is multiplied by Ug,the matrix of input weights; ht−1

j is multiplied by W g, thematrix of recurrent weights. gt

i is the sigmoid of the sum ofthese amounts and the bias bg

i . Similarly, the forget gate f ti

is defined as:

f ti = σ(b f

i +∑j

U fi, jx

tj +∑

jW f

i, jht−1j ) (2)

The internal state can now be defined as:

sti = f t

i st−1i +gt

iσ(bi +∑j

Ui, jxtj +∑

jWi, jht−1

j ) (3)

Authors highlight that no multiple multiplications by thesame operation W t is carried out: the reccurent update isgiven by the sum of only two components.

The output vector hti is consequently obtained by:

hti = tanhst

iqti (4)

qti = σ(bo

i +∑j

Uoi, jx

tj +∑

jW o

i, jht−1j ) (5)

IV. OUR USE CASE AND NEURAL NETWORK

A. Use Case

A robot is asked to swap two cubes, A and B, on atable (see Figure 1): it should pick A from its start position(xA,yA), place it on a temporary location (xT EMP,yT EMP),pick B from (xB,yB), place it on (xA,yA), pick A again andplace it on (xB,yB) (see Figure 2). The table is discretizedas a chessboard where cubes can be placed in the center ofevery chessboard cell. Learning this task means: learning thesequence of actions to be performed in order to achieve theassigned goal (see Listing 1) and learning how to recoverfrom failures. Indeed, an action can fail, e.g., the graspedobject can slip from the robot’s gripper or can be wronglyplaced. This implies that a recovery routine has to beperformed in order to restore the right action preconditions(see Listing 2).

beginPickA ( ) ;P l a c e ( xT EMP , yT EMP ) ;PickB ( ) ;P l a c e ( xA , yA ) ;PickA ( ) ;P l a c e ( xB , yB ) ;end ;

Listing 1: The actions sequence of the swapping task whenno failure is perceived.

beginPickA ( ) ; / / F a i l e d !PickA ( ) ; / / RerunP l a c e ( xT EMP , yT EMP ) ;PickB ( ) ;P l a c e ( xA , yA ) ; / / F a i l e d !PickB ( ) ; / / P i ck B aga in , b u t t h i s f a i l s t o o !PickB ( ) ;P l a c e ( xA , yA ) ;PickA ( ) ;P l a c e ( xB , yB ) ;end ;

Listing 2: The actions sequence of the swapping task whenthe robot fails to pick and place cubes: a rerun has to beperformed in order to recover the correct scene.

B. Dataset

Neural Networks need a dataset of examples to learnfrom. In the case in analysis, acquiring this data could bechallenging. It is why authors propose a Dataset Generatorwith an expert policy that generates useful synthetic datacomposed of examples in which the assigned task is com-pleted, managing any failures.

In detail,the expert policy generates a sequence of Input-Output pairs called Environment and Output, respectively.Environment is the observed high level encoding of theenvironment; Output is the desired output of the network.The description of the environment is composed by thecoordinates (XA,YA) and (XB,YB) of the two cubes, the statusof the gripper (1 if full, 0 if empty), and the temporary swap

Fig. 1: The environment used for the implementation of theassigned task: a simulated Baxter robot has to swap twocubes on a table.

Fig. 2: The assigned task: Move A in a temporary position;move B to A initial position; move A to B initial position.

position (xT EMP,yT EMP). Each coordinate is a single input tothe network. In a continuous domain, this would correspondto the Input Environment of Figure 3, where every coordinateis in the range 0 - 1. Performed tests demonstrated thatthe network is more performing in a discrete domain (thechessboard of Section IV-A). In this case, each coordinateis represented by 10 inputs (10 for the x coordinates and 10for the y ones) that are either 1 or 0 in order to reflect if thecorresponding cell of the chessboard is or is not occupied.The output is composed of the sequence of Actions to beperformed together with their Arguments, i.e., the temporaryswap position.

The generator is able to introduce an error probability thatsimulates the failure of an action. E.g., the action of pickinga cube can fail, meaning that the cube falls from the robotgripper. In this case the robot needs to pick up the objectagain. This action should be introduced in the sequence.

Fig. 3: A continuous training example. Input: (xA,yA),(xB,yB), the gripper status (1.0 if holding an object, 0otherwise), and the temporary swap position (xT EMP,yT EMP).Output: the action to be performed together with the (x,y)arguments of the place primitive.

TABLE I: List of Hyperparamenters used to train the network

Hyperparameter ValueTraining Examples 3000Batch Size 4Epochs Up tp 150Optimizer ADAMLoss Function Crossentropy

Therefore, authors generated the dataset by introducing a20% failure probability for each action.

C. Training

Once generated the dataset, it is used to train the network.This means that every i-th training example has the form

{(In;Out)i|Int = {in1, in2, ..., inn},Outt = {out1,out2, ...,outo},

∀t ∈ [0,maxlen]}(6)

where (In;Out)i is a couple of two sequences: In isthe sequence of n features representing the environmentobservations; Out is the sequence of o features describingthe target output. Every sequence has a maxlen maximumlength.

Fig. 4: Structure of the implemented NN: three LSTM hiddenlayers and three output layers: one for the action to beperformed, one for xT EMP, and one for yT EMP.

Every train is characterized by 150 epochs and 3000 train-ing examples, with a batch size of 4 samples. ADAM [18],with default parameters, is used as training algorithm. In-deed, ADAM is able to adaptively change the learning rate.The selected loss function is the cross-entropy (see Table I).

D. The implemented Neural Network

Let εt ∈ E be the environment observation at time t andlet st ∈RD be its fixed-length state encoding extracted by adomain specific encoder fenc : E →RD. The network takesthe environment representation st as input, possibly alongwith other inputs like motion planner failures, and computeshidden states ht ∈ RM , which sum up the progression ofthe task in the current environment. Three LSTM hiddenlayers of 50 units each compose the Neural Network (seeFigure 4). Three are the output layers: - the next action to beperformed, computed as fAct :RM→A, with A the finite setof actions; - the temporary position xT EMP; - the temporaryposition yT EMP. f i

Arg : RM → Rk is used to compute the(xT EMP,yT EMP) arguments in the selected discrete domain,with Rk a set of k real numbers (in a discrete domain, theproblem in analysis is reduced to a classification problem).After the computation, the Softmax activation function isused to obtain a k-dimensional vector with values in [0,1]whose sum is equal to 1. The argument with maximum valueis selected as output.

E. Results

The network is able to efficiently recover from failure (seeFigure 5).

Accuracies of 97.3% and 99.5% are reached on the val-idation sets for the action layer and the argument layers,respectively (see Figure 6a and 6b).

In Figure 7a the training and validation losses of the actionlayer are reported. Immediatly after the first 15 epochs,the validation loss starts to deviate from the training loss

Fig. 5: The performed experiment: a simulated Batxer robothas to swap two cubes on a table.

meaning that the model is beginning to overfit. Anyway,the validation accuracy does not get worse. This is becauseeven though the prediction is a little bit further from thetarget value,Softmax selects the maximum output turning onan output that is still correct. The spike observed after 120epochs is due to the small size of the batch. It is also dueto the changes of the learning rate introduced by ADAM.Using a larger batch would smooth the losses. The trainingof the parameters for the argument layers, instead, is smooth(see Figure 7b).

F. Platform Specifications

Experiments have been implemented in Keras [19]. Ten-sorflow [20] is used as back-end. They are performed on amachine with 8GB DDR3 memory and an Intel i5 2500k at4.2GHz CPU with Ubuntu 16.04 Operating System. Ten-sorflow was compiled from source to take advantage ofthe SSE instructions and consequently obtain a performanceimprovement. In this way, the training times range from fewminutes for those cases with hundreds of parameters, up to2 or 3 hours if thousands of parameters are exploited. Tests

(a) Action accuracy (b) Single argument accuracy

Fig. 6: Accuracies during training (blue line) and validation(red line)using error prone discrete domain

(a) Action loss (b) Single argument loss

Fig. 7: Losses during training (blue line) and validation (redline)using error prone discrete domain

showed a performance deterioration when using Tensorflowwith the GPU support (nVidia GTX 970). This is due tothe low batch size and the low cardinality of the input set:they do not lead to a speedup factor sufficient to improvethe training time. Thus, CPU is preferred.

Experiments have been performed in simulation;Gazebo [21] is used as simulator and the implementationis based on the Robot Operating System [22] (Kineticversion).

V. CONCLUSION

This paper proposed a three layers LSTM Neural Networkable to learn a manipulation task, identify failures on actionexecution, and recover from failures so to complete this task.An artificial dataset, equipped with synthetic data, has beengenerated and presented to train the network.

As future work, authors aim to improve the generalizationcapabilities of the proposed network by integrating it withthe generalization via recursion approach described in [23]:this approach subdivides complex tasks into simpler ones andapplies recursion so to improve the generalization. Anotheridea is that of creating a Domain Specific Encoder: anencoder able to represent the main features through a finitevector of elements. Indeed, in this work a fixed high-levelrepresentation of the environment is provided as input tothe NN: a table with two cubes on its top where cubesidentification and pose estimation can be computed by usingcomputer vision algorithms. To make this representationmore general, objects in the scene should be, for example, ofdifferent type, shape, and number. This can be achieved byusing a neural network as encoder that can learn to represent

the environment by a features vector of fixed size. This isa prerequisite for Meta-Learning: the long-term goal of thiswork. Meta-Learning allows to train the model by exploitingvery few significant samples of a multitude of different tasks.In this way, users avoid retraining the network each time adifferent task is assigned, and the network can efficientlygeneralize on new, first-time seen tasks.

REFERENCES

[1] F. A. Gers, J. A. Schmidhuber, and F. A. Cummins, “Learningto forget: Continual prediction with lstm,” Neural Comput.,vol. 12, no. 10, pp. 2451–2471, Oct. 2000. [Online]. Available:http://dx.doi.org/10.1162/089976600300015015

[2] L. P. Kaelbling and T. Lozano-Perez, “Integrated task and motionplanning in belief space,” International Journal of RoboticsResearch, vol. 32, no. 9-10, 2013. [Online]. Available: http://lis.csail.mit.edu/pubs/tlp/IJRRBelFinal.pdf

[3] M. Gharbi, R. Lallement, and R. Alami, “Combining symbolic andgeometric planning to synthesize human-aware plans: toward more ef-ficient combined search.” in 2015 IEEE/RSJ International Conferenceon Intelligent Robots and Systems (IROS), 2015, pp. 6360–6365.

[4] S. Srivastava, E. Fang, L. Riano, R. Chitnis, S. Russell, and P. Abbeel,“Combined task and motion planning through an extensible planner-independent interface layer,” in IEEE International Conference onRobotics and Automation (ICRA), 2014.

[5] S. Nedunuri, S. Prabhu, M. Moll, S. Chaudhuri, and L. E. Kavraki,“Smt-based synthesis of integrated task and motion plans from planoutlines,” in 2014 IEEE International Conference on Robotics andAutomation (ICRA), 2014, pp. 655–662.

[6] L. De Moura and N. Bjørner, “Satisfiability modulo theories: Intro-duction and applications,” Commun. ACM, vol. 54, no. 9, pp. 69–77,2011.

[7] C. Barrett, R. Sebastiani, S. Seshia, and C. Tinelli, Satisfiabilitymodulo theories, 1st ed., ser. Frontiers in Artificial Intelligence andApplications, 2009, vol. 185, no. 1, pp. 825–885.

[8] C. R. Garrett, T. Lozano-Perez, and L. P. Kaelbling, FFRob: AnEfficient Heuristic for Task and Motion Planning. Cham: SpringerInternational Publishing, 2015, pp. 179–195.

[9] J. Hoffmann and B. Nebel, “The ff planning system: Fast plangeneration through heuristic search,” J. Artif. Int. Res., vol. 14, no. 1,pp. 253–302, 2001.

[10] N. T. Dantam, S. Chaudhuri, and L. E. Kavraki, “The task motionkit,” Robotics and Automation Magazine, 2018, to appear.

[11] L. de Moura and N. Bjørner, “Z3: An efficient smt solver,” in Toolsand Algorithms for the Construction and Analysis of Systems, C. R.Ramakrishnan and J. Rehof, Eds. Berlin, Heidelberg: Springer BerlinHeidelberg, 2008, pp. 337–340.

[12] I. A. Sucan, M. Moll, and L. E. Kavraki, “The open motion planninglibrary,” IEEE Robotics Automation Magazine, vol. 19, no. 4, pp. 72–82, 2012.

[13] J. Pan, S. Chitta, and D. Manocha, “Fcl: A general purpose libraryfor collision and proximity queries,” in 2012 IEEE InternationalConference on Robotics and Automation, 2012, pp. 3859–3866.

[14] S. Reed and N. de Freitas, “Neural programmer-interpreters,” inInternational Conference on Learning Representations (ICLR), 2016.[Online]. Available: http://arxiv.org/pdf/1511.06279v3

[15] Selecting Subgoals using Deep Learning in Minecraft: A PreliminaryReport, New York, NY, 2016.

[16] C. Paxton, V. Raman, G. D. Hager, and M. Kobilarov, “Combiningneural networks and tree search for task and motion planning inchallenging environments,” CoRR, vol. abs/1703.07887, 2017.

[17] D. Xu, S. Nair, Y. Zhu, J. Gao, A. Garg, L. Fei-Fei, and S. Savarese,“Neural task programming: Learning to generalize across hierarchicaltasks,” CoRR, vol. abs/1710.01813, 2017.

[18] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion.” CoRR, vol. abs/1412.6980, 2014.

[19] F. Chollet et al., “Keras,” https://keras.io, 2015.[20] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro,

G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat,I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz,L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga,S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner,I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan,

F. Viegas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke,Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning onheterogeneous systems,” 2015, software available from tensorflow.org.[Online]. Available: https://www.tensorflow.org/

[21] C. Aguero, N. Koenig, I. Chen, H. Boyer, S. Peters, J. Hsu, B. Gerkey,S. Paepcke, J. Rivero, J. Manzo, E. Krotkov, and G. Pratt, “Insidethe virtual robotics challenge: Simulating real-time robotic disasterresponse,” Automation Science and Engineering, IEEE Transactionson, vol. 12, no. 2, pp. 494–506, April 2015.

[22] M. Quigley, K. Conley, B. P. Gerkey, J. Faust, T. Foote, J. Leibs,R. Wheeler, and A. Y. Ng, “Ros: an open-source robot operatingsystem,” in ICRA Workshop on Open Source Software, 2009.

[23] J. Cai, R. Shin, and D. Song, “Making neural programming architec-tures generalize via recursion,” CoRR, vol. abs/1704.06611, 2017.

http://dx.doi.org/10.1162/089976600300015015

http://lis.csail.mit.edu/pubs/tlp/IJRRBelFinal.pdf

http://lis.csail.mit.edu/pubs/tlp/IJRRBelFinal.pdf

http://arxiv.org/pdf/1511.06279v3

https://keras.io

https://www.tensorflow.org/

Documents

Learning Robot Task Planning primitives by means of Long ...toselloe/2018-Simpar-Workshop/TMPC-2018... · Learning Robot Task Planning primitives by means of Long Short-Term Memory