Abstract arXiv:1706.00989v1 [cs.RO] 3 Jun 2017 · A novel skill learning approach is proposed that allows a robot to acquire human-like visuospatial skills for object manipulation

Visuospatial Skill Learning for Robots

S. Reza Ahmadzadeha,∗, Fulvio Mastrogiovannib, Petar Kormushevc

aSchool of Interactive Computing, Georgia Institute of Technology, Atlanta, GA, USAbDepartment of Informatics, Bioengineering, Robotics and Systems Engineering, University of Genova, via Opera Pia 13,

16145 Genova, ItalycDyson School of Design Engineering, Imperial College London, London SW7 2AZ, United Kingdom

Abstract

A novel skill learning approach is proposed that allows a robot to acquire human-like visuospatial skillsfor object manipulation tasks. Visuospatial skills are attained by observing spatial relationships amongobjects through demonstrations. The proposed Visuospatial Skill Learning (VSL) is a goal-based approachthat focuses on achieving a desired goal configuration of objects relative to one another while maintainingthe sequence of operations. VSL is capable of learning and generalizing multi-operation skills from a singledemonstration, while requiring minimum prior knowledge about the objects and the environment. In contrastto many existing approaches, VSL offers simplicity, efficiency and user-friendly human-robot interaction. Wealso show that VSL can be easily extended towards 3D object manipulation tasks, simply by employing pointcloud processing techniques. In addition, a robot learning framework, VSL-SP, is proposed by integratingVSL, Imitation Learning, and a conventional planning method. In VSL-SP, the sequence of performedactions are learned using VSL, while the sensorimotor skills are learned using a conventional trajectory-based learning approach. such integration easily extends robot capabilities to novel situations, even by userswithout programming ability. In VSL-SP the internal planner of VSL is integrated with an existing action-level symbolic planner. Using the underlying constraints of the task and extracted symbolic predicates,identified by VSL, symbolic representation of the task is updated. Therefore the planner maintains ageneralized representation of each skill as a reusable action, which can be used in planning and performedindependently during the learning phase. The proposed approach is validated through several real-worldexperiments.

Keywords: robot learning, visuospatial skill, imitation learning, learning from demonstration, visualperception,

1. Introduction

During the past two decades, several robot skilllearning approaches based on human demonstra-tions have been proposed (Ijspeert et al., 2013;Argall et al., 2009). Many of them address mo-tor skill learning in which new sensorimotor skillsare transferred to the robot using policy derivationtechniques. Motor skill learning approaches can becategorized into two main groups: trajectory-based

∗Corresponding authorEmail addresses: [email protected] (S.

Reza Ahmadzadeh), [email protected](Fulvio Mastrogiovanni), [email protected](Petar Kormushev)

and goal-based. To emulate the demonstrated skill,the former group put the focus on recording andregenerating trajectories (Ijspeert et al., 2013) orintermittent forces (Kronander and Billard, 2012).For instance, Bentivegna et al. (2004) proposed amethod in which a humanoid robot learns to playair hockey by learning primitives such as velocityand positional trajectories. By combining Rein-forcement Learning with Imitation Learning, Kor-mushev et al. (2010) taught a robot to flip pancakes,bootstrapping the learning process with a few ob-served demonstrations.

In many cases, however, it is not the trajectorythat is of central importance, but the goal of thetask (e.g. solving a jigsaw puzzle). Learning every

arX

iv:1

706.

0098

9v1

[cs

.RO

] 3

Jun

201

7

single trajectory in such tasks actually, increasesthe complexity of the learning process unnecessar-ily (Niekum et al., 2012). To address this draw-back, the latter group of approaches which focus onthe goal of the task has been proposed (Verma andRao, 2005; Dantam et al., 2012; Chao et al., 2011).There is a large body of literature on grammarsfrom the Computational Linguistic and ComputerScience communities, with a number of applicationsrelated to Robotics (Niekum et al., 2012; Dantamet al., 2012). Furthermore, a number of symboliclearning approaches exist, which focus on goal con-figuration rather than action execution (Chao et al.,2011). However, in order to ground the symbols,they comprise many steps inherently, namely seg-mentation, clustering, object recognition, structurerecognition, symbol generation, syntactic task mod-eling, motion grammar, and rule generation. Suchapproaches require a significant amount of a prioriknowledge to be manually engineered in the system(Niekum et al., 2012; Dantam et al., 2012). In addi-tion, most above-mentioned approaches assume theavailability of the information on the internal stateof a tutor such as joint angles, which cannot beaccessed easily by humans to imitate the observedbehavior.

An alternative to motor skill learning is visuallearning. These approaches are based on observingdemonstrations and using human-like visual skillsto replicate a task (Kuniyoshi et al., 1994; Lopesand Santos-Victor, 2005).

We propose a novel visual skill learning approachfor interactive robot learning tasks. Unlike motorskill learning approaches, our framework utilizes vi-sual perception as the main source of informationfor learning new skills from a demonstration. Theproposed approach, which we refer to as Visuospa-tial Skill Learning (VSL), uses visuospatial skillsto replicate the demonstrated task. A visuospa-tial skill is the capability to visually perceive thespatial relationship between objects. VSL requiresminimum a priori knowledge to learn a sequence ofoperations from a single demonstration. In contrastto many previous approaches, VSL exhibits simplic-ity, efficiency, and user-friendly human-robot inter-action. Rather than relying on complicated modelsof human actions, labeled human data, or objectrecognition, our approach allows a robot to learna variety of complex tasks effortlessly, simply byobserving and reproducing the visual relationshipamong objects. We demonstrate the feasibility ofthe proposed approach in several experiments in

which the robot learns to organize objects of dif-ferent shape and color on a tabletop workspaceto accomplish a goal configuration. In the con-ducted experiments the robot acquires and repro-duces main capabilities such as absolute and rel-ative object placement, classification, turn-taking,user intervention to modify the reproduction, andmultiple operations performed on the same object.

In Section 4, we show that, by employing pointcloud processing techniques instead of 2D imageprocessing methods, VSL can be extended for ob-ject manipulation tasks in 3D space. The new vari-ant of VSL is called VSL-3D.

Integrating VSL and a conventional motor learn-ing approach such as Imitation Learning enablesa robot to acquire new sensorimotor skills directlyfrom demonstrations. In Section 5, we show that bythis integration, the need for a classical equation-based trajectory generation module can be avoided.In addition, in order to present the learned skills ina symbolic form, a standard action-level planneris combined with VSL. Learning preconditions andeffects of the actions in VSL and making symbolicplans based on the learned knowledge deals withthe long-standing important problem of ArtificialIntelligence, namely Symbol Grounding (Harnad,1990). This framework is denoted as VSL-SP andis described in Section 5.

In the rest of this paper, we use the term VSLfor the original version of our proposed method, ex-plained and used for 2D experiments in Section 3.We also use the term VSL-3D for the 3D exten-sion of VSL, explained in Section 4. The proposedframework by integrating VSL, Imitation Learning,and symbolic planning explained in Section 5 is re-ferred to as VSL-SP.

The paper is organized as follows. Related workis reviewed in Section 2. The terminology andmethodology of the proposed approach togetherwith a few implementation steps are explained indetails in Section 3. Results for simulated and real-world experiments are reported in Section 3.4 andSection 3.5, respectively. VSL-3D, the extension ofthe proposed approach for dealing with 3D applica-tions, is presented in Section 4. Results reported inSection 4.2 show that VSL-3D is readily attainable,although it requires more sophisticated image pro-cessing techniques. VSL-SP, the integrated frame-work, comprising a motor skill learning approach,VSL, and a symbolic planner, is described in Sec-tion 5. The implementation steps for the proposedframework are discussed in Section 5.5. The fea-

2

sibility and capability of the proposed frameworkare experimentally validated and the results are re-ported in Section 5.6. Conclusions of the paper aredrawn in Section 6.

2. Related Work

Visual skill learning or learning by watching isone of the most powerful mechanisms of learningin humans. It has been shown that even infantscan imitate both facial and manual gestures (Melt-zoff and Moore, 1983). Learning by watching hasbeen investigated in Cognitive Science as a sourceof higher order intelligence and fast acquisitionof knowledge (Rizzolatti et al., 1996; Park et al.,2008). In the next paragraph, related and mostrecent approaches on the topic of robot learningusing visual perception are discussed. Mentionedworks include those that (a) can be categorized asgoal-based, (b) utilize visual perception as the mainsource of information, and (c) especially those fo-cusing on learning object manipulation tasks.

One of the most influential works on plan ex-traction has been discussed by Ikeuchi and Suehiro(1994). By obtaining and segmenting a continuoussequence of images their method extracts assemblyplans. To extract transitions between sets of objectrelations, their approach depends on a model of theenvironment and predefined coordinate systems.

The early work of Kuniyoshi et al. (1994) focusedon acquiring reusable high-level task knowledge bywatching a demonstration. Their approach employsa model of the environment, a predefined actiondatabase, multiple vision sensors and basic visualfeatures. Their approach cannot detect rotationsand can only deal with rectangular objects whichare not in contact with each other.

Asada et al. (2000) proposed an approach basedon epipolar constraint for reconstructing the robot’sview, on which adaptive visual servoing is appliedto imitate an observed motion. Instead of using co-ordinate transformation, two sets of stereo camerasare used, one for the robot and the other for thedemonstrator.

Ehrenmann et al. (2001) used multiple sensorsfor learning pick-and-place tasks in a kitchen envi-ronment. A magnetic field based tracking system,an active trinocular camera head, and a data glovewere used in their experiments. Detected hand con-figurations were matched with a predefined sym-bol database detected using pre-trained neural net-works.

Yeasin and Chaudhuri (2000) proposed a similarapproach that focuses on extracting and classify-ing subtasks for grasping tasks. Their approachcaptures data by tracking the tutor’s hand dur-ing demonstrations, generates trajectories, and ex-tracts subtasks using neural networks.

A visual learning by imitation approach pre-sented by Lopes and Santos-Victor (2005), utilizesneural networks and viewpoint transformation tomap visual perception to visuo-motor processes.They adopted a Bayesian formulation for gestureimitation. The method proposed by Pardowitzet al. (2006) extracts knowledge about a pick-and-place task from a sequence of actions. They extractthe proper primitive actions and the constraints ofthe task from demonstrations.

Pastor et al. (2009) presented an approach toencoding motor skills into a non-linear differentialequation that can reproduce those skills afterward.Object-Action Complex (Kruger et al., 2009) wasused to provide primitive actions with semantics.However, the resulting symbolic actions were neverused in a high-level planner.

Ekvall and Kragic (2008) proposed a symboliclearning approach in which a logical model for anSTRIPS-like planner is learned from multiple hu-man demonstrations. For each object, a geometricmodel and a set of features are given to the algo-rithm. The method can only deal with rectangularobjects and the precision of the positioning is de-cided using the K-means clustering method. Thespatial relations model presented by Golland et al.(2010) has several limitations which prevent it frombeing implemented on a real robot. For instance,the approach assumes a perfect and noiseless vi-sual information and a perfect object segmenta-tion. Furthermore, a small grammar is used, whichmeans that the system just works for a few carefullyconstructed expressions.

Chao et al. (2011) proposed a symbolic goal-based approach for grounding discrete conceptsfrom continuous perceptual data using unsuper-vised learning. Task goals are learned usingBayesian inference. The approach was implementedon tasks including a single pick-and-place operationin which the initial objects configuration is random-ized.

Mason and Lopes (2011) used verbal commandsto instruct a robot to perform a room tidying task.Primitive actions are predefined and the approachuses a database including all possible discrete con-figurations of objects. The features are learned us-

3

ing a beta regression classifier and the planner findsa solution using simulated annealing. The con-ducted experiments are of particular value for usbecause they can be easily learned and reproducedusing our approach.

Kroemer and Peters (2011) proposed a two-layerframework for learning manipulation tasks. Onelayer optimizes the parameters of motor primitives,whereas the other relies on pre- and post-conditionsto model actions. In contrast to VSL, pre- and post-conditions are manually engineered in the system.Aksoy et al. (2011) showed how symbolic represen-tations can be linked to the trajectory level usingspatiotemporal relations between objects and handsin the workspace. In a similar way, Aein et al.(2013) showed how such grounded symbolic repre-sentations can be employed together with low-leveltrajectory information to create action libraries forimitating observed actions.

There is a large body of literature on grammarsfrom the Computational Linguistic and ComputerScience communities, with a number of applicationsrelated to Robotics (Niekum et al., 2012; Dantamet al., 2012). Niekum et al. (2012) proposed a goal-based method for learning from trajectory-baseddemonstrations. Segmentation and recognition areachieved using a Beta-Process Autoregressive Hid-den Markov Model (HMM), while Dynamic Move-ment Primitives are used to reproduce the skill.The approach needs to assign a coordinate frameto each detected object and it cannot deal with ro-tations. Increasing the number of operations in thedemonstrations makes the segmentation process in-creasingly more prone to errors. Sometimes duringthe segmentation process, extra skills can be ex-tracted mistakenly.

Dantam et al. (2011) proposed Motion Grammarto plan the operation of a robot through a context-free grammar. The method was used to learnhuman-robot chess games in which a point cloudof the chessboard was used to detect the pieces.Dantam et al. (2012) presented an approach forthe visual analysis of demonstrations and automaticpolicy extraction for an assembly task. They usedseveral techniques such as segmentation, clustering,symbol generation and abstraction. Guadarramaet al. (2013) proposed a natural language interfacefor grounding nouns and spatial relations. The sys-tem uses a classifier trained using a huge databaseof collected images for each object. The presentedapplications and the spatial representations are lim-ited to 2D information.

By introducing a stack-based domain specific lan-guage for describing object repositioning tasks, Fe-niello et al. (2014) built a framework based on ourinitial VSL approach (Ahmadzadeh et al., 2013b).By performing demonstrations on a tablet interface,the time required for teaching is reduced and the re-production phase can be validated before executionof the task in the real-world. Various types of real-world experiments have been conducted includingsorting, kitting, and packaging tasks.

There is a long history of research in the area ofinterpreting spatial relations (Burgard et al., 1999;Kelleher et al., 2006; Moratz and Tenbrink, 2006;Skubic et al., 2004). In most of these works, amodel of spatial relations is built by hard-codingthe meaning of the spatial relations. Instead, somerecent works concluded that a learned model ofspatial relations can outperform hard-coded mod-els (Golland et al., 2010; Guadarrama et al., 2013).

Visuospatial Skill Learning (VSL), is a goal-basedapproach that utilizes visual perception as the mainsource of information. It focuses on achieving thedesired goal configuration of objects relative to oneanother while maintaining the sequence of opera-tions. We show that VSL is capable of learningand generalizing multi-operation skills from a sin-gle demonstration while requiring minimum priorknowledge about the objects and the environment.We also show that VSL can be extended towards3D object manipulation tasks, simply by employ-ing point cloud processing techniques. In contrastto many existing approaches, VSL offers simplic-ity, efficiency and user-friendly human-robot inter-action. In addition, we propose a learning frame-work (VSL-SP) by integrating, imitation learning,VSL, and a conventional planning method. By in-tegrating VSL with a conventional trajectory-basedlearning approach, we show that primitive actionscan be learned directly through demonstrations in-stead of being programmed manually in the system.The proposed integrated approach easily extendsand adapts robot capabilities to novel situations,even by users without programming ability. To uti-lize the advantages of standard symbolic planners,in VSL-SP the internal planner is integrated withan existing action-level symbolic planner. The plan-ner represents a symbolic description of the skills.It uses the underlying constraints of the task andextracted symbolic predicates (i.e. action precondi-tions and effects), identified by VSL, it updates thesymbolic representation while the skills are beinglearned. Therefore the planner maintains a gener-

4

alized representation of each skill as a reusable ac-tion, which can be used in planning and performedindependently during the learning phase. We val-idate our approach through several real-world ex-periments with a Barret WAM manipulator and aniCub humanoid robot.

3. Introduction to Visuospatial Skill Learn-ing

Visuospatial Skill Learning (VSL) is a goal-basedrobot Learning from Demonstration approach. Atutor demonstrates a sequence of operations on aset of objects. Each operation consists of a set ofactions (e.g. pick and place). In this Section, weconsider pick-and-place object manipulation tasksin which achieving the goal of the task and retain-ing the sequence of operations are particularly im-portant. In Section 5, we also consider other ac-tions such as push and pull. We assume the virtualexperimental setup illustrated in Figure 1a, whichconsists of a robot manipulator equipped with agripper, a tabletop workspace, a set of objects, anda vision sensor. The sensor can be mounted abovethe workspace to observe the tutor performing thetask. Using VSL, the robot learns new skills fromdemonstrations by extracting spatial relationshipsamong objects. Afterward, starting from a randominitial objects configuration, the robot can performa new sequence of operations resulting in reachingthe same goal as the one demonstrated by the tu-tor. A high-level flow diagram, shown in Figure 1b,illustrates the two main phases of VSL, namelydemonstration and reproduction. In the demon-stration phase, for each action, a set of observationsis recorded used for the match finding process dur-ing the reproduction phase. During the match find-ing process, observations including the similar setof features are matched. In this Section, first, thebasic terms for describing VSL are defined. Then,the problem statement is described and finally, theVSL approach is explained.

3.1. Terminology

In order to understand VSL, it is necessary tointroduce a few terms used throughout the paper.These are in order:

• World : the workspace of the robot observableby the vision sensor. The world includes ob-jects being used during the learning task, andreconfigured by the human tutor and the robot.

(a) Virtual experimental setup for a VSL task consistingof a robot manipulator, a vision sensor, a set of objects,and a workspace.

Demonstration

Recording Observation

Action

Reproduction

Action Match finding

Loading

(b) A high-level flow diagram of VSL illustrating thedemonstration and reproduction phases.

Figure 1: Virtual setup and flow diagram for VSL.

• Frame: a bounding box defining a rectanglein 2D space or a cuboid in 3D space. The sizeof the frame can be fixed or variable. The max-imum size of the frame is equal to the size ofthe world.

• Observation: the captured context of theworld from a predefined viewpoint using a spe-cific frame. An observation can be a 2D imageor a 3D point cloud.

• Pre-action observation: an observationcaptured just before the action is executed.The robot searches for preconditions in the pre-action observations before selecting and exe-cuting an action.

• Post-action observation: an observationcaptured just after the action is executed. Therobot perceives the effects of the executed ac-

5

tions in the post-action observations.

The set of actions contains different primitiveskills such as pick, place, push, or pull. VSL as-sumes that actions are known to the robot and therobot can execute each action when required. InSection 5, VSL-SP extends VSL by teaching theprimitive actions to the robot through demonstra-tions (Ahmadzadeh et al., 2015).

3.2. Problem Statement

Formally, we define visuospatial skill learning asa tuple

V = {W,O,F ,A, C,Π, φ} , (1)

where W ∈ Rm×n is a matrix which represents thecontext of the world including the workspace andall objects (WD and WR refer to the world duringthe demonstration and reproduction phases, respec-tively). O is a set of observation dictionaries O ={OPre,OPost

}; OPre and OPost are observation

dictionaries comprising a sequence of pre-actionand post-action observations, respectively suchthat: OPre = 〈OPre(1),OPre(2), . . . ,OPre(η)〉,and OPost = 〈OPost(1),OPost(2), . . . ,OPost(η)〉,where η is the number of operations performed bythe tutor during the demonstration phase. As aconsequence, OPre(i) represents the pre-action ob-servation captured during the ith operation. F ∈Rm×n is an observation frame used to capture ob-servations. A is a set of primitive actions definedin the learning task (e.g. pick). C is a set of con-straint dictionaries C =

{CPre, CPost

}; CPre and

CPost are constraint dictionaries comprising a se-quence of pre-action, and post-action constraints,respectively. Π is a policy or an ordered action se-quence extracted from demonstrations. φ is a vec-tor containing extracted features from observations(e.g. SIFT features).

3.3. Methodology

Pseudo-code of VSL is sketched in Algorithm 1.At the beginning (line 1), objects are randomlyplaced in the world (WD) and the size of the frame(FD) is equal to the size of the world (WD). Weassume that the robot has knowledge of a set ofprimitive actions A and it is capable of executingthem. For instance, the robot is capable of mov-ing towards a desired given pose and execute a pickaction. In Section 5, we explain how a primitiveaction library can be built.

In some tasks, to specify different spatial con-cepts, the tutor can use a landmark. For instance,

a vertical borderline can be used to divide theworkspace into two areas illustrating right and leftzones. A landmark can be either static or dynamic.A static landmark is fixed with respect to the worldduring both demonstration and reproduction. Adynamic landmark, however, can be replaced by thetutor before the reproduction phase starts. Bothtypes of landmarks are shown in this paper. Incase that any landmarks (e.g. label, borderline) areused in the demonstration phase, the robot shouldbe able to detect them in the world (line 2). How-ever, we assume that the robot cannot manipulatea landmark.

During the demonstration phase, VSL capturesone pre-action observation (OPre) and one post-action observation (OPost) for each operation ex-ecuted by the tutor using the specified frame (FD)(lines 5,and 6). We assume that during each op-eration only one object can be affected by an ac-tion. The pre-action and post-action observationsare used to detect the object on which the actionis executed, the pose of the object before and afterthe action execution (PPre,PPost) (line 7). VSL iscapable of detecting objects during the demonstra-tion process without having any a priori knowledgein advance. Practically, this can be achieved usingsimple image processing techniques such as image-subtraction and thresholding applied on a couple ofraw observations. For each detected object, a sym-bolic representation (B) is created (line 7). Suchsymbolic representation of an object can be usedin case the reproduction phase of VSL is integratedwith a high-level symbolic planner as shown in Sec-tion 5. In addition, VSL extracts a feature vector(φ) for each detected object (line 7). In order toextract φ, any feature extracting method can beused. In this Section, we use Scale Invariant FeatureTransform (SIFT) algorithm (Lowe, 2004). SIFTis one of the most popular feature-based methodswhich is able to detect and describe local featuresthat are invariant to scaling and rotation.

Pose vectors and the detected landmarks (L) areused to identify preconditions and effects of the exe-cuted action through spatial reasoning (line 8). Forinstance, if PPre is above a horizontal borderlineand PPost is below the line, the precondition of theaction is that the object is above the line and the ef-fect of the execution of the action is that the objectis below the line. By observing the predicates of anexecuted action, the action itself can be identifiedfrom the set of actions A (line 9). The sequence ofidentified actions is then stored in a policy vector

6

Input : {W,F ,A}Output: {O,P,Π, C,B, φ}

1 Initialization: place objects in the world randomly

2 L=detectLandmarks(W)3 i = 1

// Part I : Demonstration

4 for each Action do5 OPre

i = getPreActionObs(WD,FD)

6 OPosti = getPostActionObs(WD,FD)

7 [Bi,PPrei ,PPost

i , φi] = getObject(OPrei ,OPost

i )

8 [CPrei , CPost

i ] = getConstraint(Bi,PPrei ,PPost

i ,L)

9 Πi = getAction(A, CPrei , CPost

i )10 i = i+ 1

11 end// Part II : Reproduction

12 for j = 1 to i do13 P∗Pre

j = findBestMatch (WR,OPrej , φj , CPre

j ,L)

14 P∗Postj = findBestMatch (WR,OPost

j , φj , CPostj ,L)

15 executeAction(P∗Prej ,P∗Post

j ,Πj)

16 end

Algorithm 1: Pseudo-code for VSL.

Π.

During the reproduction phase, VSL is able to ex-ecute the learned sequence of actions independentof any external action planner (Ahmadzadeh et al.,2013a,b). In such case, VSL observes the new world(WR) in which the objects are replaced randomly.Comparing the recorded pre- and post-action obser-vations, with the new world, VSL detects the bestmatches for each objects, considering the extractedfeatures, new location of the landmarks, and theextracted constraints from the learning phase (lines13 and 14). Although the findBestMatch functioncan use any metric (e.g. window search method)to find the best matching observation, to be consis-tent, in all of our experiments in this Section, we useSIFT. Afterward, we apply Random Sample Con-sensus method (RANSAC) in order to estimate thetransformation matrix Tsift from the set of matches.A normalization constant is used to have a uniquenormalized representation and avoiding unnaturalinterpolation results. The normalized matrix αTsift

can be decomposed into simple transformation ele-ments,

αTsift = TRθR−φSvRφP,

where R±φ are rotation matrices to align the axisfor horizontal and vertical scaling of Sv; Rθ is an-other rotation matrix to orientate the shape intoits final orientation; T is a translation matrix; and

lastly P is a pure projective matrix. An affinematrix, TA, that is the remainder of αT by ex-tracting P can be calculated as TA = αTP−1. Tis extracted by taking the 3rd column of TA andA ∈ R2×2, is the remainder of TA. A can be furtherdecomposed using SVD such that, A = UDVT ,where D is a diagonal matrix, and U and V areorthogonal matrices. Finally, we can calculate

Sv =

[D 00 1

],Rθ =

[UVT 0

0 1

],Rφ =

[VT 00 1

].

In a pick-and-place task, since Rθ is calculatedfor both pick and place operations (RPick

θ ,RPlaceθ ),

the pick and place rotation angles of the objects areextracted as follows:

θPick = arctan(

RPickθ (2,2)

RPickθ (2,1)

)(2)

θPlace = arctan(

RPlaceθ (2,2)

RPlaceθ (2,1)

). (3)

VSL relies on vision, which might be obstructedby other objects, by the tutor’s body, or during thereproduction by the robot’s arm. Therefore, for thephysical implementation of the VSL approach spe-cial care needs to be taken to avoid such obstruc-tions. Figure 2 shows the result of match findingusing SIFT applied to an observation and a newworld.

After finding the best match, the Algorithm ex-tracts the pose of the object before and after action

7

Observation

World

SubtractionMatch Finding

Figure 2: The result of the image subtracting and threshold-ing for a place action (right), match finding result using SIFTin the 4th operation during reproduction of the Domino task(left).

execution, P∗Prej , P∗Post

j (lines 13 and 14). Finally,an action is selected from the policy Π and togetherwith pre and post poses is sent to the executeAc-tion function (line 15). This function selects the Ajprimitive action.

To perform a pick-and-place operation, the robotmust execute a set of primitive actions consist-ing of reaching, grasping, relocating, and releas-ing. Either one of the reaching and relocating ac-tions are trajectory-based skills which can eitherbe manually programmed to the system or a tutorcan teach them to the robot for instance throughlearning by demonstration technique (Ahmadzadehet al., 2015) as shown in Section 5. In the originalVSL (Ahmadzadeh et al., 2013b), a simple trajec-tory generation strategy was employed that we ex-plain here. The extracted pick and place points(P∗Pick

j ,P∗Placej ) together with the corresponding

rotation angles (θPickj , θPlacej ), are used to generatea trajectory for the corresponding operation. Foreach pick-and-place operation, the desired Carte-sian trajectory of the end-effector is a cyclic move-ment between three key points: a rest point, a pickpoint, and a place point. Figure 3a illustrates acomplete trajectory cycle generated for a pick-and-place operation. Also, four different profiles of ro-tation angles are depicted in Figure 3b. The robotstarts from the rest point while the rotation an-gle is equal to zero and moves smoothly along thered curve towards the pick point, P∗Pick

j . Duringthis movement, the robot’s hand rotates to satisfythe pick rotation angle, θPickj , according to the ro-tation angle profile. Then the robot picks up anobject, relocates it along the green curve to the

place point, P∗Placej , while the hand is rotating to

meet the place rotation angle, θPlacej . Then, therobot places the object in the place point and fi-nally moves back along the blue curve to the restpoint. In order to form each trajectory, initial con-ditions (i.e. initial positions and velocities) and aspecific duration must be defined. Thereby, a ge-ometric path is defined which can be expressed inthe parametric form of the following equations:

px = a3s3 + a2s

2 + a1s+ a0 (4)

py = b3s3 + b2s

2 + b1s+ b0 (5)

pz = h[1− |(tanh−1(h0(s− 0.5)))κ|], (6)

where, s is a function of time t, (s = s(t)), px =px(s), py = py(s), and pz = pz(s) are the 3D ele-ments of the geometric spatial path; The ai and bicoefficients in (4) and (5) are calculated using theinitial and final conditions. κ in (6) is the curvature,h0 and h are the initial height and the height of thecurve in the middle point of the path respectively.h and h0 can be either provided by the tutor or de-tected through depth information provided by anRGB-D sensor. In addition, the time is smoothlydistributed with a 3rd order polynomial between thetstart and tfinal which both are instructed by theuser. Moreover, to generate a rotation angle tra-jectory for the robot’s hand, a trapezoidal profile isused together with the θPick and θPlace calculatedin (2) and (3). As shown in Figure 3a, the trajec-tory generation module can also deal with objectsplaced in different heights (different z-axis levels).

It is worth noting that to reproduce the task withmore general capabilities, a generic symbolic plan-ner can be utilized instead of the reproduction partof the Algorithm. Such extension is described inSection 5.

Another vital element for implementing a pick-and-place operation is the grasping strategy. Thereare many elaborate ways to do grasp synthesis forknown or unknown objects (Bohg et al., 2014; Suet al., 2012). Since the problem of grasping isnot the main focus of our research, we implementa simple but efficient grasping method using thetorque sensor of the robots used in the experiments.The grasping position is calculated using the centerof the corresponding pre-action observation. Thegrasping module firstly opens all the fingers andcloses them after the hand is located above the de-sired object. The fingers stop closing when the mea-sured torque is more than a pre-defined thresholdvalue. In addition, by estimating a bounding box

8

(a) A full cycle of spatial trajectory generated for a pick-and-place operation.

0 1 2 3 4 5 6 7 8−100

−80

−60

−40

−20

0

20

40

60

80

100

time [sec]

An

gle

of

Ro

tatio

n [

de

g]

(b) The generated angle of rotation, θ, for therobot’s hand.

Figure 3: The generated trajectory cycle including positionand orientation profiles for a spatial pick-and-place task.

for the target observation, the values are used todecide which axis is more convenient for grasping.

3.4. Simulation

In this Section, a simulated experiment is de-signed to gain an understanding of how VSL op-erates and to show the main idea of VSL withoutdealing with practical limitations and implementa-tion difficulties. For this simulation, as shown inFigure 4, a set of 2D objects is made which the tu-tor can manipulate and assemble them on an emptyworkspace using keyboard or mouse. Each opera-tion consists of a pick and a place action, which areexecuted by holding and releasing a mouse button.

In this task, the world includes two sets, eachcontaining three visually identical objects (i.e.three blue ‘house bodies’ and three brown ‘roofs’).As it can be seen in Figure 4a, the tutor selects the‘roof’ objects arbitrarily and places them on top ofthe ‘bodies’. However, in the 3rd operation, the tu-tor intentionally puts the ‘roof’ at the bottom of the

‘house body’. The goal of this experiment is to showthe VSL’s capability of disambiguation of multiplealternative matches. If the algorithm uses a fixedsearch frame (FR) that is smaller than the size ofthe ‘bodies’ (i.e. blue objects in the world), then,as shown in the first and second sub-figures 4b, thetwo captured observations can become equivalent(i.e. O1 = O2) and the 3rd operation might beperformed incorrectly (see the incorrect reproduc-tion in Figure 4c). The reason is that, due to thesize of the frame, the system perceives a sectionof the world not bigger than the size of a ‘housebody’. The system is not aware that an object isalready assembled on the top and it will select thefirst matching pre-place position to place the lastobject there. To resolve this problem we use adap-tive size frames during the match finding process inwhich the size of the frame starts from the smallestfeasible size and grows up to the size of the world.The function findBestMatch in Algorithm 1, is re-sponsible for creating and changing the size of theframe adaptively in each step. This technique helpsthe robot to resolve the ambiguity issue by addingmore context inside the observation, which in effectnarrows down the possible matches and leaves onlya single matching observation. Figure 4c shows thesequence of reproduction including correct and in-correct operations.

3.5. Real-world Experiments

In this Section, five real-world experiments aredescribed. As it can be seen in Figure 5, thesetup for all the experiments consists of a torque-controlled 7 DOF Barrett WAM robotic armequipped with a 3-finger Barrett Hand, a tabletopworking area, a set of objects, and a CCD cameramounted above the workspace (not necessarily per-pendicular to it). The resolution of the capturedimages is 1280 × 960 pixels. In all the conductedexperiments, the robot learns simple object manip-ulation tasks including pick-and-place actions. Inorder to perform a pick-and-place operation, theextracted pick and place poses are used to make acyclic trajectory as explained in Section 3.3. In thedemonstration phase, the size of the frame for thepre-pick observation is set equal to the size of thebiggest object in the world, and the size of the framefor the pre-place observation two or three times big-ger than the size of the biggest objects in the world.In the reproduction phase, the size of the frame isset equal to the size of the world. Note that, inall the illustrations, the straight and curved arrows

9

1

2

3

(a) Demonstration

(b) Three steps of adaptive frame-size

1

2

3

3

-- incorrect

- correct

(c) Reproduction

Figure 4: Roof placement simulated experiment to illustrateVSL’s capability of disambiguation of multiple alternativematches.

are used just to show the sequence of operations,not the actual trajectories for performing the move-ments. Table 1 summarizes different capabilities ofVSL emphasized in each task.

In order to test the repeatability of VSL and toidentify the possible factors of failure, the capturedobservations from the real-world experiments wereused while excluding the robot from the loop. Allother parts of the loop were kept intact and eachexperiment was repeated three times. The resultshows that less than 5% of pick-and-place opera-tions failed. The main failure factor is the matchfinding error which can be resolved by adjustingthe parameters of SIFT-RANSAC or using alterna-tive match finding algorithms. The noise in the im-ages and the occlusion of the objects can be listedas two other potential factors of failure. Despitethe fact that VSL is scale-invariant, color-invariant,and view-invariant, it has some limitations. For in-stance, if the tutor accidentally moves one objectwhile operating another, the algorithm may fail tofind a pick/place position. One possible solution isto combine classification techniques together withthe image subtraction and thresholding techniques

Figure 5: The experimental setup for a VSL task.

to detect multi-object movements. Such extensionis out of the scope of the presented work.

Alphabet Ordering

In the first task, the world includes four cubicobjects labeled with A, B, C, and D letters. Theworld also includes a static right angle baseline asa landmark (L). The goal is to reconfigure and sortthe set of objects with respect to the baseline ac-cording to a previous demonstration. As reportedin Table 1, this task emphasizes VSL’s capabilityof the relative positioning of an object with respectto other surrounding objects (a visuospatial skill).This is achieved through the use of visual obser-vations capturing both the object of interest andits surrounding objects (i.e. its context). In ad-dition, the baseline is provided to show the capa-bility of absolute positioning of VSL. It shows thefact that we can teach the robot to attain absolutepositioning of objects without defining any explicita priori knowledge. Figure 6a shows the sequenceof operations in the demonstration phase. Record-ing pre-pick and pre-place observations, the robotlearns the sequence of operations. Figure 6b showsthe sequence of operations produced by VSL start-ing from a novel world (i.e. a new initial configu-ration) achieved by randomizing objects locationsin the world. The accompanying video shows theexecution of this task (Video, 2013a).

Animal Puzzle

In this task, the world includes two sets of ob-jects which complete a ‘frog’ and a ‘giraffe’ puz-zle. There are also two labels (i.e. landmarks) inthe world, a ‘pond’ and a ‘tree’. The goal is to

10

Table 1: Capabilities of VSL illustrated in each real-world experiment.

Capability

Task Animal Alphabet Tower of Animals vs.

DominoPuzzle Ordering Hanoi Machines

Relative positioning X X X - XAbsolute positioning - X X - -

Classification - - - X -

Turn-taking - - X X XUser intervention - - X X X

1

2 3

4

(a) The sequence of the operations in the demonstration phaseby the tutor

1 23 4

(b) The sequence of the operations in the reproduction phaseby the robot

Figure 6: Alphabet ordering. The initial configuration of theobjects in the world is different in 6a and 6b. Black arrowsshow the operations.

assemble the set of objects for each animal withrespect to the labels according to the demonstra-tion. Figure 7a shows the sequence of operations inthe demonstration phase. To show the capabilityof generalization, the ‘tree’ and the ‘pond’ labelsare randomly replaced by the tutor before the re-production phase. Figure 7b shows the sequenceof operations reproduced by VSL after learning thespatial relationships among objects. This experi-ment shows the VSL’s capability of relative posi-tioning reported in Table 1. In the previous task,the final objects’ configuration in the reproductionand the demonstration phases are always the same.In this experiment, however, by removing the fixedbaseline from the world, the final result can be atotally new configuration of objects. The accompa-nying video shows the execution of this task (Video,2013a).

Tower of Hanoi

In this experiment, the famous Tower of Hanoipuzzle, which consists of a number of disks of differ-ent sizes and three bases or rods which actually arelandmarks. The objective of the puzzle is to move

1

2 3

4

(a) The sequence of the operations in the demonstration phaseby the tutor

1

2

34

(b) The sequence of the operations in the reproduction phaseby the robot

Figure 7: Animal puzzle. The initial and the final objects’configurations in the world are different in (a) and (b). Blackarrows show the operations.

the entire stack to another rod. This experimentdemonstrates almost all capabilities of VSL. Two ofthese capabilities are not accompanied by the previ-ous experiments. Firstly, our approach enables theuser to intervene to modify the reproduction. Suchcapability can be used to move the disks to anotherbase (e.g. to move the stack of disks to the thirdbase, instead of the second). This can be achievedonly if the user performs the very first operation inthe reproduction phase and moves the smallest diskon the third base instead of the second. Secondly,VSL enables the user to perform multiple opera-tions on the same object during the learning task.The sequence of reproduction is shown in Figure 8.

Animals vs. Machines: A Classification Task

This is an interactive task that demonstrates theVSL capability of classification of objects. In thistask, we show that interestingly the same VSL al-gorithm can be used to learn a classification task.The robot is provided with four objects, two ‘an-imals’ and two ‘machines’. Two labeled bins areused in this experiment for classifying the objects.Similar to previous tasks, the objects, labels, and

11

1

2

3

4

5

6

7

8

Figure 8: The sequence of the reproduction for the Tower ofHanoi experiment to illustrate the main capabilities of VSL.

bins are not known to the robot initially. First,all the objects are randomly placed in the world.The tutor randomly picks objects one by one andplaces them in the corresponding bins. Then, inthe reproduction phase, the tutor places one of theobjects each time, in a different sequence with re-spect to the demonstration. Differently from pre-vious tasks, the robot does not follow the opera-tions sequentially but searches in the pre-pick ob-servation dictionary for the best matching pre-pickobservation. The selected pre-pick observation isused for reproduction. The tutor can modify thesequence of operations in the reproduction phaseby presenting the objects to the robot in a differ-ent order with respect to the demonstration. Thesequence of operations in the demonstration phaseis illustrated in Figure 9. Each row represents onepick-and-place operation. During each operation,the tutor picks an object and moves it to the properbin. The set of pre-pick and pre-place observationscan be seen in left and right columns, respectively.Figure 10 shows two operations during the repro-duction phase. The accompanying video shows theexecution of this task (Video, 2013b).

Domino: A Turn-taking Task

The goal of this experiment is to show that VSLcan also deal with the tasks involving the cognitivebehavior of turn-taking. In this task, the world in-cludes a set of objects all of which are rectangulartiles. Each two pieces of the puzzle fit together toform an object (Figure 11). In the demonstration

Figure 9: The sequence of operations in the demonstrationphase. Each column represents one pick-and-place opera-tion. In each operation, the tutor picks one object and clas-sifies it either as an ‘animal’ or a ‘machine’. The selectedobject in each operation is shown in the middle row.

phase, the tutor first demonstrates all the opera-tions. In order to learn the spatial relationships,the system uses the modified algorithm from theclassification task. In the reproduction phase, thetutor starts the game by placing the first object(or another) in a random place. The robot thentakes the turn, finds and places the next match-ing domino piece. The tutor can also modify thesequence of operations in the reproduction phaseby presenting the objects to the robot in a differ-ent order with respect to the demonstration. Thesequence of reproduction is shown in Figure 11.

4. VSL-3D: An Extension to 3D Tasks

In this Section, an extension encompassing VSLis discussed. Although the main structure of the al-gorithm remains unaltered, VSL-3D requires moresophisticated image processing methods including3D feature estimation, calculation of surface prop-erties, dealing with noise, 3D match finding, andpose estimation.

4.1. Point Cloud Processing

For a VSL-3D task, point cloud processing meth-ods should be used in both demonstration and re-production phases. In the demonstration phase,for each operation, VSL-3D captures one pre-action

12

Figure 10: Two operations during the reproduction phase.Red crosses on objects and on bins show the detected posi-tions for pick and place actions, respectively.

and one post-action clouds. The captured raw pointclouds are filtered using a pass-through filter to cutthe values outside the workspace. This action re-duces the number of points by removing unneces-sary information (i.e. distant points) from the rawpoint clouds. The limits of filtering for each axis arederived from the workspace size. From each pairof filtered point clouds, pre-action and post-actionobservations are generated. In this state, to in-crease the speed of further processing, point cloudsare downsampled through voxelizing. The points ina voxel are approximated with their centroid. Toavoid decreasing the data resolution, the leaf size ofthe voxel grid must be selected appropriately. Thepoints generated in voxelization phase are storedin a Kd -tree structure. Then, K Nearest-Neighborsearch (KNN) is applied to extract the downsam-pled point cloud by finding correspondence betweengroups of points. The search precision (i.e. errorbounds) and the distance threshold should be setproperly for the search algorithm. The result of ap-plying the described process on a pair of capturedobservations is depicted in Figure 12.

In the reproduction phase, each captured worldobservation (WR) after being voxelized is comparedwith the corresponding recorded observations inthe dictionaries and a metric is applied to findthe best match. Surface normals are estimated foreach cloud using a Kd -tree and search for neigh-boring points in a pre-defined radius. The esti-mated surface normals are then used to find lo-cal features. There are several methods to rep-resent point features (Steder et al., 2010). Anacceptable feature descriptor should be able toshow the same characteristics in the presence of

rigid transformation, different sampling density,and noise. In this section, Fast Point Feature His-togram (FPFH) proposed by Rusu et al. (2009) isused which is a computationally efficient version ofPoint Feature Histogram. FPFH encodes a point’sk-neighborhood geometrical properties by general-izing the mean curvature around the point usinga multi-dimensional histogram of values. FPFH ischosen because it is 6D pose-invariant and copeswith varying sampling densities or noise levels, itdepends on the quality of the surface normal es-timate at each point. The match finding andpose estimation process for 3D experiments are ac-complished by utilizing three different algorithmsconsisting of Iterative Closest Point (ICP), Pre-rejective RANSAC (Buch et al., 2013), and SAmpleConsensus Initial Alignment (SAC-IA) (Rusu et al.,2009). Figure 13 demonstrates some results for thematch finding process.

The main differences for implementing VSL in2D versus 3D are reported in Table 2. The tableincludes the techniques used in this paper. How-ever, the implementation of VSL is not limited tothese techniques and depending on the application,various image processing and point cloud process-ing techniques can be employed.

4.2. Results

In this section, the feasibility and capability ofVSL-3D are experimentally validated. The setup issimilar to the one described in Section 3.5, wherethe CCD camera is replaced with an RGB-D sensornamely, Asus Xtion Pro Live.

Table Cleaning

In the first experiment with VSL-3D, the worldincludes an empty box and a set of objects whichare randomly placed in the workspace. The goal isto clean up the table by picking objects and plac-ing two of them inside the empty box and the finalone in front of the box according to the demon-stration. The tutor demonstrates the task and theobservations are captured by the sensor. Before thereproduction phase starts, the objects are reshuffledrandomly in the workspace. Figure 14 shows the se-quence of operations reproduced by the robot. Al-though in some operations the objects are occludedby the front wall of the box, the robot can accom-plish the task successfully.

13

Figure 11: The sequence of reproduction performed by the robot and the tutor are shown for the turn-taking task of domino.

Table 2: The main differences between VSL and VSL-3D approaches.

VSL VSL-3DObservations 2D Image 3D Point Cloud

Applicable for 3DoF Reconfiguration (X, Y, θ) 6DoF Reconfiguration

Transformation 2D to 2D (3× 3) 3D to 3D (4× 4)

Capturingobservations

Background SubtractionThresholding

VoxelizingKd-tree

KNN search

Match finding

2D Features2D Feature Estimation

2D Metrics (SIFT)

RANSAC

3D Features3D Feature Estimation

Voxelizing

Normal/Curvature Estimation

3D Metrics (ICP/RANSAC/SAC-IA)

Mainassumptions

No overlap among objectsObjects with the same height

Objects placed at the same level

Partially visible objectsObjects with different heights

Objects placed in different levels

Figure 12: Result of a subtraction process between a pre-pick(top-left) and a pre-place (bottom-left) point clouds. Theoriginal clouds are voxelized at a resolution of 2 mm and thedistance threshold for the KNN-search is set to 1 cm.

Stacking Objects

In the next pick-and-place task, objects are ran-domly placed on the table (the world). The sceneincludes two dynamic landmarks (i.e. two boxes)which are repositioned by the tutor before startingthe reproduction phase. In the first phase, the tutorstarts stacking the objects on top of the two boxes.The sequence of operations is captured and togetherwith the initial and final object configurations areshown in Figure 15a. After the learning phase is

finished, the robot can reproduce the stacking taskin the workspace even if the objects and the boxesare randomly placed. The corresponding sequenceof operations is shown in Figure 15b.

Results

In both experiments, the pre- and post-actionobservations are extracted from a single demon-stration. These observations are point cloud datawhich are actually 2.5D and are acquired from asingle point of view. Since VSL does not use a pri-ori knowledge about objects, if reconfigured in theworld, they might not be detected and recognizedsuccessfully during the reproduction phase. If anobject is observed in the demonstration phase, thereis no guarantee that it can be detected in the repro-duction phase while observed from a new point ofview. This is due to object transformations beforethe reproduction phase. Thus, in order to imple-ment a robust version of VSL-3D, more robust fea-tures for both textured and textureless objects andmodifications in the point cloud processing part ofthe algorithm are needed. However, the main stepsof the algorithm remain unaltered. Since image pro-cessing and point cloud processing are not the focus

14

(a) RGB-D (b) ICP (c) Pre-rejective RANSAC (d) SAC-IA

Figure 13: (a) illustrates the captured RGB-D cloud of a scene. The result of the match finding process using different methodsare shown in (b), (c), and (d).

Figure 14: The sequence of operations reproduced by the robot for the table cleaning task.

of this paper, we direct the readers towards appro-priate references, e.g. (Ekvall and Kragic, 2008;Papazov et al., 2012; Guadarrama et al., 2013).

5. Learning Symbolic Representations ofActions using VSL

As described in Section 3 and Section 4, VSLand VSL-3D allow a robot to identify the spatialrelationships among objects and learn a sequenceof actions for achieving the goal of a task. Wehave shown that VSL is capable of learning andgeneralizing different skills such as object reconfig-uration, classification, and turn-taking interactionsfrom a single demonstration. One of the shortcom-ings of VSL is that the primitive actions such aspick, place, and reach, have to be manually pro-grammed in the robot. For instance, in Section 3.3,a simple trajectory generation module was devisedthat is able to produce trajectories for pick andplace primitive actions. This issue can be addressedby combining VSL with a trajectory-based learningapproach such as Imitation Learning (Ijspeert et al.,2001). The obtained framework can learn the prim-itive actions of the task using conventional Learn-ing from Demonstration strategies while it learnsthe sequence of actions, preconditions, and effects

of the actions using VSL. We discuss this solutionin this section.

Furthermore, in order to use the advantage ofstandard symbolic planners, the original plannerof VSL can be replaced with an already availableaction-level symbolic planner, for instance, SGPlan6 by Chen et al. (2004). The integration of VSLand a symbolic planner brings significant advan-tages over using VSL alone. Learning preconditionsand effects of the actions in VSL and making sym-bolic plans based on the learned knowledge dealswith the long-standing important problem of Arti-ficial Intelligence, namely Symbol Grounding (Har-nad, 1990).

There are a few approaches in which a high-level symbolic representation is formed using low-level action descriptions. The method proposedby Jetchev et al. (2013) finds relational, proba-bilistic STRIPS operators by searching for a sym-bol grounding that maximizes a metric balancingtransition predictability with model size. Konidariset al. (2014) avoid this search by directly derivingthe necessary symbol grounding from the actions.

In this Section, a new robot learning frameworkis proposed by integrating a sensorimotor learn-ing framework (Imitation Learning), VSL, and asymbolic-level action representation and planning

15

(a) Demonstration

(b) Reproduction

Figure 15: Demonstration and reproduction phases in a stacking task. The initial and final object configurations are shown intop rows. The bottom rows, from left to right, show the sequence of operations.

layer. The capabilities and performance of the pro-posed framework, which is called VSL-SP, are val-idated using real-world experiments with an iCubrobot.

5.1. Overview

The proposed learning approach, VSL-SP, con-sists of three layers:

• Imitation Learning (IL), which is suitablefor teaching trajectory-based skills (i.e. ac-tions) to the robot, as it has been positedby Argall et al. (2009). IL relaxes the assump-tion about having to encode actions manually.

• Visuospatial Skill Learning (VSL), whichallows us to capture spatial relationshipsamong an object and its surrounding objectsand to extract a sequence of actions to reachthe goal of the task.

• Symbolic Planning (SP), which uses a sym-bolic representation of actions, to plan or re-plan the execution of the task. A symbolicplanner allows the system to generalize notonly to different initial states, but also to dif-ferent final states.

A high-level flow diagram of VSL-SP includingthe three layers together with their input and out-put can be seen in Figure 17. IL is employed toteach sensorimotor skills to the robot (e.g. pull andpush). Learned skills are stored in a primitive ac-tion library, which is accessible by the planner. VSLcaptures an object’s context for each demonstratedaction. This context is the basis of the visuospatialrepresentation and encodes implicitly the relativeposition of the object with respect to other objectssimultaneously. In order to employ the advantagesof symbolic-level action planners, a PDDL 3.0 com-

16

Figure 16: The iCub robot interacting with objects.

pliant planner is integrated with IL and VSL to pro-vide a general purpose representation of a planningdomain. The identified preconditions and effectsmodify their formal PDDL definition (i.e. the plan-ning domain). Once formal action definitions areobtained, it is possible to exploit the planning do-main to reason upon scenarios that are not strictlyrelated to what has been learned, e.g. includingdifferent object configurations, a varied number ofobjects or different objects at all. One of the mainadvantages of VSL-SP is that it extracts and uti-lizes multi-modal information from demonstrationsincluding both the sensorimotor information and vi-sual perception, which is used to build symbolicrepresentation structures for actions.

5.2. Learning Sensorimotor Skills using ImitationLearning

IL enables robots to learn and reproducetrajectory-based skills from a set of demonstrationsthrough kinesthetic teaching (Ijspeert et al., 2013).In VSL-SP, IL is utilized to teach primitive ac-tions (i.e. pull and push) to the robot. In particu-lar, Dynamic Movement Primitives (DMP), whichare designed for modeling attractor behaviors ofautonomous nonlinear dynamical systems, are uti-lized (Ijspeert et al., 2013).

In order to create a pattern for each action, mul-tiple desired trajectories in terms of position, veloc-ity and acceleration have to be demonstrated andrecorded by a human operator, in the form of a vec-tor [ydemo(t), ydemo(t), ydemo(t)], where t ∈ [1, ..., P ]and P is the number of datapoints. A controller

converts desired position, velocity, and acceleration,y, y, and y, into motor commands. DMP employs adamped spring model that can be modulated withnonlinear terms such that it achieves the desiredattractor behavior, as follows:

τ z = αz(βz(g − y)− z) + f + Cf ,

τ y = z.(7)

In (7), position and velocity are represented byy and z, respectively, τ is a time constant and αzand βz are positive constants. With βz = αz/4,y monotonically converges toward the goal g. f isthe forcing term and Cf is an application depen-dent coupling term. Trajectories are recorded inde-pendently of time. Instead, the canonical system isdefined as:

τ x = −αxx+ Cc, (8)

where αx is a constant and Cc is an application de-pendent coupling term. x is a phase variable, wherex = 1 indicates the start time and x close to zeromeans that the goal g has been reached. Startingfrom some arbitrarily chosen initial state x0 such asx0 = 1 the state x converges monotonically to zero.The forcing term f in (7) is chosen as follows:

f(x) =

∑Ni=1 ψi(x)ωi∑Ni=1 ψi(x)

x(g − y0), (9)

where ψi , i = 1, . . . , N, are fixed exponential basisfunctions, ψi(x) = exp(−hi(x− ci)2), where σi andci are the width and centers of the basis functions,respectively, and wi are weights. y0 is the initialstate. The parameter g is the target coinciding withthe end of the movement g = ydemo(t = P ), y0 =ydemo(t = 0). The parameter τ must be adjustedto the duration of the demonstration. The learningprocess of the parameters wi is accomplished witha locally weighted regression method, because it isvery fast and each kernel learns independently ofothers (Ijspeert et al., 2013).

In the conducted experiments two primitive ac-tions, namely pull and push are used, each of whichis demonstrated to the robot through Kinestheticteaching and learned using DMP. The robot can re-produce the action from a new initial pose towardsthe target. To achieve this goal, the pull actionis decomposed into reach after and move backwardsub-actions and also the push action is decomposedinto reach before and move forward sub-actions.The recorded sets of demonstrations for both reachbefore and reach after sub-actions are depicted as

17

Figure 17: A flow diagram illustrating the proposed integrated approach, VSL-SP, comprising three main layers. The robotlearns to reproduce primitive actions through Imitation Learning. It also learns the sequence of actions and identifies theconstraints of the task using VSL. The symbolic planner is employed to solve new symbolic tasks and execute the plans.

Figure 18: The tutor is teaching the primitive actions including pull and push to the robot using kinesthetic teaching.

black curves in Figure 19. Each red trajectory inFigure 19 illustrates a reproduction. In both Fig-ures, the goal (i.e. the object), is shown in green.Finally, the learned primitive actions are stored inthe primitive action library. Later, the planner con-nects the symbolic actions to the library of actions.The move forward and move backward sub-actionsare straight line trajectories that can be learned orimplemented directly into the trajectory generationmodule.

5.3. Learning Action Sequences by VSL

At this stage, a library of primitive actions hasbeen built. The robot learns the sequence of ac-tions required to achieve the goal of the demon-strated task through VSL. It also identifies spatialrelationships among objects by observing demon-strations. As shown in Figure 17, both IL and VSLobserve the same demonstrations (not necessarily atthe same time) but utilize different types of data.IL uses sensorimotor information (i.e. trajectories),whereas VSL uses visual data.

5.4. Generalization of Learned Skills as SymbolicAction Models

In order to use the learned primitive actions forgeneral-purpose task planning, the actions need to

-0.06

x[m]

-0.04-0.02

0

0.1y[m]

0.05

0

0.05

0.1

0.15

0.2

0

z[m

]

(a) reach-before sub-action.

0.15

x[m]

0.1

0.05

00

0.05

y[m]

0.1

0.15

0.2

0.1

0.05

0

0.15

z[m

]

(b) reach-after sub-action.

Figure 19: The recorded set of demonstrations for two sub-actions are shown in black. The reproduced trajectories froman arbitrary initial position towards the target (the greensquare) are shown in red.

be represented as a symbolic, discrete, scenario-independent form. To this aim, a symbolic repre-sentation for both the pull and push actions basedon the PDDL 3.0 formalism is defined. However,it is necessary to learn preconditions and effects foreach action in the primitive action library. The sen-sorimotor information in the skill learning processis exploited to derive symbolic predicates, insteadof defining them manually.

18

As shown in Figure 20, the robot first perceivesthe initial workspace in which one object is placedin far. Then the robot is asked to execute the pullaction, and after the action is finished, the robotperceives the workspace again. The pre-action andpost-action observations are shown in Figure 20.The robot uses VSL to extract the preconditionsand effects of the pull action considering the land-mark in the workspace (i.e. the black line in thiscase). The same steps are repeated for the pushaction. The domain file in the PDDL formalism isupdated automatically by applying this procedure.The preconditions and effects for the pull and pushactions are reported in Table 3.

Figure 20: By extracting the preconditions and effects of theprimitive actions while executing them, the robot generalizesthe learned sensorimotor skills as symbolic actions.

Table 3: The preconditions and effects extracted by VSL andwritten to the planner’s domain file.

Action push pullprecondition ¬ (far ?b) (far ?b)

effect (far ?b) ¬ (far ?b)

5.5. Implementation

In this step, it is possible to generalize each ac-tion as well as the demonstrated action sequences,using a PDDL 3.0 compliant formalism. It is note-worthy that in the present work, symbolic knowl-edge has not been bootstrapped from scratch. Onthe contrary, starting from the knowledge of theperformed action types during the demonstration,symbolic-level knowledge is updated with two kindsof constraints. The former class includes con-straints (in the form of PDDL predicates) relatedto preconditions and effects. This is done by map-ping the elements of observation dictionaries OPre

and OPost to relevant predicates. In the consid-ered scenario, one predicate is enough to character-ize push and pull actions, namely (far ?b), wherefar is the predicate name and ?b is a variable thatcan be grounded with a specific object that can be

pushed and pulled. Specifically, push(?b) is an ac-tion that makes the predicate truth value switchingfrom ¬(far ?b) to (far ?b), whereas the oppositeholds for pull(?b). Visual information about thelocation of objects in the World is mapped to theproper truth value of the (far ?b) predicate. Theexecution of a push demonstration on a green cube(as processed by VSL) allows the system to under-stand that before the action ¬(far Bgreen) holds,after the action (far Bgreen) holds, and the twopredicates contribute to the :precondition and:effect lists of a push action defined in PDDL.

The second class includes constraints (in the formof PDDL predicates) related to the trajectory of theplan that is executed in the demonstration. Thiscan be done analyzing the sequence of actions asrepresented in the VSL model (i.e. the implicitchain defined by predicates in OPost and OPre ob-servation dictionaries). In PDDL 3.0, it is possi-ble to define additional constraints, which implicitlylimit the search space when the planner attempts tofind a solution to the planning problem. In particu-lar, given a specific planning domain (in the form ofpush and pull actions, as obtained by reasoning onthe VSL model), the problem can be constrainedto force the planner to follow a specific trajectoryin the solution space. One possible constraint is toimpose, for instance, the sequence of satisfiabilityof predicate truth values. As an example, let us as-sume the VSL model represents the fact that Bgreenmust be always pushed after Bblue. According toour formal definition, this implies that the predi-cate (far Bgreen) must hold sometime after thepredicate (far Bblue) holds. To encode this con-straint in PDDL formalism, it is sufficient to use theconstraint (somewhere-after (far Bgreen) (far

Bblue)). If the constraint is used in the plan-ning problem, the planner attempts to find a planwhere this sequence is preserved, thereby execut-ing push(Bgreen) strictly after push(Bgreen). Fi-nally, it is noteworthy that such a constraint is re-moved, the planner is free to choose any suitablesequence of actions to be executed (independentlyof the demonstration), provided that the goal stateis reached.

5.6. Results

In order to evaluate the capabilities and perfor-mance of VSL-SP, a set of experiments are per-formed with an iCub humanoid robot. As shownin Figure 16, the robot is standing in front of atabletop including some polyhedral objects. All

19

the objects have the same size but different col-ors and textures. First, the tutor teaches a setof primitive actions, including pull and push, tothe robot using Imitation Learning. For each ac-tion, a set of demonstrations is captured throughKinesthetic teaching, while the robot is perceiv-ing the workspace (Figure 18). Using ImitationLearning, the robot learns to reproduce each actionfrom different initial positions of the hand towardsthe desired object. The learned primitive actionsare stored in a library. In addition, to introducethe concept of far and near, a black landmark,which represents a borderline, is placed in the mid-dle of the robot’s workspace. The robot can de-tect the landmark using conventional edge detec-tion and thresholding techniques. Also, the size ofthe frames (FD,FR) are defined equal to the size ofthe world. Figure 21 shows some instances of imageprocessing results from the conducted demonstra-tions. In this Section, blob detection and filteringby color techniques are utilized by considering thatthe iCub robot is capable of detecting the height ofthe table at the beginning of each experiment bytouching it. To detect the object which is movedand the place that the object has been moved to,background subtraction is used. Moreover, a simple(boolean) spatial reasoning has been applied thatutilizes the detected pose of the object and the de-tected borderline and then decides that the objectis far or near. However, to extract more sophisti-cated constraints, qualitative spatial reasoning lan-guages such as Region Connection Calculus by Ran-dell et al. (1992) can be employed.

b ca

d e f

Figure 21: (a) and (b) are rectified pre-action and post-action observations; (c) is a rectified pre-action observationfor the next operation; (d) is obtained by background sub-traction between (a) and (b); (e) is obtained by backgroundsubtraction between (b) and (c); (f) is the result of detectingthe border line for spatial reasoning.

A Simple VSL Task

This experiment evaluates the obtained connec-tion between the symbolic actions and their equiv-

alent sensorimotor skills. Initially, there are threeobjects on the table. The tutor provides the robotwith a sequence of actions depicted in Figure 22a.The robot perceives the demonstration, records theobservations and extracts the sequence of actionsusing VSL. In the reproduction phase, objects arereshuffled in the workspace. For the reproduction ofthe task, instead of SP (Symbolic Planner), the in-ternal planner of VSL is exploited. The result showsthat starting from an arbitrary initial configurationof the objects in the workspace, VSL is capable ofextracting the ordered sequence of actions and re-producing the same task as expected. As shown inFigure 22b, the robot achieves the goal of the taskusing VSL. The accompanying video shows the ex-ecution of this task (Video, 2015).

Keeping All Objects Near

In the previous experiment, it has been shownthat the internal planner of VSL is capable of re-producing the task. However, VSL-SP is more ca-pable in generalizing the learned skill. In this ex-periment, the robot learns a simple game includingan implicit rule: keep all objects near. For each ob-ject, two properties are defined: color and position.The symbolic planner provides VSL-SP with the ca-pability of generalizing over the number and evenshape of the objects. In order to utilize this capa-bility, the tutor performs a set of demonstrationsto eliminate the effect of color implying that anyobject independent of its color should not be far

from the robot. Three sets of demonstrations areperformed by the tutor, which are shown in Fig-ure 23. Each demonstration starts with a differ-ent initial configuration of objects. The performeddemonstrations imply the following rules:

r11 : 〈Bgreen → near〉 IF 〈Borange is far〉r21 : 〈Borange → near〉 IF 〈Bgreen is near〉r12 : 〈Bgreen → near〉 IF 〈Borange is near〉r13 : 〈Borange → near〉 IF 〈Bgreen is far〉

(10)

where rji indicates the jth rule extracted from theith demonstration. The first demonstration in-cludes two rules r11, r

21 because two actions are ex-

ecuted. It can be seen that the right parts of r11and r12 and the right parts of r21 and r13 eliminateeach other. Also, the color attribute on the leftside of the rules eliminates the effect of the coloron action execution. During the demonstration,

20

(a) Demonstration. (b) Reproduction.

Figure 22: A simple VSL task. In both sub-figures, the bottom rows show robot’s point of view.

for each object, the robot extracts the spatial re-lationships between the objects and the borderline,and decides if the object is far or near (i.e. not

far). It then applies the extracted precondition ofr11 : 〈B? → near〉 to that object. Afterward, therobot is capable of performing the learned skill ona various number of objects and even unknown ob-jects with different colors. The accompanying videoshows the execution of this task (Video, 2015).

Figure 23: Three different sets of demonstrations devised bythe tutor for implying the rule of the game in the secondexperiment.

Pull All Objects in a Given Order

In the previous experiment, the robot learns thatall objects must be kept close, without giving im-portance to the order of actions. In this experi-ment, to give priority to the sequence of actions tobe performed, the color attribute for each objectis included as well. For instance, consider the or-dered set 〈orange, green, yellow〉 in which the or-ange cube and the yellow cube are the most andleast important cubes, respectively. This meansthat we want the orange cube to be moved before

the green one, and the green cube before the yel-low one. The robot has to learn that the most im-portant cube has to be operated first. It has beenshown that learning a sequence of actions in whichthe order is important, is one of the main capabili-ties of VSL. Therefore, if the task was demonstratedonce and the reproduction part was done by VSL(and not by VSL-SP) the robot would learn the taskusing one single demonstration. In addition, VSLprovides the robot with the capability of generaliz-ing the task over the initial objects’ configuration.

Furthermore, using VSL-SP, it is possible to en-code such an ordering also in a planning domain, atthe cost of adding extra constraints (in the form ofpredicates) to the definition of the planning prob-lem. In this case the planner should include twoextra predicates, namely (somewhere-after (far

Bgreen) (far Borange)) and (somewhere-after

(far Byellow) (far Bgreen)). The inclusion ofthese predicates must be managed at an extra-logiclevel, e.g. by extracting the sequence of actionsusing VSL and inserting the corresponding con-straints in the planner problem definition. In thiscase, there is no need to have more demonstrations,but expressing the order of sequence as precondi-tions may be done automatically. The accompany-ing video shows the execution of this task (Video,2015).

21

Figure 24: In the third experiment, using VSL-SP, the robotneeds six sets of demonstrations for each of which the initial(a) and final (b) objects’ configurations are shown. UsingVSL only the robot requires a single demonstration for whichthe sequence of operations is shown (a-d).

6. Conclusions

In this paper, a novel skill learning approach hasbeen proposed, which allows a robot to acquirenew skills for object manipulation by observing ademonstration. VSL utilizes visual perception asthe main source of information. As opposed toconventional trajectory-based learning approaches,VSL is a goal-based robot learning approach thatfocuses on achieving the desired goal configurationof objects relative to one another while maintain-ing the sequence of operations. VSL is capableof learning and generalizing multi-operation skillsfrom a single demonstration while requiring min-imum a priori knowledge about the environment.It has been shown that integrating VSL with aconventional trajectory-based approach eliminatesthe need to program the primitive actions man-ually and to tune them according to the desiredtask. The robot learns the required primitive ac-tions through kinesthetic teaching using ImitationLearning. Then it learns the sequence of actionsto reach the desired goal of the task through VSL.The obtained framework easily extends and adaptsrobots capabilities to novel situations, even by userswithout programming ability. It has been illus-trated that to utilize the advantages of standardsymbolic planners, VSL’s original planner can bereplaced with an already available action-level sym-bolic planner. In order to ground the actions,the symbolic actions are defined in the plannerand VSL maps identified preconditions and effectsin a formalism suitable to be used at the sym-bolic level. Experimental results have shown theimprovement in generalization to find a solutionthat can be acquired from both multiple and single

demonstrations. In contrast to many existing ap-proaches, VSL leverages simplicity, efficiency, anduser-friendly human-robot interaction. The feasi-bility and efficiency of VSL have been validatedthrough simulated and real-world experiments.

References

References

Aein M.J., Aksoy E.E., Tamosiunaite M., Papon J., UdeA., Worgotter F. Toward a library of manipulation ac-tions based on semantic object-action relations. In: In-telligent Robots and Systems (IROS), IEEE/RSJ Inter-national Conference on. IEEE; 2013. p. 4555–62.

Ahmadzadeh S.R., Kormushev P., Caldwell D.G. In-teractive robot learning of visuospatial skills. In: Ad-vanced Robotics (ICAR), 16th International Conferenceon. IEEE; 2013a. p. 1–8.

Ahmadzadeh S.R., Kormushev P., Caldwell D.G. Visu-ospatial skill learning for object reconfiguration tasks. In:Intelligent Robots and Systems (IROS), IEEE/RSJ Inter-national Conference on. IEEE; 2013b. p. 685–91.

Ahmadzadeh S.R., Paikan A., Mastrogiovanni F., NataleL., Kormushev P., Caldwell D.G. Learning symbolic rep-resentations of actions from human demonstrations. In:Robotics and Automation (ICRA), IEEE InternationalConference on. Seattle, Washington, USA: IEEE; 2015.p. 3801–8.

Aksoy E.E., Abramov A., Dorr J., Ning K., Dellen B.,Worgotter F. Learning the semantics of object–actionrelations by observation. The International Journal ofRobotics Research2011;:1–29.

Argall B.D., Chernova S., Veloso M., Browning B. Asurvey of robot learning from demonstration. Roboticsand autonomous systems2009;57(5):469–83.

Asada M., Yoshikawa Y., Hosoda K. Learning by observa-tion without three-dimensional reconstruction. IntelligentAutonomous Systems (IAS)2000;:555–60.

Bentivegna D.C., Atkeson C.G., Ude A., Cheng G. Learn-ing to act from observation and practice. InternationalJournal of Humanoid Robotics2004;1(04):585–611.

Bohg J., Morales A., Asfour T., Kragic D. Data-drivengrasp synthesis: A survey. Robotics, IEEE Transactionson2014;30:289–309.

Buch A.G., Kraft D., Kamarainen J.K., Petersen H.G.,Kruger N. Pose estimation using local structure-specificshape and appearance context. In: Robotics and Automa-tion (ICRA), IEEE International Conference on. IEEE;2013. p. 2080–7.

Burgard W., Cremers A.B., Fox D., Hahnel D., Lake-meyer G., Schulz D., Steiner W., Thrun S. Experienceswith an interactive museum tour-guide robot. Artificialintelligence1999;114(1):3–55.

Chao C., Cakmak M., Thomaz A.L. Towards groundingconcepts for transfer in goal learning from demonstration.In: Development and Learning (ICDL), IEEE Interna-tional Conference on. IEEE; volume 2; 2011. p. 1–6.

Chen Y., Hsu C.W., Wah B.W. Sgplan: Subgoal partition-ing and resolution in planning. Edelkamp et al(Edelkamp,Hoffmann, Littman, & Younes)2004;.

Dantam N., Essa I., Stilman M. Linguistic transfer ofhuman assembly tasks to robots. In: Intelligent Robots

22

and Systems (IROS), IEEE/RSJ International Conferenceon. IEEE; 2012. p. 237–42.

Dantam N., Koine P., Stilman M. The motion grammar forphysical human-robot games. In: Robotics and Automa-tion (ICRA), IEEE International Conference on. IEEE;2011. p. 5463–9.

Ehrenmann M., Rogalla O., Zollner R., Dillmann R.Teaching service robots complex tasks: Programming bydemonstration for workshop and household environments.In: Field and Service Robots (FSR), International Con-ference on. volume 1; 2001. p. 397–402.

Ekvall S., Kragic D. Robot learning from demonstration:a task-level planning approach. International Journal ofAdvanced Robotic Systems2008;5(3):223–34.

Feniello A., Dang H., Birchfield S. Program synthesisby examples for object repositioning tasks. In: Intel-ligent Robots and Systems (IROS), IEEE/RSJ Interna-tional Conference on. IEEE; 2014. p. 4428–35.

Golland D., Liang P., Klein D. A game-theoretic approachto generating spatial descriptions. In: Empirical methodsin natural language processing, conference on. Associationfor Computational Linguistics; 2010. p. 410–9.

Guadarrama S., Riano L., Golland D., Gouhring D.,Jia Y., Klein D., Abbeel P., Darrell T. Groundingspatial relations for human-robot interaction. In: Intel-ligent Robots and Systems (IROS), IEEE/RSJ Interna-tional Conference on. IEEE; 2013. p. 1640–7.

Harnad S. The symbol grounding problem. Physica D:Nonlinear Phenomena1990;42(1):335–46.

Ijspeert A.J., Nakanishi J., Hoffmann H., Pastor P.,Schaal S. Dynamical movement primitives: learn-ing attractor models for motor behaviors. Neuralcomputation2013;25(2):328–73.

Ijspeert A.J., Nakanishi J., Schaal S. Trajectory forma-tion for imitation with nonlinear dynamical systems. In:Intelligent Robots and Systems (IROS), IEEE/RSJ Inter-national Conference on. IEEE; volume 2; 2001. p. 752–7.

Ikeuchi K., Suehiro T. Toward an assembly planfrom observation. i. task recognition with polyhedral ob-jects. Robotics and Automation, IEEE Transactionson1994;10(3):368–85.

Jetchev N., Lang T., Toussaint M. Learning groundedrelational symbols from continuous data for abstract rea-soning. 2013.

Kelleher J.D., Kruijff G.J.M., Costello F.J. Proximity incontext: an empirically grounded computational model ofproximity for processing topological spatial expressions.In: Proceedings of the 21st International Conference onComputational Linguistics and the 44th annual meetingof the Association for Computational Linguistics. Associ-ation for Computational Linguistics; 2006. p. 745–52.

Konidaris G., Kaelbling L.P., Lozano-Perez T. Con-structing symbolic representations for high-level planning.In: Twenty-Eighth AAAI Conference on Artificial Intelli-gence. 2014. .

Kormushev P., Calinon S., Caldwell D.G. Robot motorskill coordination with EM-based reinforcement learning.In: Intelligent Robots and Systems (IROS), IEEE/RSJInternational Conference on. 2010. p. 3232–7.

Kroemer O., Peters J. A flexible hybrid framework formodeling complex manipulation tasks. In: Robotics andAutomation (ICRA), IEEE International Conference on.IEEE; 2011. p. 1856–61.

Kronander K., Billard A. Online learning of varyingstiffness through physical human-robot interaction. In:

Robotics and Automation (ICRA), IEEE InternationalConference on. IEEE; 2012. p. 1842–9.

Kruger N., Piater J., Worgotter F., Geib C., Petrick R.,Steedman M., Ude A., Asfour T., Kraft D., Omrcen D.,et al. A formal definition of object-action complexes andexamples at different levels of the processing hierarchy.PACO-PLUS Technical Report2009;.

Kuniyoshi Y., Inaba M., Inoue H. Learning by watching:Extracting reusable task knowledge from visual observa-tion of human performance. Robotics and Automation,IEEE Transactions on1994;10(6):799–822.

Lopes M., Santos-Victor J. Visual learning by imita-tion with motor representations. Systems, Man, andCybernetics, Part B: Cybernetics, IEEE Transactionson2005;35(3):438–49.

Lowe D.G. Distinctive image features from scale-invariant keypoints. International journal of computervision2004;60(2):91–110.

Mason M., Lopes M. Robot self-initiative and person-alization by learning through repeated interactions. In:Human-Robot Interaction (HRI), 6th ACM/IEEE Inter-national Conference on. IEEE; 2011. p. 433–40.

Meltzoff A.N., Moore M.K. Newborn infants imitate adultfacial gestures. Child development1983;:702–9.

Moratz R., Tenbrink T. Spatial reference in linguistichuman-robot interaction: Iterative, empirically supporteddevelopment of a model of projective relations. Spatialcognition and computation2006;6(1):63–107.

Niekum S., Osentoski S., Konidaris G., Barto A.G.Learning and generalization of complex tasks from un-structured demonstrations. In: Intelligent Robots andSystems (IROS), IEEE/RSJ International Conference on.IEEE; 2012. p. 5239–46.

Papazov C., Haddadin S., Parusel S., Krieger K., BurschkaD. Rigid 3D geometry matching for grasping of knownobjects in cluttered scenes. The International Journal ofRobotics Research2012;31(4):538–53.

Pardowitz M., Zollner R., Dillmann R. Incremental ac-quisition of task knowledge applying heuristic relevanceestimation. In: Robotics and Automation (ICRA), IEEEInternational Conference on. IEEE; 2006. p. 3011–6.

Park G., Ra S., Kim C., Song J. Imitation learning of robotmovement using evolutionary algorithm. In: Proceedingsof the 17th World Congress, International Federation ofAutomatic Control (IFAC). 2008. p. 730–5.

Pastor P., Hoffmann H., Asfour T., Schaal S. Learn-ing and generalization of motor skills by learning fromdemonstration. In: Robotics and Automation (ICRA),IEEE International Conference on. IEEE; 2009. p. 763–8.

Randell D.A., Cui Z., Cohn A.G. A spatial logic basedon regions and connection.Knowledge Representation andReasoning1992;92:165–76.

Rizzolatti G., Fadiga L., Gallese V., Fogassi L. Premotorcortex and the recognition of motor actions. Cognitivebrain research1996;3(2):131–41.

Rusu R.B., Blodow N., Beetz M. Fast point feature his-tograms (FPFH) for 3D registration. In: Robotics andAutomation (ICRA), IEEE International Conference on.IEEE; 2009. p. 3212–7.

Skubic M., Perzanowski D., Blisard S., Schultz A.,Adams W., Bugajska M., Brock D. Spatial languagefor human-robot dialogs. Systems, Man, and Cybernet-ics, Part C: Applications and Reviews, IEEE Transactionson2004;34(2):154–67.

Steder B., Rusu R.B., Konolige K., Burgard W. NARF:

23

3D range image features for object recognition. In: Work-shop on Defining and Solving Realistic Perception Prob-lems in Personal Robotics at the IEEE/RSJ InternationalConference on Intelligent Robots and Systems (IROS).volume 44; 2010. .

Su Y., Wu Y., Lee K., Du Z., Demiris Y. Robust grasp-ing for an under-actuated anthropomorphic hand underobject position uncertainty. In: Humanoid Robots (Hu-manoids), IEEE-RAS International Conference on. 2012.p. 719–25.

Verma D., Rao R. Goal-based imitation as probabilisticinference over graphical models. Advances in neural in-formation processing systems2005;18:1393–400.

Video Video accompanying this paper, available online.https://youtu.be/1xFeOZaMpK4; 2013a.

Video Video accompanying this paper, available online.https://youtu.be/W6zgdve3e1A; 2013b.

Video Video accompanying this paper, available online.https://youtu.be/baF9bCidcyw; 2015.

Yeasin M., Chaudhuri S. Toward automatic robot pro-gramming: learning human skill from visual data. Sys-tems, Man, and Cybernetics, Part B: Cybernetics, IEEETransactions on2000;30(1):180–5.

24

https://youtu.be/1xFeOZaMpK4

https://youtu.be/W6zgdve3e1A

https://youtu.be/baF9bCidcyw

Documents

Abstract arXiv:1706.00989v1 [cs.RO] 3 Jun 2017 · A novel skill learning approach is proposed that allows a robot to acquire human-like visuospatial skills for object manipulation