3
Towards Learning Task Intentions from Human Demonstrations Tim Welschehold Nichola Abdo Christian Dornhege Wolfram Burgard Abstract— Learning from demonstrations is an active re- search area and many methods to instruct robots have been developed recently. However, learning tasks directly from a human teacher is still an unsolved problem. In this paper we present an approach that learns action models for joint motions of the robot base and gripper as well as a representation of the overall task both from observing demonstrations carried out by a human teacher. To this end we adapt RGBD observations of the human teacher to the capabilities of the robot for action learning. Furthermore, we introduce a framework called teach- and-improvise to imitate tasks without explicitly specifying their goals. We demonstrate the effectiveness of both action and task learning in complex real world experiments. I. I NTRODUCTION To pave the way for complex mobile robots into common households the ability to easily instruct a robot is essential. Users need to be able to teach their robot to achieve a custom task in their own home environments without any explicit knowledge of how robots learn such tasks. In this work we address the problem of jointly learning both action models and task goals from human teacher demonstrations. We consider the teacher to demonstrate to the robot how to manipulate the objects in a scene in order to achieve a goal. The envisioned teaching process should be intuitive and manageable for non-expert users. Thus, for learning the individual actions we do not rely on kinesthetic teaching like [1], or programming skills. The challenge in the context of human action demonstrations arises from the fact that these cannot be directly reproduced by the robot due to differing grasping capabilities and kinematic constraints. To overcome this issue the demonstrations must be adapted to the robot. Grasp planners [2] while able to generate high- quality grasps lack the desired connection to the specific task demonstration and are thus not suited to always choose the appropriate grasp. In our work we introduce a system that is able to adapt the observed human grasp to the robot capabilities, while also considering the observed underlying grasping and manipulation motions. We generate both robot gripper and base motions based on the demonstrated human hand and torso motions. Most current work on learning the embedding of the action models in tasks builds on existing predicates to instantiate or extract symbolic or logical rules from teacher demonstrations like H¨ ofer and Brock [3]. In this work we do not rely on an expert to program those predicates, but instead do not assume to have prior semantic knowledge about tasks All authors are with the Institute of Computer Science, University of Freiburg, Germany. This work has been supported by the Baden- urttemberg Stiftung and the German Research Foundation under research unit BU 865/8-1 (HYBRIS) Fig. 1: Teaching of a task. In the left column a few steps of the task of arranging the objects on the table are depicted. The right column shows the final states for 3 different demonstrations. Note that the task goals can be ambiguous. and actions and moreover rely only on a small number of teacher demonstrations to learn them. When addressing the task learning itself, in this work we focus on inferring the desired goal states from few demonstrations. A major challenge in this context is the potential ambiguity in the demonstrations. The teacher might demonstrate different valid object configurations, compare Fig. 1. Without prior knowledge about the task, we aim at generalizing these configurations and reproducing the task from unseen initial states without explicitly specifying a goal state. Traditional task and motion planning techniques [4] expect a concrete goal state description as input. In this work we propose a teach and improvise approach to generate the goals by formulating an optimization problem, in which we maximize the likelihood that the final goal state aligns with the intention of the teacher given the demonstrations. II. PROBLEM STATEMENT In our work we aim to enable a robot to learn a task from RGBD observations of human teacher demonstrations. The challenge lies in learning both the representation of the individual actions as well as the intention and composition of the task itself. We consider a task to consist of the individual manipulations of all objects involved in it. We assume that each object manipulation can be segmented into three parts, the reaching or grasping, the manipulation itself and the releasing or retreat part. For learning action models a demonstration is given by the tracks of 6-dof trajectories of the human hand and torso as well as the handled objects. Hand and object trajectories describe how the object should be handled, while the torso motion gives insight about the general positioning for the action. As these trajectories are not directly executable for the robot, the action learning

Towards Learning Task Intentions from Human Demonstrations€¦ · Towards Learning Task Intentions from Human Demonstrations Tim Welschehold Nichola Abdo Christian Dornhege Wolfram

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Towards Learning Task Intentions from Human Demonstrations€¦ · Towards Learning Task Intentions from Human Demonstrations Tim Welschehold Nichola Abdo Christian Dornhege Wolfram

Towards Learning Task Intentions from Human Demonstrations

Tim Welschehold Nichola Abdo Christian Dornhege Wolfram Burgard

Abstract— Learning from demonstrations is an active re-search area and many methods to instruct robots have beendeveloped recently. However, learning tasks directly from ahuman teacher is still an unsolved problem. In this paper wepresent an approach that learns action models for joint motionsof the robot base and gripper as well as a representation of theoverall task both from observing demonstrations carried outby a human teacher. To this end we adapt RGBD observationsof the human teacher to the capabilities of the robot for actionlearning. Furthermore, we introduce a framework called teach-and-improvise to imitate tasks without explicitly specifying theirgoals. We demonstrate the effectiveness of both action and tasklearning in complex real world experiments.

I. INTRODUCTION

To pave the way for complex mobile robots into commonhouseholds the ability to easily instruct a robot is essential.Users need to be able to teach their robot to achieve acustom task in their own home environments without anyexplicit knowledge of how robots learn such tasks. In thiswork we address the problem of jointly learning both actionmodels and task goals from human teacher demonstrations.We consider the teacher to demonstrate to the robot howto manipulate the objects in a scene in order to achieve agoal. The envisioned teaching process should be intuitiveand manageable for non-expert users. Thus, for learning theindividual actions we do not rely on kinesthetic teachinglike [1], or programming skills. The challenge in the contextof human action demonstrations arises from the fact thatthese cannot be directly reproduced by the robot due todiffering grasping capabilities and kinematic constraints. Toovercome this issue the demonstrations must be adapted tothe robot. Grasp planners [2] while able to generate high-quality grasps lack the desired connection to the specifictask demonstration and are thus not suited to always choosethe appropriate grasp. In our work we introduce a systemthat is able to adapt the observed human grasp to the robotcapabilities, while also considering the observed underlyinggrasping and manipulation motions. We generate both robotgripper and base motions based on the demonstrated humanhand and torso motions.

Most current work on learning the embedding of the actionmodels in tasks builds on existing predicates to instantiate orextract symbolic or logical rules from teacher demonstrationslike Hofer and Brock [3]. In this work we do not relyon an expert to program those predicates, but instead donot assume to have prior semantic knowledge about tasks

All authors are with the Institute of Computer Science, Universityof Freiburg, Germany. This work has been supported by the Baden-Wurttemberg Stiftung and the German Research Foundation under researchunit BU 865/8-1 (HYBRIS)

Fig. 1: Teaching of a task. In the left column a few steps of thetask of arranging the objects on the table are depicted. The rightcolumn shows the final states for 3 different demonstrations. Notethat the task goals can be ambiguous.

and actions and moreover rely only on a small numberof teacher demonstrations to learn them. When addressingthe task learning itself, in this work we focus on inferringthe desired goal states from few demonstrations. A majorchallenge in this context is the potential ambiguity in thedemonstrations. The teacher might demonstrate differentvalid object configurations, compare Fig. 1. Without priorknowledge about the task, we aim at generalizing theseconfigurations and reproducing the task from unseen initialstates without explicitly specifying a goal state. Traditionaltask and motion planning techniques [4] expect a concretegoal state description as input. In this work we proposea teach and improvise approach to generate the goals byformulating an optimization problem, in which we maximizethe likelihood that the final goal state aligns with the intentionof the teacher given the demonstrations.

II. PROBLEM STATEMENT

In our work we aim to enable a robot to learn a taskfrom RGBD observations of human teacher demonstrations.The challenge lies in learning both the representation of theindividual actions as well as the intention and compositionof the task itself. We consider a task to consist of theindividual manipulations of all objects involved in it. Weassume that each object manipulation can be segmented intothree parts, the reaching or grasping, the manipulation itselfand the releasing or retreat part. For learning action modelsa demonstration is given by the tracks of 6-dof trajectoriesof the human hand and torso as well as the handled objects.Hand and object trajectories describe how the object shouldbe handled, while the torso motion gives insight about thegeneral positioning for the action. As these trajectories arenot directly executable for the robot, the action learning

Page 2: Towards Learning Task Intentions from Human Demonstrations€¦ · Towards Learning Task Intentions from Human Demonstrations Tim Welschehold Nichola Abdo Christian Dornhege Wolfram

problem is to adapt these to the capabilities of the robot,while retaining the intended demonstrated movements. Fortask learning a demonstration consists of the manipulationactions performed for the task and the intermediate statesof the world between them. A state is defined by all objectposes and the final state is considered to be an instance ofthe task goal. Here, different demonstrations of the sametask are allowed to lead to ambiguous goals and the objectmanipulations may be performed in different orders. The tasklearning problem is therefore to produce action sequences forpreviously unseen initial states of the same task that imitatethe intention of the user without explicitly formulating a goalstate.

III. ACTION LEARNING

We propose an approach in which the robot motion isdesigned to follow the demonstrations as close as possiblewhile deviating as necessary to fulfill the constraints posedby its geometry. We account for the robot’s grasping skillsand kinematics as well as for collisions with the environment.We furthermore can assume for most actions that the graspon the object is fixed during manipulation and all trajecto-ries should be smooth in the sense that consecutive posesshould be near each other. Incorporating this constraints andassumptions we formulate a graph optimization problem toadapt the recorded human motions in a way that they aresuited for the robot to learn a motion model. For details onthe graph structure and the implementation we refer to ourrecent work [5], [6].

IV. TASK LEARNING

Task learning computes a sequence of manipulation ac-tions at and their corresponding intermediate goals st thatrepresent the intention of the demonstrations for a giveninitial state s0. This is done without explicitly specifyingany goal state. Based on pairwise object relations, we definean intention likelihood Ψ(sT ), which holds the informationof how well a state sT fits as the intended task goal.

We further introduce templates λt which serve as under-laying frames in which the actions can be interpreted andexecuted. Exploiting the intention likelihood we formulate asearch-based optimization with

maxa0:T−1

λ0:T−1s1:T

Ψ(sT )−T−1∑t=0

cost(at) (1)

where a0:T−1, s1:T and λ0:T−1 are the desired manipula-tions, their corresponding goals and the interpreting referenceframes. We use Monte Carlo Tree Search to search tasksolutions and reason about interpretations of actions and theirgoals.

A. Tree Structure

Classical Monte Carlo Tree Search iteratively grows asearch tree to approximate the returns of states and actionsusing Monte Carlo Simulations. Typically there are two typesof nodes: decision and chance nodes. Decision nodes select

Fig. 2: Proposed structure for the Monte Carlo Tree Search. Thered circles define action-selection nodes, which are used to reasonabout actions. Template selection nodes are depicted as blue squaresand are used to select templates for the corresponding actions, i.e.which stationary object should be used as reference to position themoving object. The green hexagons describe goal-selection nodes.These generate goals for the current selected actions and templates.

actions according to the node’s associated state. The chancenodes reason about potential outcomes of the action usuallyreflecting the stochasticity of the addressed problem. In ourcase we also use decision nodes to decide how to proceedfrom a given state. For the chance nodes we also consider astochasticity which results from the ambiguity in the intendedgoal of the task in contrast to unexpected or random events.Fig. 2 shows the structure of the proposed tree. For furtherinsight and a general overview of MCTS-based algorithmsand applications we refer the reader to Browne et al. [7].

B. Teach-and-Improvise

Our algorithm consists of four main steps. First we selecta promising leaf in the tree that we expand. Then the selectedleaf is expanded by appending template-selection nodes,goal-selection nodes and action-selection nodes. Using theintention likelihood we simulate scores for the added nodesand in a last step do a backpropagation to compute the bestsolutions. We repeat this for a predefined number of iterationsor until no node can be extended further. The branch withthe highest score gives us the plan for the task.

V. EXPERIMENTAL EVALUATION

A. Action Learning

We attached markers to hand and torso to record thehuman teacher poses. Using the adapted data of humandemonstrations from our approach, we build action modelsusing a Gaussian mixture model representation. Fig. 3 showsexemplary trajectories generated for the actions of graspinga door handle and opening the door. A detailed experimentalevaluation of the action learning can be found in our currentwork [5], [6].

B. Task Learning

1) Improvising Task Goals: Using our proposed approachwe conducted experiments to reproduce demonstrated objectarrangements from random starting states. Fig. 4 shows howthe system improvises solutions for the task demonstrated inFig. 1 for different starting states. In the third row in Fig. 4

Page 3: Towards Learning Task Intentions from Human Demonstrations€¦ · Towards Learning Task Intentions from Human Demonstrations Tim Welschehold Nichola Abdo Christian Dornhege Wolfram

Fig. 3: In the top row the figure illustrates exemplary generatedtrajectories to grasp the handle (left) and then open the door (right).The bottom row shows an image of the action demonstration (left)and the PR2 during task execution (middle and right).

Fig. 4: Improvising solutions for the task of arranging the objectson the table. In the left column the respective initial states arevisualized. The right column shows the solutions generated by ourapproach. The third row shows an improvised solution in clutter ofother objects. In this case as many relations as possible are satisfiedwhile others, i.e., the relations considering the location on the tablecannot be fulfilled.

we see the proposed solution for the same task in a clutterof other, irrelevant objects.

2) Robot Experiments: We implemented our approach ona PR2 robot. Fig. 5 shows the improvised accomplishedsolution for the object arrangement task, where individualmanipulation actions were carried out by an existing motionplanner.

VI. CONCLUSION AND OUTLOOK

We have shown that it is possible for a robot to learntask and action representations directly from a human byobserving demonstrations with an RGBD camera. On the onehand we presented an approach to learn mobile manipulationactions relying only on human demonstrations and knowl-edge about the robot’s kinematic and grasping capabilities.

Fig. 5: On the top row we show two final states of demonstrationsfor the task of arranging the objects on the table. In the bottom rowwe see a random initial state with the same objects (left) and theimprovised arrangement achieved by the robot (right).

On the other hand we demonstrated how demonstrations oftasks can be used to infer solutions that satisfy the userintention. Experiments show that we are able to reproducecomplex actions with constrained trajectories as well asimprovise solutions for tasks involving multiple objects.

So far the described modules are implemented indepen-dently on the robot. Current work aims to fuse the approachof learning individual actions into our task learning approach.Our experiments have shown that on the one hand therepresentation of actions as a joint model for base andarm motions is well suited for concatenation as part ofcomplex tasks. It allows an easy transition between actionsas it is suited to reposition the robot base for the nextaction’s demands. On the other hand the task learning moduleprovides the information about which object to handle andthe desired goal pose. So far these were manually given to therobot when reproducing demonstrated actions. Experimentswith the task learning framework only considered point topoint actions that do not require a certain trajectory. Thecombination of both approaches will be able to includecomplex actions as part of the planning as, for example,opening a shelf door to fetch something from inside.

REFERENCES

[1] S. Calinon, T. Alizadeh, and D. G. Caldwell, “On improving theextrapolation capability of task-parameterized movement models,” inInt. Conf. on Intelligent Robots and Systems (IROS), 2013.

[2] D. Fischinger, A. Weiss, and M. Vincze, “Learning grasps with topo-graphic features,” Int. Journal of Robotics Research, vol. 34, no. 9, pp.1167–1194, 2015.

[3] S. Hofer and O. Brock, “Coupled learning of action parameters andforward models for manipulation,” Int. Conf. on Intelligent Robots andSystems (IROS), 2016.

[4] M. Gharbi, R. Lallement, and R. Alami, “Combining symbolic andgeometric planning to synthesize human-aware plans: Toward moreefficient combined search,” in Int. Conf. on Intelligent Robots andSystems (IROS), 2015.

[5] T. Welschehold, C. Dornhege, and W. Burgard, “Learning manipulationactions from human demonstrations,” in Int. Conf. on Intelligent Robotsand Systems (IROS), 2016.

[6] ——, “Learning mobile manipulation actions from human demonstra-tions,” in Int. Conf. on Intelligent Robots and Systems (IROS), 2017.

[7] C. Browne, E. Powley, D. Whitehouse, S. Lucas, P. Cowling, P. Rohlf-shagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton, “Asurvey of monte carlo tree search methods,” IEEE Transactions onComputational Intelligence and AI in Games, vol. 4, pp. 1–43, 2012.