Learning How to Avoid an Obstacle from Human Demonstration · 1) Nominal component of the reward function: We rep-resent the model producing the nominal path as Dynamic Movement Primitives,

Learning How to Avoid an Obstacle from HumanDemonstration

A. M. Ghalamzan E., C. Paxton, G. Hager, L. Bascetta

Abstract—Dynamic Movement Primitives (DMP) have beenproposed to teach a robot how to perform a simple task. A majorchallenge in this approach is how the robot can use the learnedskill in the presence of an obstacle. This is necessary becauseobstacle avoidance is a serious concern in robotic problems. Inprior work, the problem of obstacle avoidance has been explicitlyaccounted for in the DMP formulation. Therefore, a person who isnot a roboticist cannot teach the desired response to the robot byDMP. A person may respond differently to different obstacles andwe may wish to have a robot to learn that particular responses.We build upon dynamic movement primitives with a rewardfunction that is learned from the demonstration capturing theresponse of an individual to an obstacle. This method allows auser to teach a robot the desired responses to different obstacles.

I. INTRODUCTION

Dynamic Movement Primitives (DMP) have been proposedto teach a robot a skill from a task demonstration and repro-duce the task with different start and end points, an approachinspired by biological systems [7, 10]. A challenge that arisesin this task reproduction is the problem of adapting to a newenvironment with obstacles.

For example, assume that a person may wish to teach ahousehold service robot how to set a dinner table. He/she maywalk the robot and teach it how to bring the dishes from thekitchen and put them on the table. DMP may be used byrobot to learn the appropriate path that the robot needs tofollow. However, the learned path is not applicable when thereis a cat or a chair on the robot’s way. In this case, the robotshould ideally be capable of adapting the learned skill to thenew situations based on the provided demonstrations. Priorwork explicitly planned for obstacle avoidance and combinedit with DMP [3]. This does not allow a user to teach the robot adesired response to an object class. Here, we propose a methodto teach the robot by demonstrating how to respond to differentobjects. In this example, the person can teach the robot to keepfar from the cat and not very far from the chair. This is themost convenient way for a user, who is not a roboticist, toshow the robot how to respond to different objects.

A range of methods has been developed to cope withnew environments in the presence of obstacles combinedwith DMP. In prior work by Calinon et al. [3], variabilityof different demonstrations along the movement has beenused to estimate the stiffness matrices in a modified DMP.Then, a risk indicator modulating repulsive force is definedto enable a robot to safely avoid collision with human. Inother work [5] a model of a task was developed using adynamical system modulated by a Gaussian Mixture Model(GMM). This was combined with reinforcement learning to

enable the robot to learn a new way of accomplishing atask in a constrained environment. A modified DMP modelis initialized with imitation learning in [8] and reinforcementlearning was again used to compute the optimal parametervalues of the model for a new environment. In another worka dynamic potential field that depends on the relative distanceand velocity between end-effector and obstacle is combinedwith DMP to solve the problem of obstacle avoidance [9].In this work, the gradient of the potential field was added tothe acceleration term of the differential equation of DMP. Therelative angle between robot end effector velocity and a vectorconnecting the position of the end effector to the position ofthe obstacle is modified based on the relative position andvelocity of the end effector and obstacle in [6]. A term ofperturbation, which is a function of the relative angle, is thenadded to the DMP formulation to guarantee that the generatedtrajectory does not collide with the obstacle.

The main challenge addressed in these works was to solvethe problem of obstacle avoidance in combination with use ofDMP for robot learning from demonstration. In these works,the obstacle avoidance parameters were fixed explicitly inthe formulation. Therefore, they do not provide the flexibilityneeded to teach the robot the desired responses to differentobjects.

We propose an approach that builds upon DMP [4]: a rewardfunction learned from human demonstration incorporatingboth the skill obtained by the DMP and the desired userresponse to different obstacles. In this way, we benefit from useof DMP, which produce a nominal path, as well as from thecapability of inverse optimal control to capture the responsesof a user to different obstacles. Inverse optimal control hasbeen proposed as a way to recover a reward function from aset of demonstrations [1, 15].

The remainder of this paper is organized as follows. InSection II problem formulation is presented. Then, in Sec-tion II-A, the Robot Learning from Demonstration (RLfD)problem is formulated in terms of an optimal control problem.Section III describes how a reward function underlying a setof task demonstration is computed. In Section IV a case studyis presented to illustrate how the proposed method is capableof learning a reward function from a set of data set collectedwith the da Vinci robot.

II. PROBLEM FORMULATION

Many robotic problems can be formulated as an optimalcontrol problem. The optimal control approach computes theoptimal action a⊂A at each state s⊂ S for a new unobserved

environment. The optimal control problem is defined by a statespace s ∈ Rn, action space a ∈ Rm, a state transition functionT (sk,ak) : Rn+m→Rn, and a reward function R(sk) : Rn→R.Here the reward function is assumed to be a function of state,R(sk = {xk, fk}), including state of the actor x ∈ Rp and thefeatures of the environment f ∈ Rq such that n = p+q.

We assume a nominal model learned by DMP, which is theoptimal solution to perform a task with no obstacles in theenvironment. Furthermore, we assume there exists at least anoptimal way to perform the task in the presence of obstacles.We assume that we have a set of task demonstrations Dsuch that each task demonstration ζd ⊂ D , ∀d = 1, ...,D, isan optimal solution to an unknown reward function in thecorresponding environment, Scd = {O1, ...,OL}, ∀d = 1, ...,D,where D is the number of demonstrations and L is the numberof obstacles O in dth scene. Therefore, the recovered rewardfunction must be a function of the corresponding environmen-tal features as well as the learned nominal model. In otherwords, features of environment depend on x given a scene.However, for the sake of simplicity, we write f instead of f(x).

A. Optimal Control Formulation

Here we focus on an episodic, deterministic robot learningfrom demonstration problem with fixed time horizon K aswell as discrete time, a continuous state-space, a continuousaction-space and a known world model. Given a set of taskdemonstrations D in different environments, the system isgoing to learn the underlying reward function R of D . Wedecompose the underlying reward function R into two compo-nents: a nominal component RN(xk) whose optimal solutionwill be identical to the learned model by nominal model, andan adaptation component RA(fk) which encodes the responseof the robot to the environmental features. Given a rewardfunction we assume a robot is going to maximize the expectedreturn ρπ = ∑

K−1k=1 R(sk+1).

π∗ = argmax

π

K−1

∑k=1

R(sk+1) = argmaxπ

K−1

∑k=1

RN (xk+1)+RA (fk+1)

subj. to sk+1 = T (sk,ak),

ak ∈U ,(1)

where π = {a1, ...,aK−1} is the sequence of actions which arobot takes to accomplish the task and U ⊆A is a polyhedralregion that is a feasible subset of the set of all actions A . Byfollowing the optimal policy π∗ maximizing (1) a sequenceof states that robot follows based on its state transition is ζ ={s1, s2, ..., sK}, where s1 is a given initial condition.

1) Nominal component of the reward function: We rep-resent the model producing the nominal path as DynamicMovement Primitives, which encode the observed trajectoryin a dynamical system that is then used to reproduce thetrajectory for a new end point [4]. We impose a quadraticreward function around the path generated by the learned DMPto encourage the controller to take actions that result in robotstates close to the nominal path, giving us the nominal reward

function:

RN (xk : Q) =−(xk−xN

k)T Q

(xk−xN

k)

xNk ⊂ ζ , k = 1, ...,K (2)

where xNk ∈ Rn is a point on the nominal path, ζ , which is

learned by DMP, and the line segment of xNk xk is perpendicular

to ζ .2) Adaptation component of the reward function: The op-

timal solution for the adaptation component problem is notinvariant over different environments, Scd ∀d = 1, ...,D; for anew given environment, Scnew, the actor needs to determinea specific optimal path model of the task. Since deviationfrom the nominal model is local, a Gaussian function withcovariance matrix R, learned from demonstration data setbased on features of the environment, is employed to provide alocal effect on the reward function. This gives us the adaptationcomponent of the reward function RA:

RA (fk : R) =−exp(−(fTk R−1 fk

))(3)

where fk ∈Rq is a vector of the environmental features at xk,captured during dth demonstration, e.g. fk(xk,Scd) = (xk−O),where O is the position of an added obstacle in dth scene,Scd .

Accordingly, the general reward function characterizing thedemonstrated behavior in a different environment with anadded obstacle is a combination of the adaptation component(eq. (3)) and nominal component (eq. (2)) as follows:

R(xk,Scd : θ) =−(xk−xnk)

T Q(xk−xnk)− e−fT

k R−1 fk (4)

where θ = {Q,R}. Q and R are positive definite matrix.

III. INVERSE OPTIMAL CONTROL

Inverse optimal control aims at finding a reward functionwhose optimal solution is as close as possible to the demon-stration. Given an estimated reward function one can use theexisting methods to find a solution, ζR(θ ,Scd), to eq. (1), such asdynamic programming or reinforcement learning. Therefore,to learn the parameters of the reward function we minimizethe cumulative distances between the solution to the learnedreward function ζR(θ ,Scd) and the demonstrations as follows:

θ = argminθ

D

∑d=1

K

∑k‖(ζR(θ ,Scd)(k)−ζd(k))‖ (5)

where ζd(k) and ζR(θ ,Scd)(k) are the corresponding points onthe demonstration and the solution to the estimated rewardfunction. Because the objective is nonlinear a stochastic opti-mization method [14] has been employed to find the optimalsolution to eq. (5), This numerical method always finds a localoptimal solution to a non convex objective function.

A. Solution to the learned reward function

In order to compute the optimal solution to the learnedreward function for a new scenario, e.g. a new arrangement ofobstacles with different initial and goal position, we maximizethe expected return of eq. (1). In a finite-horizon problem opti-mal control aims at finding the optimal policy by determining asequence of actions a maximizing the expected return. Modelpredictive control is employed to find an optimal solutionto the learned reward function. Consider a prediction timehorizon H, the optimal action corresponding to the proposedproblem in eq. (1) at kth time step can be formulated asfollows:

ak = argmaxak

H+k

∑h=k−(xk−xn

k)T Q(xk−xn

k)− e−(fTk R−1 fk)

subj. to xh+1 = Axh +Bah

h = k, ...,(k+H)

ah ∈A

xh ∈X

k = 1,2, ...,K,

(6)

where X and A are the polyhedral feasible sets of actorstates and actions respectively, and π∗ = {a1, ..., aK−1} isa sequence of optimal actions. In eq. (6) we consider thetransition function of the actor in eq. (1) as a stable secondorder dynamical system, as given by eq. (7).

xk+1 = Axk +Bak. (7)

We use this linearized dynamical model of a robot to findthe optimal solution to the estimated reward function. To finda solution to eq. (6), we use minConf with a quasi-Newtonstrategy and limited-memory BFGS updates by Schmidt [11].

It is worth mentioning that asymptotic stability of theproposed MPC formulation for a path planning problem wasdiscussed by Xu et al. [13].

IV. CASE STUDY

To demonstrate how the proposed method can be applied toa real scenario, a da Vinci robot shown in Fig. 1(b) was usedto collect a set of demonstrations of a simple task: moving anobject from point PI to point PG of the designed structureshown in Fig. 2(a).

The operator was told to maximize the distances from bothwalls while performing the task. A marker was fixed to thescene during performance of the task as an obstacle, anddistance to the marker was used as an environmental featuref. The operator was asked to perform the task several timeswith different positions of the marker; these demonstrationsconstitute our training data set and can be seen in Fig. 2(b).As shown in this figure, the operator responded differently tothe same environmental stimulus, resulting in noisy responsesto the same obstacle. Since Nonlinear Principal ComponentAnalysis (NLPCA) has been shown to capture the principal

PI

PGO

(a)

(b)

Fig. 1. (a) The task model used for data collection with da Vinci surgicalrobot with a single obstacle (marker), the nominal path that expert followsin the absence of obstacles (green line); (b) da Vinci set up to collect expertdemonstrations.

curve underlying a data set, we used NLPCA [12] to estimatea nonlinear principal component of the data set and extract anoise-free input for training the DMP.

A stochastic optimization method [14] is employed to findthe underlying reward function of the set of demonstrationsusing eq. (5). The resulting reward function along with demon-strations and the corresponding solutions to the recoveredreward function are shown in Fig. 2 for different positions ofthe obstacle. The reward function results in a smooth trajectoryeven if the demonstrations are noisy; this is highly desirablein many robotic tasks.

DS-GMR [4], a modified version of DMP, is employed togenerate a smooth path when the target point P ′

G is differentfrom the demonstrated one PG. In Fig. 3 the new targetpoint, the generated path by DS-GMR, the correspondingreward function in the presence of an obstacle, and thesolution to the obtained reward function are shown. Thereward function combines the nominal component and theadaptation component to generalize the observed behaviorto the new target point P ′

G and new environment, e.g. anew obstacle location. The proposed approach incorporatesNLPCA to reduce the effect of noise on demonstrations, thenominal model, used to generalize the observed task to a newtarget point, and the adaptation component, used to capturethe response of the actor to the features of the environment,into a reward function. This reward function can cope withnoisy demonstrations, perturbed target positions and different

−0.05 0 0.05 0.1 0.15

0

0.05

0.1

0.15

x2 [m

]

−0.05 0 0.05 0.1 0.15

0

0.05

0.1

0.15

x2 [m

]

(a)

−0.05 0 0.05 0.1 0.15

0

0.05

0.1

0.15

x2 [

m]

−0.05 0 0.05 0.1 0.15

0

0.05

0.1

0.15

x1 [m]

x2 [

m]

(b)

Fig. 2. The contour of the learned RF for the da Vinci experiment,corresponding demonstration (green line), and the computed optimal solutionto the reward function. The area with hot color represents the higher reward.

x1 [m]

x2 [

m]

−0.05 0 0.05 0.1 0.15 0.2 0.25 0.3

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

PI

P ′G

PG

O

Fig. 3. The estimated nominal path (thick green line) with initial (PI ) andtarget points (PG). The path computed by DS-GMR (thin blue line) for thenew target point P ′

G. The contour of the learned reward function based onthe path computed by DS-GMR. O is an added obstacle to the scene, andthe corresponding computed solution to the learned reward function is shownwith a black dashed line.

arrangements of obstacles. The results shown in Fig. 2 and3 illustrate the effectiveness of the approach in capturing theresponse to an obstacle and generalize the learned skill to thenew target point.

The human responses to two different objects are simulatedand the learned reward function and the corresponding demon-strations are shown in Fig. 4. This example illustrates how

x1 [m]

x2 [

m]

−0.05 0 0.05 0.1 0.15

0

0.05

0.1

(a)

x1 [m]

x2 [

m]

−0.05 0 0.05 0.1 0.15

0

0.05

0.1

(b)

Fig. 4. The contour of the learned reward function in the case of two differentobjects, corresponding demonstration (black line).

a user can teach the robot his/her own desired responses todifferent obstacles.Then, based on the object type the corre-sponding component of reward function can be computed.

We are going to apply the proposed method to the problemin which a robot learns how to sweep objects on a table,a challenging problem explored in prior work [2]. In thisscenario, the robot learns by DS-GMR the skill of movinga small brush from an initial position to the position of a dustpan where there is no object on the table. We add two types ofobjects to the table, object A and object B; the robot recognizesthe object types and then learns how to move object A to thedust pan while avoiding collision with object B.

V. CONCLUSION

In this work, we discussed the problem of learning obstacleavoidance to improve learning from demonstration with DMP.This is a particularly challenging problem for robot learningfrom demonstration in which robot is going to learn a new skillby a set of human demonstrations. In prior work, the problemof obstacle avoidance has been addressed by combining theplanning methods of obstacle avoidance with DMP. In thiscase, the corresponding method for obstacle avoidance and itsparameters should be determined explicitly by a programmer.Here, we proposed a method of robot learning from demon-stration which enables a robot to learn how to respond to anobstacle from the human demonstrations. To do so, we builtupon dynamic movement primitives a reward function that islearned from demonstration and captures the response of anindividual to obstacles. This approach allows the robot to learna skill from demonstrations and to adapt that skill to avoidcollision with obstacles.

REFERENCES

[1] P. Abbeel and A. Y. Ng. Apprenticeship learning via in-verse reinforcement learning. In Proceedings of the 21stinternational conference on Machine learning, pages 1 –9. ACM, 2004.

[2] T. Alizadeh, S. Calinon, and D. Caldwell. Learning fromdemonstrations with partially observable task parameters.In Proceeding of IEEE International Conference onRobotics and Automation, 2014.

[3] S. Calinon, I. Sardellitti, and D. G Caldwell. Learning-based control strategy for safe human-robot interactionexploiting task and robot redundancies. In IEEE/RSJInternational Conference on Intelligent Robots and Sys-tems, pages 249–254. IEEE, 2010.

[4] S. Calinon, Z. Li, T. Alizadeh, N. G Tsagarakis, and D. GCaldwell. Statistical dynamical systems for skills acqui-sition in humanoids. In 12th IEEE-RAS InternationalConference on Humanoid Robots, pages 323–329. IEEE,2012.

[5] F. Guenter, M. Hersch, S. Calinon, and A. Billard.Reinforcement learning for imitating constrained reach-ing movements. Advanced Robotics, 21(13):1521–1544,2007.

[6] H. Hoffmann, P. Pastor, D. Park, and S. Schaal.Biologically-inspired dynamical systems for movementgeneration: automatic real-time goal adaptation and ob-stacle avoidance. In IEEE International Conference onRobotics and Automation, pages 2587–2592. IEEE, 2009.

[7] A. Jan Ijspeert, J. Nakanishi, and S. Schaal. Movementimitation with nonlinear dynamical systems in humanoidrobots. In Proceedings of IEEE International Conferenceon Robotics and Automation, volume 2, pages 1398–1403. IEEE, 2002.

[8] P. Kormushev, S. Calinon, and D. G Caldwell. Robotmotor skill coordination with em-based reinforcementlearning. In IEEE/RSJ International Conference onIntelligent Robots and Systems, pages 3232–3237. IEEE,2010.

[9] D. Park, H. Hoffmann, P. Pastor, and S. Schaal. Move-ment reproduction and obstacle avoidance with dynamicmovement primitives and potential fields. In 8th IEEE-RAS International Conference on Humanoid Robots,pages 91–98. IEEE, 2008.

[10] S. Schaal. Dynamic movement primitives-a frameworkfor motor control in humans and humanoid robotics. InAdaptive Motion of Animals and Machines, pages 261–280. Springer, 2006.

[11] M. Schmidt. Graphical model structure learning withl1-regularization. PhD thesis, University of BritishColumbia, 2010.

[12] M. Scholz. Validation of nonlinear pca. Neural process-ing letters, 36(1):21–30, 2012.

[13] B. Xu, A. Kurdila, and D. Stilwell. A hybrid recedinghorizon control method for path planning in uncertainenvironments. In IEEE/RSJ International Conference on

Intelligent Robots and Systems, pages 4887–4892. IEEE,2009.

[14] A.M. Zanchettin, A. Calloni, and M. Lovera. Robustmagnetic attitude control of satellites. IEEE/ASMETransactions on Mechatronics, 18(4):1259–1268, 2013.

[15] B. D Ziebart, A. L Maas, J A. Bagnell, and A. K Dey.Maximum entropy inverse reinforcement learning. InAAAI, pages 1433 –1438, 2008.

Documents

Learning How to Avoid an Obstacle from Human Demonstration · 1) Nominal component of the reward function: We rep-resent the model producing the nominal path as Dynamic Movement Primitives,