Movement Primitives via Optimization€¦ · Movement Primitives via Optimization Anca D. Dragan, Katharina Muelling, J. Andrew Bagnell, and Siddhartha S. Srinivasa The Robotics Institute,

Movement Primitives via OptimizationAnca D. Dragan, Katharina Muelling, J. Andrew Bagnell, and Siddhartha S. Srinivasa

The Robotics Institute, Carnegie Mellon University

Abstract—We formalize the problem of adapting ademonstrated trajectory to a new start and goal config-uration as an optimization problem over a Hilbert spaceof trajectories: minimize the distance between the demon-stration and the new trajectory subject to the new endpoint constraints. We show that the commonly used versionof Dynamic Movement Primitives (DMPs) implement thisminimization in the way they adapt demonstrations, fora particular choice of the Hilbert space norm. The gener-alization to arbitrary norms enables the robot to select amore appropriate norm for the task, as well as learn howto adapt the demonstration from the user. Our experimentsshow that this can significantly improve the robot’s abilityto accurately generalize the demonstration.

I. Introduction

We focus on the problem of learning motor skillsfrom demonstration, in which a user demonstrates atrajectory to a robot (like the gray trajectory in Fig.1)(for instance, through kinesthetic demonstration), andthe robot adapts it to new conditions that it faces, suchas a new start or goal configuration (the blue trajectoryin Fig.1). This problem is important in learning motionskills from demonstration [1], as well as in learningfrom experience using trajectory libraries [2, 3].

Among several tools for addressing this problem[1, 4–6], a commonly used one is a Dynamic MovementPrimitive (DMP) [7, 8]. DMPs have seen wide appli-cation across a variety of domains, including bipedlocomotion [9], grasping [10], placing and pouring [11],dart throwing [12], ball paddling [13], pancake flipping[14], playing pool [15], and handing over an object [16].

DMPs represent a demonstration as a dynamicalsystem tracking a moving target configuration, and adaptit to new start and goal constraints by simply changingthe start and goal parameters in the equation of themoving target. The adaptation process is the same,regardless of the task and of the user, and is merelyone instance of a larger problem.

Our work introduces a generalization of this adap-tation process. We provide a variational characteri-zation of the problem by formalizing the adaptationof a demonstrated trajectory to new endpoints as anoptimization over a Hilbert space of trajectories (Sec.II). We find the closest trajectory to the demonstration,in the linear subspace induced by the new endpointconstraints (Fig.2). Distance (the notion of “closer”) ismeasured by the norm induced by the inner productin the space.

Using this formalism, different choices for the in-ner product lead to different adaptation processes. Weprove that DMPs implement this optimization in theway they adapt trajectories, for a particular choice of anorm (Sec. III). We do so by proving that when up-dating the endpoints, the moving target tracked by thedynamical system adapts (as in Fig.1(b)) using the verysame norm that we often use in trajectory optimizerslike CHOMP [17] (we denote this norm by A). We then

0!

20!

40!

0! 1!

ξ x(z)−ξDx (z)

zs

gg

gx − gx

ξD(z) ξ(z)

(a) Adaptation with M = A

fDx(z)

s

gg

-‐6 -‐5 -‐4 -‐3 -‐2 -‐1 0 1 2 3 4 5

0! 1!

T D (z) ˆT (z)

z

(b) Adaptation with DMP

Fig. 1: (a) Using a norm M for adaptation propagates the change inthe start and goal, from {s,g} to {s, g}, to the rest of the trajectory,changing ξD into ξ. The difference between the two as a functionof time is plotted in blue. (b) In contrast, DMPs represent thedemonstration as a spring damper system tracking a moving targettrajectory TD , compute differences fD (purple) between TD and thestraight line trajectory, and apply the same differences to the newstraight line trajectory between the new endpoints. This results ina new target trajectory T for the dynamical system to track. WhenM = A, the velocity norm typically used in CHOMP [17], the twoadaptations are equivalent. In general, different norms M would leadto different adaptions.

show that this also implies that the adaptation in thetrajectory space, obtained by then tracking the adaptedtarget, is also the result of optimizing a norm based onA.

Beyond providing a deeper understanding of DMPsand what criteria they are inherently optimizing whenadapting demonstrations, our generalization frees therobot from a fixed adaptation process by enabling it touse any inner product (or norm). Because computingthe minimum norm adaptation is near-instant, any suchadaption process can be used in the DMP to obtain thenew moving target trajectory.

Thus, we can select a more appropriate norm basedon the task at hand (Sec. IV-A). What is more, if theuser is willing to provide a few examples of how toadapt the trajectory as well, then the robot can learnthe desired norm (Sec. IV-B): the robot can learn, fromthe user, not only the trajectory, but also how to adaptthe trajectory to new situations.

We conduct an experimental analysis of the benefitof learning a norm both with synthetic data wherewe have ground truth, as well as with kinestheticdemonstrations on a robot arm. Our results show asignificant improvement in how well the norm thatthe robot learns is able to reconstruct a holdout set ofdemonstrations, compared to the default DMP norm.

Overall, we contribute a deeper theoretical under-standing of DMPs that relates them to trajectory opti-

mization, and also leads to practical benefits for learn-ing from demonstration that help broaden the use ofDMP-like algorithms.

II. Hilbert Norm Minimization

In this section, we formalize trajectory adaptation asa Hilbert norm minimization problem. We then derivethe solution to this problem, and study the case inwhich translating trajectories carries no penalty. Thisis the case for the norm stemming from a commontrajectory optimization objective, as well as for thenorm DMPs use in their adaptation process.

A. Problem StatementTrajectories are functions ξ : [0, 1] → Q mapping

time to robot configurations. We allow the time index tobe either discrete or continuous. Given a demonstratedtrajectory ξD, we propose to adapt it to a new start s(the robot’s starting configuration) and a new goal g bysolving:

ξ = arg minξ∈Ξ||ξD − ξ||2M

s.t. ξ(0) = sξ(1) = g (1)

where M is the norm defined by the inner productin the Hilbert space of trajectories, ||ξ||2M = 〈ξ, ξ〉.Fig.2 illustrates this problem. Different inner productslead to different Ms, which in turn lead to differentadaptations.

B. SolutionIn general, M induces a linear operator. When time

is discrete, M is a matrix, and ||ξ||2M = ξT Mξ.The Lagrangian of Eq.(1) is

L = (ξD− ξ)T M(ξD− ξ) + λT(ξ(0)− s) + γT(ξ(1)− g)(2)

Taking the gradient w.r.t. ξ, λ, and γ:

∇ξL = M(ξD − ξ) + (λ, 0, ..0)T + (0, .., 0, γ)T (3)

∇λL = ξ(0)− s, ∇γL = ξ(1)− g (4)

Thus, the solution is:

ξ = ξD + M−1(λ, 0, .., 0, γ)T (5)

where the vectors λ and γ are set by Eq.(4).This has an intuitive interpretation: correct the start

and the goal, and propagate the differences across the tra-jectory in a manner dictated by the norm M (Fig.1(a)).Fig.2 depicts the geometry of the space.

C. Free TranslationsOften times, we are interested in being able to

translate trajectories at no cost, i.e. if ξ = ξ + ξk,with ξk(t) = k, ∀t (a constant valued trajectory), then||ξ− ξ||M = 0, ∀k. However, that makes M a semi-norm,as 〈ξk, ξk〉 = 0, ∀k, which makes the problem ill posed.

ξD

ξ

(s,g) (s, g)

ξD +M−1(λ ,0,...,0,γ)T

Fig. 2: We adapt ξD by finding the closest trajectory to it that satisfiesthe new end point constraints. The x axis is the start-goal tuple, andthe y axis is the rest of the trajectory. M warps the space, transforming(hyper)spheres into (hyper)ellipsoids. The space of all adaptations ofξD is a linear subspace of Ξ.

1) Why Free Translations: One such example that isof wide applicability is the one stemming from theintegral over squared velocities along the trajectory, acriteria often used in trajectory optimization [17–19].Let

C[ξ] =12

∫ξ(z)T ξ(z)dz =

12

ξT Aξ (6)

with A = KTK, K being the finite differencing matrix.Then M = A is such a semi-norm, as every constanttrajectory ξk has norm 0:

ξTk Aξk = 2C[ξk] = 2

∫0dz = 0, ∀k

The CHOMP trajectory optimizer [17] is often imple-mented using this velocity norm to measure distancesbetween trajectories.

In the next section, we show that this norm A is thenorm that DMPs minimize in the way they adapt thetrajectory being tracked by the spring damper system tonew start and goal configurations.2) Handling Free Translations: CHOMP bypasses the

semi-norm problem because the trajectory endpointsare constants. Similarly, the key to free translationswhile maintaining a full norm is fixing one of the end-points, e.g. the starting configuration: one can adapt thetrajectory’s goal in a restricted space of trajectories thatall have the same (constant) start, and then translatethe result to the new starting configuration.

Let Ξs=k be the subspace of trajectories s.t. the start-ing configuration is a constant k: ξ(0) = k, ξ ∈ Ξs=k ⊂Ξ. M is a full norm in Ξs=k, as no translations areallowed.

Let σk : Ξs=k → Ξs=0, σk(ξ) = ξ − ξk be the functionthat translates trajectories from Ξs=k to start at s = 0.This function is bijective, σ−1

k (ξ) = ξ + ξk.We can reformulate Eq.(1) to finding the closest

trajectory within Ξs=0 that ends at g− s, and translatingthis trajectory to the new start s, thereby obtaining atrajectory from s to g− s + s = g:

ξ = σ−1s

(arg min

ξ∈Ξs=0||σs(ξD)− ξ||2M

)s.t. ξ(1) = g− s (7)

The solution to this, following an analogous deriva-tion to Sec. II-B, is to take the demonstration translatedto 0, correct the goal to g− s, propagate this change tothe rest of the trajectory via M, and then translate theresult to the new start:

ξ = σ−1s

(σs(ξD) + M−1(0, .., 0, γ)T

)(8)

with γ s.t. ˆξ(1) = g. For a norm M with no couplingbetween joints, and m the last entry in M, this becomes:

ξ = σ−1s

(σs(ξD) +

1m

M−1(0, .., 0, (g− s)− (g− s))T)(9)

This corrects the goal in Ξs=0 from g − s to g − s,effectively changing the goal in Ξ from g to g.1

III. DMP Adaptation as a Special Case of Hilbert

Norm Minimization

In this section, we summarize a commonly usedversion of DMPs, and write it as a target tracker witha moving target. Next, we show that the adaptation ofthe tracked target to a new start and goal is an instanceof Hilbert norm minimization (Theorem 1). Finally, weshow that this induces an adaption in trajectory spacethat is an instance of norm minimization (Theorem 2).

A. DMPsA commonly used version [10, 11, 15, 16] of a DMP

is a second order linear dynamical system which isstimulated with a non-linear forcing term:

τ2ξ(t) = K(g− ξ(t))− Dτξ(t)− K(g− s)u + K f (u)(10)

where K(g − ξ(t)) is an attractor towards the goal,K(g− s)u avoids jumps at the beginning of the move-ment, Dξ(t) is a damper, and K f (u) is a nonlinearforcing term. u is a phase variable generated by thedynamical system

τu = −αu

Thus, u maps time from 1 to (almost) 0:

u(t) = e−ατ t (11)

B. DMP Adaptation as Tracked Target AdaptationLet z = 1− u. We can reformulate a DMP as a target

tracker with a moving target, T (z):

τ2ξ(t) = K(T (z)− ξ(t))− Dτξ(t) (12)

with T (z) moving from s to g as a function of z on astraight line constant speed in z plus a deviation f asa function of z, f (z) = f (u):

T (z) = s + z(g− s) + f (z) (13)1Note that here we are overloading M. In Eq.(7), we are measuring

norms in a space of trajectories with constant start 0, which is a lowerdimensional space of trajectories ξ : (0, 1] → Q that do not containthe starting configuration (which is not a variable). In this space,we can define a norm M by ||ξ||M = ||ξ||M , with ξ(0) = 0 andξ(z) = ξ(z)∀z ∈ (0, 1]. M is then of dimensionality one less than Mand full rank, and what we actually use in Eq.(9).

Given a demonstration ξD, one forms a DMP bycomputing fD(z) from Eq.(12).2 To generalize to a news and g, the target changes from Eq.(14) to Eq.(15):

TD(z) = s + z(g− s) + fD(z) (14)

T (z) = s + z(g− s) + fD(z) (15)

The linear function from s to g is adapted to the newendpoints, becoming s + z(g− s) (black trajectories inFig.1(b)), and the deviation fD remains fixed (purpledeviations in Fig.1(b)).

C. Relation to Hilbert Norm MinimizationThe adaptation of the target being tracked by the DMP,from TD to T , is a special case of the Hilbert normadaptation from ξD to ξ, when the norm M = A fromEq.(6).To prove this, we show the equivalence between the

DMP adapted trajectory T and the outcome of theHilbert norm minimization ξ from Eq.(9), for T = ξ.

We do this in two steps. Since T is the sum of astraight line trajectory (as a function of z) and a fixeddeviation, we first show that the Eq.(7) will adapt astraight line trajectory to another straight line whenM is the norm A. Next we show that when addinga nonzero deviation to the initial trajectory, the samedeviation is added by Eq.(7) to the adapted trajectory.

Therefore, we first focus on the case when fD = 0.In this case, the targets are straight lines from thestart to the goal, moving at constant speed: T (z) =ξstraight(z) = (g − s)z + s, and T (z) = ξstraight(z) =(g− s)z + s.

In Lemma 3, we show that the adaptation of ξstraightto a new start s and a new goal g with respect to thenorm A matches ξstraight. We build to this via two otherlemmas, where the key is to represent straight linesin terms of the norm A. We first prove that ξstraightminimizes ξT Aξ (Lemma 1). This enables us to writeout ξstraight in terms of A (Lemma 2).

We then generalize this to non-zero fD using that fDin not actually changed by the norm M in Eq.(9).Lemma 1: ξstraight is the solution to minimizing Eq.(6):constant speed straight line trajectories have minimum normunder A.

Proof: We show this by showing that the solutionto Eq.(6) is a straight line with constant velocity, justlike ξstraight. The gradient of C is

∇ξC = −ξ

and setting this to 0 results in ξ = az + b. ξ(0) = s ⇒b = s, and ξ(1) = g ⇒ a = g− s. Thus, ξ = (g− s)z +s = ξstraight.Lemma 2: σs(ξstraight) = 1

m A−1(0, .., 0, g)T with m thelast entry of A as in Eq.(9): we can write constant speedstraight line trajectories in closed form in terms of A.

2Typically, there is a smoothing step before adaptation wherefD is fitted by some basis functions, fD(u) = ∑ ψi(u)θiu

∑ ψi(u). The same

smoothing can be applied to a trajectory before performing Hilbertnorm minimization.

Proof: From Lemma 1 and from C[ξ] = ξT Aξ, weinfer that σs(ξstraight), which is the straight line from 0to g− s, is the solution to

minξ∈Ξs=0

ξT Aξ

s.t. ξ(1) = g− s (16)

Writing the Lagrangian and taking the gradient likebefore, we get that σs(ξstraight) =

1m A−1(0, .., 0, g− s)T :

this term is the straight line from 0 to g− s.Lemma 3: ξstraight is the solution to Eq.(7) for ξD =ξstraight: constant speed straight lines get adapted by A toconstant speed straight lines.

Proof: From Lemma 2, the term 1m M−1(0, .., 0, (g−

s)− g)T from Eq.(9) is the straight line from 0 to (g−s)− (g− s), i.e. ((g− s)− g)t.

Thus, Eq.(9) becomes

ξ = σ−1s

(σs(ξstraight)+

+1m

M−1(0, .., 0, (g− s)− (g− s))T)⇒

ξ = σ−1s ((g− s)z + ((g− s)− (g− s))z)⇒

ξ = σ−1s ((g− s)z)⇒

ξ = (g− s)z + s⇒ξ = ξstraight

Theorem 1: T is the solution to Eq.(7) for ξD = T :straight lines plus deviations get adapted by A to straightlines plus the same deviations, like the target trajectories inDMPs.

Proof: When fD = 0, T = ξstraight, and T = ξstraight.The theorem follows from Lemma 3.

When fD 6= 0, the demonstrated target is TD =ξstraight + fD, and the adapted target is T = ξstraight +fD. This adapted target still matches the solution inEq.(9):

ξ = σ−1s

(σs(ξstraight + fD)+

+1m

M−1(0, .., 0, (g− s)− (g− s))T)⇒

ξ = σ−1s

(σs(ξstraight) + fD+

+1m

M−1(0, .., 0, (g− s)− (g− s))T)⇒

ξ = σ−1s ( fD + (g− s)z)⇒

ξ = fD + (g− s)z + s⇒ξ = T

Therefore, the target adaptation that the DMP does,from T to T , is none other than the Hilbert normminimization from Eq.(1), with the same norm as theone often used in trajectory optimization algorithmslike CHOMP.Norm Minimization Directly in the Trajectory Space.Because the tracked target adaption from T to T is

a Hilbert norm minimization, then the correspondingadaptation in the space of trajectories, which adapts ξDinto ξ by tracking T , is also the result of a Hilbert normminimization.

To see this, let β : ξ 7→ T be the function mappinga demonstrated trajectory to the corresponding trackedtarget like in Eq.(12). Given a particular spring dampersystem, β is a bijection: every demonstrated trajectorymaps to a unique tracked target, and every trackedtarget maps to a unique trajectory when tracked by thespring damper. Furthermore, β is linear, due to Eq.(12)and additivity and homogeneity of differentiation.

Because β is bijective and linear, the norm A in thetracked target spaces induces a norm P in the trajectoryspace: ||ξ||P = ||β(ξ)||A.Theorem 2: The final trajectory obtained by tracking theadapted target T , ξ = β−1(T ), is the closest trajectoryto ξD that satisfies the new endpoint constraints withrespect to the norm P: the final trajectory in a DMP isthe result of Eq.(1) for M = P.

Proof: Assume ∃ξ with endpoints s and g s.t. ||ξD−ξ||P < ||ξD − ξ||P, i.e. ξ is closer to ξD than ξ is. Then||β(ξD − ξ)||A < ||β(ξD − ξ)||A ⇒ ||β(ξD)− β(ξ)||A <||β(ξD) − β(ξ)||A ⇒ ||TD − β(ξ)||A < ||TD − T ||A,which contradicts Theorem 1: we know that T is theclosest to TD w.r.t. the norm A given the endpointconstraints, thus β(ξ) cannot be closer.

Therefore, DMPs adapt trajectories by minimizing anorm that depends on both A (the norm used to adaptthe tracked target), as well as the particulars of thedynamical system (represented here by the function β).

IV. Implications

Theoretical Implications. Our work connects DMPs totrajectory optimization, providing an understanding ofwhat objective the DMP adaptation process is inher-ently optimizing.

Our work also opens the door for handling ob-stacle avoidance via planning. Currently with DMPs,obstacles that appear as part of new situations in-fluence the adapted trajectory in a reactive manner,akin to a potential field. Certain more difficult situ-ations, however, require using a motion planner forsuccessful obstacle avoidance, which reasons about theentire trajectory and not just the current configuration.Using our generalization, a trajectory optimizer akin toCHOMP can search for a trajectory that minimizes theadaptation norm (as opposed to the trajectory norm, asin CHOMP) while avoiding collisions.Practical Implications. First, the generalization frees usfrom the default A norm, and enables us to select moreappropriate norms for each task. We discuss this benefitin Sec. IV-A.

Second, the generalization gives the robot the oppor-tunity to learn how to adapt trajectories from the user. Ifthe user is willing to provide not only a demonstration,but also a few adaptations of that demonstration todifferent start and goal configurations, then the robotcan use this set of trajectories to learn the desired normM. We describe an algorithm for doing so in Sec. IV-B.Aside 1 — Computation. The adaptation in a DMPhappens instantly, by instantiating the start and goalvariables with new values. Hilbert norm minimization

s

g

s

g

ξDξA

ξM1

(a) Minimum Jerks

g

s

g

ξD

ξA

ξM2

(b) Reweighing Velocitiess

g

s

g

ξD

ξA

ξM3

(c) Coupling Timepoints

Fig. 3: A is not always an appropriate choice for the Hilbert norm. Each of the plots compares adapting a different original trajectory ξD(gray) using A (ξA, blue) vs. using a better norm (ξM , orange). The norms we used, much like A, do not allow free rotations, but freerotations could be obtained similarly to free translations (Sec. II-C).

1

(a) Minimum Velocity (A)! (b) Minimum Jerk (M1)! (c) Reweighing Velocities (M2)! (d) Coupling Timepoints (M3)!Fig. 4: The different changes to the norm structure result in different adaptation effects.

has an analytical solution, with computational com-plexity in the discrete case dominated by a singlematrix multiplication. This means any DMP can adaptits moving target using norm minimization.Aside 2 — Using a Spring Damper. DMPs first castthe trajectory as a moving target tracked by a springdamper, and adapt the moving target trajectory. Hilbertnorm minimization can be used to adapt trajectoriesboth for the moving target, as well as for the demon-strated trajectory itself. The decision to use a springdamper is independent from the adaptation process.

A. Selecting a Better NormThe norm A can lead to good adaptations (see Fig.1),

but it is not always the most suitable norm. Fig.3 showsthree cases where a different norm leads to betteradaptations. In all three cases, the better norm is amodification of the matrix structure of A (as shownin Fig.4).

The first case, Fig.3(a), uses a demonstrated trajectorythat minimizes jerk. Therefore, using a norm that stemsfrom jerk as opposed to velocities, results in the correctadaptation – the minimum jerk trajectory (orange).This norm is band diagonal, like A, but has a winderband because computing the jerk requires terms furtheraway from the current trajectory point than computingvelocities (Fig.4(b)).

The second case, Fig.3(b), uses a demonstrated tra-jectory that moves faster in the middle than it does in

the beginning and end. Therefore, a norm that weighsvelocities in middle of the trajectory less than velocitiesat the endpoints (unlike A, for which the velocities atevery time point matter equally), results in the adaptionin orange: the trajectory remains a straight line, andfollows a similar velocity profile as the demonstration.This norm is a reweighing of the rows of A (Fig.4(c)).

The third case, Fig.3(c), uses a loop as the demon-strated trajectory. The demonstration itself is not nec-essarily minimizing any L2 norm. However, a moreappropriate norm for adapting this demonstration cou-ples waypoints that are distant in time but close inspace: instead of only minimizing velocities, it alsominimizes the distance between the two points thatbegin and end the loop. Unlike A, which is banddiagonal, this norm also has entries far from the di-agonal, depending on how far apart in time these twowaypoints are (Fig.4(d)).

B. Learning a Better Norm

As we saw in the previous section, different normsresult in different ways of adapting a demonstratedtrajectory. If the user providing the demonstration iswilling to also provide example adaptations to newendpoints, then the robot can learn the norm M fromthese examples: instead of adapting trajectories in a pre-defined way, the robot can learn from the user how it shouldadapt trajectories.

Let D = {ξi} be the set of user demonstrations, eachof them corresponding to a different tuple of endpoints(ξi(0), ξi(1)). The robot needs to find a norm M suchthat for each pair of trajectories (ξi, ξ j) ∈ D ×D, ξ j isthe closest trajectory to ξi out of all trajectories betweenthe new endpoints, ξ j(0) and ξ j(1), i.e. find a norm thatexplains why the user adapted ξi into ξ j and not intoany other trajectory:

||ξi − ξ j||M ≤ ||ξi − ξ||M, ∀ξ ∈ Ξg=ξ j(1)s=ξ j(0)

(17)

Equivalently:

||ξi − ξ j||2M ≤ minξ∈Ξ||ξi − ξ||2M

s.t. ξ(0) = ξ j(0)ξ(1) = ξ j(1) (18)

One way to find an M under these constraints is tofollow Maximum Margin Planning [20]. We find M byminimizing the following expression:

minM

∑i,j||ξi − ξ j||2M −min

ξ∈Ξ[||ξi − ξ||2M −L(ξ, ξ j)]

s.t. ξ(0) = ξ j(0)ξ(1) = ξ j(1)

s.t. M � 0 (19)

with L a loss function, e.g. a function evaluating to 0when the trajectory matches ξ j and to 1 otherwise, andM � 0 the positive-definiteness constraint.

If ξ∗ij is the optimal solution to the inner minimizationproblem, then the gradient update is:

M = M− α ∑i,j[(ξi − ξ j)(ξi − ξ j)

T − (ξi − ξ∗ij)(ξi − ξ∗ij)T ]

(20)followed by a projection onto the space of positivedefinite matrices.Aside 3 — Geometry. An M that satisfies all theconstraints only exists if the demonstrations in D lie ina linear subspace of Ξ of dimensionality 2d, with d thenumber of degrees of freedom: the adaptation inducesa foliation of the space, with each linear subspace of ademonstration and all its adaptations to new endpointsforming a plaque of the foliation. Fig.2 depicts such alinear subspace, obtained by adapting ξD.

This follows from Eq.(5): the space of all adaptationsof a trajectory is parametrized by the vectors λ and γ.Similarly, when we allow free translations, the linearsubspace has dimensionality d (Eq.(8)). Note that thereare many norms that satisfy the constraints in this case,because only a subset of the rows of M−1 are used inthe adaptation.

When the demonstrations do not form such a linearsubspace, the algorithm will find an approximate Mthat minimizes the criterion in Eq.(19). We study theeffects of noise in the next section. Other techniquesfor finding an approximate M, such as least squaresor PCA, would also apply, but they would minimizedifferent criteria, e.g. the difference between the trajec-tories themselves (∑ ||ξ j− ξ∗ij||2), and not the differencebetween the norms.

V. Experimental Analysis

We divide our experiments in two parts. The firstexperiment (Sec. V-A) analyzes the ability to learn anorm from only a few demonstrations, under differentnoise conditions. We do this on synthetically generateddata so that we can manipulate the noise and comparethe results to ground truth. We assume an underlyingnorm, generate noisy demonstrations based on it, andtest the learner’s ability to recover the norm. The sec-ond experiment tests the benefit of learning the normwith real kinesthetic demonstrations on a robot arm(Sec. V-B).

A. Synthetic DataTo analyze the dependency of learning the norm on

the number of demonstrations, we generate demonstra-tions for different endpoints using a given norm Mand some arbitrary initial trajectory. We then use thetraining data to learn a norm M. For simplicity, wefocus on norms that allow free translations, and thatdo not couple different joints (similar to A).

1) Dependent Measures: We test the quality of alearned norm M using two measures (which signifi-cantly correlate, see Analysis): one is about the normitself, and the other is about the effect it has on adap-tations.Waypoint Error: This measure captures deviations ofthe behavior induced by the learned norm from desiredbehavior. We generate a test set of 1200 new start andgoal configuration tuples for testing, leading to 1200adapted trajectories using M as ground truth. We thenadapt the demonstrated trajectory to each tuple usingthe learned norm M. For each obtained trajectory, wemeasure the mean waypoint deviation from the groundtruth trajectories, and combine these into an averageacross the entire set.Norm Error: This measure captures deviations in thelearned norm itself (between M and M). Because onlythe last row of M−1 (which we denote M−1

N ) affectsthe resulting adaptation, we compute the norm of thecomponent of the normalized M−1

N that is orthogonalto the true normalized M−1

N .

2) Ideal Demonstrations: We first test learning fromideal demonstrations, meaning perfectly adapted usingM, without any noise.

Because of the structure that M imposes on the op-timal adaptations (a linear subspace of dimensionality2d in general, d for free translations), only a few idealdemonstrations are necessary to perfectly retrieve M: 3in the general case, and 2 in the case of free translations.

As a sanity check, we ran an experiment in which wechose the starting trajectory from Fig.1 and generated100 random norms. For each norm, we computed thetwo measures above. The resulting error was exactly 0in each case: the learning algorithm perfectly retrievedthe underlying norm.

3) Tolerance to Noise: Real demonstrations will notbe perfect adaptations – they will be noisy. With noisecomes the necessity for more than the minimal number

0!

0.5!

1!

1.5!

2!

2.5!

3!

2! 4! 8! 16! 32! 64!

Num. of Demonstrations! Noise Factor!

Way

poin

t Err

or!

Way

poin

t Err

or!

0!

2!

4!

6!

8!

10!

12!

1! 10! 100! 1000! 10000!

Learned Norm Error!Training Noise!

H1 a&b! H2 a&b!

Ideal! Noisy! Learned!

Fig. 5: Left: an ideal adapted trajectory (gray), a noisy adapted trajectory (red) that we use for training, and the reproduction using the learnednorm (green), with a 6-fold average reduction in noise. Center: the error on a test set as a function of the number of training examples.Right: the error on a test set as a function of the amount of noise, compared to the magnitude of the noise (red). Error bars show standarderror on the mean – when not visible, the error bars are smaller than the marker size.

(a) Robot Setup! (b) Demonstrations! (c) Norm A Adaptations! (d) Learned Norm Adaptations!

Fig. 6: A comparison between adapting trajectories with the default A metric (c) and adapting using a learned metric (d) on a holdout setof demonstrated pointing gestures (shown in black). The trajectory ξD used for adaptation is in gray. Note that the adaption happens in thefull configuration space of the robot, but here we plot the end effector traces for visualization. The learned norm more closely reproducestwo of the trajectories, and has higher error in the third. Overall, the error decreases significantly (see Fig.7).

0.15!

0.17!

0.19!

0.21!

0.23!

2! 3! 4! 5! 6!

Way

poin

t Err

or (r

ad)!

Num. of Demonstrations!

Learned Norm!

Norm A!

Fig. 7: The average waypoint error on a holdout set of pointinggesture demonstrations on the HERB robot, for the adaptationsobtained using the learned norm, compared to error when using thedefault A.

of demonstrations, and the questions of how manydemonstrations are needed and how robust the learn-ing is to the amount of noise.Manipulated Variables. In this experiment, we studythese questions by manipulating two factors: (1) thenumber of demonstrations, and (2) the amount ofnoise we add to the adaptations in the training data.

We added Gaussian noise to the ideal adaptationsusing a covariance matrix that adds more noise to themiddle of the trajectory than the endpoints (since theendpoints are fixed when requesting an adaptation).

For the first factor – number of demonstrations –we started at 2 (the minimum number required), andchose exponentially increasing levels (2, 4, 8, 16, 32,64) to get an idea for what the scale of the numberof demonstration should be. For the second factor, wescaled the covariance matrix (by 1, 10, 100, 1000, 10000)

up to the point where the average noise for a trajectorywaypoint was 50% of the average distance from startto goal (which we considered an extreme amount thatexceeds by far levels we expect to see in practice).This resulted in 30 total conditions, and we ran theexperiment with 30 different random seeds for eachcondition.Hypotheses:H1a. (Sanity Check) The number of demonstration posi-tively affects the learned norm quality.H1b. (We Only Need a Small Number of Examples)There is a point beyond which increasing the number ofexamples results in practically equivalent norm quality.H2a. (Sanity Check) The amount of noise negatively affectsnorm quality.H2b. (Learning is Tolerant to Noise) The waypoint erroris significantly lower than the noise on the training exam-ples.Analysis. The waypoint error and norm error mea-sures were indeed significantly correlated (standard-ized Crohnbach’s α = 0.95), suggesting that the way-point error also captures the deviation from the realnorm.

A factorial least squares regression revealed that, inline with H1a and H2a, both factors were significant:as the number of demonstrations increased, the errordid decrease (F(1, 867) = 24.07, p < .0001), and asthe amount of noise increased, the error did increase(F(1, 867) = 628.35, p < .0001).

Fig.5 plots these two effects. In support of H1b, theerror stops decreasing after 8 demonstrations (it takesa difference threshold of 0.3 for an equivalence testbetween the error at 8 and the error at 16 to rejectthe hypothesis that they are practically the same withp = .04). This suggests that learning the norm can

happen from relatively few demonstrations.In support of H2b, the error was significantly lower

than the noise in the training trajectories (t(899) =19.35, p < .0001): on average, the error was lower bya factor of 6.71, and this factor increased significantlywith the number of demonstrations (F(1, 869) = 869.01,p < .0001).

B. Real DataOur simulation study compared the learned norm

to ground truth. Next, we were interested in studyingthe benefits of learning the norm with real kinestheticdemonstrations on a robot arm.

We collected 9 expert demonstrations of pointinggestures on the HERB robot [21], where the task was topoint to a particular location on a board, as in Fig.6(a).We chose pointing as a task because the shape of theadapted trajectories is important for such gestures. Weused up to 6 of these trajectories for training, and heldout 3 for testing.Dependent Measures. We use the waypoint error mea-sure from before, this time from the noisy holdout setas opposed to ground truth. We cannot use the normerror since we no longer have access to the true normM.Manipulated Variables. We used both the learnednorm, as well as the default A norm from Eq.(6), togenerate adaptations of the same original demonstra-tion (its end effector trace is shown in gray in Fig.6(cand d)). Note that even though the learned norm hasaccess to more than the original demonstration, weused this demonstration only when testing the adap-tation, to remain fair to the default norm. In practice,if the user provides multiple demonstrations, the onecorresponding the situation closest to the test situationcould be used for adaptation.

We also manipulated how many of the 6 demonstra-tions the learning algorithm used.Hypotheses:H3. (Data Improves Performance) As before, we expectthat the number of demonstrations positively affects perfor-mance of the learned norm, i.e. error in reproducing the hold-out trajectories decreases as the number of demonstrationsincreases.H4. (Learned Norm > Default A) The learned norm hassmaller error in reproducing the holdout demonstrations thanthe default A norm.Analysis. Fig.6 qualitatively compares the learned andthe default norm, and Fig.7 plots our results.

Overall, the performance did tend to improve withthe number of demonstrations, but the effect was notsignificant (F(4, 26) = 1.31, p = .29). In support of H4,the error was significantly lower overall when learningthe norm than when using the DMP default (t(30) =31.96, p < .0001), suggesting that for real kinestheticdemonstrations, there is indeed a practical benefit tothe generalization we propose in this paper.

VI. Discussion and Future Work

In this paper, we formalized the trajectory adaptationproblem as Hilbert norm minimization and showed thisis a generalization of the process DMPs use to adaptdemonstrated trajectories to new endpoints. Our work

can be used for tasks in which DMPs are typicallyemployed, in particular those tasks for which the shapeof the trajectory is important (such as gestures): bylearning the adaptation norm from the user, the robotcan produce trajectories that better match the desiredshape in new situations (as in our experiments).

We also envision an alternate use, geared towardsskills for which success is the main driver (such as clos-ing a lid or hitting a tennis ball): rather than learningthe norm from the user, the robot can learn a goodnorm through reinforcement, targeting an increase inthe success rate compared to the default norm.

Further avenues of future work include testing thenorm learning on data from novice users. This wouldrequire addressing the problem of enabling novicesto provide good demonstrations, which is a separateline of research and the reason we opted for expertdemonstrations in this paper.

Finally, we plan on investigating skills beyond point-ing gestures. In particular, since minimum L2 normadaptations lie in a low dimensional linear subspaceof the trajectory space, we aim to explore what typesof skills best meet this assumption.

References

[1] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning.A survey of robot learning from demonstration. Robotics and autonomoussystems, 57(5):469–483, 2009.

[2] Nikolay Jetchev and Marc Toussaint. Trajectory prediction: learning tomap situations to robot trajectories. In ICML, 2009.

[3] Martin Stolle and Christopher G Atkeson. Policies based on trajectorylibraries. In ICRA, 2006.

[4] Sylvain Calinon, Florent Guenter, and Aude Billard. On learning, repre-senting, and generalizing a task in a humanoid robot. Systems, Man, andCybernetics, Part B: Cybernetics, IEEE Transactions on, 37(2):286–298, 2007.

[5] Gu Ye and Ron Alterovitz. Demonstration-guided motion planning. InISRR, 2011.

[6] John Schulman, Jonathan Ho, Cameron Lee, and Pieter Abbeel. Learningfrom demonstrations through the use of non-rigid registration. In ISRR,2013.

[7] Auke Jan Ijspeert, Jun Nakanishi, and Stefan Schaal. Learning attractorlandscapes for learning motor primitives. In NIPS, 2003.

[8] Auke Jan Ijspeert, Jun Nakanishi, Heiko Hoffmann, Peter Pastor, andStefan Schaal. Dynamical movement primitives: learning attractor modelsfor motor behaviors. Neural computation, 25(2):328–373, 2013.

[9] Jun Nakanishi, Jun Morimoto, Gen Endo, Gordon Cheng, Stefan Schaal,and Mitsuo Kawato. Learning from demonstration and adaptation ofbiped locomotion. Robotics and Autonomous Systems, 47(2):79–91, 2004.

[10] Peter Pastor, Ludovic Righetti, Mrinal Kalakrishnan, and Stefan Schaal.Online movement adaptation based on previous sensor experiences. InIROS, 2011.

[11] Peter Pastor, Heiko Hoffmann, Tamim Asfour, and Stefan Schaal. Learn-ing and generalization of motor skills by learning from demonstration.In ICRA, 2009.

[12] Jens Kober, Erhan Oztop, and Jan Peters. Reinforcement learning to adjustrobot movements to new situations. In IJCAI, 2011.

[13] Jens Kober and Jan Peters. Learning motor primitives for robotics. InICRA, 2009.

[14] Petar Kormushev, Sylvain Calinon, and Darwin G Caldwell. Robot motorskill coordination with em-based reinforcement learning. In IROS, 2010.

[15] Peter Pastor, Mrinal Kalakrishnan, Sachin Chitta, Evangelos Theodorou,and Stefan Schaal. Skill learning and task outcome prediction formanipulation. In ICRA, 2011.

[16] Miguel Prada, Anthony Remazeilles, Ansgar Koene, and Satoshi Endo.Dynamic movement primitives for human-robot interaction: comparisonwith human behavioral observation. In IROS, 2013.

[17] Matt Zucker, Nathan Ratliff, Anca D Dragan, Mihail Pivtoraiko, MatthewKlingensmith, Christopher M Dellin, J Andrew Bagnell, and Siddhartha SSrinivasa. Chomp: Covariant hamiltonian optimization for motion plan-ning. IJRR, 32(9-10):1164–1193, 2013.

[18] Mrinal Kalakrishnan, Sachin Chitta, Evangelos Theodorou, Peter Pastor,and Stefan Schaal. Stomp: Stochastic trajectory optimization for motionplanning. In ICRA, 2011.

[19] Chonhyon Park, Jia Pan, and Dinesh Manocha. Itomp: Incrementaltrajectory optimization for real-time replanning in dynamic environments.In ICAPS, 2012.

[20] Nathan D Ratliff, J Andrew Bagnell, and Martin A Zinkevich. Maximummargin planning. In ICML, 2006.

[21] Siddhartha S Srinivasa, Dmitry Berenson, Maya Cakmak, Alvaro Collet,Mehmet Remzi Dogar, Anca D Dragan, Ross A Knepper, Tim Niemueller,Kyle Strabala, Mike Vande Weghe, et al. Herb 2.0: Lessons learned fromdeveloping a mobile manipulator for the home. Proceedings of the IEEE,100(8), 2012.

Documents

Movement Primitives via Optimization€¦ · Movement Primitives via Optimization Anca D. Dragan, Katharina Muelling, J. Andrew Bagnell, and Siddhartha S. Srinivasa The Robotics Institute,