Combining Model-Based and Model-Free Updates for ...proceedings.mlr.press/v70/chebotar17a/chebotar17a.pdfYevgen Chebotar *12 Karol Hausman 1Marvin Zhang 3 Gaurav Sukhatme Stefan Schaal12

Combining Model-Based and Model-Free Updates for Trajectory-CentricReinforcement Learning

Yevgen Chebotar* 1 2 Karol Hausman* 1 Marvin Zhang* 3 Gaurav Sukhatme 1 Stefan Schaal 1 2 Sergey Levine 3

AbstractReinforcement learning algorithms for real-world robotic applications must be able to han-dle complex, unknown dynamical systems whilemaintaining data-efficient learning. These re-quirements are handled well by model-free andmodel-based RL approaches, respectively. In thiswork, we aim to combine the advantages of theseapproaches. By focusing on time-varying linear-Gaussian policies, we enable a model-based al-gorithm based on the linear-quadratic regulatorthat can be integrated into the model-free frame-work of path integral policy improvement. Wecan further combine our method with guided pol-icy search to train arbitrary parameterized poli-cies such as deep neural networks. Our simula-tion and real-world experiments demonstrate thatthis method can solve challenging manipulationtasks with comparable or better performance thanmodel-free methods while maintaining the sam-ple efficiency of model-based methods.

1. IntroductionReinforcement learning (RL) aims to enable automatic ac-quisition of behavioral skills, which can be crucial forrobots and other autonomous systems to behave intelli-gently in unstructured real-world environments. However,real-world applications of RL have to contend with twooften opposing requirements: data-efficient learning andthe ability to handle complex, unknown dynamical systemsthat might be difficult to model. Real-world physical sys-tems, such as robots, are typically costly and time consum-ing to run, making it highly desirable to learn using thelowest possible number of real-world trials. Model-based

*Equal contribution 1University of Southern California, LosAngeles, CA, USA 2Max Planck Institute for Intelligent Systems,Tubingen, Germany 3University of California Berkeley, Berke-ley, CA, USA. Correspondence to: Yevgen Chebotar <[email protected]>.

Proceedings of the 34 th International Conference on MachineLearning, Sydney, Australia, PMLR 70, 2017. Copyright 2017by the author(s).

Figure 1. Real robot tasks used to evaluate our method. Left: Thehockey task which involves discontinuous dynamics. Right: Thepower plug task which requires high level of precision. Both ofthese tasks are learned from scratch without demonstrations.

methods tend to excel at this (Deisenroth et al., 2013), butsuffer from significant bias, since complex unknown dy-namics cannot always be modeled accurately enough toproduce effective policies. Model-free methods have theadvantage of handling arbitrary dynamical systems withminimal bias, but tend to be substantially less sample-efficient (Kober et al., 2013; Schulman et al., 2015). Canwe combine the efficiency of model-based algorithms withthe final performance of model-free algorithms in a methodthat we can practically use on real-world physical systems?

As we will discuss in Section 2, many prior methods thatcombine model-free and model-based techniques achieveonly modest gains in efficiency or performance (Heesset al., 2015; Gu et al., 2016). In this work, we aim todevelop a method in the context of a specific policy rep-resentation: time-varying linear-Gaussian controllers. Thestructure of these policies provides us with an effective op-tion for model-based updates via iterative linear-Gaussiandynamics fitting (Levine & Abbeel, 2014), as well as a sim-ple option for model-free updates via the path integral pol-icy improvement (PI2) algorithm (Theodorou et al., 2010).

Although time-varying linear-Gaussian (TVLG) policiesare not as powerful as representations such as deep neuralnetworks (Mnih et al., 2013; Lillicrap et al., 2016) or RBF

Combining Model-Based and Model-Free Updates for Trajectory-Centric Reinforcement Learning

networks (Deisenroth et al., 2011), they can represent arbi-trary trajectories in continuous state-action spaces. Further-more, prior work on guided policy search (GPS) has shownthat TVLG policies can be used to train general-purpose pa-rameterized policies, including deep neural network poli-cies, for tasks involving complex sensory inputs such asvision (Levine & Abbeel, 2014; Levine et al., 2016). Thisyields a general-purpose RL procedure with favorable sta-bility and sample complexity compared to fully model-freedeep RL methods (Montgomery et al., 2017).

The main contribution of this paper is a procedure for op-timizing TVLG policies that integrates both fast model-based updates via iterative linear-Gaussian model fittingand corrective model-free updates via the PI2 framework.The resulting algorithm, which we call PILQR, combinesthe efficiency of model-based learning with the generalityof model-free updates and can solve complex continuouscontrol tasks that are infeasible for either linear-Gaussianmodels or PI2 by itself, while remaining orders of magni-tude more efficient than standard model-free RL. We inte-grate this approach into GPS to train deep neural networkpolicies and present results both in simulation and on a realrobotic platform. Our real-world results demonstrate thatour method can learn complex tasks, such as hockey andpower plug plugging (see Figure 1), each with less than anhour of experience and no user-provided demonstrations.

2. Related WorkThe choice of policy representation has often been a cru-cial component in the success of a RL procedure (Deisen-roth et al., 2013; Kober et al., 2013). Trajectory-centricrepresentations, such as splines (Peters & Schaal, 2008),dynamic movement primitives (Schaal et al., 2003), andTVLG controllers (Lioutikov et al., 2014; Levine &Abbeel, 2014) have proven particularly popular in robotics,where they can be used to represent cyclical and episodicmotions and are amenable to a range of efficient optimiza-tion algorithms. In this work, we build on prior work intrajectory-centric RL to devise an algorithm that is bothsample-efficient and able to handle a wide class of tasks,all while not requiring human demonstration initialization.

More general representations for policies, such as deepneural networks, have grown in popularity recently due totheir ability to process complex sensory input (Mnih et al.,2013; Lillicrap et al., 2016; Levine et al., 2016) and repre-sent more complex strategies that can succeed from a va-riety of initial conditions (Schulman et al., 2015; 2016).While trajectory-centric representations are more limitedin their representational power, they can be used as an in-termediate step toward efficient training of general param-eterized policies using the GPS framework (Levine et al.,2016). Our proposed trajectory-centric RL method can also

be combined with GPS to supervise the training of complexneural network policies. Our experiments demonstrate thatthis approach is several orders of magnitude more sample-efficient than direct model-free deep RL algorithms.

Prior algorithms for optimizing trajectory-centric policiescan be categorized as model-free methods (Theodorouet al., 2010; Peters et al., 2010), methods that use globalmodels (Deisenroth et al., 2014; Pan & Theodorou, 2014),and methods that use local models (Levine & Abbeel,2014; Lioutikov et al., 2014; Akrour et al., 2016). Model-based methods typically have the advantage of being fastand sample-efficient, at the cost of making simplifying as-sumptions about the problem structure such as smooth, lo-cally linearizable dynamics or continuous cost functions.Model-free algorithms avoid these issues by not modelingthe environment explicitly and instead improving the policydirectly based on the returns, but this often comes at a costin sample efficiency. Furthermore, many of the most pop-ular model-free algorithms for trajectory-centric policiesuse example demonstrations to initialize the policies, sincemodel-free methods require a large number of samples tomake large, global changes to the behavior (Theodorouet al., 2010; Peters et al., 2010; Pastor et al., 2009).

Prior work has sought to combine model-based and model-free learning in several ways. Farshidian et al. (2014) alsouse LQR and PI2, but do not combine these methods di-rectly into one algorithm, instead using LQR to produce agood initialization for PI2. Their work assumes the exis-tence of a known model, while our method uses estimatedlocal models. A number of prior methods have also lookedat incorporating models to generate additional syntheticsamples for model-free learning (Sutton, 1990; Gu et al.,2016), as well as using models for improving the accuracyof model-free value function backups (Heess et al., 2015).Our work directly combines model-based and model-freeupdates into a single trajectory-centric RL method withoutusing synthetic samples that degrade with modeling errors.

3. PreliminariesThe goal of policy search methods is to optimize the pa-rameters ✓ of a policy p(u

t

|xt

), which defines a proba-bility distribution over actions u

t

conditioned on the sys-tem state x

t

at each time step t of a task execution. Let⌧ = (x1,u1, . . . ,xT

,uT

) be a trajectory of states and ac-tions. Given a cost function c(x

t

,ut

), we define the tra-jectory cost as c(⌧) =

PT

t=1 c(xt

,ut

). The policy is opti-mized with respect to the expected cost of the policy

J(✓) = Ep

[c(⌧)] =

Zc(⌧)p(⌧)d⌧ ,

where p(⌧) is the policy trajectory distribution given thesystem dynamics p (x

t+1|xt

,ut

)


p(⌧) = p(x1)

TY

t=1

p (xt+1|xt

,ut

) p(ut

|xt

) .

One policy class that allows us to employ an efficientmodel-based update is the TVLG controller p(u

t

|xt

) =

N (K

t

x

t

+ k

t

,⌃t

). In this section, we present the model-based and model-free algorithms that form the constituentparts of our hybrid method. The model-based method isan extension of a KL-constrained LQR algorithm (Levine& Abbeel, 2014), which we shall refer to as LQR with fit-ted linear models (LQR-FLM). The model-free method is aPI2 algorithm with per-time step KL-divergence constraintsthat is derived in previous work (Chebotar et al., 2017).

3.1. Model-Based Optimization of TVLG Policies

The model-based method we use is based on the itera-tive linear-quadratic regulator (iLQR) and builds on priorwork (Levine & Abbeel, 2014; Tassa et al., 2012). We pro-vide a full description and derivation in Appendix 8.1.

We use samples to fit a TVLG dynamics modelp(x

t+1|xt

,ut

) = N (f

x,t

x

t

+ f

u,t

u

t

,Ft

) and assumea twice-differentiable cost function. Tassa et al. (2012)showed that we can compute a second-order Taylor approx-imation of our Q-function and optimize this with respect tou

t

to find the optimal action at each time step t. To dealwith unknown dynamics, Levine & Abbeel (2014) imposea KL-divergence constraint between the updated policy p(i)

and previous policy p(i�1) to stay within the space of tra-jectories where the dynamics model is approximately cor-rect. We similarly set up our optimization as

min

p(i)Ep(i) [Q(xt,ut)] s.t. Ep(i)

hDKL(p

(i)kp(i�1))

i ✏t . (1)

The main difference from Levine & Abbeel (2014) isthat we enforce separate KL constraints for each linear-Gaussian policy rather than a single constraint on the in-duced trajectory distribution (i.e., compare Eq. (1) to thefirst equation in Section 3.1 of Levine & Abbeel (2014)).

LQR-FLM has substantial efficiency benefits over model-free algorithms. However, as our experimental results inSection 6 show, the performance of LQR-FLM is highlydependent on being able to model the system dynamics ac-curately, causing it to fail for more challenging tasks.

3.2. Policy Improvement with Path Integrals

PI2 is a model-free RL algorithm based on stochastic op-timal control. A detailed derivation of this method can befound in Theodorou et al. (2010).

Each iteration of PI2 involves generating N trajecto-ries by running the current policy. Let S(x

i,t

,ui,t

) =

c(xi,t

,ui,t

)+

PT

j=t+1 c(xi,j

,ui,j

) be the cost-to-go of tra-jectory i 2 {1, . . . , N} starting in state x

i,t

by performing

action u

i,t

and following the policy p(ut

|xt

) afterwards.Then, we can compute probabilities P (x

i,j

,ui,j

) for eachtrajectory starting at time step t

P (xi,t,ui,t) =

exp

⇣� 1

⌘tS(xi,t,ui,t)

⌘

Rexp

⇣� 1

⌘tS(xi,t,ui,t)

⌘dui,t

. (2)

The probabilities follow from the Feynman-Kac theoremapplied to stochastic optimal control (Theodorou et al.,2010). The intuition is that the trajectories with lowercosts receive higher probabilities, and the policy distribu-tion shifts towards a lower cost trajectory region. The costsare scaled by ⌘

t

, which can be interpreted as the tempera-ture of a soft-max distribution. This is similar to the dualvariables ⌘

t

in LQR-FLM in that they control the KL stepsize, however they are derived and computed differently.After computing the new probabilities P , we update thepolicy distribution by reweighting each sampled controlu

i,t

by P (x

i,t

,ui,t

) and updating the policy parameters bya maximum likelihood estimate (Chebotar et al., 2017).

To relate PI2 updates to LQR-FLM optimization of a con-strained objective, which is necessary for combining thesemethods, we can formulate the following theorem.Theorem 1. The PI2 update corresponds to a KL-constrained minimization of the expected cost-to-goS(x

t

,ut

) =

PT

j=t

c(xj

,uj

) at each time step t

min

p

(i)Ep

(i) [S(xt

,ut

)] s.t. Ep

(i�1)

hDKL

⇣p(i)k p(i�1)

⌘i ✏ ,

where ✏ is the maximum KL-divergence between the newpolicy p(i) (u

t

|xt

) and the old policy p(i�1)(u

t

|xt

).

Proof. The Lagrangian of this problem is given by

L(p(i), ⌘t)=Ep(i)[S(xt,ut)]+⌘tEp(i�1)

hDKL

⇣p(i)k p(i�1)

⌘�✏

i.

By minimizing the Lagrangian with respect to p(i) we canfind its relationship to p(i�1) (see Appendix 8.2), given by

p(i)(ut|xt)/p(i�1)(ut|xt)Ep(i�1)

exp

✓� 1

⌘tS(xt,ut)

◆�. (3)

This gives us an update rule for p(i) that corresponds ex-actly to reweighting the controls from the previous policyp(i�1) based on their probabilities P (x

t

,ut

) described ear-lier. The temperature ⌘

t

now corresponds to the dual vari-able of the KL-divergence constraint.

The temperature ⌘t

can be estimated at each time step sep-arately by optimizing the dual function

g(⌘t

)=⌘t

✏+⌘t

logEp

(i�1)

exp

✓� 1

⌘t

S(xt

,ut

)

◆�, (4)

with derivation following from Peters et al. (2010).


PI2 was used by Chebotar et al. (2017) to solve severalchallenging robotic tasks such as door opening and pick-and-place, where they achieved better final performancethan LQR-FLM. However, due to its greater sample com-plexity, PI2 required initialization from demonstrations.

4. Integrating Model-Based Updates into PI2

Both PI2 and LQR-FLM can be used to learn TVLG poli-cies and both have their strengths and weaknesses. In thissection, we first show how the PI2 update can be broken upinto two parts, with one part using a model-based cost ap-proximation and another part using the residual cost errorafter this approximation. Next, we describe our method forintegrating model-based updates into PI2 by using our ex-tension of LQR-FLM to optimize the linear-quadratic costapproximation and performing a subsequent update withPI2 on the residual cost. We demonstrate in Section 6 thatour method combines the strengths of PI2 and LQR-FLMwhile compensating for their weaknesses.

4.1. Two-Stage PI2 update

To integrate a model-based optimization into PI2, we candivide it into two steps. Given an approximation c(x

t

,ut

)

of the real cost c(xt

,ut

) and the residual cost c(xt

,ut

) =

c(xt

,ut

) � c(xt

,ut

), let ˆSt

=

ˆS(xt

,ut

) be the approxi-mated cost-to-go of a trajectory starting with state x

t

andaction u

t

, and ˜St

=

˜S(xt

,ut

) be the residual of the realcost-to-go S(x

t

,ut

) after approximation. We can rewritethe PI2 policy update rule from Eq. (3) as

p(i) (ut

|xt

)

/ p(i�1)(u

t

|xt

)Ep

(i�1)

exp

✓� 1

⌘t

⇣ˆSt

+

˜St

⌘◆�

/ p (ut

|xt

)Ep

(i�1)

exp

✓� 1

⌘t

˜St

◆�, (5)

where p (ut

|xt

) is given by

p (ut

|xt

) / p(i�1)(u

t

|xt

)Ep

(i�1)

exp

✓� 1

⌘t

ˆSt

◆�. (6)

Hence, by decomposing the cost into its approximation andthe residual approximation error, the PI2 update can be splitinto two steps: (1) update using the approximated costsc(x

t

,ut

) and samples from the old policy p(i�1)(u

t

|xt

)

to get p (ut

|xt

); (2) update p(i) (ut

|xt

) using the residualcosts c(x

t

,ut

) and samples from p (ut

|xt

).

4.2. Model-Based Substitution with LQR-FLM

We can use Theorem (1) to rewrite Eq. (6) as a constrainedoptimization problem

min

p

Ep

hˆS(x

t

,ut

)

is.t. E

p

(i�1)

hDKL

⇣pk p(i�1)

⌘i ✏ .

Thus, the policy p (ut

|xt

) can be updated using any algo-rithm that can solve this optimization problem. By choos-ing a model-based approach for this, we can speed up thelearning process significantly. Model-based methods aretypically constrained to some particular cost approxima-tion, however, PI2 can accommodate any form of c (x

t

,ut

)

and thus will handle arbitrary cost residuals.

LQR-FLM solves the type of constrained optimizationproblem in Eq. (1), which matches the optimization prob-lem needed to obtain p, where the cost-to-go ˆS is approxi-mated with a quadratic cost and a linear-Gaussian dynam-ics model.1 We can thus use LQR-FLM to perform our firstupdate, which enables greater efficiency but is susceptibleto modeling errors when the fitted local dynamics are notaccurate, such as in discontinuous systems. We can use aPI2 optimization on the residuals to correct for this bias.

4.3. Optimizing Cost Residuals with PI2

In order to perform a PI2 update on the residual costs-to-go˜S, we need to know what ˆS is for each sampled trajec-tory. That is, what is the cost-to-go that is actually used byLQR-FLM to make its update? The structure of the algo-rithm implies a specific cost-to-go formulation for a giventrajectory – namely, the sum of quadratic costs obtained byrunning the same policy under the TVLG dynamics used byLQR-FLM. A given trajectory can be viewed as being gen-erated by a deterministic policy conditioned on a particularnoise realization ⇠

i,1, . . . , ⇠i,T , with actions given by

u

i,t

= K

t

x

i,t

+ k

t

+

p⌃

t

⇠i,t

, (7)

where Kt

, kt

, and ⌃

t

are the parameters of p(i�1). We cantherefore evaluate ˆS(x

t

,ut

) by simulating this determinis-tic controller from (x

t

,ut

) under the fitted TVLG dynam-ics and evaluating its time-varying quadratic cost, and thenplugging these values into the residual cost.

In addition to the residual costs ˜S for each trajectory, thePI2 update also requires control samples from the updatedLQR-FLM policy p (u

t

|xt

). Although we have the updatedLQR-FLM policy, we only have samples from the old pol-icy p(i�1)

(u

t

|xt

). However, we can apply a form of the re-parametrization trick (Kingma & Welling, 2013) and againuse the stored noise realization of each trajectory ⇠

t,i

toevaluate what the control would have been for that sampleunder the LQR-FLM policy p. The expectation of the resid-ual cost-to-go in Eq. (5) is taken with respect to the old pol-icy distribution p(i�1). Hence, we can reuse the states x

i,t

and their corresponding noise ⇠i,t

that was sampled while

1In practice, we make a small modification to the problem inEq. (1) so that the expectation in the constraint is evaluated withrespect to the new distribution p(xt) rather than the previous onep(i�1)

(xt). This modification is heuristic and no longer alignswith Theorem (1), but works better in practice.


Figure 2. We evaluate on a set of simulated robotic manipulationtasks with varying difficulty. Left to right, the tasks involve push-ing a block, reaching for a target, and opening a door in 3D.

rolling out the previous policy p(i�1) and evaluate the newcontrols according to u

i,t

=

ˆ

K

t

x

i,t

+

ˆ

k

t

+

pˆ

⌃

t

⇠i,t

. Thislinear transformation on the sampled control provides un-biased samples from p(u

t

|xt

). After transforming the con-trol samples, they are reweighted according to their residualcosts and plugged into the PI2 update in Eq. (2).

4.4. Summary of PILQR algorithm

Algorithm 1 summarizes our method for combining LQR-FLM and PI2 to create a hybrid model-based and model-free algorithm. After generating a set of trajectories by run-ning the current policy (line 2), we fit TVLG dynamics andcompute the quadratic cost approximation c(x

t

,ut

) and ap-proximation error residuals c(x

t

,ut

) (lines 3, 4). In or-der to improve the convergence behavior of our algorithm,we adjust the KL-step ✏

t

of the LQR-FLM optimization inEq. (1) based inversely on the proportion of the residualcosts-to-go to the sampled costs-to-go (line 5). In particu-lar, if ratio between the residual cost and the overall cost issufficiently small or large, we increase or decrease, respec-tively, the KL-step ✏

t

. We then continue with optimizingfor the temperature ⌘

t

using the dual function from Eq. (4)(line 6). Finally, we perform an LQR-FLM update on thecost approximation (line 7) and a subsequent PI2 update us-ing the cost residuals (line 8). As PILQR combines LQR-FLM and PI2 updates in sequence in each iteration, its com-putational complexity can be determined as the sum of bothmethods. Due to the properties of PI2, the covariance of theoptimized TVLG controllers decreases each iteration andthe method eventually converges to a single solution.

5. Training Parametric Policies with GPSPILQR offers an approach to perform trajectory optimiza-tion of TVLG policies. In this work, we employ mirrordescent guided policy search (MDGPS) (Montgomery &Levine, 2016) in order to use PILQR to train paramet-ric policies, such as neural networks. Instead of directlylearning the parameters of a high-dimensional parametricor “global policy” with RL, we first learn simple TVLGpolicies, which we refer to as “local policies” p(u

t

|xt

) forvarious initial conditions of the task. After optimizing thelocal policies, the optimized controls from these policiesare used to create a training set for learning the global pol-

Algorithm 1 PILQR algorithm1: for iteration k 2 {1, . . . ,K} do2: Generate trajectories D = {⌧

i

} by running the cur-rent linear-Gaussian policy p(k�1)

(u

t

|xt

)

3: Fit TVLG dynamics p (xt+1|xt

,ut

)

4: Estimate cost approximation c(xt

,ut

) using fitteddynamics and compute cost residuals:c(x

t

,ut

) = c(xt

,ut

)� c(xt

,ut

)

5: Adjust LQR-FLM KL step ✏t

based on ratio of resid-ual costs-to-go ˜S and sampled costs-to-go S

6: Compute ⌘t

using dual function from Eq. (4)7: Perform LQR-FLM update to compute p (u

t

|xt

):min

p

(i) Ep

(i) [Q(x

t

,ut

)]

s.t. Ep

(i)

⇥DKL(p(i)kp(i�1)

)

⇤ ✏

t

8: Perform PI2 update using cost residuals and LQR-FLM actions to compute the new policy:p(k) (u

t

|xt

) / p (ut

|xt

)Ep

(i�1)

hexp

⇣� 1

⌘t

˜St

)

⌘i

9: end for

icy ⇡✓

in a supervised manner. Hence, the final global pol-icy generalizes across multiple local policies.

Using the TVLG representation of the local policies makesit straightforward to incorporate PILQR into the MDGPSframework. Instead of constraining against the old localTVLG policy as in Theorem (1), each instance of the localpolicy is now constrained against the old global policy

min

p(i)Ep(i) [S(xt,ut)]s.t. Ep(i�1)

hDKL

⇣p(i)k⇡(i�1)

✓

⌘i ✏ .

The two-stage update proceeds as described in Section 4.1,with the change that the LQR-FLM policy is now con-strained against the old global policy ⇡(i�1)

✓

.

6. Experimental EvaluationOur experiments aim to answer the following questions:(1) How does our method compare to other trajectory-centric and deep RL algorithms in terms of final perfor-mance and sample efficiency? (2) Can we utilize linear-Gaussian policies trained using PILQR to obtain robustneural network policies using MDGPS? (3) Is our pro-posed algorithm capable of learning complex manipulationskills on a real robotic platform? We study these questionsthrough a set of simulated comparisons against prior meth-ods, as well as real-world tasks using a PR2 robot. The per-formance of each method can be seen in our supplementaryvideo.2 Our focus in this work is specifically on roboticstasks that involve manipulation of objects, since such tasksoften exhibit elements of continuous and discontinuous dy-namics and require sample-efficient methods, making themchallenging for both model-based and model-free methods.

2https://sites.google.com/site/icml17pilqr


Figure 3. Average final distance from the block to the goal on onecondition of the gripper pusher task. This condition is difficultdue to the block being initialized far away from the gripper andthe goal area, and only PILQR is able to succeed in reaching theblock and pushing it toward the goal. Results for additional condi-tions are available in Appendix 8.4, and the supplementary videodemonstrates the final behavior of each learned policy.

6.1. Simulation Experiments

We evaluate our method on three simulated robotic manip-ulation tasks, depicted in Figure 2 and discussed below:

Gripper pusher. This task involves controlling a 4 DoFarm with a gripper to push a white block to a red goal area.The cost function is a weighted combination of the distancefrom the gripper to the block and from the block to the goal.

Reacher. The reacher task from OpenAI gym (Brockmanet al., 2016) requires moving the end of a 2 DoF arm to atarget position. This task is included to provide compar-isons against prior methods. The cost function is the dis-tance from the end effector to the target. We modify thecost function slightly: the original task uses an `2 norm,while we use a differentiable Huber-style loss, which ismore typical for LQR-based methods (Tassa et al., 2012).

Door opening. This task requires opening a door with a 6DoF 3D arm. The arm must grasp the handle and pull thedoor to a target angle, which can be particularly challeng-ing for model-based methods due to the complex contactsbetween the hand and the handle, and the fact that a con-tact must be established before the door can be opened. Thecost function is a weighted combination of the distance ofthe end effector to the door handle and the angle of the door.

Additional experimental setup details, including the exactcost functions, are provided in Appendix 8.3.1.

We first compare PILQR to LQR-FLM and PI2 on the grip-per pusher and door opening tasks. Figure 3 details perfor-mance of each method on the most difficult condition forthe gripper pusher task. Both LQR-FLM and PI2 perform

Figure 4. Final distance from the reacher end effector to the tar-get averaged across 300 random test conditions per iteration.MDGPS with LQR-FLM, MDGPS with PILQR, TRPO, andDDPG all perform competitively. However, as the log scale forthe x axis shows, TRPO and DDPG require orders of magnitudemore samples. MDGPS with PI2 performs noticeably worse.

significantly worse on the two more difficult conditions ofthis task. While PI2 improves in performance as we pro-vide more samples, LQR-FLM is bounded by its ability tomodel the dynamics, and thus predict the costs, at the mo-ment when the gripper makes contact with the block. Ourmethod solves all four conditions with 400 total episodesper condition and, as shown in the supplementary video, isable to learn a diverse set of successful behaviors includ-ing flicking, guiding, and hitting the block. On the dooropening task, PILQR trains TVLG policies that succeed atopening the door from each of the four initial robot posi-tions. While the policies trained with LQR-FLM are ableto reach the handle, they fail to open the door.

Next we evaluate neural network policies on the reachertask. Figure 4 shows results for MDGPS with each lo-cal policy method, as well as two prior deep RL meth-ods that directly learn neural network policies: trust re-gion policy optimization (TRPO) (Schulman et al., 2015)and deep deterministic policy gradient (DDPG) (Lillicrapet al., 2016). MDGPS with LQR-FLM and MDGPS withPILQR perform competitively in terms of the final dis-tance from the end effector to the target, which is unsur-prising given the simplicity of the task, whereas MDGPSwith PI2 is again not able to make much progress. Onthe reacher task, DDPG and TRPO use 25 and 150 timesmore samples, respectively, to achieve approximately thesame performance as MDGPS with LQR-FLM and PILQR.For comparison, amongst previous deep RL algorithmsthat combined model-based and model-free methods, SVGand NAF with imagination rollouts reported using approx-imately up to five times fewer samples than DDPG on asimilar reacher task (Heess et al., 2015; Gu et al., 2016).Thus we can expect that MDGPS with our method is about


Figure 5. Minimum angle in radians of the door hinge (lower isbetter) averaged across 100 random test conditions per iteration.MDGPS with PILQR outperforms all other methods we compareagainst, with orders of magnitude fewer samples than DDPG andTRPO, which is the only other successful algorithm.

one order of magnitude more sample-efficient than SVGand NAF. While this is a rough approximation, it demon-strates a significant improvement in efficiency.

Finally, we compare the same methods for training neuralnetwork policies on the door opening task, shown in Fig-ure 5. TRPO requires 20 times more samples than MDGPSwith PILQR to learn a successful neural network policy.The other three methods were unable to learn a policy thatopens the door despite extensive hyperparameter tuning.We provide additional simulation results in Appendix 8.4.

6.2. Real Robot Experiments

To evaluate our method on a real robotic platform, we usea PR2 robot (see Figure 1) to learn the following tasks:

Hockey. The hockey task requires using a stick to hit apuck into a goal 1.4m away. The cost function consists oftwo parts: the distance between the current position of thestick and a target pose that is close to the puck, and the dis-tance between the position of the puck and the goal. Thepuck is tracked using a motion capture system. Althoughthe cost provides some shaping, this task presents a signif-icant challenge due to the difference in outcomes based onwhether or not the robot actually strikes the puck, makingit challenging for prior methods, as we show below.

Power plug plugging. In this task, the robot must pluga power plug into an outlet. The cost function is the dis-tance between the plug and a target location inside the out-let. This task requires fine manipulation to fully insert theplug. Our TVLG policies consist of 100 time steps and wecontrol our robot at a frequency of 20 Hz. For further de-tails of the experimental setup, including the cost functions,we refer the reader to Appendix 8.3.2.

Figure 6. Single condition comparison of the hockey task per-formed on the real robot. Costs lower than the dotted line cor-respond to the puck entering the goal.

Both of these tasks have difficult, discontinuous dynam-ics at the contacts between the objects, and both require ahigh degree of precision to succeed. In contrast to priorworks (Daniel et al., 2013) that use kinesthetic teachingto initialize a policy that is then finetuned with model-freemethods, our method does not require any human demon-strations. The policies are randomly initialized using aGaussian distribution with zero mean. Such initializationdoes not provide any information about the task to be per-formed. In all of the real robot experiments, policies areupdated every 10 rollouts and the final policy is obtainedafter 20-25 iterations, which corresponds to mastering theskill with less than one hour of experience.

In the first set of experiments, we aim to learn a policythat is able to hit the puck into the goal for a single po-sition of the goal and the puck. The results of this exper-iment are shown in Figure 6. In the case of the prior PI2method (Theodorou et al., 2010), the robot was not able tohit the puck. Since the puck position has the largest influ-ence on the cost, the resulting learning curve shows littlechange in the cost over the course of training. The policyto move the arm towards the recorded arm position that en-ables hitting the puck turned out to be too challenging forPI2 in the limited number of trials used for this experiment.In the case of LQR-FLM, the robot was able to occasion-ally hit the puck in different directions. However, the re-sulting policy could not capture the complex dynamics ofthe sliding puck or the discrete transition, and was unableto hit the puck toward the goal. The PILQR method wasable to learn a robust policy that consistently hits the puckinto the goal. Using the step adjustment rule described inSection 4.4, the algorithm would shift towards model-freeupdates from the PI2 method as the TVLG approximationof the dynamics became less accurate. Using our method,the robot was able to get to the final position of the arm


Figure 7. Experimental setup of the hockey task and the successrate of the final PILQR-MDGPS policy. Red and Blue: goal posi-tions used for training, Green: new goal position.

using fast model-based updates from LQR-FLM and learnthe puck-hitting policy, which is difficult to model, by au-tomatically shifting towards model-free PI2 updates.

In our second set of hockey experiments, we evaluatewhether we can learn a neural network policy using theMDGPS-PILQR algorithm that can hit the puck into differ-ent goal locations. The goals were spaced 0.5m apart (seeFigure 7). The strategies for hitting the puck into differ-ent goal positions differ substantially, since the robot mustadjust the arm pose to approach the puck from the rightdirection and aim toward the target. This makes it quitechallenging to learn a single policy for this task. We per-formed 30 rollouts for three different positions of the goal(10 rollouts each), two of which were used during training.The neural network policy was able to hit the puck into thegoal in 90% of the cases (see Figure 7). This shows that ourmethod can learn high-dimensional neural network policiesthat generalize across various conditions.

The results of the plug experiment are shown in Figure 8.PI2 alone was unable to reach the socket. The LQR-FLMalgorithm succeeded only 60% of the time at convergence.In contrast to the peg insertion-style tasks evaluated in priorwork that used LQR-FLM (Levine et al., 2015), this taskrequires very fine manipulation due to the small size of theplug. Our method was able to converge to a policy thatplugged in the power plug on every rollout at convergence.The supplementary video illustrates the final behaviors ofeach method for both the hockey and power plug tasks.3

7. Discussion and Future WorkWe presented an algorithm that combines elements ofmodel-free and model-based RL, with the aim of combin-ing the sample efficiency of model-based methods with theability of model-free methods to improve the policy evenin situations where the model’s structural assumptions areviolated. We show that a particular choice of policy rep-resentation – TVLG controllers – is amenable to fast opti-mization with model-based LQR-FLM and model-free PI2

3https://sites.google.com/site/icml17pilqr

Figure 8. Single condition comparison of the power plug task per-formed on the real robot. Note that costs above the dotted linecorrespond to executions that did not actually insert the plug intothe socket. Only our method (PILQR) was able to consistentlyinsert the plug all the way into the socket by the final iteration.

algorithms using sample-based updates. We propose a hy-brid algorithm based on these two components, where thePI2 update is performed on the residuals between the truesample-based cost and the cost estimated under the locallinear models. This algorithm has a number of appealingproperties: it naturally trades off between model-based andmodel-free updates based on the amount of model error,can easily be extended with a KL-divergence constraint forstable learning, and can be effectively used for real-worldrobotic learning. We further demonstrate that, although thisalgorithm is specific to TVLG policies, it can be integratedinto the GPS framework in order to train arbitrary parame-terized policies, including deep neural networks.

We evaluated our approach on a range of challenging sim-ulated and real-world tasks. The results show that ourmethod combines the efficiency of model-based learningwith the ability of model-free methods to succeed on taskswith discontinuous dynamics and costs. We further illus-trate in direct comparisons against state-of-the-art model-free deep RL methods that, when combined with the GPSframework, our method achieves substantially better sam-ple efficiency. It is worth noting, however, that the appli-cation of trajectory-centric RL methods such as ours, evenwhen combined with GPS, requires the ability to reset theenvironment into consistent initial states (Levine & Abbeel,2014; Levine et al., 2016). Recent work proposes a cluster-ing method for lifting this restriction by sampling trajec-tories from random initial states and assembling them intotask instances after the fact (Montgomery et al., 2017). In-tegrating this technique into our method would further im-prove its generality. An additional limitation of our methodis that the form of both the model-based and model-freeupdate requires a continuous action space. Extensions todiscrete or hybrid action spaces would require some kindof continuous relaxation, and this is left for future work.


AcknowledgementsThe authors would like to thank Sean Mason for his helpwith preparing the real robot experiments. This work wassupported in part by National Science Foundation grantsIIS-1614653, IIS-1205249, IIS-1017134, EECS-0926052,the Office of Naval Research, the Okawa Foundation, andthe Max-Planck-Society. Marvin Zhang was supported bya BAIR fellowship. Any opinions, findings, and conclu-sions or recommendations expressed in this material arethose of the authors and do not necessarily reflect the viewsof the funding organizations.

ReferencesAkrour, R., Abdolmaleki, A., Abdulsamad, H., and Neu-

mann, G. Model-free trajectory optimization for rein-forcement learning. In ICML, 2016.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J.,Schulman, J., Tang, J., and Zaremba, W. OpenAI gym.arXiv preprint arXiv:1606.01540, 2016.

Chebotar, Y., Kalakrishnan, M., Yahya, A., Li, A., Schaal,S., and Levine, S. Path integral guided policy search. InICRA, 2017.

Daniel, Christian, Neumann, Gerhard, Kroemer, Oliver,and Peters, Jan. Learning sequential motor tasks. InICRA, 2013.

Deisenroth, M., Rasmussen, C., and Fox, D. Learning tocontrol a low-cost manipulator using data-efficient rein-forcement learning. In RSS, 2011.

Deisenroth, M., Neumann, G., and Peters, J. A survey onpolicy search for robotics. Foundations and Trends inRobotics, 2(1-2):1–142, 2013.

Deisenroth, M., Fox, D., and Rasmussen, C. Gaussian pro-cesses for data-efficient learning in robotics and control.PAMI, 2014.

Farshidian, F., Neunert, M., and Buchli, J. Learning ofclosed-loop motion control. In IROS, 2014.

Gu, S., Lillicrap, T., Sutskever, I., and Levine, S. Con-tinuous deep Q-learning with model-based acceleration.CoRR, abs/1603.00748, 2016.

Heess, N., Wayne, G., Silver, D., Lillicrap, T., Tassa, Y.,and Erez, T. Learning continuous control policies bystochastic value gradients. In NIPS, 2015.

Kingma, D. and Welling, M. Auto-encoding variationalBayes. CoRR, abs/1312.6114, 2013.

Kober, J., Bagnell, J., and Peters, J. Reinforcement learningin robotics: a survey. International Journal of RoboticResearch, 32(11):1238–1274, 2013.

Levine, S. and Abbeel, P. Learning neural network policieswith guided policy search under unknown dynamics. InNIPS, 2014.

Levine, S., Wagener, N., and Abbeel, P. Learning contact-rich manipulation skills with guided policy search. InICRA, 2015.

Levine, S., Finn, C., Darrell, T., and Abbeel, P. End-to-end training of deep visuomotor policies. JMLR, 17(1),2016.

Lillicrap, T., Hunt, J., Pritzel, A., Heess, N., Erez, T., Tassa,Y., Silver, D., and Wierstra, D. Continuous control withdeep reinforcement learning. In ICLR, 2016.

Lioutikov, R., Paraschos, A., Neumann, G., and Peters,J. Sample-based information-theoretic stochastic opti-mal control. In ICRA, 2014.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,Antonoglou, I., Wierstra, D., and Riedmiller, M. PlayingAtari with deep reinforcement learning. In NIPS Work-shop on Deep Learning, 2013.

Montgomery, W. and Levine, S. Guided policy search viaapproximate mirror descent. In NIPS, 2016.

Montgomery, W., Ajay, A., Finn, C., Abbeel, P., andLevine, S. Reset-free guided policy search: efficientdeep reinforcement learning with stochastic initial states.In ICRA, 2017.

Pan, Y. and Theodorou, E. Probabilistic differential dy-namic programming. In NIPS, 2014.

Pastor, P., Hoffmann, H., Asfour, T., and Schaal, S. Learn-ing and generalization of motor skills by learning fromdemonstration. In ICRA, 2009.

Peters, J. and Schaal, S. Reinforcement learning of mo-tor skills with policy gradients. Neural Networks, 21(4),2008.

Peters, J., Mulling, K., and Altun, Y. Relative entropy pol-icy search. In AAAI, 2010.

Schaal, S., Peters, J., Nakanishi, J., and Ijspeert, A. Con-trol, planning, learning, and imitation with dynamicmovement primitives. In IROS Workshop on BilateralParadigms on Humans and Humanoids, 2003.

Schulman, J., Levine, S., Moritz, P., Jordan, M., andAbbeel, P. Trust region policy optimization. In ICML,2015.

Schulman, J., Moritz, P., Levine, S., Jordan, M., andAbbeel, P. High-dimensional continuous control usinggeneralized advantage estimation. In ICLR, 2016.

Sutton, R. Integrated architectures for learning, planning,and reacting based on approximating dynamic program-ming. In ICML, 1990.

Tassa, Y., Erez, T., and Todorov, E. Synthesis and stabiliza-tion of complex behaviors. In IROS, 2012.

Theodorou, E., Buchli, J., and Schaal, S. A generalizedpath integral control approach to reinforcement learning.JMLR, 11, 2010.

Documents

Combining Model-Based and Model-Free Updates for ...proceedings.mlr.press/v70/chebotar17a/chebotar17a.pdfYevgen Chebotar *12 Karol Hausman 1Marvin Zhang 3 Gaurav Sukhatme Stefan Schaal12