Abstract arXiv:2005.07513v1 [cs.LG] 15 May 2020

A Distributional View on Multi-Objective Policy Optimization

Abbas Abdolmaleki 1 Sandy H Huang 1 Leonard Hasenclever 1 Michael Neunert 1 H Francis Song 1

Martina Zambelli 1 Murilo F Martins 1 Nicolas Heess 1 Raia Hadsell 1 Martin Riedmiller 1

Abstract

Many real-world problems require trading off mul-tiple competing objectives However these ob-jectives are often in different units andor scaleswhich can make it challenging for practitioners toexpress numerical preferences over objectives intheir native units In this paper we propose a novelalgorithm for multi-objective reinforcement learn-ing that enables setting desired preferences forobjectives in a scale-invariant way We proposeto learn an action distribution for each objectiveand we use supervised learning to fit a parametricpolicy to a combination of these distributions Wedemonstrate the effectiveness of our approach onchallenging high-dimensional real and simulatedrobotics tasks and show that setting different pref-erences in our framework allows us to trace outthe space of nondominated solutions

1 IntroductionReinforcement learning (RL) algorithms do an excellentjob at training policies to optimize a single scalar rewardfunction Recent advances in deep RL have made it possi-ble to train policies that exceed human-level performanceon Atari (Mnih et al 2015) and Go (Silver et al 2016)perform complex robotic manipulation tasks (Zeng et al2019) learn agile locomotion (Tan et al 2018) and evenobtain reward in unanticipated ways (Amodei et al 2016)

However many real-world tasks involve multiple possiblycompeting objectives For instance choosing a financialportfolio requires trading off between risk and return con-trolling energy systems requires trading off performanceand cost and autonomous cars must trade off fuel costs ef-ficiency and safety Multi-objective reinforcement learning(MORL) algorithms aim to tackle such problems (Roijerset al 2013 Liu et al 2015) A common approach is scalar-ization based on preferences across objectives transformthe multi-objective reward vector into a single scalar re-

Equal contribution 1DeepMind Correspondence to AbbasAbdolmaleki ltaabdolmalekigooglecomgt Sandy H Huangltshhuanggooglecomgt

Figure 1 We demonstrate our approach in four complex continu-ous control domains in simulation and in the real world Videosare at httpsitesgooglecomviewmo-mpo

ward (eg by taking a convex combination) and then usestandard RL to optimize this scalar reward

It is tricky though for practitioners to pick the appropriatescalarization for a desired preference across objectives be-cause often objectives are defined in different units andorscales For instance suppose we want an agent to com-plete a task while minimizing energy usage and mechan-ical wear-and-tear Task completion may correspond to asparse reward or to the number of square feet a vacuumingrobot has cleaned and reducing energy usage and mechani-cal wear-and-tear could be enforced by penalties on powerconsumption (in kWh) and actuator efforts (in N or Nm)respectively Practitioners would need to resort to usingtrial and error to select a scalarization that ensures the agentprioritizes actually doing the task (and thus being useful)over saving energy

Motivated by this we propose a scale-invariant approach forencoding preferences derived from the RL-as-inference per-spective Instead of choosing a scalarization practitionersset a constraint per objective Based on these constraints welearn an action distribution per objective that improves onthe current policy Then to obtain a single updated policythat makes these trade-offs we use supervised learning tofit a policy to the combination of these action distributionsThe constraints control the influence of each objective onthe policy by constraining the KL-divergence between eachobjective-specific distribution and the current policy Thehigher the constraint value the more influence the objec-tive has Thus a desired preference over objectives can beencoded as the relative magnitude of these constraint values

Fundamentally scalarization combines objectives in rewardspace whereas our approach combines objectives in distri-

arX

iv2

005

0751

3v1

[cs

LG

] 1

5 M

ay 2

020


bution space thus making it invariant to the scale of rewardsIn principle our approach can be combined with any RLalgorithm regardless of whether it is off-policy or on-policyWe combine it with maximum a posteriori policy optimiza-tion (MPO) (Abdolmaleki et al 2018ab) an off-policyactor-critic RL algorithm and V-MPO (Song et al 2020)an on-policy variant of MPO We call these two algorithmsmulti-objective MPO (MO-MPO) and multi-objective V-MPO (MO-V-MPO) respectively

Our main contribution is providing a distributional view onMORL which enables scale-invariant encoding of prefer-ences We show that this is a theoretically-grounded ap-proach that arises from taking an RL-as-inference perspec-tive of MORL Empirically we analyze the mechanics ofMO-MPO and show it finds all Pareto-optimal policies in apopular MORL benchmark task Finally we demonstratethat MO-MPO and MO-V-MPO outperform scalarized ap-proaches on multi-objective tasks across several challenginghigh-dimensional continuous control domains (Fig 1)

2 Related Work21 Multi-Objective Reinforcement Learning

Multi-objective reinforcement learning (MORL) algorithmsare either single-policy or multiple-policy (Vamplew et al2011) Single-policy approaches seek to find the optimalpolicy for a given scalarization of the multi-objective prob-lem Often this scalarization is linear but other choices havealso been explored (Van Moffaert et al 2013)

However the scalarization may be unknown at training timeor it may change over time Multiple-policy approacheshandle this by finding a set of policies that approximatesthe true Pareto front Some approaches repeatedly call asingle-policy MORL algorithm with strategically-chosenscalarizations (Natarajan amp Tadepalli 2005 Roijers et al2014 Mossalam et al 2016 Zuluaga et al 2016) Otherapproaches learn a set of policies simultaneously by using amulti-objective variant of the Q-learning update rule (Barrettamp Narayanan 2008 Moffaert amp Nowe 2014 Reymond ampNowe 2019 Yang et al 2019) or by modifying gradient-based policy search (Parisi et al 2014 Pirotta et al 2015)

Most existing approaches for finding the Pareto front arelimited to discrete state and action spaces in which tabularalgorithms are sufficient Although recent work combiningMORL with deep RL handles high-dimensional observa-tions this is in domains with low-dimensional and usuallydiscrete action spaces (Mossalam et al 2016 van Seijenet al 2017 Friedman amp Fontaine 2018 Abels et al 2019Reymond amp Nowe 2019 Yang et al 2019 Nottinghamet al 2019) In contrast we evaluate our approach on con-

tinuous control tasks with more than 20 action dimensions1

A couple of recent works have applied deep MORL to findthe Pareto front in continuous control tasks these worksassume scalarization and rely on additionally learning ei-ther a meta-policy (Chen et al 2019) or inter-objectiverelationships (Zhan amp Cao 2019) We take an orthogonalapproach to existing approaches one encodes preferencesvia constraints on the influence of each objective on thepolicy update instead of via scalarization MO-MPO canbe run multiple times with different constraint settings tofind a Pareto front of policies

22 Constrained Reinforcement Learning

An alternate way of setting preferences is to enforce thatpolicies meet certain constraints For instance thresholdlexicographic ordering approaches optimize a (single) ob-jective while meeting specified threshold values on the otherobjectives (Gabor et al 1998) optionally with slack (Wrayet al 2015) Similarly safe RL is concerned with learningpolicies that optimize a scalar reward while not violatingsafety constraints (Achiam et al 2017 Chow et al 2018)this has also been studied in the off-policy batch RL setting(Le et al 2019) Related work minimizes costs while ensur-ing the policy meets a constraint on the minimum expectedreturn (Bohez et al 2019) but this requires that the desiredor achievable reward is known a priori In contrast MO-MPO does not require knowledge of the scale of rewardsIn fact often there is no easy way to specify constraints onobjectives eg it is difficult to figure out a priori how muchactuator effort a robot will need to use to perform a task

23 Multi-Task Reinforcement Learning

Multi-task reinforcement learning can also be cast as aMORL problem Generally these algorithms learn a separatepolicy for each task with shared learning across tasks (Tehet al 2017 Riedmiller et al 2018 Wulfmeier et al 2019)In particular Distral (Teh et al 2017) learns a shared priorthat regularizes the per-task policies to be similar to eachother and thus captures essential structure that is sharedacross tasks MO-MPO differs in that the goal is to learn asingle policy that must trade off across different objectives

Other multi-task RL algorithms seek to train a single agentto solve different tasks and thus need to handle differentreward scales across tasks Prior work uses adaptive normal-ization for the targets in value-based RL so that the agent

1MORL for continuous control tasks is difficult because thepolicy can no longer output arbitrary action distributions whichlimits how well it can compromise between competing objectivesIn state-of-the-art RL algorithms for continuous control policiestypically output a single action (eg D4PG (Barth-Maron et al2018)) or a Gaussian (eg PPO (Schulman et al 2017) MPO(Abdolmaleki et al 2018b) and SAC (Haarnoja et al 2018))


cares equally about all tasks (van Hasselt et al 2016 Hesselet al 2019) Similarly prior work in multi-objective opti-mization has dealt with objectives of different units andorscales by normalizing objectives to have similar magnitudes(Marler amp Arora 2005 Grodzevich amp Romanko 2006 Kimamp de Weck 2006 Daneshmand et al 2017 Ishibuchi et al2017) MO-MPO can also be seen as doing adaptive nor-malization but for any preference over objectives not justequal preferences

In general invariance to reparameterization of the functionapproximator has been investigated in optimization liter-ature resulting in for example natural gradient methods(Martens 2014) The common tool here is measuring dis-tances in function space instead of parameter space usingKL-divergence Similarly in this work to achieve invari-ance to the scale of objectives we use KL-divergence overpolicies to encode preferences

3 Background and NotationMulti Objective Markov Decision Process In this pa-per we consider a multi-objective RL problem defined by amulti-objective Markov Decision Process (MO-MDP) TheMO-MDP consists of states s isin S and actions a isin Aan initial state distribution p(s0) transition probabilitiesp(st+1|st at) which define the probability of changingfrom state st to st+1 when taking action at reward func-tions rk(s a) isin RNk=1 per objective k and a discountfactor γ isin [0 1) We define our policy πθ(a|s) as astate conditional distribution over actions parametrized byθ Together with the transition probabilities this givesrise to a state visitation distribution micro(s) We also con-sider per-objective action-value functions The action-value function for objective k is defined as the expectedreturn (ie cumulative discounted reward) from choosingaction a in state s for objective k and then following pol-icy π Qπk (s a) = Eπ[

suminfint=0 γ

trk(st at)|s0 = s a0 = a]We can represent this function using the recursive expres-sion Qπk (st at) = Ep(st+1|stat)

[rk(st at) + γV πk (st+1)

]

where V πk (s) = Eπ[Qπk (s a)] is the value function of π forobjective k

Problem Statement For any MO-MDP there is a setof nondominated policies ie the Pareto front A policyis nondominated if there is no other policy that improvesits expected return for an objective without reducing theexpected return of at least one other objective Given apreference setting our goal is to find a nondominated policyπθ that satisfies those preferences In our approach a settingof constraints does not directly correspond to a particularscalarization but we show that by varying these constraintsettings we can indeed trace out a Pareto front of policies

Algorithm 1 MO-MPO One policy improvement step1 given batch size (L) number of actions to sample (M ) (N )

Q-functions Qπoldk (s a)Nk=1 preferences εkNk=1 previous

policy πold previous temperatures ηkNk=1 replay buffer Dfirst-order gradient-based optimizer O

23 initialize πθ from the parameters of πold4 repeat5 Collect dataset si aij Qijk

LMNijk where

6 M actions aij sim πold(a|si) and Qijk = Qπoldk (si aij)

78 Compute action distribution for each objective9 for k = 1 N do

10 δηk larr nablaηkηkεk + ηksumLi

1Llog

(sumMj

1M

exp(Q

ijkηk

))11 Update ηk based on δηk using optimizer O12 qijk prop exp(

Qijkηk

)

13 end for1415 Update parametric policy16 δπ larr minusnablaθ

sumLi

sumMj

sumNk q

ijk log πθ(a

ij |si)17 (subject to additional KL regularization see Sec 422)18 Update πθ based on δπ using optimizer O1920 until fixed number of steps21 return πold = πθ

4 MethodWe propose a policy iteration algorithm for multi-objectiveRL Policy iteration algorithms decompose the RL probleminto two sub-problems and iterate until convergence

1 Policy evaluation estimate Q-functions given policy2 Policy improvement update policy given Q-functions

Algorithm 1 summarizes this two-step multi-objective pol-icy improvement procedure In Appendix E we explain howthis can be derived from the ldquoRL as inferencerdquo perspective

We describe multi-objective MPO in this section and explainmulti-objective V-MPO in Appendix D When there is onlyone objective MO-(V-)MPO reduces to (V-)MPO

41 Multi-Objective Policy Evaluation

In this step we learn Q-functions to evaluate the previouspolicy πold We train a separate Q-function per objectivefollowing the Q-decomposition approach (Russell amp Zim-dars 2003) In principle any Q-learning algorithm can beused as long as the target Q-value is computed with respectto πold2 In this paper we use the Retrace objective (Munos

2Russell amp Zimdars (2003) prove critics suffer from ldquoillusion ofcontrolrdquo if they are trained with conventional Q-learning (Watkins1989) In other words if each critic computes its target Q-valuebased on its own best action for the next state then they overesti-mate Q-values because in reality the parametric policy πθ (thatconsiders all criticsrsquo opinions) is in charge of choosing actions


et al 2016) to learn a Q-function Qπoldk (s aφk) for each

objective k parameterized by φk as follows

minφkN1

Nsumk=1

E(sa)simD

[(Qretk (s a)minusQπold

k (s aφk))2]

where Qretk is the Retrace target for objective k and the

previous policy πold and D is a replay buffer containinggathered transitions See Appendix C for details

42 Multi-Objective Policy Improvement

Given the previous policy πold(a|s) and associated Q-functions Qπold

k (s a)Nk=1 our goal is to improve the previ-ous policy for a given visitation distribution micro(s)3 To thisend we learn an action distribution for each Q-function andcombine these to obtain the next policy πnew(a|s) This is amulti-objective variant of the two-step policy improvementprocedure employed by MPO (Abdolmaleki et al 2018b)

In the first step for each objective k we learn an improvedaction distribution qk(a|s) such that Eqk(a|s)[Q

πoldk (s a)] ge

Eπold(a|s)[Qπoldk (s a)] where states s are drawn from a visi-

tation distribution micro(s)

In the second step we combine and distill the improveddistributions qk into a new parametric policy πnew (withparameters θnew) by minimizing the KL-divergence betweenthe distributions and the new parametric policy ie

θnew = argminθ

Nsumk=1

Emicro(s)

[KL(qk(a|s)πθ(a|s)

)] (1)

This is a supervised learning loss that performs maximumlikelihood estimate of distributions qk Next we will explainthese two steps in more detail

421 OBTAINING ACTION DISTRIBUTIONS PEROBJECTIVE (STEP 1)

To obtain the per-objective improved action distributionsqk(a|s) we optimize the standard RL objective for eachobjective Qk

maxqk

ints

micro(s)

inta

qk(a|s)Qk(s a) dads (2)

stints

micro(s) KL(qk(a|s)πold(a|s)) ds lt εk

where εk denotes the allowed expected KL divergence forobjective k We use these εk to encode preferences over ob-jectives More concretely εk defines the allowed influenceof objective k on the change of the policy

3In practice we use draws from the replay buffer to estimateexpectations over the visitation distribution micro(s)

For nonparametric action distributions qk(a|s) we can solvethis constrained optimization problem in closed form foreach state s sampled from micro(s) (Abdolmaleki et al 2018b)

qk(a|s) prop πold(a|s) exp(Qk(s a)

ηk

) (3)

where the temperature ηk is computed based on the corre-sponding εk by solving the following convex dual function

ηk = argminη

η εk + (4)

η

ints

micro(s) log

inta

πold(a|s) exp(Qk(s a)

η

)dads

In order to evaluate qk(a|s) and the integrals in (4) we drawL states from the replay buffer and for each state sampleM actions from the current policy πold In practice wemaintain one temperature parameter ηk per objective Wefound that optimizing the dual function by performing a fewsteps of gradient descent on ηk is effective and we initializewith the solution found in the previous policy iteration stepSince ηk should be positive we use a projection operatorafter each gradient step to maintain ηk gt 0 Please refer toAppendix C for derivation details

Application to Other Deep RL Algorithms Since theconstraints εk in (2) encode the preferences over objectivessolving this optimization problem with good satisfactionof constraints is key for learning a policy that satisfies thedesired preferences For nonparametric action distributionsqk(a|s) we can satisfy these constraints exactly One coulduse any policy gradient method (eg Schulman et al 20152017 Heess et al 2015 Haarnoja et al 2018) to obtainqk(a|s) in a parametric form instead However solving theconstrained optimization for parametric qk(a|s) is not exactand the constraints may not be well satisfied which impedesthe use of εk to encode preferences Moreover assuming aparametric qk(a|s) requires maintaining a function approx-imator (eg a neural network) per objective which cansignificantly increase the complexity of the algorithm andlimits scalability

Choosing εk It is more intuitive to encode preferences viaεk rather than via scalarization weights because the formeris invariant to the scale of rewards In other words having adesired preference across objectives narrows down the rangeof reasonable choices for εk but does not narrow down therange of reasonable choices for scalarization weights Inorder to identify reasonable scalarization weights a RL prac-titioner needs to additionally be familiar with the scale ofrewards for each objective In practice we have found thatlearning performance is robust to a wide range of scales forεk It is the relative scales of the εk that matter for encodingpreferences over objectivesmdashthe larger a particular εk iswith respect to others the more that objective k is preferredOn the other hand if εk = 0 then objective k will have


no influence and will effectively be ignored In AppendixA1 we provide suggestions for setting εk given a desiredpreference over objectives

422 FITTING A NEW PARAMETRIC POLICY (STEP 2)

In the previous section for each objective k we have ob-tained an improved action distribution qk(a|s) Next wewant to combine these distributions to obtain a single para-metric policy that trades off the objectives according tothe constraints εk that we set For this we solve a super-vised learning problem that fits a parametric policy to theper-objective action distributions from step 1

θnew = argmaxθ

Nsumk=1

ints

micro(s)

inta

qk(a|s) log πθ(a|s) dads

stints

micro(s) KL(πold(a|s) πθ(a|s)) ds lt β (5)

where θ are the parameters of our policy (a neural network)and the KL constraint enforces a trust region of size β thatlimits the overall change in the parametric policy The KLconstraint in this step has a regularization effect that pre-vents the policy from overfitting to the sample-based actiondistributions and therefore avoids premature convergenceand improves stability of learning (Schulman et al 2015Abdolmaleki et al 2018ab)

Similar to the first policy improvement step we evaluatethe integrals by using the L states sampled from the replaybuffer and the M actions per state sampled from the oldpolicy In order to optimize (5) using gradient descent weemploy Lagrangian relaxation similar to in MPO (Abdol-maleki et al 2018a) (see Appendix C for more detail)

5 Experiments Toy DomainsIn the empirical evaluation that follows we will first demon-strate the mechanics and scale-invariance of MO-MPO ina single-state environment (Sec 51) and then show thatMO-MPO can find all Pareto-dominant policies in a popularMORL benchmark (Sec 52) Finally we show the benefitof using MO-MPO in high-dimensional continuous controldomains including on a real robot (Sec 6) AppendicesA and B contain a detailed description of all domains andtasks experimental setups and implementation details

Baselines The goal of our empirical evaluation is toanalyze the benefit of using our proposed multi-objectivepolicy improvement step (Sec 422) that encodes prefer-ences over objectives via constraints εk on expected KL-divergences rather than via weights wk Thus we primarilycompare MO-MPO against scalarized MPO which relieson linear scalarization weights wk to encode preferencesThe only difference between MO-MPO and scalarized MPOis the policy improvement step for scalarized MPO a single

Figure 2 A simple environment in which the agent starts at stateS0 and chooses to navigate to one of three terminal states Thereare two objectives Taking the left action for instance leads toa reward of 1 for the first objective and 4 for the second

improved action distribution q(a|s) is computed based onsumk wkQk(s a) and a single KL constraint ε

State-of-the-art approaches that combine MORL with deepRL assume linear scalarization as well either learning aseparate policy for each setting of weights (Mossalam et al2016) or learning a single policy conditioned on scalar-ization weights (Friedman amp Fontaine 2018 Abels et al2019) Scalarized MPO addresses the former problemwhich is easier The policy evaluation step in scalarizedMPO is analagous to scalarized Q-learning proposed byMossalam et al (2016) As we show later in Sec 6 evenlearning an optimal policy for a single scalarization is diffi-cult in high-dimensional continuous control domains

51 Simple World

First we will examine the behavior of MO-MPO in a simplemulti-armed bandit with three actions (up right andleft) (Fig 2) inspired by Russell amp Zimdars (2003)We train policies with scalarized MPO and with MO-MPOThe policy evaluation step is exact because the Q-valuefunction for each objective is known it is equal to thereward received for that objective after taking each actionas labeled in Fig 24

We consider three possible desired preferences equal pref-erence for the two objectives preferring the first and pre-ferring the second Encoding preferences in scalarizedMPO amounts to choosing appropriate linear scalarizationweights and in MO-MPO amounts to choosing appropriateεrsquos We use the following weights and εrsquos

bull equal preference weights [05 05] or εrsquos [001 001]bull prefer first weights [09 01] or εrsquos [001 0002]bull prefer second weights [01 09] or εrsquos [0002 001]

We set ε = 001 for scalarized MPO If we start with a4The policy improvement step can also be computed exactly

because solving for the optimal temperature η (or η1 and η2 in theMO-MPO case) is a convex optimization problem and the KL-constrained policy update is also a convex optimization problemwhen there is only one possible state We use CVXOPT (Andersenet al 2020) as our convex optimization solver


uniform policy and run MPO with β = 0001 until thepolicy converges scalarized MPO and MO-MPO result insimilar policies (Fig 3 solid bars) up for equal preferenceright for prefer first and left for prefer second

However if we make the rewards imbalanced by multiply-ing the rewards obtained for the first objective by 20 (egleft now obtains a reward of [20 4]) we see that the poli-cies learned by scalarized MPO shift to preferring the opti-mal action for the first objective (right) in both the equalpreference and prefer second cases (Fig 3 striped bars)In contrast the final policies for MO-MPO are the same asfor balanced rewards because in each policy improvementstep MO-MPO optimizes for a separate temperature ηk thatscales each objectiversquos Q-value function This ηk is com-puted based on the corresponding allowed KL-divergenceεk so when the rewards for any objective k are multipliedby a factor but εk remains the same the computed ηk endsup being scaled by that factor as well neutralizing the effectof the scaling of rewards (see Eq (4))

Even in this simple environment we see that MO-MPOrsquosscale-invariant way of encoding preferences is valuable Inmore complex domains in which the Q-value functionsmust be learned in parallel with the policy the (automatic)dynamic adjustment of temperatures ηk per objective be-comes more essential (Sec 6)

The scale of εk controls the amount that objective k caninfluence the policyrsquos update If we set ε1 = 001 and sweepover the range from 0 to 001 for ε2 the resulting policies gofrom always picking right to splitting probability acrossright and up to always picking up (Fig 4 right) Incontrast setting weights leads to policies quickly convergingto placing all probability on a single action (Fig 4 left)We hypothesize this limits the ability of scalarized MPOto explore and find compromise policies (that perform wellwith respect to all objectives) in more challenging domains

52 Deep Sea Treasure

An important quality of any MORL approach is the abilityto find a variety of policies on the true Pareto front (Roijerset al 2013) We demonstrate this in Deep Sea Treasure(DST) (Vamplew et al 2011) a popular benchmark for test-ing MORL approaches DST consists of a 11times10 grid worldwith ten treasure locations The agent starts in the upper leftcorner (s0 = (0 0)) and has a choice of four actions (mov-ing one square up right down and left) When theagent picks up a treasure the episode terminates The agenthas two objectives treasure value and time penalty Thetime penalty is minus1 for each time step that the agent takesand farther-away treasures have higher treasure values Weuse the treasure values in Yang et al (2019)

We ran scalarized MPO with weightings [w 1 minus w] and

00

05

10equal preference prefer first prefer second

left up right00

05

10

left up right left up right

scalarizeddistributional

scalarized imbalanced rewardsdistributional imbalanced rewards

Figure 3 When two objectives have comparable reward scales(solid bars) scalarized MPO (first row) and MO-MPO (second row)learn similar policies across three different preferences Howeverwhen the scale of the first objective is much higher (striped bars)scalarized MPO shifts to always preferring the first objective Incontrast MO-MPO is scale-invariant and still learns policies thatsatisfy the preferences The y-axis denotes action probability

00 05 10p(left)

0

1distributional

[001]

[0101]

[010]

00 05 10p(left)

0

1p(

righ

t)scalarized

[01]

[55]

[10][w1 w2]

00 05 10p(left)

0

1p(

righ

t)scalarized

[01]

[55]

[10]

00 05 10p(left)

0

1

p(righ

t)

scalarized

[01]

[55]

[10][1205981 1205982]

00 05 10p(left)

0

1distributional

[001]

[0101]

[010]

Figure 4 A visualization of policies during learningmdasheach curvecorresponds to a particular setting of weights (left) or εrsquos (right)Policies are initialized as uniform (the blue dot) and are traineduntil convergence Each point (x y) corresponds to the policywith p(left) = x p(right) = y and p(up) = 1minusxminusy Thetop left and bottom right blue stars denote the optimal policy forthe first and second objectives respectively

w isin [0 001 002 1] All policies converged to a pointon the true Pareto front and all but three found the optimalpolicy for that weighting In terms of coverage policieswere found for eight out of ten points on the Pareto front5

We ran MO-MPO on this task as well for a range of εεtime isin [001 002 005] and εtreasure = c lowast εtime wherec isin [05 051 052 15] All runs converged to a policyon the true Pareto front and MO-MPO finds policies for allten points on the front (Fig 5 left) Note that it is the ratioof εrsquos that matters rather than the exact settingsmdashacrossall settings of εtime similar ratios of εtreasure to εtime resultin similar policies as this ratio increases policies tend toprefer higher-value treasures (Fig 5 right)

5Since scalarized MPO consistently finds the optimal policyfor any given weighting in this task with a more strategic selectionof weights we expect policies for all ten points would be found


Figure 5 Left Blue stars mark the true Pareto front for Deep SeaTreasure MO-MPO with a variety of settings for εk discoversall points on the true Pareto front The area of the orange circlesis proportional to the number of εk settings that converged tothat point Right As more preference is given to the treasureobjective (ie as x increases) policies tend to prefer higher-valuetreasures Each orange dot in the scatterplot corresponds to aparticular setting of εk

6 Experiments Continuous Control DomainsThe advantage of encoding preferences via εrsquos rather thanvia weights is apparent in more complex domains We com-pared our approaches MO-MPO and MO-V-MPO againstscalarized MPO and V-MPO in four high-dimensional con-tinuous control domains in MuJoCo (Todorov et al 2012)and on a real robot The domains we consider are

Humanoid We use the humanoid run task defined in Deep-Mind Control Suite (Tassa et al 2018) Policies must opti-mize for horizontal speed h while minimizing energy usageThe task reward is min(h10 1) where h is in meters persecond and the energy usage penalty is action `2-norm Thehumanoid has 21 degrees of freedom and the observationconsists of joint angles joint velocities head height handand feet positions torso vertical orientation and center-of-mass velocity for a total of 67 dimensions

Shadow Hand We consider three tasks on the ShadowDexterous Hand touch turn and orient In the touch andturn tasks policies must complete the task while minimiz-ing ldquopainrdquo A sparse task reward of 10 is given for pressingthe block with greater than 5N of force or for turning thedial from a random initial location to the target locationThe pain penalty penalizes the robot for colliding with ob-jects at high speed this penalty is defined as in Huang et al(2019) In the orient task there are three aligned objectivestouching the rectangular peg lifting it to a given height andorienting it to be perpendicular to the ground All threerewards are between 0 and 1 The Shadow Hand has five

fingers and 24 degrees of freedom actuated by 20 motorsThe observation consists of joint angles joint velocitiesand touch sensors for a total of 63 dimensions The touchand turn tasks terminate when the goal is reached or after 5seconds and the orient task terminates after 10 seconds

Humanoid Mocap We consider a large-scale humanoidmotion capture tracking task similar to in Peng et al (2018)in which policies must learn to follow motion capture ref-erence data6 There are five objectives each capturing adifferent aspect of the similarity of the pose between thesimulated humanoid and the mocap target joint orientationsjoint velocities hand and feet positions center-of-mass po-sitions and certain body positions and joint angles Theseobjectives are described in detail in Appendix B4 In orderto balance these multiple objectives prior work relied onheavily-tuned reward functions (eg Peng et al 2018) Thehumanoid has 56 degrees of freedom and the observation is1021-dimensional consisting of proprioceptive observationsas well as six steps of motion capture reference frames Intotal we use about 40 minutes of locomotion mocap datamaking this an extremely challenging domain

Sawyer Peg-in-Hole We train a Rethink Robotics Sawyerrobot arm to insert a cylindrical peg into a hole while min-imizing wrist forces The task reward is shaped towardpositioning the peg directly above the hole and increasesfor insertion and the penalty is the `1-norm of Cartesianforces measured by the wrist force-torque sensor The latterimplicitly penalizes contacts and impacts as well as exces-sive directional change (due to the gripperrsquos inertia inducingforces when accelerating) We impose a force thresholdto protect the hardwaremdashif this threshold is exceeded theepisode is terminated The action space is the end effectorrsquosCartesian velocity and the observation is 102-dimensionalconsisting of Cartesian position joint position and velocitywrist force-torque and joint action for three timesteps

61 Evaluation Metric

We run MO-(V-)MPO and scalarized (V-)MPO with a widerange of constraint settings εk and scalarization weightswk respectively corresponding to a wide range of possibledesired preferences (The exact settings are provided inAppendix A) For tasks with two objectives we plot thePareto front found by each approach We also computethe hypervolume of each found Pareto front this metric iscommonly-used for evaluating MORL algorithms (Vamplewet al 2011) Given a set of policies Π and a reference policyr that is dominated by all policies in this set this metric isthe hypervolume of the space of all policies that dominater and are dominated by at least one policy in Π We useDEAP (Fortin et al 2012) to compute hypervolumes

6This task was developed concurrently by Anonymous (2020)and does not constitute a contribution of this paper


3000 00

400

800

task

rewar

d

mid-training

3000 0

action norm penalty

end of training

3000 0

normal scale

30000 0

10x penalty

MO-MPO scalarized MPO

5 0pain penalty

00

05

task

rewar

d

touch mid-training

5 0pain penalty

touch

5 0pain penalty

turn

0 100z

0

150

angl

e

orient

3000 00

400

800

task

rewar

d

mid-training

3000 0

action norm penalty

end of training

3000 0

normal scale

30000 0

10x penalty


5 0pain penalty

00

05

task

rewar

d

touch mid-training

5 0pain penalty

touch

5 0pain penalty

turn

0 100z

0

150

angl

e

orient

touch mid-training

1x penalty

action norm penalty

pain penalty

3000 00

400

800

task

rewar

d

mid-training

3000 0

action norm penalty

end of training

3000 0

normal scale

30000 0

10x penalty

MO-MPO scalarized MPOscalarized MPOMO-MPO

touch end of training

Figure 6 Pareto fronts found by MO-MPO and scalarized MO-MPO for humanoid run (top row) and Shadow Hand tasks Eachdot represents a single trained policy Corresponding hypervol-umes are in Table 1 Task reward is discounted for touch and turnwith a discount factor of 099 For orient the x and y axes are thetotal reward for the lift and orientation objectives respectively

62 Results Humanoid and Shadow Hand

In the run touch and turn tasks the two objectives arecompetingmdasha very high preference for minimizing actionnorm or pain as opposed to getting task reward will resultin a policy that always chooses zero-valued actions Acrossthese three tasks the Pareto front found by MO-MPO issuperior to the one found by scalarized MPO with respectto the hypervolume metric (Table 1)7 In particular MO-MPO finds more policies that perform well with respect toboth objectives ie are in the upper right portion of thePareto front MO-MPO also speeds up learning on run andtouch (Fig 6) Qualitatively MO-MPO trains run policiesthat look more natural and ldquohuman-likerdquo videos are athttpsitesgooglecomviewmo-mpo

When we scale the action norm penalty by 10times for runscalarized MPO policies no longer achieve high task rewardwhereas MO-MPO policies do (Fig 6 top right) Thissupports that MO-MPOrsquos encoding of preferences is indeedscale-invariant When the objectives are aligned and havesimilarly scaled rewards as in the orient task MO-MPOand scalarized MPO perform similarly as expected

Ablation We also ran vanilla MPO on humanoid run withthe same range of weight settings as for scalarized MPO toinvestigate how useful Q-decomposition is In vanilla MPOwe train a single critic on the scalarized reward function

7We use hypervolume reference points of [0minus104] for run[0minus105] for run with 10times penalty [0minus20] for touch and turnand the zero vector for orient and humanoid mocap

Task scalarized MPO MO-MPO

Humanoid run mid-training 11times106 33times106

Humanoid run 64times106 71times106

Humanoid run 1times penalty 50times106 59times106


Shadow touch mid-training 143 156Shadow touch 162 164Shadow turn 144 154Shadow orient 28times104 28times104

Humanoid Mocap 386times10minus6 341times10minus6

Table 1 Hypervolume measurements across tasks and approaches

which is equivalent to removing Q-decomposition fromscalarized MPO Vanilla MPO trains policies that achievesimilar task reward (up to 800) but with twice the actionnorm penalty (up to minus104) As a result the hypervolumeof the Pareto front that vanilla MPO finds is more than amagnitude worse than that of scalarized MPO and MO-MPO(22times106 versus 26times107 and 69times107 respectively)

63 Results Humanoid Mocap

The objectives in this task are mostly aligned In contrast tothe other experiments we use V-MPO (Song et al 2020) asthe base algorithm because it outperforms MPO in learningthis task In addition since V-MPO is an on-policy vari-ant of MPO this enables us to evaluate our approach inthe on-policy setting Each training run is very computa-tionally expensive so we train only a handful of policieseach8 None of the MO-V-MPO policies are dominatedby those found by V-MPO In fact although the weightsspan a wide range of ldquopreferencesrdquo for the joint velocity theonly policies found by V-MPO that are not dominated by aMO-V-MPO policy are those with extreme values for jointvelocity reward (either le 0006 or ge 0018) whereas it isbetween 00103 and 00121 for MO-V-MPO policies

Although the hypervolume of the Pareto front found by V-MPO is higher than that of MO-V-MPO (Table 1) findingpolicies that over- or under- prioritize any objective is unde-sirable Qualitatively the policies trained with MO-V-MPOlook more similar to the mocap reference datamdashthey exhibitless feet jittering compared to those trained with scalarizedV-MPO this can be seen in the corresponding video

64 Results Sawyer Peg-in-Hole

In this task we would like the robot to prioritize successfultask completion while minimizing wrist forces With this

8For MO-V-MPO we set all εk = 001 Also for each objec-tive we set εk = 0001 and set all others to 001 For V-MPOwe fix the weights of four objectives to reasonable values and tryweights of [01 03 06 1 5] for matching joint velocities


0 200000 400000

200

400

number of policy improvement steps

task reward

0 200000 400000

minus12

00minus

800

wrist force penalty

MO-MPO scalarized MPO [095 005] [08 02]

Figure 7 Task reward (left) and wrist force penalty (right) learningcurves for the Sawyer peg-in-hole task The policy trained withMO-MPO quickly learns to optimize both objectives whereas theother two do not Each line represents a single trained policy

in mind for MO-MPO we set εtask = 01 and εforce = 005and for scalarized MPO we try [wtask wforce] = [095 005]and [08 02] We find that policies trained with scalarizedMPO focus all learning on a single objective at the beginningof training we also observed this in the touch task wherescalarized MPO policies quickly learn to either maximizetask reward or minimize pain penalty but not both (Fig 6bottom left) In contrast the policy trained with MO-MPOsimultaneously optimizes for both objectives throughouttraining In fact throughout training the MO-MPO policydoes just as well with respect to task reward as the scalarizedMPO policy that cares more about task reward and similarlyfor wrist force penalty (Fig 7)

7 Conclusions and Future WorkIn this paper we presented a new distributional perspectiveon multi objective reinforcement learning that is derivedfrom the Rl-as-inference perspective This view leads to twonovel multi-objective RL algorithms namely MO-MPO andMO-V-MPO We showed these algorithms enable practition-ers to encode preferences in a scale-invariant way AlthoughMO-(V-)MPO is a single-policy approach to MORL it iscapable of producing a variety of Pareto-dominant policiesA limitation of this work is that we produced this set ofpolicies by iterating over a relatively large number of εrsquoswhich is computationally expensive In future work weplan to extend MO-(V-)MPO into a true multiple-policyMORL approach either by conditioning the policy on set-tings of εrsquos or by developing a way to strategically select εrsquosto train policies for analogous to what prior work has donefor weights (eg Roijers et al (2014))

AcknowledgmentsThe authors would like to thank Guillaume DesjardinsCsaba Szepesvari Jost Tobias Springenberg Steven BohezPhilemon Brakel Brendan Tracey Jonas Degrave JonasBuchli Leslie Fritz Chloe Rosenberg and many others atDeepMind for their support and feedback on this paper

ReferencesAbdolmaleki A Springenberg J T Degrave J Bohez S

Tassa Y Belov D Heess N and Riedmiller M Rela-tive entropy regularized policy iteration arXiv preprintarXiv181202256 2018a

Abdolmaleki A Springenberg J T Tassa Y Munos RHeess N and Riedmiller M Maximum a posterioripolicy optimisation In Proceedings of the Sixth Interna-tional Conference on Learning Representations (ICLR)2018b

Abels A Roijers D M Lenaerts T Nowe A andSteckelmacher D Dynamic weights in multi-objectivedeep reinforcement learning In Proceedings of the 36thInternational Conference on Machine Learning (ICML)2019

Achiam J Held D Tamar A and Abbeel P Constrainedpolicy optimization In Proceedings of the 34th Inter-national Conference on Machine Learning (ICML) pp22ndash31 2017

Amodei D Olah C Steinhardt J Christiano P Schul-man J and Mane D Concrete problems in AI safetyarXiv preprint arXiv160606565 2016

Andersen M S Dahl J and Vandenberghe L CVXOPTA Python package for convex optimization version 12httpscvxoptorg 2020

Anonymous CoMic Co-Training and Mimicry forReusable Skills 2020

Barrett L and Narayanan S Learning all optimal policieswith multiple criteria In Proceedings of the 25th Inter-national Conference on Machine Learning (ICML) pp41ndash47 2008

Barth-Maron G Hoffman M W Budden D Dabney WHorgan D TB D Muldal A Heess N and LillicrapT Distributed distributional deterministic policy gradi-ents In Proceedings of the Sixth International Conferenceon Learning Representations (ICLR) 2018

Bohez S Abdolmaleki A Neunert M Buchli J HeessN and Hadsell R Value constrained model-free contin-uous control arXiv preprint arXiv190204623 2019

Chen X Ghadirzadeh A Bjorkman M and Jensfelt PMeta-learning for multi-objective reinforcement learningarXiv preprint arXiv181103376 2019

Chentanez N Muller M Macklin M Makoviychuk Vand Jeschke S Physics-based motion capture imitationwith deep reinforcement learning In International Con-ference on Motion Interaction and Games (MIG) ACM2018


Chow Y Nachum O Duenez-Guzman E andGhavamzadeh M A Lyapunov-based approach to safereinforcement learning In Proceedings of the 31st Inter-national Conference on Neural Information ProcessingSystems (NeurIPS) pp 8092ndash8101 2018

Daneshmand M Tale Masouleh M Saadatzi M HOzcinar C and Anbarjafari G A robust proportion-preserving composite objective function for scale-invariant multi-objective optimization Scientia Iranica24(6)2977ndash2991 2017

Fortin F-A De Rainville F-M Gardner M-A ParizeauM and Gagne C DEAP Evolutionary algorithms madeeasy Journal of Machine Learning Research 132171ndash2175 July 2012

Friedman E and Fontaine F Generalizing across multi-objective reward functions in deep reinforcement learningarXiv preprint arXiv180906364 2018

Gabor Z Kalmar Z and Szepesvari C Multi-criteriareinforcement learning In Proceedings of the FifteenthInternational Conference on Machine Learning (ICML)pp 197ndash205 1998

Grodzevich O and Romanko O Normalization and othertopics in multi-objective optimization In Proceedings ofthe Fields-MITACS Industrial Problems Workshop 2006

Haarnoja T Zhou A Abbeel P and Levine S Softactor-critic Off-policy maximum entropy deep reinforce-ment learning with a stochastic actor In Proceedings ofthe 35th International Conference on Machine Learning(ICML) pp 1861ndash1870 2018

Heess N Wayne G Silver D Lillicrap T Erez T andTassa Y Learning continuous control policies by stochas-tic value gradients In Advances in Neural InformationProcessing Systems 28 (NIPS) 2015

Hessel M Soyer H Espeholt L Czarnecki W SchmittS and van Hasselt H Multi-task deep reinforcementlearning with popart In Proceedings of the AAAI Confer-ence on Artificial Intelligence volume 33 pp 3796ndash38032019

Huang S H Zambelli M Kay J Martins M F TassaY Pilarski P M and Hadsell R Learning gentle objectmanipulation with curiosity-driven deep reinforcementlearning arXiv preprint arXiv190308542 2019

Ishibuchi H Doi K and Nojima Y On the effect ofnormalization in MOEAD for multi-objective and many-objective optimization Complex amp Intelligent Systems 3(4)279ndash294 2017

Kim I and de Weck O Adaptive weighted sum methodfor multiobjective optimization A new method for paretofront generation Structural and Multidisciplinary Opti-mization 31(2)105ndash116 2006

Kingma D P and Ba J Adam A method for stochasticoptimization In Proceedings of the Third InternationalConference on Learning Representations (ICLR) 2015

Le H Voloshin C and Yue Y Batch policy learningunder constraints In Proceedings of the 36th Inter-national Conference on Machine Learning (ICML) pp3703ndash3712 2019

Levine S Finn C Darrell T and Abbeel P End-to-endtraining of deep visuomotor policies Journal of MachineLearning Research 17(39)1ndash40 2016

Liu C Xu X and Hu D Multiobjective reinforcementlearning A comprehensive overview IEEE Transactionson Systems Man and Cybernetics Systems 45(3)385ndash398 2015

Marler R T and Arora J S Function-transformationmethods for multi-objective optimization EngineeringOptimization 37(6)551ndash570 2005

Martens J New perspectives on the natural gradient methodarXiv preprint arXiv14121193 2014

Merel J Ahuja A Pham V Tunyasuvunakool S LiuS Tirumala D Heess N and Wayne G Hierarchi-cal visuomotor control of humanoids In InternationalConference on Learning Representations (ICLR) 2019

Mnih V Kavukcuoglu K Silver D Rusu A A VenessJ Bellemare M G Graves A Riedmiller M Fidje-land A K Ostrovski G et al Human-level controlthrough deep reinforcement learning Nature 518(7540)529 2015

Moffaert K V and Nowe A Multi-objective reinforcementlearning using sets of Pareto dominating policies Journalof Machine Learning Research 153663ndash3692 2014

Mossalam H Assael Y M Roijers D M and WhitesonS Multi-objective deep reinforcement learning arXivpreprint arXiv161002707 2016

Munos R Stepleton T Harutyunyan A and BellemareM Safe and efficient off-policy reinforcement learningIn Proceedings of the 29th International Conference onNeural Information Processing Systems pp 1054ndash10622016

Natarajan S and Tadepalli P Dynamic preferences inmulti-criteria reinforcement learning In Proceedings ofthe 22nd International Conference on Machine Learning(ICML) pp 601ndash608 2005


Nottingham K Balakrishnan A Deshmukh J Christo-pherson C and Wingate D Using logical specificationsof objectives in multi-objective reinforcement learningarXiv preprint arXiv191001723 2019

Parisi S Pirotta M Smacchia N Bascetta L andRestelli M Policy gradient approaches for multi-objective sequential decision making In Proceedingsof the International Joint Conference on Neural Networks(IJCNN) pp 2323ndash2330 2014

Peng X B Abbeel P Levine S and van de Panne MDeepmimic Example-guided deep reinforcement learn-ing of physics-based character skills ACM Transactionson Graphics 37(4) 2018

Pirotta M Parisi S and Restelli M Multi-objectivereinforcement learning with continuous Pareto frontierapproximation In Proceedings of the Twenty-Ninth AAAIConference on Artificial Intelligence (AAAI) 2015

Reymond M and Nowe A Pareto-DQN Approximat-ing the Pareto front in complex multi-objective decisionproblems In Proceedings of the Adaptive and LearningAgents Workshop at the 18th International Conferenceon Autonomous Agents and MultiAgent Systems AAMAS2019

Riedmiller M Hafner R Lampe T Neunert M De-grave J Van de Wiele T Mnih V Heess N and Sprin-genberg J T Learning by playing - solving sparse rewardtasks from scratch arXiv preprint arXiv1802105672018

Roijers D M Vamplew P Whiteson S and Dazeley RA survey of multi-objective sequential decision-makingJournal of Artificial Intelligence Research 48(1)67ndash1132013

Roijers D M Whiteson S and Oliehoek F A Linearsupport for multi-objective coordination graphs In Pro-ceedings of the International Conference on AutonomousAgents and Multi-Agent Systems (AAMAS) pp 1297ndash1304 2014

Russell S and Zimdars A L Q-decomposition for rein-forcement learning agents In Proceedings of the Twenti-eth International Conference on International Conferenceon Machine Learning (ICML) pp 656ndash663 2003

Schulman J Levine S Abbeel P Jordan M and MoritzP Trust region policy optimization In Proceedings ofthe 32nd International Conference on Machine Learning(ICML) 2015

Schulman J Wolski F Dhariwal P Radford A andKlimov O Proximal policy optimization algorithmsarXiv preprint arXiv170706347 2017

Shadow Robot Company Shadow dexterous handhttpswwwshadowrobotcomproductsdexterous-hand 2020

Silver D Huang A Maddison C J Guez A Sifre Lvan den Driessche G Schrittwieser J Antonoglou IPanneershelvam V Lanctot M Dieleman S GreweD Nham J Kalchbrenner N Sutskever I Lillicrap TLeach M Kavukcuoglu K Graepel T and HassabisD Mastering the game of Go with deep neural networksand tree search Nature 529484ndash503 2016

Song H F Abdolmaleki A Springenberg J T ClarkA Soyer H Rae J W Noury S Ahuja A LiuS Tirumala D Heess N Belov D Riedmiller Mand Botvinick M M V-MPO On-policy maximum aposteriori policy optimization for discrete and continu-ous control In International Conference on LearningRepresentations 2020

Sutton R S Learning to predict by the methods of temporaldifferences Machine learning 3(1)9ndash44 1988

Tan J Zhang T Coumans E Iscen A Bai Y HafnerD Bohez S and Vanhoucke V Sim-to-real Learningagile locomotion for quadruped robots In Proceedingsof Robotics Science and Systems (RSS) 2018

Tassa Y Doron Y Muldal A Erez T Li Yde Las Casas D Budden D Abdolmaleki A Merel JLefrancq A Lillicrap T and Riedmiller M Deepmindcontrol suite arXiv preprint arXiv180100690 2018

Teh Y W Bapst V Czarnecki W M Quan J Kirk-patrick J Hadsell R Heess N and Pascanu R DistralRobust multitask reinforcement learning In Proceedingsof the 31st International Conference on Neural Informa-tion Processing Systems (NeurIPS) 2017

Todorov E Erez T and Tassa Y Mujoco A physicsengine for model-based control In Proceedings of theIEEERSJ International Conference on Intelligent Robotsand Systems (IROS) pp 5026ndash5033 2012

Vamplew P Dazeley R Berry A Issabekov R andDekker E Empirical evaluation methods for multiobjec-tive reinforcement learning algorithms Machine Learn-ing 84(1)51ndash80 Jul 2011

van Hasselt H P Guez A Hessel M Mnih V and Sil-ver D Learning values across many orders of magnitudeIn Advances in Neural Information Processing Systemspp 4287ndash4295 2016

Van Moffaert K Drugan M M and Nowe A Scalar-ized multi-objective reinforcement learning Novel de-sign techniques In Proceedings of the IEEE Symposiumon Adaptive Dynamic Programming and ReinforcementLearning (ADPRL) pp 191ndash199 2013


van Seijen H Fatemi M Romoff J Laroche R BarnesT and Tsang J Hybrid reward architecture for reinforce-ment learning In Proceedings of the 31st InternationalConference on Neural Information Processing Systems(NeurIPS) pp 5398ndash5408 2017

Watkins C J C H Learning from Delayed Rewards PhDthesis Kingrsquos College Cambridge UK May 1989

Wray K H Zilberstein S and Mouaddib A-I Multi-objective MDPs with conditional lexicographic rewardpreferences In Proceedings of the Twenty-Ninth AAAIConference on Artificial Intelligence (AAAI) pp 3418ndash3424 2015

Wulfmeier M Abdolmaleki A Hafner R SpringenbergJ T Neunert M Hertweck T Lampe T Siegel NHeess N and Riedmiller M Regularized hierarchicalpolicies for compositional transfer in robotics arXivpreprint arXiv190611228 2019

Yang R Sun X and Narasimhan K A generalized algo-rithm for multi-objective reinforcement learning and pol-icy adaptation In Proceedings of the 32nd InternationalConference on Neural Information Processing Systems(NeurIPS) pp 14610ndash14621 2019

Zeng A Song S Lee J Rodriguez A and FunkhouserT Tossingbot Learning to throw arbitrary objects withresidual physics In Proceedings of Robotics Scienceand Systems (RSS) 2019

Zhan H and Cao Y Relationship explainable multi-objective optimization via vector value function basedreinforcement learning arXiv preprint arXiv1910019192019

Zuluaga M Krause A and Puschel M ε-pal An ac-tive learning approach to the multi-objective optimizationproblem Journal of Machine Learning Research 17(1)3619ndash3650 2016


A Experiment DetailsIn this section we describe implementation details and spec-ify the hyperparameters used for our algorithm In all ourexperiments the policy and critic(s) are implemented withfeed-forward neural networks We use Adam (Kingma ampBa 2015) for optimization

In our continuous control tasks the policy returns a Gaus-sian distribution with a diagonal covariance matrix ieπθ(a|s) = N (microΣ) The policy is parametrized by aneural network which outputs the mean micro = micro(s) and di-agonal Cholesky factors A = A(s) such that Σ = AAT The diagonal factor A has positive diagonal elements en-forced by the softplus transform Aii larr log(1 + exp(Aii))to ensure positive definiteness of the diagonal covariancematrix

Table 2 shows the default hyperparameters that we use forMO-MPO and scalarized MPO and Table 3 shows the hy-perparameters that differ from these defaults for the hu-manoid Shadow Hand and Sawyer experiments For alltasks that we train policies with MO-MPO for a separatecritic is trained for each objective with a shared first layerThe only exception is the action norm penalty in humanoidrun which is computed exactly from the action We foundthat for both policy and critic networks layer normalizationof the first layer followed by a hyperbolic tangent (tanh) isimportant for the stability of the algorithm

In our experiments comparing the Pareto front found byMO-MPO versus scalarized MPO we ran MO-MPO with arange of εk settings and ran scalarized MPO with a range ofweight settings to obtain a set of policies for each algorithmThe settings for humanoid run and the Shadow Hand tasksare listed in Table 4 The settings for humanoid mocap andSawyer are specified in the main paper in Sec 63 and 64respectively

In the following subsections we first give suggestions forchoosing appropriate εk to encode a particular preference(Sec A1) then we describe the discrete MO-MPO usedfor Deep Sea Treasure (Sec A2) and finally we describeimplementation details for humanoid mocap (Sec A3)

A1 Suggestions for Setting ε

Our proposed algorithm MO-(V-)MPO requires practition-ers to translate a desired preference across objectives to nu-merical choices for εk for each objective k At first glancethis may seem daunting However in practice we havefound that encoding preferences via εk is often more in-tuitive than doing so via scalarization In this subsectionwe seek to give an intuition on how to set εk for differentdesired preferences Recall that each εk controls the influ-ence of objective k on the policy update by constraining theKL-divergence between each objective-specific distribution

Hyperparameter Default

policy networklayer sizes (300 200)take tanh of action mean yesminimum variance 10minus12

maximum variance unbounded

critic network(s)layer sizes (400 400 300)take tanh of action yesRetrace sequence size 8discount factor γ 099

both policy and critic networkslayer norm on first layer yestanh on output of layer norm yesactivation (after each hidden layer) ELU

MPOactions sampled per state 20default ε 01KL-constraint on policy mean βmicro 10minus3

KL-constraint on policy covariance βΣ 10minus5

initial temperature η 1

trainingbatch size 512replay buffer size 106

Adam learning rate 3times10minus4

Adam ε 10minus3

target network update period 200

Table 2 Default hyperparameters for MO-MPO and scalarizedMPO with decoupled update on mean and covariance (Sec C3)

and the current policy (Sec 41) We generally choose εk inthe range of 0001 to 01

Equal Preference When all objectives are equally impor-tant the general rule is to set all εk to the same value Wedid this in our orient and humanoid mocap tasks where ob-jectives are aligned or mostly aligned In contrast it can betricky to choose appropriate weights in linear scalarizationto encode equal preferencesmdashsetting all weights equal to1K (where K is the number of objectives) is only appro-priate if the objectivesrsquo rewards are of similar scales Weexplored this in Sec 51 and Fig 3

When setting all εk to the same value what should thisvalue be The larger εk is the more influence the objectiveswill have on the policy update step Since the per-objectivecritics are learned in parallel with the policy setting εktoo high tends to destabilize learning because early on intraining when the critics produce unreliable Q-values theirinfluence on the policy will lead it in the wrong directionOn the other hand if εk is set too low then it slows downlearning because the per-objective action distribution isonly allowed to deviate by a tiny amount from the currentpolicy and the updated policy is obtained via supervisedlearning on the combination of these action distributions


Humanoid run

policy networklayer sizes (400 400 300)take tanh of action mean nominimum variance 10minus8

critic network(s)layer sizes (500 500 400)take tanh of action no

MPOKL-constraint on policy mean βmicro 5times10minus3


trainingAdam learning rate 2times10minus4

Adam ε 10minus8

Shadow Hand

MPOactions sampled per state 30default ε 001

Sawyer

MPOactions sampled per state 15

trainingtarget network update period 100

Table 3 Hyperparameters for humanoid Shadow Hand andSawyer experiments that differ from the defaults in Table 2

Eventually the learning will converge to more or less thesame policy though as long as εk is not set too high

Unequal Preference When there is a difference in pref-erences across objectives the relative scale of εk is whatmatters The more the relative scale of εk is compared to εlthe more influence objective k has over the policy updatecompared to objective l And in the extreme case whenεl is near-zero for objective l then objective l will haveno influence on the policy update and will effectively beignored We explored this briefly in Sec 51 and Fig 4

One common example of unequal preferences is when wewould like an agent to complete a task while minimiz-ing other objectivesmdasheg an action norm penalty ldquopainrdquopenalty or wrist force-torque penalty in our experimentsIn this case the ε for the task objective should be higherthan that for the other objectives to incentivize the agent toprioritize actually doing the task If the ε for the penaltiesis too high then the agent will care more about minimiz-ing the penalty (which can typically be achieved by simplytaking no actions) rather than doing the task which is notparticularly useful

The scale of εk has a similar effect as in the equal preference

Condition Settings

Humanoid run (one seed per setting)

scalarized MPO wtask = 1minus wpenaltywpenalty isin linspace(0 015 100)

MO-MPO εtask = 01εpenalty isin linspace(10minus6 015 100)

Humanoid run normal vs scaled (three seeds per setting)

scalarized MPO wtask = 1minus wpenaltywpenalty isin 001 005 01

MO-MPO εtask = 01εpenalty isin 001 005 01

Shadow Hand touch and turn (three seeds per setting)


MO-MPO εtask = 001εpenalty isin linspace(0001 0015 15)

Shadow Hand orient (ten seeds per setting)

scalarized MPO wtouch = wheight = worientation = 13MO-MPO εtouch = εheight = εorientation = 001

Table 4 Settings for εk and weights (linspace(x y z) denotes aset of z evenly-spaced values between x and y)

case If the scale of εk is too high or too low then the sameissues arise as discussed for equal preferences If all εkincrease or decrease in scale by the same (moderate) factorand thus their relative scales remain the same then typicallythey will converge to more or less the same policy Thiscan be seen in Fig 5 (right) in the Deep Sea Treasuredomain regardless of whether εtime is 001 002 or 005the relationship between the relative scale of εtreasure andεtime to the treasure that the policy converges to selecting isessentially the same

A2 Deep Sea Treasure

In order to handle discrete actions we make several mi-nor adjustments to scalarized MPO and MO-MPO Thepolicy returns a categorical distribution rather than a Gaus-sian The policy is parametrized by a neural network whichoutputs a vector of logits (ie unnormalized log probabil-ities) of size |A| The KL constraint on the change of thispolicy β is 0001 The input to the critic network is thestate concatenated with a four-dimensional one-hot vectordenoting which action is chosen (eg the up action corre-sponds to [1 0 0 0]gt) Critics are trained with one-steptemporal-difference error with a discount of 0999 Otherthan these changes the network architectures and the MPOand training hyperparameters are the same as in Table 2


Humanoid Mocap

Scalarized VMPOKL-constraint on policy mean βmicro 01KL-constraint on policy covariance βΣ 10minus5

default ε 01initial temperature η 1

TrainingAdam learning rate 10minus4

batch size 128unroll length (for n-step return Sec D1) 32

Table 5 Hyperparameters for the humanoid mocap experiments

A3 Humanoid Mocap

For the humanoid mocap experiments we used the follow-ing architecture for both MO-VMPO and VMPO for thepolicy we process a concatentation of the mocap referenceobservations and the proprioceptive observations by a twolayer MLP with 1024 hidden units per layer This referenceencoder is followed by linear layers to produce the meanand log standard deviation of a stochastic latent variableThese latent variables are then concatenated with the propri-oceptive observations and processed by another two layerMLP with 1024 hidden units per layer to produce the actiondistribution For VMPO we use a three layer MLP with1024 hidden units per layer as the critic For MO-VMPOwe use a shared two layer MLP with 1024 hidden units perlayer followed by a one layer MLP with 1024 hidden unitsper layer per objective In both cases we use k-step returnsto train the critic with a discount factor of 095 Table 5shows additional hyperparameters used in our experiments

B Experimental DomainsWe evaluated our approach on one discrete domain (DeepSea Treasure) three simulated continuous control domains(humanoid Shadow Hand and humanoid mocap) and onereal-world continuous control domain (Sawyer robot) Herewe provide more detail about these domains and the objec-tives used in each task

B1 Deep Sea Treasure

Deep Sea Treasure (DST) is a 11times10 grid-world domainthe state space S consists of the x and y position of the agentand the action spaceA is up right down left Thelayout of the environment and values of the treasures areshown in Fig 8 If the action would cause the agent tocollide with the sea floor or go out of bounds it has noeffect Farther-away treasures have higher values Theepisode terminates when the agent collects a treasure orafter 200 timesteps

237

224

203196

161151140

115

82

07

Figure 8 Deep Sea Treasure environment from Vamplew et al(2011) with weights from Yang et al (2019) Treasures are labeledwith their respective values The agent can move around freelyin the white squares but cannot enter the black squares (ie theocean floor)

There are two objectives time penalty and treasure valueA time penalty of minus1 is given at each time step The agentreceives the value of the treasure it picks up as the rewardfor the treasure objective In other words when the agentpicks up a treasure of value v the reward vector is [minus1 v]otherwise it is [minus1 0]

B2 Shadow Hand

Our robot platform is a simulated Shadow Dexterous Hand(Shadow Robot Company 2020) in the MuJoCo physicsengine (Todorov et al 2012) The Shadow Hand has fivefingers and 24 degrees of freedom actuated by 20 motorsThe observation consists of joint angles joint velocities andtouch sensors for a total of 63 dimensions Each fingertipof our Shadow Hand has a 4times4 spatial touch sensor Thissensor has three channels measuring the normal force andthe x and y-axis tangential forces for a sensor dimension of4times4times3 per fingertip We simplify this sensor by summingacross spatial dimensions resulting in a 1times1times3 observationper fingertip

In the touch task there is a block in the environment that isalways fixed in the same pose In the turn task there is a dialin the environment that can be rotated and it is initialized toa random position between minus30 and 30 The target angleof the dial is 0 The angle of the dial is included in theagentrsquos observation In the orient task the robot interactswith a rectangular peg in the environment the initial andtarget pose of the peg remains the same across episodesThe pose of the block is included in the agentrsquos observationencoded as the xyz positions of four corners of the block(based on how Levine et al (2016) encodes end-effectorpose)


Figure 9 These images show what task completion looks like forthe touch turn and orient Shadow Hand tasks (from left to right)

0 2 4 6impact force

minus6

minus4

minus2

0

pain

pen

alty

Figure 10 In the touch and turn Shadow Hand tasks the painpenalty is computed based on the impact force as plotted here Forlow impact forces the pain penalty is near-zero For high impactforces the pain penalty is equal to the negative of the impact force

B21 BALANCING TASK COMPLETION AND PAIN

In the touch and turn tasks there are two objectives ldquopainrdquopenalty and task completion A sparse task completion re-ward of 1 is given for pressing a block with greater than 5Nof force or turning a dial to a fixed target angle respectivelyIn both tasks the episode terminates when the agent com-pletes the task ie the agent gets a total reward of either0 or 1 for the task completion objective per episode ThePareto plots for these two tasks in the main paper (Fig 6)show the discounted task reward (with a discount factor ofγ = 099) to capture how long it takes agents to completethe task

The ldquopainrdquo penalty is computed as in (Huang et al 2019) Itis based on the impact force m(s sprime) which is the increasein force from state s to the next state sprime In our tasks this ismeasured by a touch sensor on the block or dial The painpenalty is equal to the negative of the impact force scaledby how unacceptable it is

rpain(s a sprime) = minus[1minus aλ(m(s sprime))

]m(s sprime) (6)

where aλ(middot) isin [0 1] computes the acceptability of an impactforce aλ(middot) should be a monotonically decreasing functionthat captures how resilient the robot and the environmentare to impacts As in (Huang et al 2019) we use

aλ(m) = sigmoid(λ1(minusm+ λ2)) (7)

with λ = [2 2]gt The relationship between pain penaltyand impact force is plotted in Fig 10

675 700 725zpeg

0

1

rew

ard

height objective

0 2d(θ θpeg)

orientation objective

Figure 11 In the orient Shadow Hand task the height and orienta-tion objectives have shaped rewards as plotted here

B22 ALIGNED OBJECTIVES

In the orient task there are three non-competing objectivestouching the peg lifting the peg to a target height and ori-enting the peg to be perpendicular to the ground The targetz position and orientation is shown by the gray transparentpeg (Fig 9 right-most) note that the x and y position isnot specified so the robot in the figure gets the maximumreward with respect to all three objectives

The touch objective has a sparse reward of 1 for activat-ing the pegrsquos touch sensor and zero otherwise For theheight objective the target z position is 7cm above theground the pegrsquos z position is computed with respect toits center of mass The shaped reward for this objectiveis 1 minus tanh(50|ztarget minus zpeg|) For the orientation objec-tive since the peg is symmetrical there are eight possibleorientations that are equally valid The acceptable targetorientationsQtarget and the pegrsquos orientation qpeg are denotedas quaternions The shaped reward is computed with respectto the closest target quaternion as

minqisinQtarget

1minus tanh(2lowastd(q qpeg))

where d(middot) denotes the `2-norm of the axis-angle equivalent(in radians) for the distance between the two quaternionsFig 11 shows the shaping of the height and orientationrewards

B3 Humanoid

We make use of the humanoid run task from Tassa et al(2018)9 The observation dimension is 67 and the actiondimension is 21 Actions are joint accelerations with min-imum and maximum limits of minus1 and 1 respectively Forthis task there are two objectives

bull The original task reward given by the environment Thegoal is is to achieve a horizontal speed of 10 meters persecond in any direction This reward is shaped it isequal to min(h10 1) where h is in meters per secondFor this objective we learn a Q-function

9This task is available at githubcomdeepminddm control


bull Limiting energy usage by penalizing taking high-magnitude actions The penalty is the `2-norm of theaction vector ie rpenalty(a) = minusa2 For this ob-jective we do not learn a Q-function Instead wecompute this penalty directly to evaluate a given actionin a state-independent way during policy optimization

B4 Humanoid Mocap

The motion capture tracking task used in this paper wasdeveloped as part of a concurrent submission (Anonymous2020) and does not constitute a contribution of this paperWe use a simulated humanoid adapted from the ldquoCMUhumanoidrdquo available at dm controllocomotion (Merel et al2019) We adjusted various body and actuator parametersto be more comparable to that of an average human Thishumanoid has 56 degrees of freedom and the observation is1021-dimensional

This motion capture tracking task is broadly similar to previ-ous work on motion capture tracking (Chentanez et al 2018Peng et al 2018 Merel et al 2019) The task is associatedwith an underlying set of motion capture clips For thispaper we used roughly 40 minutes of locomotion motioncapture clips from the CMU motion capture database10

At the beginning of a new episode we select a frame uni-formly at random from all frames in the underlying mocapdata (excluding the last 10 frames of each clip) The simu-lated humanoid is then initialized to the pose of the selectedframe

The observations in this environment are various propriocep-tive observations about the current state of the body as wellas the relative target positions and orientations of 13 differ-ent body parts in the humanoidrsquos local frame We providethe agent with the target poses for the next 5 timesteps

The choice of reward function is crucial for motion capturetracking tasks Here we consider five different reward com-ponents each capturing a different aspect of the similarityof the pose between the simulated humanoid and the mocaptarget Four of our reward components were initially pro-posed by Peng et al (2018) The first reward component isbased on the difference in the center of mass

rcom = exp(minus10pcom minus pref

com2)

where pcom and prefcom are the positions of the center of mass

of the simulated humanoid and the mocap reference re-spectively The second reward component is based on thedifference in joint angle velocities

rvel = exp(minus01qvel minus qref

vel2)

where qvel and qrefvel are the joint angle velocities of the simu-

lated humanoid and the mocap reference respectively The

10This database is available at mocapcscmuedu

third reward component is based on the difference in end-effector positions

rapp = exp(minus40papp minus pref

app2)

where papp and prefapp are the end effector positions of the

simulated humanoid and the mocap reference respectivelyThe fourth reward component is based on the difference injoint orientations

rquat = exp(minus2qquat qref

quat2)

where denotes quaternion differences and qquat and qrefquat

are the joint quaternions of the simulated humanoid and themocap reference respectively

Finally the last reward component is based on difference inthe joint angles and the Euclidean positions of a set of 13body parts

rtrunc = 1minus 1

03(

=ε︷︸︸︷bpos minus bref

pos1 + qpos minus qrefpos1)

where bpos and brefpos correspond to the body positions of the

simulated character and the mocap reference and qpos andqref

pos correspond to the joint angles

We include an early termination condition in our task that islinked to this last reward term We terminate the episode ifε gt 03 which ensures that rtrunc isin [0 1]

In our MO-VMPO experiments we treat each reward com-ponent as a separate objective In our VMPO experimentswe use a reward of the form

r =1

2rtrunc +

1

2(01rcom + λrvel + 015rapp + 065rquat)

(8)

varying λ as described in the main paper (Sec 63)

B5 Sawyer

The Sawyer peg-in-hole setup consists of a Rethink RoboticsSawyer robot arm The robot has 7 joints driven by serieselastic actuators On the wrist a Robotiq FT 300 force-torque sensor is mounted followed by a Robotiq 2F85 par-allel gripper We implement a 3D end-effector Cartesianvelocity controller that maintains a fixed orientation of thegripper The agent controls the Cartesian velocity setpointsThe gripper is not actuated in this setup The observationsprovided are the joint positions velocities and torques theend-effector pose the wrist force-torque measurements andthe joint velocity command output of the Cartesian con-troller We augment each observation with the two previousobservations as a history

In each episode the peg is initalized randomly within an8times8times8 cm workspace (with the peg outside of the hole)


The environment is then run at 20 Hz for 600 steps Theaction limits are plusmn005 ms on each axis If the magnitudeof the wrist force-torque measurements exceeds 15 N onany horizontal axis (x and y axis) or 5 N on the vertical axis(z axis) the episode is terminated early

There are two objectives the task itself (inserting the peg)and minimizing wrist force measurements

The reward for the task objective is defined as

rinsertion = max(

02rapproach rinserted

)(9)

rapproach = s(ppeg popening) (10)

rinserted = raligned s(ppegz pbottom

z ) (11)

raligned = ppegxy minus popening

xy lt d2 (12)

where ppeg is the peg position popening the position of thehole opening pbottom the bottom position of the hole and dthe diameter of the hole

s(p1 p2) is a shaping operator

s(p1 p2) = 1minus tanh2(atanh(

radicl)

εp1 minus p22) (13)

which gives a reward of 1minus l if p1 and p2 are at a Euclideandistance of ε Here we chose lapproach = 095 εapproach =001 linserted = 05 and εinserted = 004 Intentionally 004corresponds to the length of the peg such that rinserted = 05if ppeg = popening As a result if the peg is in the approachposition rinserted is dominant over rapproach in rinsertion

Intuitively there is a shaped reward for aligning the pegwith a magnitude of 02 After accurate alignment withina horizontal tolerance the overall reward is dominated bya vertical shaped reward component for insertion The hor-izontal tolerance threshold ensures that the agent is notencouraged to try to insert the peg from the side

The reward for the secondary objective minimizing wristforce measurements is defined as

rforce = minusF1 (14)

where F = (Fx Fy Fz) are the forces measured by thewrist sensor

C Algorithmic DetailsC1 General Algorithm Description

We maintain one online network and one target network foreach Q-function associated to objective k with parametersdenoted by φk and φprimek respectively We also maintain oneonline network and one target network for the policy withparameters denoted by θ and θprime respectively Target net-works are updated every fixed number of steps by copying

parameters from the online network Online networks areupdated using gradient descent in each learning iterationNote that in the main paper we refer to the target policynetwork as the old policy πold

We use an asynchronous actor-learner setup in which ac-tors regularly fetch policy parameters from the learner andact in the environment writing these transitions to the re-play buffer We refer to this policy as the behavior policythroughout the paper The learner uses the transitions inthe replay buffer to update the (online) Q-functions and thepolicy Please see Algorithms 2 and 3 for more details onthe actor and learner processes respectively

Algorithm 2 MO-MPO Asynchronous Actor

1 given (N) reward functions rk(s a)Nk=1 T steps perepisode

2 repeat3 Fetch current policy parameters θ from learner4 Collect trajectory from environment5 τ = 6 for t = 0 T do7 at sim πθ(middot|st)8 Execute action and determine rewards9 r = [r1(st at) rN (st at)]

10 τ larr τ cup (st at r πθ(at|st))11 end for12 Send trajectory τ to replay buffer13 until end of training

C2 Retrace for Multi-Objective Policy Evaluation

Recall that the goal of the policy evaluation stage (Sec 41)is to learn the Q-function Qφk

(s a) for each objective kparametrized by φk These Q-functions are with respectto the policy πold We emphasize that it is valid to use anyoff-the-shelf Q-learning algorithm such as TD(0) (Sutton1988) to learn these Q-functions We choose to use Retrace(Munos et al 2016) described here

Given a replay buffer D containing trajectory snippetsτ = (s0 a0 r0 s1) (sT aT rT sT+1) where rtdenotes a reward vector rk(st at)Nk=1 that consists of ascalar reward for each ofN objectives the Retrace objectiveis as follows

minφk

EτsimD[(Qretk (st at)minus Qφk

(st at))2] (15)

with

Qretk (st at) = Qφprime

k(st at) +

Tsumj=t

γjminust( jprodz=t+1

cz

)δj

δj = rk(sj aj) + γV (sj+1)minus Qφprimek(sj aj)

V (sj+1) = Eπold(a|sj+1)[Qφprimek(sj+1 a)]


Algorithm 3 MO-MPO Asynchronous Learner1 given batch size (L) number of actions (M) (N) Q-functions

(N) preferences εkNk=1 policy networks and replay bufferD

2 initialize Lagrangians ηkNk=1 and ν target networks andonline networks such that πθprime = πθ and Qφprime

k= Qφk

3 repeat4 repeat5 Collect dataset si aij Qijk

LMNijk where

6 si sim D aij sim πθprime(a|si) and Qijk = Qφkprime(si aij)



1Llog

(sumMj

1M

exp(Q

ijkηk

))11 Update ηk based on δηk12 qijk prop exp(

Qijkηk

)

13 end for1415 Update parametric policy with trust region16 sg denotes a stop-gradient17 δπ larr minusnablaθ

sumLi

sumMj

sumNk q

ijk log πθ(a

ij |si)18 +sg(ν)

(β minus

sumLi KL(πθprime(a|si) πθ(a|si)) ds

)19 δν larr nablaνν

(βminussumLi KL(πθprime(a|si) πsg(θ)(a|si)) ds

)20 Update πθ based on δπ21 Update ν based on δν2223 Update Q-functions24 for k = 1 N do25 δφk larr nablaφk

sum(stat)isinτsimD

(Qφk (st at)minusQ

retk

)226 with Qret

k as in Eq 1527 Update φk based on δφk

28 end for2930 until fixed number of steps31 Update target networks32 πθprime = πθ Qφk

prime = Qφk

33 until convergence

The importance weights ck are defined as

cz = min

(1πold(az|sz)b(az|sz)

)

where b(az|sz) denotes the behavior policy used to col-lect trajectories in the environment When j = t we set(prodj

z=t+1 cz

)= 1

We also make use of a target network for each Q-function(Mnih et al 2015) parametrized by φprimek which we copyfrom the online network φk after a fixed number of gradientsteps on the Retrace objective (15)

C3 Policy Fitting

Recall that fitting the policy in the policy improvementstage (Sec 422) requires solving the following constrained

optimization

πnew = argmaxθ

Nsumk=1

ints

micro(s)

inta

qk(a|s) log πθ(a|s) da ds

stints


We first write the generalized Lagrangian equation ie

L(θ ν) =

Nsumk=1

ints

micro(s)

inta

qk(a|s) log πθ(a|s) da ds (17)

+ ν(β minus

ints

micro(s) KL(πold(a|s) πθ(a|s)) ds)

where ν is the Lagrangian multiplier Now we solve thefollowing primal problem

maxθ

minνgt0

L(θ ν)

To obtain the parameters θ for the updated policy we solvefor ν and θ by alternating between 1) fixing ν to its currentvalue and optimizing for θ and 2) fixing θ to its currentvalue and optimizing for ν This can be applied for anypolicy output distribution

For Gaussian policies we decouple the update rules to op-timize for mean and covariance independently as in (Ab-dolmaleki et al 2018b) This allows for setting differentKL bounds for the mean (βmicro) and covariance (βΣ) whichresults in more stable learning To do this we first separateout the following two policies for mean and covariance

πmicroθ (a|s) = N(amicroθ(s)Σθold

(s)) (18)

πΣθ (a|s) = N

(amicroθold

(s)Σθ(s)) (19)

Policy πmicroθ (a|s) takes the mean from the online policy net-work and covariance from the target policy network andpolicy πΣ

θ (a|s) takes the mean from the target policy net-work and covariance from online policy network Now ouroptimization problem has the following form

maxθ

Nsumk=1

ints

micro(s)

inta

qk(a|s)(

log πmicroθ (a|s)πΣθ (a|s)

)da ds

stints

micro(s) KL(πold(a|s) πmicroθ (a|s)) ds lt βmicroints

micro(s) KL(πold(a|s) πΣθ (a|s)) ds lt βΣ (20)

As in (Abdolmaleki et al 2018b) we set a much smallerbound for covariance than for mean to keep the explorationand avoid premature convergence We can solve this op-timization problem using the same general procedure asdescribed above


C4 Derivation of Dual Function for η

Recall that obtaining the per-objective improved action dis-tributions qk(a|s) in the policy improvement stage (Sec421) requires solving a convex dual function for the tem-perature ηk for each objective For the derivation of thisdual function please refer to Appendix D2 of the originalMPO paper (Abdolmaleki et al 2018b)

D MO-V-MPO Multi-Objective On-PolicyMPO

In this section we describe how MO-MPO can be adaptedto the on-policy case in which a state-value function V (s)is learned and used to estimate the advantages for eachobjective We call this approach MO-V-MPO

D1 Multi-objective policy evaluation in the on-policysetting

In the on-policy setting to evaluate the previous policyπold we use advantages A(s a) estimated from a learnedstate-value function V (s) instead of a state-action valuefunction Q(s a) as in the main text We train a separateV -function for each objective by regressing to the stan-dard n-step return (Sutton 1988) associated with each ob-jective More concretely given trajectory snippets τ =(s0 a0 r0) (sT aT rT ) where rt denotes a rewardvector rk(st at)Nk=1 that consists of rewards for all N ob-jectives we find value function parameters φk by optimizingthe following objective

minφ1k

Nsumk=1

Eτ[(G

(T )k (st at)minus V πold

φk(st))

2] (21)

Here G(T )k (st at) is the T -step target for value function

k which uses the actual rewards in the trajectory andbootstraps from the current value function for the restG

(T )k (st at) =

sumTminus1`=t γ

`minustrk(s` a`) + γTminustV πoldφk

(s`+T )The advantages are then estimated as Aπold

k (st at) =

G(T )k (st at)minus V πold

φk(st)

D2 Multi-objective policy improvement in theon-policy setting

Given the previous policy πold(a|s) and estimated advan-tages Aπold

k (s a)k=1N associated with this policy foreach objective our goal is to improve the previous policyTo this end we first learn an improved variational distribu-tion qk(s a) for each objective and then combine and distillthe variational distributions into a new parametric policyπnew(a|s) Unlike in MO-MPO for MO-V-MPO we use thejoint distribution qk(s a) rather than local policies qk(a|s)because without a learned Q-function only one action perstate is available for learning This is a multi-objective vari-

ant of the two-step policy improvement procedure employedby V-MPO (Song et al 2020)

D21 OBTAINING IMPROVED VARIATIONALDISTRIBUTIONS PER OBJECTIVE (STEP 1)

In order to obtain the improved variational distributionsqk(s a) we optimize the RL objective for each objectiveAk

maxqk

intsa

qk(s a)Ak(s a) dads (22)

st KL(qk(s a)pold(s a)) lt εk

where the KL-divergence is computed over all (s a) εk de-notes the allowed expected KL divergence and pold(s a) =micro(s)πold(a|s) is the state-action distribution associated withπold

As in MO-MPO we use these εk to define the preferencesover objectives More concretely εk defines the allowed con-tribution of objective k to the change of the policy There-fore the larger a particular εk is with respect to others themore that objective k is preferred On the other hand ifεk = 0 is zero then objective k will have no contribution tothe change of the policy and will effectively be ignored

Equation (22) can be solved in closed form

qk(s a) prop pold(s a) exp(Ak(s a)

ηk

) (23)

where the temperature ηk is computed based on the con-straint εk by solving the following convex dual problem

ηk = argminηk

[ηk εk + (24)

ηk log

intsa

pold(s a) exp(Ak(s a)

η

)da ds

]

We can perform the optimization along with the loss bytaking a gradient descent step on ηk and we initialize withthe solution found in the previous policy iteration step Sinceηk must be positive we use a projection operator after eachgradient step to maintain ηk gt 0

Top-k advantages As in Song et al (2020) in practicewe used the samples corresponding to the top 50 of ad-vantages in each batch of data

D22 FITTING A NEW PARAMETRIC POLICY (STEP 2)

We next want to combine and distill the state-action distribu-tions obtained in the previous step into a single parametricpolicy πnew(a|s) that favors all of the objectives accordingto the preferences specified by εk For this we solve a su-pervised learning problem that fits a parametric policy asfollows


πnew = argmaxθ

Nsumk=1

intsa

qk(s a) log πθ(a|s) da ds

stints

KL(πold(a|s) πθ(a|s)) ds lt β (25)

where θ are the parameters of our function approximator(a neural network) which we initialize from the weights ofthe previous policy πold and the KL constraint enforces atrust region of size β that limits the overall change in theparametric policy to improve stability of learning As inMPO the KL constraint in this step has a regularizationeffect that prevents the policy from overfitting to the localpolicies and therefore avoids premature convergence

In order to optimize Equation (25) we employ Lagrangianrelaxation similar to the one employed for ordinary V-MPO (Song et al 2020)

E Multi-Objective Policy Improvement asInference

In the main paper we motivated the multi-objective policyupdate rules from an intuitive perspective In this section weshow that our multi-objective policy improvement algorithmcan also be derived from the RL-as-inference perspective Inthe policy improvement stage we assume that a Q-functionfor each objective is given and we would like to improveour policy with respect to these Q-functions The deriva-tion here extends the derivation for the policy improvementalgorithm in (single-objective) MPO in Abdolmaleki et al(2018a) (in appendix) to the multi-objective case

We assume there are observable binary improvement eventsRkNk=1 for each objective Rk = 1 indicates that ourpolicy has improved for objective k whileRk = 0 indicatesthat it has not Now we ask if the policy has improved withrespect to all objectives ie Rk = 1Nk=1 what would theparameters θ of that improved policy be More concretelythis is equivalent to the maximum a posteriori estimate forRk = 1Nk=1

maxθ

pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)

where the probability of improvement events depends on θAssuming independence between improvement events Rkand using log probabilities leads to the following objective

maxθ

Nsumk=1

log pθ(Rk = 1) + log p(θ) (27)

The prior distribution over parameters p(θ) is fixed duringthe policy improvement step We set the prior such that πθ

stays close to the target policy during each policy improve-ment step to improve stability of learning (Sec E2)

We use the standard expectation-maximization (EM) algo-rithm to efficiently maximize

sumNk=1 log pθ(Rk = 1) The

EM algorithm repeatedly constructs a tight lower bound inthe E-step given the previous estimate of θ (which corre-sponds to the target policy in our case) and then optimizesthat lower bound in the M-step We introduce a variationaldistribution qk(s a) per objective and use this to decom-pose the log-evidence

sumNk=1 log pθ(Rk = 1) as follows

Nsumk=1

log pθ(Rk = 1) (28)

=

Nsumk=1

KL(qk(s a) pθ(s a|Rk = 1))minus

Nsumk=1

KL(qk(s a) pθ(Rk = 1 s a))

The second term of this decomposition is expanded as

Nsumk=1

KL(qk(s a) pθ(Rk = 1 s a)) (29)

=

Nsumk=1

Eqk(sa)

[log

pθ(Rk = 1 s a)

qk(s a)

]=

Nsumk=1

Eqk(sa)

[log

p(Rk = 1|s a)πθ(a|s)micro(s)

qk(s a)

]

where micro(s) is the stationary state distribution which is as-sumed to be given in each policy improvement step Inpractice micro(s) is approximated by the distribution of thestates in the replay buffer p(Rk = 1|s a) is the likelihoodof the improvement event for objective k if our policy choseaction a in state s

The first term of the decomposition in Equation (28) isalways positive so the second term is a lower bound onthe log-evidence

sumNk=1 log pθ(Rk = 1) πθ(a|s) and

qk(a|s) = qk(sa)micro(s) are unknown even though micro(s) is given

In the E-step we estimate qk(a|s) for each objective k byminimizing the first term given θ = θprime (the parameters ofthe target policy) Then in the M-step we find a new θ byfitting the parametric policy to the distributions from firststep

E1 E-Step

The E-step corresponds to the first step of policy improve-ment in the main paper (Sec 421) In the E-step wechoose the variational distributions qk(a|s)Nk=1 such thatthe lower bound on

sumNk=1 log pθ(Rk = 1) is as tight as


possible when θ = θprime the parameters of the target policyThe lower bound is tight when the first term of the decompo-sition in Equation (28) is zero so we choose qk to minimizethis term We can optimize for each variational distributionqk independently

qk(a|s) = argminq

Emicro(s)

[KL(q(a|s) pθprime(s a|Rk = 1))

]= argmin

qEmicro(s)

[KL(q(a|s) πθprime(a|s))minus

Eq(a|s) log p(Rk = 1|s a))]

(30)

We can solve this optimization problem in closed formwhich gives us

qk(a|s) =πθprime(a|s) p(Rk = 1|s a)intπθprime(a|s) p(Rk = 1|s a) da

This solution weighs the actions based on their relativeimprovement likelihood p(Rk = 1|s a) for each objectiveWe define the likelihood of an improvement event Rk as

p(Rk = 1|s a) prop exp(Qk(s a)

αk

)

where αk is an objective-dependent temperature parameterthat controls how greedy the solution qk(a|s) is with respectto its associated objective Qk(s a) For example given abatch of state-action pairs at the extremes an αk of zerowould give all probability to the state-action pair with themaximum Q-value while an αk of positive infinity wouldgive the same probability to all state-action pairs In orderto automatically optimize for αk we plug the exponentialtransformation into (30) which leads to

qk(a|s) = argmaxq

intmicro(s)

intq(a|s)Qk(s a) dadsminus

αk

intmicro(s) KL(q(a|s) πθprime(a|s)) ds (31)

If we instead enforce the bound on KL divergence as a hardconstraint (which we do in practice) that leads to

qk(a|s) = argmaxq

intmicro(s)

intq(a|s)Qk(s a) da ds (32)

stintmicro(s) KL(q(a|s) πθprime(a|s)) ds lt εk

where as described in the main paper the parameter εkdefines preferences over objectives If we now assume thatqk(a|s) is a non-parametric sample-based distribution as in(Abdolmaleki et al 2018b) we can solve this constrained

optimization problem in closed form for each sampled states by setting

q(a|s) prop πθprime(a|s) exp(Qk(s a)

ηk

) (33)

where ηk is computed by solving the following convex dualfunction

ηk = argminη

ηεk+ (34)

η

intmicro(s) log

intπθprime(a|s) exp

(Qk(s a)

η

)da ds

Equations (32) (33) and (34) are equivalent to those givenfor the first step of multi-objective policy improvement inthe main paper (Sec 421) Please see refer to AppendixD2 in (Abdolmaleki et al 2018b) for details on derivationof the dual function

E2 M-Step

After obtaining the variational distributions qk(a|s) foreach objective k we have found a tight lower bound forsumNk=1 log pθ(Rk = 1) when θ = θprime We can now obtain

the parameters θ of the updated policy πθ(a|s) by optimiz-ing this lower bound After rearranging and dropping theterms in (27) that do not involve θ we obtain

θlowast = argmaxθ

Nsumk=1

intmicro(s)

intqk(a|s) log πθ(a|s) da ds

+ log p(θ)

This corresponds to the supervised learning step of policyimprovement in the main paper (Sec 422)

For the prior p(θ) on the parameters θ we enforce that thenew policy πθ(a|s) is close to the target policy πθprime(a|s)We therefore optimize for

θlowast = argmaxθ

Nsumk=1

intmicro(s)

intqk(a|s) log πθ(a|s) dads

stintmicro(s) KL(πθprime(a|s) πθ(a|s)) ds lt β

Because of the constraint on the change of the parametricpolicy we do not greedily optimize the M-step objectiveSee (Abdolmaleki et al 2018b) for more details


bution space thus making it invariant to the scale of rewardsIn principle our approach can be combined with any RLalgorithm regardless of whether it is off-policy or on-policyWe combine it with maximum a posteriori policy optimiza-tion (MPO) (Abdolmaleki et al 2018ab) an off-policyactor-critic RL algorithm and V-MPO (Song et al 2020)an on-policy variant of MPO We call these two algorithmsmulti-objective MPO (MO-MPO) and multi-objective V-MPO (MO-V-MPO) respectively

Our main contribution is providing a distributional view onMORL which enables scale-invariant encoding of prefer-ences We show that this is a theoretically-grounded ap-proach that arises from taking an RL-as-inference perspec-tive of MORL Empirically we analyze the mechanics ofMO-MPO and show it finds all Pareto-optimal policies in apopular MORL benchmark task Finally we demonstratethat MO-MPO and MO-V-MPO outperform scalarized ap-proaches on multi-objective tasks across several challenginghigh-dimensional continuous control domains (Fig 1)

2 Related Work21 Multi-Objective Reinforcement Learning

Multi-objective reinforcement learning (MORL) algorithmsare either single-policy or multiple-policy (Vamplew et al2011) Single-policy approaches seek to find the optimalpolicy for a given scalarization of the multi-objective prob-lem Often this scalarization is linear but other choices havealso been explored (Van Moffaert et al 2013)

However the scalarization may be unknown at training timeor it may change over time Multiple-policy approacheshandle this by finding a set of policies that approximatesthe true Pareto front Some approaches repeatedly call asingle-policy MORL algorithm with strategically-chosenscalarizations (Natarajan amp Tadepalli 2005 Roijers et al2014 Mossalam et al 2016 Zuluaga et al 2016) Otherapproaches learn a set of policies simultaneously by using amulti-objective variant of the Q-learning update rule (Barrettamp Narayanan 2008 Moffaert amp Nowe 2014 Reymond ampNowe 2019 Yang et al 2019) or by modifying gradient-based policy search (Parisi et al 2014 Pirotta et al 2015)

Most existing approaches for finding the Pareto front arelimited to discrete state and action spaces in which tabularalgorithms are sufficient Although recent work combiningMORL with deep RL handles high-dimensional observa-tions this is in domains with low-dimensional and usuallydiscrete action spaces (Mossalam et al 2016 van Seijenet al 2017 Friedman amp Fontaine 2018 Abels et al 2019Reymond amp Nowe 2019 Yang et al 2019 Nottinghamet al 2019) In contrast we evaluate our approach on con-

tinuous control tasks with more than 20 action dimensions1

A couple of recent works have applied deep MORL to findthe Pareto front in continuous control tasks these worksassume scalarization and rely on additionally learning ei-ther a meta-policy (Chen et al 2019) or inter-objectiverelationships (Zhan amp Cao 2019) We take an orthogonalapproach to existing approaches one encodes preferencesvia constraints on the influence of each objective on thepolicy update instead of via scalarization MO-MPO canbe run multiple times with different constraint settings tofind a Pareto front of policies

22 Constrained Reinforcement Learning

An alternate way of setting preferences is to enforce thatpolicies meet certain constraints For instance thresholdlexicographic ordering approaches optimize a (single) ob-jective while meeting specified threshold values on the otherobjectives (Gabor et al 1998) optionally with slack (Wrayet al 2015) Similarly safe RL is concerned with learningpolicies that optimize a scalar reward while not violatingsafety constraints (Achiam et al 2017 Chow et al 2018)this has also been studied in the off-policy batch RL setting(Le et al 2019) Related work minimizes costs while ensur-ing the policy meets a constraint on the minimum expectedreturn (Bohez et al 2019) but this requires that the desiredor achievable reward is known a priori In contrast MO-MPO does not require knowledge of the scale of rewardsIn fact often there is no easy way to specify constraints onobjectives eg it is difficult to figure out a priori how muchactuator effort a robot will need to use to perform a task

23 Multi-Task Reinforcement Learning

Multi-task reinforcement learning can also be cast as aMORL problem Generally these algorithms learn a separatepolicy for each task with shared learning across tasks (Tehet al 2017 Riedmiller et al 2018 Wulfmeier et al 2019)In particular Distral (Teh et al 2017) learns a shared priorthat regularizes the per-task policies to be similar to eachother and thus captures essential structure that is sharedacross tasks MO-MPO differs in that the goal is to learn asingle policy that must trade off across different objectives

Other multi-task RL algorithms seek to train a single agentto solve different tasks and thus need to handle differentreward scales across tasks Prior work uses adaptive normal-ization for the targets in value-based RL so that the agent

1MORL for continuous control tasks is difficult because thepolicy can no longer output arbitrary action distributions whichlimits how well it can compromise between competing objectivesIn state-of-the-art RL algorithms for continuous control policiestypically output a single action (eg D4PG (Barth-Maron et al2018)) or a Gaussian (eg PPO (Schulman et al 2017) MPO(Abdolmaleki et al 2018b) and SAC (Haarnoja et al 2018))





suminfint=0 γ



]







LMNijk where




1Llog

(sumMj

1M

exp(Q

ijkηk


Qijkηk

)


sumLi

sumMj

sumNk q

ijk log πθ(a












minφkN1

Nsumk=1

E(sa)simD


k (s aφk))2]







πoldk (s a)] ge




θnew = argminθ

Nsumk=1

Emicro(s)


)] (1)




maxqk

ints

micro(s)

inta


stints






ηk

) (3)


ηk = argminη

η εk + (4)

η

ints

micro(s) log

inta


η

)dads








θnew = argmaxθ

Nsumk=1

ints

micro(s)

inta


stints









51 Simple World














00

05


left up right00

05

10





00 05 10p(left)

0

1distributional

[001]

[0101]

[010]

00 05 10p(left)

0

1p(

righ

t)scalarized

[01]

[55]

[10][w1 w2]

00 05 10p(left)

0

1p(

righ

t)scalarized

[01]

[55]

[10]

00 05 10p(left)

0

1

p(righ

t)

scalarized

[01]

[55]

[10][1205981 1205982]

00 05 10p(left)

0

1distributional

[001]

[0101]

[010]

















3000 00

400

800

task

rewar

d

mid-training

3000 0

action norm penalty

end of training

3000 0

normal scale

30000 0

10x penalty


5 0pain penalty

00

05

task

rewar

d

touch mid-training

5 0pain penalty

touch

5 0pain penalty

turn

0 100z

0

150

angl

e

orient

3000 00

400

800

task

rewar

d

mid-training

3000 0

action norm penalty

end of training

3000 0

normal scale

30000 0

10x penalty


5 0pain penalty

00

05

task

rewar

d

touch mid-training

5 0pain penalty

touch

5 0pain penalty

turn

0 100z

0

150

angl

e

orient

touch mid-training

1x penalty

action norm penalty

pain penalty

3000 00

400

800

task

rewar

d

mid-training

3000 0

action norm penalty

end of training

3000 0

normal scale

30000 0

10x penalty

























0 200000 400000

200

400


task reward

0 200000 400000

minus12

00minus

800

wrist force penalty






























































































Adam ε 10minus3







Humanoid run






Adam ε 10minus8

Shadow Hand


Sawyer








Condition Settings

















Humanoid Mocap






A3 Humanoid Mocap





237

224

203196

161151140

115

82

07



B2 Shadow Hand





0 2 4 6impact force

minus6

minus4

minus2

0

pain

pen

alty






]m(s sprime) (6)




675 700 725zpeg

0

1

rew

ard

height objective

0 2d(θ θpeg)






minqisinQtarget



B3 Humanoid






B4 Humanoid Mocap







com2)




vel2)






app2)




quat2)




rtrunc = 1minus 1

03(








r =1

2rtrunc +

1


(8)


B5 Sawyer







rinsertion = max(


)(9)



z ) (11)


xy lt d2 (12)




radicl)



















minφk


(st at))2] (15)

with


k(st at) +

Tsumj=t


cz

)δj







k= Qφk


LMNijk where




1Llog

(sumMj

1M

exp(Q

ijkηk


Qijkηk

)


sumLi

sumMj

sumNk q

ijk log πθ(a

ij |si)18 +sg(ν)

(β minus





sum(stat)isinτsimD

(Qφk (st at)minusQ

retk

)226 with Qret



prime = Qφk



cz = min


)


z=t+1 cz

)= 1


C3 Policy Fitting


optimization

πnew = argmaxθ

Nsumk=1

ints

micro(s)

inta


stints



L(θ ν) =

Nsumk=1

ints

micro(s)

inta


+ ν(β minus

ints



maxθ

minνgt0

L(θ ν)




(s)) (18)

πΣθ (a|s) = N

(amicroθold

(s)Σθ(s)) (19)



maxθ

Nsumk=1

ints

micro(s)

inta

qk(a|s)(


)da ds

stints











minφ1k

Nsumk=1

Eτ[(G


φk(st))

2] (21)



(T )k (st at) =

sumTminus1`=t γ



k (st at) =


φk(st)







maxqk

intsa







ηk

) (23)


ηk = argminηk

[ηk εk + (24)

ηk log

intsa


η

)da ds

]






πnew = argmaxθ

Nsumk=1

intsa


stints







maxθ

pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)


maxθ

Nsumk=1








Nsumk=1

log pθ(Rk = 1) (28)

=

Nsumk=1


Nsumk=1



Nsumk=1

KL(qk(s a) pθ(Rk = 1 s a)) (29)

=

Nsumk=1

Eqk(sa)

[log

pθ(Rk = 1 s a)

qk(s a)

]=

Nsumk=1

Eqk(sa)

[log


qk(s a)

]






E1 E-Step





qk(a|s) = argminq

Emicro(s)


]= argmin

qEmicro(s)



(30)





αk

)


qk(a|s) = argmaxq

intmicro(s)


αk



qk(a|s) = argmaxq

intmicro(s)






ηk

) (33)


ηk = argminη

ηεk+ (34)

η

intmicro(s) log


(Qk(s a)

η

)da ds


E2 M-Step



θlowast = argmaxθ

Nsumk=1

intmicro(s)


+ log p(θ)



θlowast = argmaxθ

Nsumk=1

intmicro(s)








suminfint=0 γ



]







LMNijk where




1Llog

(sumMj

1M

exp(Q

ijkηk


Qijkηk

)


sumLi

sumMj

sumNk q

ijk log πθ(a












minφkN1

Nsumk=1

E(sa)simD


k (s aφk))2]







πoldk (s a)] ge




θnew = argminθ

Nsumk=1

Emicro(s)


)] (1)




maxqk

ints

micro(s)

inta


stints






ηk

) (3)


ηk = argminη

η εk + (4)

η

ints

micro(s) log

inta


η

)dads








θnew = argmaxθ

Nsumk=1

ints

micro(s)

inta


stints









51 Simple World














00

05


left up right00

05

10





00 05 10p(left)

0

1distributional

[001]

[0101]

[010]

00 05 10p(left)

0

1p(

righ

t)scalarized

[01]

[55]

[10][w1 w2]

00 05 10p(left)

0

1p(

righ

t)scalarized

[01]

[55]

[10]

00 05 10p(left)

0

1

p(righ

t)

scalarized

[01]

[55]

[10][1205981 1205982]

00 05 10p(left)

0

1distributional

[001]

[0101]

[010]

















3000 00

400

800

task

rewar

d

mid-training

3000 0

action norm penalty

end of training

3000 0

normal scale

30000 0

10x penalty


5 0pain penalty

00

05

task

rewar

d

touch mid-training

5 0pain penalty

touch

5 0pain penalty

turn

0 100z

0

150

angl

e

orient

3000 00

400

800

task

rewar

d

mid-training

3000 0

action norm penalty

end of training

3000 0

normal scale

30000 0

10x penalty


5 0pain penalty

00

05

task

rewar

d

touch mid-training

5 0pain penalty

touch

5 0pain penalty

turn

0 100z

0

150

angl

e

orient

touch mid-training

1x penalty

action norm penalty

pain penalty

3000 00

400

800

task

rewar

d

mid-training

3000 0

action norm penalty

end of training

3000 0

normal scale

30000 0

10x penalty

























0 200000 400000

200

400


task reward

0 200000 400000

minus12

00minus

800

wrist force penalty






























































































Adam ε 10minus3







Humanoid run






Adam ε 10minus8

Shadow Hand


Sawyer








Condition Settings

















Humanoid Mocap






A3 Humanoid Mocap





237

224

203196

161151140

115

82

07



B2 Shadow Hand





0 2 4 6impact force

minus6

minus4

minus2

0

pain

pen

alty






]m(s sprime) (6)




675 700 725zpeg

0

1

rew

ard

height objective

0 2d(θ θpeg)






minqisinQtarget



B3 Humanoid






B4 Humanoid Mocap







com2)




vel2)






app2)




quat2)




rtrunc = 1minus 1

03(








r =1

2rtrunc +

1


(8)


B5 Sawyer







rinsertion = max(


)(9)



z ) (11)


xy lt d2 (12)




radicl)



















minφk


(st at))2] (15)

with


k(st at) +

Tsumj=t


cz

)δj







k= Qφk


LMNijk where




1Llog

(sumMj

1M

exp(Q

ijkηk


Qijkηk

)


sumLi

sumMj

sumNk q

ijk log πθ(a

ij |si)18 +sg(ν)

(β minus





sum(stat)isinτsimD

(Qφk (st at)minusQ

retk

)226 with Qret



prime = Qφk



cz = min


)


z=t+1 cz

)= 1


C3 Policy Fitting


optimization

πnew = argmaxθ

Nsumk=1

ints

micro(s)

inta


stints



L(θ ν) =

Nsumk=1

ints

micro(s)

inta


+ ν(β minus

ints



maxθ

minνgt0

L(θ ν)




(s)) (18)

πΣθ (a|s) = N

(amicroθold

(s)Σθ(s)) (19)



maxθ

Nsumk=1

ints

micro(s)

inta

qk(a|s)(


)da ds

stints











minφ1k

Nsumk=1

Eτ[(G


φk(st))

2] (21)



(T )k (st at) =

sumTminus1`=t γ



k (st at) =


φk(st)







maxqk

intsa







ηk

) (23)


ηk = argminηk

[ηk εk + (24)

ηk log

intsa


η

)da ds

]






πnew = argmaxθ

Nsumk=1

intsa


stints







maxθ

pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)


maxθ

Nsumk=1








Nsumk=1

log pθ(Rk = 1) (28)

=

Nsumk=1


Nsumk=1



Nsumk=1

KL(qk(s a) pθ(Rk = 1 s a)) (29)

=

Nsumk=1

Eqk(sa)

[log

pθ(Rk = 1 s a)

qk(s a)

]=

Nsumk=1

Eqk(sa)

[log


qk(s a)

]






E1 E-Step





qk(a|s) = argminq

Emicro(s)


]= argmin

qEmicro(s)



(30)





αk

)


qk(a|s) = argmaxq

intmicro(s)


αk



qk(a|s) = argmaxq

intmicro(s)






ηk

) (33)


ηk = argminη

ηεk+ (34)

η

intmicro(s) log


(Qk(s a)

η

)da ds


E2 M-Step



θlowast = argmaxθ

Nsumk=1

intmicro(s)


+ log p(θ)



θlowast = argmaxθ

Nsumk=1

intmicro(s)







minφkN1

Nsumk=1

E(sa)simD


k (s aφk))2]







πoldk (s a)] ge




θnew = argminθ

Nsumk=1

Emicro(s)


)] (1)




maxqk

ints

micro(s)

inta


stints






ηk

) (3)


ηk = argminη

η εk + (4)

η

ints

micro(s) log

inta


η

)dads








θnew = argmaxθ

Nsumk=1

ints

micro(s)

inta


stints









51 Simple World














00

05


left up right00

05

10





00 05 10p(left)

0

1distributional

[001]

[0101]

[010]

00 05 10p(left)

0

1p(

righ

t)scalarized

[01]

[55]

[10][w1 w2]

00 05 10p(left)

0

1p(

righ

t)scalarized

[01]

[55]

[10]

00 05 10p(left)

0

1

p(righ

t)

scalarized

[01]

[55]

[10][1205981 1205982]

00 05 10p(left)

0

1distributional

[001]

[0101]

[010]

















3000 00

400

800

task

rewar

d

mid-training

3000 0

action norm penalty

end of training

3000 0

normal scale

30000 0

10x penalty


5 0pain penalty

00

05

task

rewar

d

touch mid-training

5 0pain penalty

touch

5 0pain penalty

turn

0 100z

0

150

angl

e

orient

3000 00

400

800

task

rewar

d

mid-training

3000 0

action norm penalty

end of training

3000 0

normal scale

30000 0

10x penalty


5 0pain penalty

00

05

task

rewar

d

touch mid-training

5 0pain penalty

touch

5 0pain penalty

turn

0 100z

0

150

angl

e

orient

touch mid-training

1x penalty

action norm penalty

pain penalty

3000 00

400

800

task

rewar

d

mid-training

3000 0

action norm penalty

end of training

3000 0

normal scale

30000 0

10x penalty

























0 200000 400000

200

400


task reward

0 200000 400000

minus12

00minus

800

wrist force penalty






























































































Adam ε 10minus3







Humanoid run






Adam ε 10minus8

Shadow Hand


Sawyer








Condition Settings

















Humanoid Mocap






A3 Humanoid Mocap





237

224

203196

161151140

115

82

07



B2 Shadow Hand





0 2 4 6impact force

minus6

minus4

minus2

0

pain

pen

alty






]m(s sprime) (6)




675 700 725zpeg

0

1

rew

ard

height objective

0 2d(θ θpeg)






minqisinQtarget



B3 Humanoid






B4 Humanoid Mocap







com2)




vel2)






app2)




quat2)




rtrunc = 1minus 1

03(








r =1

2rtrunc +

1


(8)


B5 Sawyer







rinsertion = max(


)(9)



z ) (11)


xy lt d2 (12)




radicl)



















minφk


(st at))2] (15)

with


k(st at) +

Tsumj=t


cz

)δj







k= Qφk


LMNijk where




1Llog

(sumMj

1M

exp(Q

ijkηk


Qijkηk

)


sumLi

sumMj

sumNk q

ijk log πθ(a

ij |si)18 +sg(ν)

(β minus





sum(stat)isinτsimD

(Qφk (st at)minusQ

retk

)226 with Qret



prime = Qφk



cz = min


)


z=t+1 cz

)= 1


C3 Policy Fitting


optimization

πnew = argmaxθ

Nsumk=1

ints

micro(s)

inta


stints



L(θ ν) =

Nsumk=1

ints

micro(s)

inta


+ ν(β minus

ints



maxθ

minνgt0

L(θ ν)




(s)) (18)

πΣθ (a|s) = N

(amicroθold

(s)Σθ(s)) (19)



maxθ

Nsumk=1

ints

micro(s)

inta

qk(a|s)(


)da ds

stints











minφ1k

Nsumk=1

Eτ[(G


φk(st))

2] (21)



(T )k (st at) =

sumTminus1`=t γ



k (st at) =


φk(st)







maxqk

intsa







ηk

) (23)


ηk = argminηk

[ηk εk + (24)

ηk log

intsa


η

)da ds

]






πnew = argmaxθ

Nsumk=1

intsa


stints







maxθ

pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)


maxθ

Nsumk=1








Nsumk=1

log pθ(Rk = 1) (28)

=

Nsumk=1


Nsumk=1



Nsumk=1

KL(qk(s a) pθ(Rk = 1 s a)) (29)

=

Nsumk=1

Eqk(sa)

[log

pθ(Rk = 1 s a)

qk(s a)

]=

Nsumk=1

Eqk(sa)

[log


qk(s a)

]






E1 E-Step





qk(a|s) = argminq

Emicro(s)


]= argmin

qEmicro(s)



(30)





αk

)


qk(a|s) = argmaxq

intmicro(s)


αk



qk(a|s) = argmaxq

intmicro(s)






ηk

) (33)


ηk = argminη

ηεk+ (34)

η

intmicro(s) log


(Qk(s a)

η

)da ds


E2 M-Step



θlowast = argmaxθ

Nsumk=1

intmicro(s)


+ log p(θ)



θlowast = argmaxθ

Nsumk=1

intmicro(s)








θnew = argmaxθ

Nsumk=1

ints

micro(s)

inta


stints









51 Simple World














00

05


left up right00

05

10





00 05 10p(left)

0

1distributional

[001]

[0101]

[010]

00 05 10p(left)

0

1p(

righ

t)scalarized

[01]

[55]

[10][w1 w2]

00 05 10p(left)

0

1p(

righ

t)scalarized

[01]

[55]

[10]

00 05 10p(left)

0

1

p(righ

t)

scalarized

[01]

[55]

[10][1205981 1205982]

00 05 10p(left)

0

1distributional

[001]

[0101]

[010]

















3000 00

400

800

task

rewar

d

mid-training

3000 0

action norm penalty

end of training

3000 0

normal scale

30000 0

10x penalty


5 0pain penalty

00

05

task

rewar

d

touch mid-training

5 0pain penalty

touch

5 0pain penalty

turn

0 100z

0

150

angl

e

orient

3000 00

400

800

task

rewar

d

mid-training

3000 0

action norm penalty

end of training

3000 0

normal scale

30000 0

10x penalty


5 0pain penalty

00

05

task

rewar

d

touch mid-training

5 0pain penalty

touch

5 0pain penalty

turn

0 100z

0

150

angl

e

orient

touch mid-training

1x penalty

action norm penalty

pain penalty

3000 00

400

800

task

rewar

d

mid-training

3000 0

action norm penalty

end of training

3000 0

normal scale

30000 0

10x penalty

























0 200000 400000

200

400


task reward

0 200000 400000

minus12

00minus

800

wrist force penalty






























































































Adam ε 10minus3







Humanoid run






Adam ε 10minus8

Shadow Hand


Sawyer








Condition Settings

















Humanoid Mocap






A3 Humanoid Mocap





237

224

203196

161151140

115

82

07



B2 Shadow Hand





0 2 4 6impact force

minus6

minus4

minus2

0

pain

pen

alty






]m(s sprime) (6)




675 700 725zpeg

0

1

rew

ard

height objective

0 2d(θ θpeg)






minqisinQtarget



B3 Humanoid






B4 Humanoid Mocap







com2)




vel2)






app2)




quat2)




rtrunc = 1minus 1

03(








r =1

2rtrunc +

1


(8)


B5 Sawyer







rinsertion = max(


)(9)



z ) (11)


xy lt d2 (12)




radicl)



















minφk


(st at))2] (15)

with


k(st at) +

Tsumj=t


cz

)δj







k= Qφk


LMNijk where




1Llog

(sumMj

1M

exp(Q

ijkηk


Qijkηk

)


sumLi

sumMj

sumNk q

ijk log πθ(a

ij |si)18 +sg(ν)

(β minus





sum(stat)isinτsimD

(Qφk (st at)minusQ

retk

)226 with Qret



prime = Qφk



cz = min


)


z=t+1 cz

)= 1


C3 Policy Fitting


optimization

πnew = argmaxθ

Nsumk=1

ints

micro(s)

inta


stints



L(θ ν) =

Nsumk=1

ints

micro(s)

inta


+ ν(β minus

ints



maxθ

minνgt0

L(θ ν)




(s)) (18)

πΣθ (a|s) = N

(amicroθold

(s)Σθ(s)) (19)



maxθ

Nsumk=1

ints

micro(s)

inta

qk(a|s)(


)da ds

stints











minφ1k

Nsumk=1

Eτ[(G


φk(st))

2] (21)



(T )k (st at) =

sumTminus1`=t γ



k (st at) =


φk(st)







maxqk

intsa







ηk

) (23)


ηk = argminηk

[ηk εk + (24)

ηk log

intsa


η

)da ds

]






πnew = argmaxθ

Nsumk=1

intsa


stints







maxθ

pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)


maxθ

Nsumk=1








Nsumk=1

log pθ(Rk = 1) (28)

=

Nsumk=1


Nsumk=1



Nsumk=1

KL(qk(s a) pθ(Rk = 1 s a)) (29)

=

Nsumk=1

Eqk(sa)

[log

pθ(Rk = 1 s a)

qk(s a)

]=

Nsumk=1

Eqk(sa)

[log


qk(s a)

]






E1 E-Step





qk(a|s) = argminq

Emicro(s)


]= argmin

qEmicro(s)



(30)





αk

)


qk(a|s) = argmaxq

intmicro(s)


αk



qk(a|s) = argmaxq

intmicro(s)






ηk

) (33)


ηk = argminη

ηεk+ (34)

η

intmicro(s) log


(Qk(s a)

η

)da ds


E2 M-Step



θlowast = argmaxθ

Nsumk=1

intmicro(s)


+ log p(θ)



θlowast = argmaxθ

Nsumk=1

intmicro(s)












00

05


left up right00

05

10





00 05 10p(left)

0

1distributional

[001]

[0101]

[010]

00 05 10p(left)

0

1p(

righ

t)scalarized

[01]

[55]

[10][w1 w2]

00 05 10p(left)

0

1p(

righ

t)scalarized

[01]

[55]

[10]

00 05 10p(left)

0

1

p(righ

t)

scalarized

[01]

[55]

[10][1205981 1205982]

00 05 10p(left)

0

1distributional

[001]

[0101]

[010]

















3000 00

400

800

task

rewar

d

mid-training

3000 0

action norm penalty

end of training

3000 0

normal scale

30000 0

10x penalty


5 0pain penalty

00

05

task

rewar

d

touch mid-training

5 0pain penalty

touch

5 0pain penalty

turn

0 100z

0

150

angl

e

orient

3000 00

400

800

task

rewar

d

mid-training

3000 0

action norm penalty

end of training

3000 0

normal scale

30000 0

10x penalty


5 0pain penalty

00

05

task

rewar

d

touch mid-training

5 0pain penalty

touch

5 0pain penalty

turn

0 100z

0

150

angl

e

orient

touch mid-training

1x penalty

action norm penalty

pain penalty

3000 00

400

800

task

rewar

d

mid-training

3000 0

action norm penalty

end of training

3000 0

normal scale

30000 0

10x penalty

























0 200000 400000

200

400


task reward

0 200000 400000

minus12

00minus

800

wrist force penalty






























































































Adam ε 10minus3







Humanoid run






Adam ε 10minus8

Shadow Hand


Sawyer








Condition Settings

















Humanoid Mocap






A3 Humanoid Mocap





237

224

203196

161151140

115

82

07



B2 Shadow Hand





0 2 4 6impact force

minus6

minus4

minus2

0

pain

pen

alty






]m(s sprime) (6)




675 700 725zpeg

0

1

rew

ard

height objective

0 2d(θ θpeg)






minqisinQtarget



B3 Humanoid






B4 Humanoid Mocap







com2)




vel2)






app2)




quat2)




rtrunc = 1minus 1

03(








r =1

2rtrunc +

1


(8)


B5 Sawyer







rinsertion = max(


)(9)



z ) (11)


xy lt d2 (12)




radicl)



















minφk


(st at))2] (15)

with


k(st at) +

Tsumj=t


cz

)δj







k= Qφk


LMNijk where




1Llog

(sumMj

1M

exp(Q

ijkηk


Qijkηk

)


sumLi

sumMj

sumNk q

ijk log πθ(a

ij |si)18 +sg(ν)

(β minus





sum(stat)isinτsimD

(Qφk (st at)minusQ

retk

)226 with Qret



prime = Qφk



cz = min


)


z=t+1 cz

)= 1


C3 Policy Fitting


optimization

πnew = argmaxθ

Nsumk=1

ints

micro(s)

inta


stints



L(θ ν) =

Nsumk=1

ints

micro(s)

inta


+ ν(β minus

ints



maxθ

minνgt0

L(θ ν)




(s)) (18)

πΣθ (a|s) = N

(amicroθold

(s)Σθ(s)) (19)



maxθ

Nsumk=1

ints

micro(s)

inta

qk(a|s)(


)da ds

stints











minφ1k

Nsumk=1

Eτ[(G


φk(st))

2] (21)



(T )k (st at) =

sumTminus1`=t γ



k (st at) =


φk(st)







maxqk

intsa







ηk

) (23)


ηk = argminηk

[ηk εk + (24)

ηk log

intsa


η

)da ds

]






πnew = argmaxθ

Nsumk=1

intsa


stints







maxθ

pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)


maxθ

Nsumk=1








Nsumk=1

log pθ(Rk = 1) (28)

=

Nsumk=1


Nsumk=1



Nsumk=1

KL(qk(s a) pθ(Rk = 1 s a)) (29)

=

Nsumk=1

Eqk(sa)

[log

pθ(Rk = 1 s a)

qk(s a)

]=

Nsumk=1

Eqk(sa)

[log


qk(s a)

]






E1 E-Step





qk(a|s) = argminq

Emicro(s)


]= argmin

qEmicro(s)



(30)





αk

)


qk(a|s) = argmaxq

intmicro(s)


αk



qk(a|s) = argmaxq

intmicro(s)






ηk

) (33)


ηk = argminη

ηεk+ (34)

η

intmicro(s) log


(Qk(s a)

η

)da ds


E2 M-Step



θlowast = argmaxθ

Nsumk=1

intmicro(s)


+ log p(θ)



θlowast = argmaxθ

Nsumk=1

intmicro(s)
















3000 00

400

800

task

rewar

d

mid-training

3000 0

action norm penalty

end of training

3000 0

normal scale

30000 0

10x penalty


5 0pain penalty

00

05

task

rewar

d

touch mid-training

5 0pain penalty

touch

5 0pain penalty

turn

0 100z

0

150

angl

e

orient

3000 00

400

800

task

rewar

d

mid-training

3000 0

action norm penalty

end of training

3000 0

normal scale

30000 0

10x penalty


5 0pain penalty

00

05

task

rewar

d

touch mid-training

5 0pain penalty

touch

5 0pain penalty

turn

0 100z

0

150

angl

e

orient

touch mid-training

1x penalty

action norm penalty

pain penalty

3000 00

400

800

task

rewar

d

mid-training

3000 0

action norm penalty

end of training

3000 0

normal scale

30000 0

10x penalty

























0 200000 400000

200

400


task reward

0 200000 400000

minus12

00minus

800

wrist force penalty






























































































Adam ε 10minus3







Humanoid run






Adam ε 10minus8

Shadow Hand


Sawyer








Condition Settings

















Humanoid Mocap






A3 Humanoid Mocap





237

224

203196

161151140

115

82

07



B2 Shadow Hand





0 2 4 6impact force

minus6

minus4

minus2

0

pain

pen

alty






]m(s sprime) (6)




675 700 725zpeg

0

1

rew

ard

height objective

0 2d(θ θpeg)






minqisinQtarget



B3 Humanoid






B4 Humanoid Mocap







com2)




vel2)






app2)




quat2)




rtrunc = 1minus 1

03(








r =1

2rtrunc +

1


(8)


B5 Sawyer







rinsertion = max(


)(9)



z ) (11)


xy lt d2 (12)




radicl)



















minφk


(st at))2] (15)

with


k(st at) +

Tsumj=t


cz

)δj







k= Qφk


LMNijk where




1Llog

(sumMj

1M

exp(Q

ijkηk


Qijkηk

)


sumLi

sumMj

sumNk q

ijk log πθ(a

ij |si)18 +sg(ν)

(β minus





sum(stat)isinτsimD

(Qφk (st at)minusQ

retk

)226 with Qret



prime = Qφk



cz = min


)


z=t+1 cz

)= 1


C3 Policy Fitting


optimization

πnew = argmaxθ

Nsumk=1

ints

micro(s)

inta


stints



L(θ ν) =

Nsumk=1

ints

micro(s)

inta


+ ν(β minus

ints



maxθ

minνgt0

L(θ ν)




(s)) (18)

πΣθ (a|s) = N

(amicroθold

(s)Σθ(s)) (19)



maxθ

Nsumk=1

ints

micro(s)

inta

qk(a|s)(


)da ds

stints











minφ1k

Nsumk=1

Eτ[(G


φk(st))

2] (21)



(T )k (st at) =

sumTminus1`=t γ



k (st at) =


φk(st)







maxqk

intsa







ηk

) (23)


ηk = argminηk

[ηk εk + (24)

ηk log

intsa


η

)da ds

]






πnew = argmaxθ

Nsumk=1

intsa


stints







maxθ

pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)


maxθ

Nsumk=1








Nsumk=1

log pθ(Rk = 1) (28)

=

Nsumk=1


Nsumk=1



Nsumk=1

KL(qk(s a) pθ(Rk = 1 s a)) (29)

=

Nsumk=1

Eqk(sa)

[log

pθ(Rk = 1 s a)

qk(s a)

]=

Nsumk=1

Eqk(sa)

[log


qk(s a)

]






E1 E-Step





qk(a|s) = argminq

Emicro(s)


]= argmin

qEmicro(s)



(30)





αk

)


qk(a|s) = argmaxq

intmicro(s)


αk



qk(a|s) = argmaxq

intmicro(s)






ηk

) (33)


ηk = argminη

ηεk+ (34)

η

intmicro(s) log


(Qk(s a)

η

)da ds


E2 M-Step



θlowast = argmaxθ

Nsumk=1

intmicro(s)


+ log p(θ)



θlowast = argmaxθ

Nsumk=1

intmicro(s)





3000 00

400

800

task

rewar

d

mid-training

3000 0

action norm penalty

end of training

3000 0

normal scale

30000 0

10x penalty


5 0pain penalty

00

05

task

rewar

d

touch mid-training

5 0pain penalty

touch

5 0pain penalty

turn

0 100z

0

150

angl

e

orient

3000 00

400

800

task

rewar

d

mid-training

3000 0

action norm penalty

end of training

3000 0

normal scale

30000 0

10x penalty


5 0pain penalty

00

05

task

rewar

d

touch mid-training

5 0pain penalty

touch

5 0pain penalty

turn

0 100z

0

150

angl

e

orient

touch mid-training

1x penalty

action norm penalty

pain penalty

3000 00

400

800

task

rewar

d

mid-training

3000 0

action norm penalty

end of training

3000 0

normal scale

30000 0

10x penalty

























0 200000 400000

200

400


task reward

0 200000 400000

minus12

00minus

800

wrist force penalty






























































































Adam ε 10minus3







Humanoid run






Adam ε 10minus8

Shadow Hand


Sawyer








Condition Settings

















Humanoid Mocap






A3 Humanoid Mocap





237

224

203196

161151140

115

82

07



B2 Shadow Hand





0 2 4 6impact force

minus6

minus4

minus2

0

pain

pen

alty






]m(s sprime) (6)




675 700 725zpeg

0

1

rew

ard

height objective

0 2d(θ θpeg)






minqisinQtarget



B3 Humanoid






B4 Humanoid Mocap







com2)




vel2)






app2)




quat2)




rtrunc = 1minus 1

03(








r =1

2rtrunc +

1


(8)


B5 Sawyer







rinsertion = max(


)(9)



z ) (11)


xy lt d2 (12)




radicl)



















minφk


(st at))2] (15)

with


k(st at) +

Tsumj=t


cz

)δj







k= Qφk


LMNijk where




1Llog

(sumMj

1M

exp(Q

ijkηk


Qijkηk

)


sumLi

sumMj

sumNk q

ijk log πθ(a

ij |si)18 +sg(ν)

(β minus





sum(stat)isinτsimD

(Qφk (st at)minusQ

retk

)226 with Qret



prime = Qφk



cz = min


)


z=t+1 cz

)= 1


C3 Policy Fitting


optimization

πnew = argmaxθ

Nsumk=1

ints

micro(s)

inta


stints



L(θ ν) =

Nsumk=1

ints

micro(s)

inta


+ ν(β minus

ints



maxθ

minνgt0

L(θ ν)




(s)) (18)

πΣθ (a|s) = N

(amicroθold

(s)Σθ(s)) (19)



maxθ

Nsumk=1

ints

micro(s)

inta

qk(a|s)(


)da ds

stints











minφ1k

Nsumk=1

Eτ[(G


φk(st))

2] (21)



(T )k (st at) =

sumTminus1`=t γ



k (st at) =


φk(st)







maxqk

intsa







ηk

) (23)


ηk = argminηk

[ηk εk + (24)

ηk log

intsa


η

)da ds

]






πnew = argmaxθ

Nsumk=1

intsa


stints







maxθ

pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)


maxθ

Nsumk=1








Nsumk=1

log pθ(Rk = 1) (28)

=

Nsumk=1


Nsumk=1



Nsumk=1

KL(qk(s a) pθ(Rk = 1 s a)) (29)

=

Nsumk=1

Eqk(sa)

[log

pθ(Rk = 1 s a)

qk(s a)

]=

Nsumk=1

Eqk(sa)

[log


qk(s a)

]






E1 E-Step





qk(a|s) = argminq

Emicro(s)


]= argmin

qEmicro(s)



(30)





αk

)


qk(a|s) = argmaxq

intmicro(s)


αk



qk(a|s) = argmaxq

intmicro(s)






ηk

) (33)


ηk = argminη

ηεk+ (34)

η

intmicro(s) log


(Qk(s a)

η

)da ds


E2 M-Step



θlowast = argmaxθ

Nsumk=1

intmicro(s)


+ log p(θ)



θlowast = argmaxθ

Nsumk=1

intmicro(s)





0 200000 400000

200

400


task reward

0 200000 400000

minus12

00minus

800

wrist force penalty






























































































Adam ε 10minus3







Humanoid run






Adam ε 10minus8

Shadow Hand


Sawyer








Condition Settings

















Humanoid Mocap






A3 Humanoid Mocap





237

224

203196

161151140

115

82

07



B2 Shadow Hand





0 2 4 6impact force

minus6

minus4

minus2

0

pain

pen

alty






]m(s sprime) (6)




675 700 725zpeg

0

1

rew

ard

height objective

0 2d(θ θpeg)






minqisinQtarget



B3 Humanoid






B4 Humanoid Mocap







com2)




vel2)






app2)




quat2)




rtrunc = 1minus 1

03(








r =1

2rtrunc +

1


(8)


B5 Sawyer







rinsertion = max(


)(9)



z ) (11)


xy lt d2 (12)




radicl)



















minφk


(st at))2] (15)

with


k(st at) +

Tsumj=t


cz

)δj







k= Qφk


LMNijk where




1Llog

(sumMj

1M

exp(Q

ijkηk


Qijkηk

)


sumLi

sumMj

sumNk q

ijk log πθ(a

ij |si)18 +sg(ν)

(β minus





sum(stat)isinτsimD

(Qφk (st at)minusQ

retk

)226 with Qret



prime = Qφk



cz = min


)


z=t+1 cz

)= 1


C3 Policy Fitting


optimization

πnew = argmaxθ

Nsumk=1

ints

micro(s)

inta


stints



L(θ ν) =

Nsumk=1

ints

micro(s)

inta


+ ν(β minus

ints



maxθ

minνgt0

L(θ ν)




(s)) (18)

πΣθ (a|s) = N

(amicroθold

(s)Σθ(s)) (19)



maxθ

Nsumk=1

ints

micro(s)

inta

qk(a|s)(


)da ds

stints











minφ1k

Nsumk=1

Eτ[(G


φk(st))

2] (21)



(T )k (st at) =

sumTminus1`=t γ



k (st at) =


φk(st)







maxqk

intsa







ηk

) (23)


ηk = argminηk

[ηk εk + (24)

ηk log

intsa


η

)da ds

]






πnew = argmaxθ

Nsumk=1

intsa


stints







maxθ

pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)


maxθ

Nsumk=1








Nsumk=1

log pθ(Rk = 1) (28)

=

Nsumk=1


Nsumk=1



Nsumk=1

KL(qk(s a) pθ(Rk = 1 s a)) (29)

=

Nsumk=1

Eqk(sa)

[log

pθ(Rk = 1 s a)

qk(s a)

]=

Nsumk=1

Eqk(sa)

[log


qk(s a)

]






E1 E-Step





qk(a|s) = argminq

Emicro(s)


]= argmin

qEmicro(s)



(30)





αk

)


qk(a|s) = argmaxq

intmicro(s)


αk



qk(a|s) = argmaxq

intmicro(s)






ηk

) (33)


ηk = argminη

ηεk+ (34)

η

intmicro(s) log


(Qk(s a)

η

)da ds


E2 M-Step



θlowast = argmaxθ

Nsumk=1

intmicro(s)


+ log p(θ)



θlowast = argmaxθ

Nsumk=1

intmicro(s)















































































Adam ε 10minus3







Humanoid run






Adam ε 10minus8

Shadow Hand


Sawyer








Condition Settings

















Humanoid Mocap






A3 Humanoid Mocap





237

224

203196

161151140

115

82

07



B2 Shadow Hand





0 2 4 6impact force

minus6

minus4

minus2

0

pain

pen

alty






]m(s sprime) (6)




675 700 725zpeg

0

1

rew

ard

height objective

0 2d(θ θpeg)






minqisinQtarget



B3 Humanoid






B4 Humanoid Mocap







com2)




vel2)






app2)




quat2)




rtrunc = 1minus 1

03(








r =1

2rtrunc +

1


(8)


B5 Sawyer







rinsertion = max(


)(9)



z ) (11)


xy lt d2 (12)




radicl)



















minφk


(st at))2] (15)

with


k(st at) +

Tsumj=t


cz

)δj







k= Qφk


LMNijk where




1Llog

(sumMj

1M

exp(Q

ijkηk


Qijkηk

)


sumLi

sumMj

sumNk q

ijk log πθ(a

ij |si)18 +sg(ν)

(β minus





sum(stat)isinτsimD

(Qφk (st at)minusQ

retk

)226 with Qret



prime = Qφk



cz = min


)


z=t+1 cz

)= 1


C3 Policy Fitting


optimization

πnew = argmaxθ

Nsumk=1

ints

micro(s)

inta


stints



L(θ ν) =

Nsumk=1

ints

micro(s)

inta


+ ν(β minus

ints



maxθ

minνgt0

L(θ ν)




(s)) (18)

πΣθ (a|s) = N

(amicroθold

(s)Σθ(s)) (19)



maxθ

Nsumk=1

ints

micro(s)

inta

qk(a|s)(


)da ds

stints











minφ1k

Nsumk=1

Eτ[(G


φk(st))

2] (21)



(T )k (st at) =

sumTminus1`=t γ



k (st at) =


φk(st)







maxqk

intsa







ηk

) (23)


ηk = argminηk

[ηk εk + (24)

ηk log

intsa


η

)da ds

]






πnew = argmaxθ

Nsumk=1

intsa


stints







maxθ

pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)


maxθ

Nsumk=1








Nsumk=1

log pθ(Rk = 1) (28)

=

Nsumk=1


Nsumk=1



Nsumk=1

KL(qk(s a) pθ(Rk = 1 s a)) (29)

=

Nsumk=1

Eqk(sa)

[log

pθ(Rk = 1 s a)

qk(s a)

]=

Nsumk=1

Eqk(sa)

[log


qk(s a)

]






E1 E-Step





qk(a|s) = argminq

Emicro(s)


]= argmin

qEmicro(s)



(30)





αk

)


qk(a|s) = argmaxq

intmicro(s)


αk



qk(a|s) = argmaxq

intmicro(s)






ηk

) (33)


ηk = argminη

ηεk+ (34)

η

intmicro(s) log


(Qk(s a)

η

)da ds


E2 M-Step



θlowast = argmaxθ

Nsumk=1

intmicro(s)


+ log p(θ)



θlowast = argmaxθ

Nsumk=1

intmicro(s)






















































Adam ε 10minus3







Humanoid run






Adam ε 10minus8

Shadow Hand


Sawyer








Condition Settings

















Humanoid Mocap






A3 Humanoid Mocap





237

224

203196

161151140

115

82

07



B2 Shadow Hand





0 2 4 6impact force

minus6

minus4

minus2

0

pain

pen

alty






]m(s sprime) (6)




675 700 725zpeg

0

1

rew

ard

height objective

0 2d(θ θpeg)






minqisinQtarget



B3 Humanoid






B4 Humanoid Mocap







com2)




vel2)






app2)




quat2)




rtrunc = 1minus 1

03(








r =1

2rtrunc +

1


(8)


B5 Sawyer







rinsertion = max(


)(9)



z ) (11)


xy lt d2 (12)




radicl)



















minφk


(st at))2] (15)

with


k(st at) +

Tsumj=t


cz

)δj







k= Qφk


LMNijk where




1Llog

(sumMj

1M

exp(Q

ijkηk


Qijkηk

)


sumLi

sumMj

sumNk q

ijk log πθ(a

ij |si)18 +sg(ν)

(β minus





sum(stat)isinτsimD

(Qφk (st at)minusQ

retk

)226 with Qret



prime = Qφk



cz = min


)


z=t+1 cz

)= 1


C3 Policy Fitting


optimization

πnew = argmaxθ

Nsumk=1

ints

micro(s)

inta


stints



L(θ ν) =

Nsumk=1

ints

micro(s)

inta


+ ν(β minus

ints



maxθ

minνgt0

L(θ ν)




(s)) (18)

πΣθ (a|s) = N

(amicroθold

(s)Σθ(s)) (19)



maxθ

Nsumk=1

ints

micro(s)

inta

qk(a|s)(


)da ds

stints











minφ1k

Nsumk=1

Eτ[(G


φk(st))

2] (21)



(T )k (st at) =

sumTminus1`=t γ



k (st at) =


φk(st)







maxqk

intsa







ηk

) (23)


ηk = argminηk

[ηk εk + (24)

ηk log

intsa


η

)da ds

]






πnew = argmaxθ

Nsumk=1

intsa


stints







maxθ

pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)


maxθ

Nsumk=1








Nsumk=1

log pθ(Rk = 1) (28)

=

Nsumk=1


Nsumk=1



Nsumk=1

KL(qk(s a) pθ(Rk = 1 s a)) (29)

=

Nsumk=1

Eqk(sa)

[log

pθ(Rk = 1 s a)

qk(s a)

]=

Nsumk=1

Eqk(sa)

[log


qk(s a)

]






E1 E-Step





qk(a|s) = argminq

Emicro(s)


]= argmin

qEmicro(s)



(30)





αk

)


qk(a|s) = argmaxq

intmicro(s)


αk



qk(a|s) = argmaxq

intmicro(s)






ηk

) (33)


ηk = argminη

ηεk+ (34)

η

intmicro(s) log


(Qk(s a)

η

)da ds


E2 M-Step



θlowast = argmaxθ

Nsumk=1

intmicro(s)


+ log p(θ)



θlowast = argmaxθ

Nsumk=1

intmicro(s)































Adam ε 10minus3







Humanoid run






Adam ε 10minus8

Shadow Hand


Sawyer








Condition Settings

















Humanoid Mocap






A3 Humanoid Mocap





237

224

203196

161151140

115

82

07



B2 Shadow Hand





0 2 4 6impact force

minus6

minus4

minus2

0

pain

pen

alty






]m(s sprime) (6)




675 700 725zpeg

0

1

rew

ard

height objective

0 2d(θ θpeg)






minqisinQtarget



B3 Humanoid






B4 Humanoid Mocap







com2)




vel2)






app2)




quat2)




rtrunc = 1minus 1

03(








r =1

2rtrunc +

1


(8)


B5 Sawyer







rinsertion = max(


)(9)



z ) (11)


xy lt d2 (12)




radicl)



















minφk


(st at))2] (15)

with


k(st at) +

Tsumj=t


cz

)δj







k= Qφk


LMNijk where




1Llog

(sumMj

1M

exp(Q

ijkηk


Qijkηk

)


sumLi

sumMj

sumNk q

ijk log πθ(a

ij |si)18 +sg(ν)

(β minus





sum(stat)isinτsimD

(Qφk (st at)minusQ

retk

)226 with Qret



prime = Qφk



cz = min


)


z=t+1 cz

)= 1


C3 Policy Fitting


optimization

πnew = argmaxθ

Nsumk=1

ints

micro(s)

inta


stints



L(θ ν) =

Nsumk=1

ints

micro(s)

inta


+ ν(β minus

ints



maxθ

minνgt0

L(θ ν)




(s)) (18)

πΣθ (a|s) = N

(amicroθold

(s)Σθ(s)) (19)



maxθ

Nsumk=1

ints

micro(s)

inta

qk(a|s)(


)da ds

stints











minφ1k

Nsumk=1

Eτ[(G


φk(st))

2] (21)



(T )k (st at) =

sumTminus1`=t γ



k (st at) =


φk(st)







maxqk

intsa







ηk

) (23)


ηk = argminηk

[ηk εk + (24)

ηk log

intsa


η

)da ds

]






πnew = argmaxθ

Nsumk=1

intsa


stints







maxθ

pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)


maxθ

Nsumk=1








Nsumk=1

log pθ(Rk = 1) (28)

=

Nsumk=1


Nsumk=1



Nsumk=1

KL(qk(s a) pθ(Rk = 1 s a)) (29)

=

Nsumk=1

Eqk(sa)

[log

pθ(Rk = 1 s a)

qk(s a)

]=

Nsumk=1

Eqk(sa)

[log


qk(s a)

]






E1 E-Step





qk(a|s) = argminq

Emicro(s)


]= argmin

qEmicro(s)



(30)





αk

)


qk(a|s) = argmaxq

intmicro(s)


αk



qk(a|s) = argmaxq

intmicro(s)






ηk

) (33)


ηk = argminη

ηεk+ (34)

η

intmicro(s) log


(Qk(s a)

η

)da ds


E2 M-Step



θlowast = argmaxθ

Nsumk=1

intmicro(s)


+ log p(θ)



θlowast = argmaxθ

Nsumk=1

intmicro(s)






















Adam ε 10minus3







Humanoid run






Adam ε 10minus8

Shadow Hand


Sawyer








Condition Settings

















Humanoid Mocap






A3 Humanoid Mocap





237

224

203196

161151140

115

82

07



B2 Shadow Hand





0 2 4 6impact force

minus6

minus4

minus2

0

pain

pen

alty






]m(s sprime) (6)




675 700 725zpeg

0

1

rew

ard

height objective

0 2d(θ θpeg)






minqisinQtarget



B3 Humanoid






B4 Humanoid Mocap







com2)




vel2)






app2)




quat2)




rtrunc = 1minus 1

03(








r =1

2rtrunc +

1


(8)


B5 Sawyer







rinsertion = max(


)(9)



z ) (11)


xy lt d2 (12)




radicl)



















minφk


(st at))2] (15)

with


k(st at) +

Tsumj=t


cz

)δj







k= Qφk


LMNijk where




1Llog

(sumMj

1M

exp(Q

ijkηk


Qijkηk

)


sumLi

sumMj

sumNk q

ijk log πθ(a

ij |si)18 +sg(ν)

(β minus





sum(stat)isinτsimD

(Qφk (st at)minusQ

retk

)226 with Qret



prime = Qφk



cz = min


)


z=t+1 cz

)= 1


C3 Policy Fitting


optimization

πnew = argmaxθ

Nsumk=1

ints

micro(s)

inta


stints



L(θ ν) =

Nsumk=1

ints

micro(s)

inta


+ ν(β minus

ints



maxθ

minνgt0

L(θ ν)




(s)) (18)

πΣθ (a|s) = N

(amicroθold

(s)Σθ(s)) (19)



maxθ

Nsumk=1

ints

micro(s)

inta

qk(a|s)(


)da ds

stints











minφ1k

Nsumk=1

Eτ[(G


φk(st))

2] (21)



(T )k (st at) =

sumTminus1`=t γ



k (st at) =


φk(st)







maxqk

intsa







ηk

) (23)


ηk = argminηk

[ηk εk + (24)

ηk log

intsa


η

)da ds

]






πnew = argmaxθ

Nsumk=1

intsa


stints







maxθ

pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)


maxθ

Nsumk=1








Nsumk=1

log pθ(Rk = 1) (28)

=

Nsumk=1


Nsumk=1



Nsumk=1

KL(qk(s a) pθ(Rk = 1 s a)) (29)

=

Nsumk=1

Eqk(sa)

[log

pθ(Rk = 1 s a)

qk(s a)

]=

Nsumk=1

Eqk(sa)

[log


qk(s a)

]






E1 E-Step





qk(a|s) = argminq

Emicro(s)


]= argmin

qEmicro(s)



(30)





αk

)


qk(a|s) = argmaxq

intmicro(s)


αk



qk(a|s) = argmaxq

intmicro(s)






ηk

) (33)


ηk = argminη

ηεk+ (34)

η

intmicro(s) log


(Qk(s a)

η

)da ds


E2 M-Step



θlowast = argmaxθ

Nsumk=1

intmicro(s)


+ log p(θ)



θlowast = argmaxθ

Nsumk=1

intmicro(s)





Humanoid run






Adam ε 10minus8

Shadow Hand


Sawyer








Condition Settings

















Humanoid Mocap






A3 Humanoid Mocap





237

224

203196

161151140

115

82

07



B2 Shadow Hand





0 2 4 6impact force

minus6

minus4

minus2

0

pain

pen

alty






]m(s sprime) (6)




675 700 725zpeg

0

1

rew

ard

height objective

0 2d(θ θpeg)






minqisinQtarget



B3 Humanoid






B4 Humanoid Mocap







com2)




vel2)






app2)




quat2)




rtrunc = 1minus 1

03(








r =1

2rtrunc +

1


(8)


B5 Sawyer







rinsertion = max(


)(9)



z ) (11)


xy lt d2 (12)




radicl)



















minφk


(st at))2] (15)

with


k(st at) +

Tsumj=t


cz

)δj







k= Qφk


LMNijk where




1Llog

(sumMj

1M

exp(Q

ijkηk


Qijkηk

)


sumLi

sumMj

sumNk q

ijk log πθ(a

ij |si)18 +sg(ν)

(β minus





sum(stat)isinτsimD

(Qφk (st at)minusQ

retk

)226 with Qret



prime = Qφk



cz = min


)


z=t+1 cz

)= 1


C3 Policy Fitting


optimization

πnew = argmaxθ

Nsumk=1

ints

micro(s)

inta


stints



L(θ ν) =

Nsumk=1

ints

micro(s)

inta


+ ν(β minus

ints



maxθ

minνgt0

L(θ ν)




(s)) (18)

πΣθ (a|s) = N

(amicroθold

(s)Σθ(s)) (19)



maxθ

Nsumk=1

ints

micro(s)

inta

qk(a|s)(


)da ds

stints











minφ1k

Nsumk=1

Eτ[(G


φk(st))

2] (21)



(T )k (st at) =

sumTminus1`=t γ



k (st at) =


φk(st)







maxqk

intsa







ηk

) (23)


ηk = argminηk

[ηk εk + (24)

ηk log

intsa


η

)da ds

]






πnew = argmaxθ

Nsumk=1

intsa


stints







maxθ

pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)


maxθ

Nsumk=1








Nsumk=1

log pθ(Rk = 1) (28)

=

Nsumk=1


Nsumk=1



Nsumk=1

KL(qk(s a) pθ(Rk = 1 s a)) (29)

=

Nsumk=1

Eqk(sa)

[log

pθ(Rk = 1 s a)

qk(s a)

]=

Nsumk=1

Eqk(sa)

[log


qk(s a)

]






E1 E-Step





qk(a|s) = argminq

Emicro(s)


]= argmin

qEmicro(s)



(30)





αk

)


qk(a|s) = argmaxq

intmicro(s)


αk



qk(a|s) = argmaxq

intmicro(s)






ηk

) (33)


ηk = argminη

ηεk+ (34)

η

intmicro(s) log


(Qk(s a)

η

)da ds


E2 M-Step



θlowast = argmaxθ

Nsumk=1

intmicro(s)


+ log p(θ)



θlowast = argmaxθ

Nsumk=1

intmicro(s)





Humanoid Mocap






A3 Humanoid Mocap





237

224

203196

161151140

115

82

07



B2 Shadow Hand





0 2 4 6impact force

minus6

minus4

minus2

0

pain

pen

alty






]m(s sprime) (6)




675 700 725zpeg

0

1

rew

ard

height objective

0 2d(θ θpeg)






minqisinQtarget



B3 Humanoid






B4 Humanoid Mocap







com2)




vel2)






app2)




quat2)




rtrunc = 1minus 1

03(








r =1

2rtrunc +

1


(8)


B5 Sawyer







rinsertion = max(


)(9)



z ) (11)


xy lt d2 (12)




radicl)



















minφk


(st at))2] (15)

with


k(st at) +

Tsumj=t


cz

)δj







k= Qφk


LMNijk where




1Llog

(sumMj

1M

exp(Q

ijkηk


Qijkηk

)


sumLi

sumMj

sumNk q

ijk log πθ(a

ij |si)18 +sg(ν)

(β minus





sum(stat)isinτsimD

(Qφk (st at)minusQ

retk

)226 with Qret



prime = Qφk



cz = min


)


z=t+1 cz

)= 1


C3 Policy Fitting


optimization

πnew = argmaxθ

Nsumk=1

ints

micro(s)

inta


stints



L(θ ν) =

Nsumk=1

ints

micro(s)

inta


+ ν(β minus

ints



maxθ

minνgt0

L(θ ν)




(s)) (18)

πΣθ (a|s) = N

(amicroθold

(s)Σθ(s)) (19)



maxθ

Nsumk=1

ints

micro(s)

inta

qk(a|s)(


)da ds

stints











minφ1k

Nsumk=1

Eτ[(G


φk(st))

2] (21)



(T )k (st at) =

sumTminus1`=t γ



k (st at) =


φk(st)







maxqk

intsa







ηk

) (23)


ηk = argminηk

[ηk εk + (24)

ηk log

intsa


η

)da ds

]






πnew = argmaxθ

Nsumk=1

intsa


stints







maxθ

pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)


maxθ

Nsumk=1








Nsumk=1

log pθ(Rk = 1) (28)

=

Nsumk=1


Nsumk=1



Nsumk=1

KL(qk(s a) pθ(Rk = 1 s a)) (29)

=

Nsumk=1

Eqk(sa)

[log

pθ(Rk = 1 s a)

qk(s a)

]=

Nsumk=1

Eqk(sa)

[log


qk(s a)

]






E1 E-Step





qk(a|s) = argminq

Emicro(s)


]= argmin

qEmicro(s)



(30)





αk

)


qk(a|s) = argmaxq

intmicro(s)


αk



qk(a|s) = argmaxq

intmicro(s)






ηk

) (33)


ηk = argminη

ηεk+ (34)

η

intmicro(s) log


(Qk(s a)

η

)da ds


E2 M-Step



θlowast = argmaxθ

Nsumk=1

intmicro(s)


+ log p(θ)



θlowast = argmaxθ

Nsumk=1

intmicro(s)






0 2 4 6impact force

minus6

minus4

minus2

0

pain

pen

alty






]m(s sprime) (6)




675 700 725zpeg

0

1

rew

ard

height objective

0 2d(θ θpeg)






minqisinQtarget



B3 Humanoid






B4 Humanoid Mocap







com2)




vel2)






app2)




quat2)




rtrunc = 1minus 1

03(








r =1

2rtrunc +

1


(8)


B5 Sawyer







rinsertion = max(


)(9)



z ) (11)


xy lt d2 (12)




radicl)



















minφk


(st at))2] (15)

with


k(st at) +

Tsumj=t


cz

)δj







k= Qφk


LMNijk where




1Llog

(sumMj

1M

exp(Q

ijkηk


Qijkηk

)


sumLi

sumMj

sumNk q

ijk log πθ(a

ij |si)18 +sg(ν)

(β minus





sum(stat)isinτsimD

(Qφk (st at)minusQ

retk

)226 with Qret



prime = Qφk



cz = min


)


z=t+1 cz

)= 1


C3 Policy Fitting


optimization

πnew = argmaxθ

Nsumk=1

ints

micro(s)

inta


stints



L(θ ν) =

Nsumk=1

ints

micro(s)

inta


+ ν(β minus

ints



maxθ

minνgt0

L(θ ν)




(s)) (18)

πΣθ (a|s) = N

(amicroθold

(s)Σθ(s)) (19)



maxθ

Nsumk=1

ints

micro(s)

inta

qk(a|s)(


)da ds

stints











minφ1k

Nsumk=1

Eτ[(G


φk(st))

2] (21)



(T )k (st at) =

sumTminus1`=t γ



k (st at) =


φk(st)







maxqk

intsa







ηk

) (23)


ηk = argminηk

[ηk εk + (24)

ηk log

intsa


η

)da ds

]






πnew = argmaxθ

Nsumk=1

intsa


stints







maxθ

pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)


maxθ

Nsumk=1








Nsumk=1

log pθ(Rk = 1) (28)

=

Nsumk=1


Nsumk=1



Nsumk=1

KL(qk(s a) pθ(Rk = 1 s a)) (29)

=

Nsumk=1

Eqk(sa)

[log

pθ(Rk = 1 s a)

qk(s a)

]=

Nsumk=1

Eqk(sa)

[log


qk(s a)

]






E1 E-Step





qk(a|s) = argminq

Emicro(s)


]= argmin

qEmicro(s)



(30)





αk

)


qk(a|s) = argmaxq

intmicro(s)


αk



qk(a|s) = argmaxq

intmicro(s)






ηk

) (33)


ηk = argminη

ηεk+ (34)

η

intmicro(s) log


(Qk(s a)

η

)da ds


E2 M-Step



θlowast = argmaxθ

Nsumk=1

intmicro(s)


+ log p(θ)



θlowast = argmaxθ

Nsumk=1

intmicro(s)






B4 Humanoid Mocap







com2)




vel2)






app2)




quat2)




rtrunc = 1minus 1

03(








r =1

2rtrunc +

1


(8)


B5 Sawyer







rinsertion = max(


)(9)



z ) (11)


xy lt d2 (12)




radicl)



















minφk


(st at))2] (15)

with


k(st at) +

Tsumj=t


cz

)δj







k= Qφk


LMNijk where




1Llog

(sumMj

1M

exp(Q

ijkηk


Qijkηk

)


sumLi

sumMj

sumNk q

ijk log πθ(a

ij |si)18 +sg(ν)

(β minus





sum(stat)isinτsimD

(Qφk (st at)minusQ

retk

)226 with Qret



prime = Qφk



cz = min


)


z=t+1 cz

)= 1


C3 Policy Fitting


optimization

πnew = argmaxθ

Nsumk=1

ints

micro(s)

inta


stints



L(θ ν) =

Nsumk=1

ints

micro(s)

inta


+ ν(β minus

ints



maxθ

minνgt0

L(θ ν)




(s)) (18)

πΣθ (a|s) = N

(amicroθold

(s)Σθ(s)) (19)



maxθ

Nsumk=1

ints

micro(s)

inta

qk(a|s)(


)da ds

stints











minφ1k

Nsumk=1

Eτ[(G


φk(st))

2] (21)



(T )k (st at) =

sumTminus1`=t γ



k (st at) =


φk(st)







maxqk

intsa







ηk

) (23)


ηk = argminηk

[ηk εk + (24)

ηk log

intsa


η

)da ds

]






πnew = argmaxθ

Nsumk=1

intsa


stints







maxθ

pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)


maxθ

Nsumk=1








Nsumk=1

log pθ(Rk = 1) (28)

=

Nsumk=1


Nsumk=1



Nsumk=1

KL(qk(s a) pθ(Rk = 1 s a)) (29)

=

Nsumk=1

Eqk(sa)

[log

pθ(Rk = 1 s a)

qk(s a)

]=

Nsumk=1

Eqk(sa)

[log


qk(s a)

]






E1 E-Step





qk(a|s) = argminq

Emicro(s)


]= argmin

qEmicro(s)



(30)





αk

)


qk(a|s) = argmaxq

intmicro(s)


αk



qk(a|s) = argmaxq

intmicro(s)






ηk

) (33)


ηk = argminη

ηεk+ (34)

η

intmicro(s) log


(Qk(s a)

η

)da ds


E2 M-Step



θlowast = argmaxθ

Nsumk=1

intmicro(s)


+ log p(θ)



θlowast = argmaxθ

Nsumk=1

intmicro(s)








rinsertion = max(


)(9)



z ) (11)


xy lt d2 (12)




radicl)



















minφk


(st at))2] (15)

with


k(st at) +

Tsumj=t


cz

)δj







k= Qφk


LMNijk where




1Llog

(sumMj

1M

exp(Q

ijkηk


Qijkηk

)


sumLi

sumMj

sumNk q

ijk log πθ(a

ij |si)18 +sg(ν)

(β minus





sum(stat)isinτsimD

(Qφk (st at)minusQ

retk

)226 with Qret



prime = Qφk



cz = min


)


z=t+1 cz

)= 1


C3 Policy Fitting


optimization

πnew = argmaxθ

Nsumk=1

ints

micro(s)

inta


stints



L(θ ν) =

Nsumk=1

ints

micro(s)

inta


+ ν(β minus

ints



maxθ

minνgt0

L(θ ν)




(s)) (18)

πΣθ (a|s) = N

(amicroθold

(s)Σθ(s)) (19)



maxθ

Nsumk=1

ints

micro(s)

inta

qk(a|s)(


)da ds

stints











minφ1k

Nsumk=1

Eτ[(G


φk(st))

2] (21)



(T )k (st at) =

sumTminus1`=t γ



k (st at) =


φk(st)







maxqk

intsa







ηk

) (23)


ηk = argminηk

[ηk εk + (24)

ηk log

intsa


η

)da ds

]






πnew = argmaxθ

Nsumk=1

intsa


stints







maxθ

pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)


maxθ

Nsumk=1








Nsumk=1

log pθ(Rk = 1) (28)

=

Nsumk=1


Nsumk=1



Nsumk=1

KL(qk(s a) pθ(Rk = 1 s a)) (29)

=

Nsumk=1

Eqk(sa)

[log

pθ(Rk = 1 s a)

qk(s a)

]=

Nsumk=1

Eqk(sa)

[log


qk(s a)

]






E1 E-Step





qk(a|s) = argminq

Emicro(s)


]= argmin

qEmicro(s)



(30)





αk

)


qk(a|s) = argmaxq

intmicro(s)


αk



qk(a|s) = argmaxq

intmicro(s)






ηk

) (33)


ηk = argminη

ηεk+ (34)

η

intmicro(s) log


(Qk(s a)

η

)da ds


E2 M-Step



θlowast = argmaxθ

Nsumk=1

intmicro(s)


+ log p(θ)



θlowast = argmaxθ

Nsumk=1

intmicro(s)








k= Qφk


LMNijk where




1Llog

(sumMj

1M

exp(Q

ijkηk


Qijkηk

)


sumLi

sumMj

sumNk q

ijk log πθ(a

ij |si)18 +sg(ν)

(β minus





sum(stat)isinτsimD

(Qφk (st at)minusQ

retk

)226 with Qret



prime = Qφk



cz = min


)


z=t+1 cz

)= 1


C3 Policy Fitting


optimization

πnew = argmaxθ

Nsumk=1

ints

micro(s)

inta


stints



L(θ ν) =

Nsumk=1

ints

micro(s)

inta


+ ν(β minus

ints



maxθ

minνgt0

L(θ ν)




(s)) (18)

πΣθ (a|s) = N

(amicroθold

(s)Σθ(s)) (19)



maxθ

Nsumk=1

ints

micro(s)

inta

qk(a|s)(


)da ds

stints











minφ1k

Nsumk=1

Eτ[(G


φk(st))

2] (21)



(T )k (st at) =

sumTminus1`=t γ



k (st at) =


φk(st)







maxqk

intsa







ηk

) (23)


ηk = argminηk

[ηk εk + (24)

ηk log

intsa


η

)da ds

]






πnew = argmaxθ

Nsumk=1

intsa


stints







maxθ

pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)


maxθ

Nsumk=1








Nsumk=1

log pθ(Rk = 1) (28)

=

Nsumk=1


Nsumk=1



Nsumk=1

KL(qk(s a) pθ(Rk = 1 s a)) (29)

=

Nsumk=1

Eqk(sa)

[log

pθ(Rk = 1 s a)

qk(s a)

]=

Nsumk=1

Eqk(sa)

[log


qk(s a)

]






E1 E-Step





qk(a|s) = argminq

Emicro(s)


]= argmin

qEmicro(s)



(30)





αk

)


qk(a|s) = argmaxq

intmicro(s)


αk



qk(a|s) = argmaxq

intmicro(s)






ηk

) (33)


ηk = argminη

ηεk+ (34)

η

intmicro(s) log


(Qk(s a)

η

)da ds


E2 M-Step



θlowast = argmaxθ

Nsumk=1

intmicro(s)


+ log p(θ)



θlowast = argmaxθ

Nsumk=1

intmicro(s)











minφ1k

Nsumk=1

Eτ[(G


φk(st))

2] (21)



(T )k (st at) =

sumTminus1`=t γ



k (st at) =


φk(st)







maxqk

intsa







ηk

) (23)


ηk = argminηk

[ηk εk + (24)

ηk log

intsa


η

)da ds

]






πnew = argmaxθ

Nsumk=1

intsa


stints







maxθ

pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)


maxθ

Nsumk=1








Nsumk=1

log pθ(Rk = 1) (28)

=

Nsumk=1


Nsumk=1



Nsumk=1

KL(qk(s a) pθ(Rk = 1 s a)) (29)

=

Nsumk=1

Eqk(sa)

[log

pθ(Rk = 1 s a)

qk(s a)

]=

Nsumk=1

Eqk(sa)

[log


qk(s a)

]






E1 E-Step





qk(a|s) = argminq

Emicro(s)


]= argmin

qEmicro(s)



(30)





αk

)


qk(a|s) = argmaxq

intmicro(s)


αk



qk(a|s) = argmaxq

intmicro(s)






ηk

) (33)


ηk = argminη

ηεk+ (34)

η

intmicro(s) log


(Qk(s a)

η

)da ds


E2 M-Step



θlowast = argmaxθ

Nsumk=1

intmicro(s)


+ log p(θ)



θlowast = argmaxθ

Nsumk=1

intmicro(s)





πnew = argmaxθ

Nsumk=1

intsa


stints







maxθ

pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)


maxθ

Nsumk=1








Nsumk=1

log pθ(Rk = 1) (28)

=

Nsumk=1


Nsumk=1



Nsumk=1

KL(qk(s a) pθ(Rk = 1 s a)) (29)

=

Nsumk=1

Eqk(sa)

[log

pθ(Rk = 1 s a)

qk(s a)

]=

Nsumk=1

Eqk(sa)

[log


qk(s a)

]






E1 E-Step





qk(a|s) = argminq

Emicro(s)


]= argmin

qEmicro(s)



(30)





αk

)


qk(a|s) = argmaxq

intmicro(s)


αk



qk(a|s) = argmaxq

intmicro(s)






ηk

) (33)


ηk = argminη

ηεk+ (34)

η

intmicro(s) log


(Qk(s a)

η

)da ds


E2 M-Step



θlowast = argmaxθ

Nsumk=1

intmicro(s)


+ log p(θ)



θlowast = argmaxθ

Nsumk=1

intmicro(s)






qk(a|s) = argminq

Emicro(s)


]= argmin

qEmicro(s)



(30)





αk

)


qk(a|s) = argmaxq

intmicro(s)


αk



qk(a|s) = argmaxq

intmicro(s)






ηk

) (33)


ηk = argminη

ηεk+ (34)

η

intmicro(s) log


(Qk(s a)

η

)da ds


E2 M-Step



θlowast = argmaxθ

Nsumk=1

intmicro(s)


+ log p(θ)



θlowast = argmaxθ

Nsumk=1

intmicro(s)




Documents

Abstract arXiv:2005.07513v1 [cs.LG] 15 May 2020