Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
A Distributional View on Multi-Objective Policy Optimization
Abbas Abdolmaleki 1 Sandy H Huang 1 Leonard Hasenclever 1 Michael Neunert 1 H Francis Song 1
Martina Zambelli 1 Murilo F Martins 1 Nicolas Heess 1 Raia Hadsell 1 Martin Riedmiller 1
Abstract
Many real-world problems require trading off mul-tiple competing objectives However these ob-jectives are often in different units andor scaleswhich can make it challenging for practitioners toexpress numerical preferences over objectives intheir native units In this paper we propose a novelalgorithm for multi-objective reinforcement learn-ing that enables setting desired preferences forobjectives in a scale-invariant way We proposeto learn an action distribution for each objectiveand we use supervised learning to fit a parametricpolicy to a combination of these distributions Wedemonstrate the effectiveness of our approach onchallenging high-dimensional real and simulatedrobotics tasks and show that setting different pref-erences in our framework allows us to trace outthe space of nondominated solutions
1 IntroductionReinforcement learning (RL) algorithms do an excellentjob at training policies to optimize a single scalar rewardfunction Recent advances in deep RL have made it possi-ble to train policies that exceed human-level performanceon Atari (Mnih et al 2015) and Go (Silver et al 2016)perform complex robotic manipulation tasks (Zeng et al2019) learn agile locomotion (Tan et al 2018) and evenobtain reward in unanticipated ways (Amodei et al 2016)
However many real-world tasks involve multiple possiblycompeting objectives For instance choosing a financialportfolio requires trading off between risk and return con-trolling energy systems requires trading off performanceand cost and autonomous cars must trade off fuel costs ef-ficiency and safety Multi-objective reinforcement learning(MORL) algorithms aim to tackle such problems (Roijerset al 2013 Liu et al 2015) A common approach is scalar-ization based on preferences across objectives transformthe multi-objective reward vector into a single scalar re-
Equal contribution 1DeepMind Correspondence to AbbasAbdolmaleki ltaabdolmalekigooglecomgt Sandy H Huangltshhuanggooglecomgt
Figure 1 We demonstrate our approach in four complex continu-ous control domains in simulation and in the real world Videosare at httpsitesgooglecomviewmo-mpo
ward (eg by taking a convex combination) and then usestandard RL to optimize this scalar reward
It is tricky though for practitioners to pick the appropriatescalarization for a desired preference across objectives be-cause often objectives are defined in different units andorscales For instance suppose we want an agent to com-plete a task while minimizing energy usage and mechan-ical wear-and-tear Task completion may correspond to asparse reward or to the number of square feet a vacuumingrobot has cleaned and reducing energy usage and mechani-cal wear-and-tear could be enforced by penalties on powerconsumption (in kWh) and actuator efforts (in N or Nm)respectively Practitioners would need to resort to usingtrial and error to select a scalarization that ensures the agentprioritizes actually doing the task (and thus being useful)over saving energy
Motivated by this we propose a scale-invariant approach forencoding preferences derived from the RL-as-inference per-spective Instead of choosing a scalarization practitionersset a constraint per objective Based on these constraints welearn an action distribution per objective that improves onthe current policy Then to obtain a single updated policythat makes these trade-offs we use supervised learning tofit a policy to the combination of these action distributionsThe constraints control the influence of each objective onthe policy by constraining the KL-divergence between eachobjective-specific distribution and the current policy Thehigher the constraint value the more influence the objec-tive has Thus a desired preference over objectives can beencoded as the relative magnitude of these constraint values
Fundamentally scalarization combines objectives in rewardspace whereas our approach combines objectives in distri-
arX
iv2
005
0751
3v1
[cs
LG
] 1
5 M
ay 2
020
A Distributional View on Multi-Objective Policy Optimization
bution space thus making it invariant to the scale of rewardsIn principle our approach can be combined with any RLalgorithm regardless of whether it is off-policy or on-policyWe combine it with maximum a posteriori policy optimiza-tion (MPO) (Abdolmaleki et al 2018ab) an off-policyactor-critic RL algorithm and V-MPO (Song et al 2020)an on-policy variant of MPO We call these two algorithmsmulti-objective MPO (MO-MPO) and multi-objective V-MPO (MO-V-MPO) respectively
Our main contribution is providing a distributional view onMORL which enables scale-invariant encoding of prefer-ences We show that this is a theoretically-grounded ap-proach that arises from taking an RL-as-inference perspec-tive of MORL Empirically we analyze the mechanics ofMO-MPO and show it finds all Pareto-optimal policies in apopular MORL benchmark task Finally we demonstratethat MO-MPO and MO-V-MPO outperform scalarized ap-proaches on multi-objective tasks across several challenginghigh-dimensional continuous control domains (Fig 1)
2 Related Work21 Multi-Objective Reinforcement Learning
Multi-objective reinforcement learning (MORL) algorithmsare either single-policy or multiple-policy (Vamplew et al2011) Single-policy approaches seek to find the optimalpolicy for a given scalarization of the multi-objective prob-lem Often this scalarization is linear but other choices havealso been explored (Van Moffaert et al 2013)
However the scalarization may be unknown at training timeor it may change over time Multiple-policy approacheshandle this by finding a set of policies that approximatesthe true Pareto front Some approaches repeatedly call asingle-policy MORL algorithm with strategically-chosenscalarizations (Natarajan amp Tadepalli 2005 Roijers et al2014 Mossalam et al 2016 Zuluaga et al 2016) Otherapproaches learn a set of policies simultaneously by using amulti-objective variant of the Q-learning update rule (Barrettamp Narayanan 2008 Moffaert amp Nowe 2014 Reymond ampNowe 2019 Yang et al 2019) or by modifying gradient-based policy search (Parisi et al 2014 Pirotta et al 2015)
Most existing approaches for finding the Pareto front arelimited to discrete state and action spaces in which tabularalgorithms are sufficient Although recent work combiningMORL with deep RL handles high-dimensional observa-tions this is in domains with low-dimensional and usuallydiscrete action spaces (Mossalam et al 2016 van Seijenet al 2017 Friedman amp Fontaine 2018 Abels et al 2019Reymond amp Nowe 2019 Yang et al 2019 Nottinghamet al 2019) In contrast we evaluate our approach on con-
tinuous control tasks with more than 20 action dimensions1
A couple of recent works have applied deep MORL to findthe Pareto front in continuous control tasks these worksassume scalarization and rely on additionally learning ei-ther a meta-policy (Chen et al 2019) or inter-objectiverelationships (Zhan amp Cao 2019) We take an orthogonalapproach to existing approaches one encodes preferencesvia constraints on the influence of each objective on thepolicy update instead of via scalarization MO-MPO canbe run multiple times with different constraint settings tofind a Pareto front of policies
22 Constrained Reinforcement Learning
An alternate way of setting preferences is to enforce thatpolicies meet certain constraints For instance thresholdlexicographic ordering approaches optimize a (single) ob-jective while meeting specified threshold values on the otherobjectives (Gabor et al 1998) optionally with slack (Wrayet al 2015) Similarly safe RL is concerned with learningpolicies that optimize a scalar reward while not violatingsafety constraints (Achiam et al 2017 Chow et al 2018)this has also been studied in the off-policy batch RL setting(Le et al 2019) Related work minimizes costs while ensur-ing the policy meets a constraint on the minimum expectedreturn (Bohez et al 2019) but this requires that the desiredor achievable reward is known a priori In contrast MO-MPO does not require knowledge of the scale of rewardsIn fact often there is no easy way to specify constraints onobjectives eg it is difficult to figure out a priori how muchactuator effort a robot will need to use to perform a task
23 Multi-Task Reinforcement Learning
Multi-task reinforcement learning can also be cast as aMORL problem Generally these algorithms learn a separatepolicy for each task with shared learning across tasks (Tehet al 2017 Riedmiller et al 2018 Wulfmeier et al 2019)In particular Distral (Teh et al 2017) learns a shared priorthat regularizes the per-task policies to be similar to eachother and thus captures essential structure that is sharedacross tasks MO-MPO differs in that the goal is to learn asingle policy that must trade off across different objectives
Other multi-task RL algorithms seek to train a single agentto solve different tasks and thus need to handle differentreward scales across tasks Prior work uses adaptive normal-ization for the targets in value-based RL so that the agent
1MORL for continuous control tasks is difficult because thepolicy can no longer output arbitrary action distributions whichlimits how well it can compromise between competing objectivesIn state-of-the-art RL algorithms for continuous control policiestypically output a single action (eg D4PG (Barth-Maron et al2018)) or a Gaussian (eg PPO (Schulman et al 2017) MPO(Abdolmaleki et al 2018b) and SAC (Haarnoja et al 2018))
A Distributional View on Multi-Objective Policy Optimization
cares equally about all tasks (van Hasselt et al 2016 Hesselet al 2019) Similarly prior work in multi-objective opti-mization has dealt with objectives of different units andorscales by normalizing objectives to have similar magnitudes(Marler amp Arora 2005 Grodzevich amp Romanko 2006 Kimamp de Weck 2006 Daneshmand et al 2017 Ishibuchi et al2017) MO-MPO can also be seen as doing adaptive nor-malization but for any preference over objectives not justequal preferences
In general invariance to reparameterization of the functionapproximator has been investigated in optimization liter-ature resulting in for example natural gradient methods(Martens 2014) The common tool here is measuring dis-tances in function space instead of parameter space usingKL-divergence Similarly in this work to achieve invari-ance to the scale of objectives we use KL-divergence overpolicies to encode preferences
3 Background and NotationMulti Objective Markov Decision Process In this pa-per we consider a multi-objective RL problem defined by amulti-objective Markov Decision Process (MO-MDP) TheMO-MDP consists of states s isin S and actions a isin Aan initial state distribution p(s0) transition probabilitiesp(st+1|st at) which define the probability of changingfrom state st to st+1 when taking action at reward func-tions rk(s a) isin RNk=1 per objective k and a discountfactor γ isin [0 1) We define our policy πθ(a|s) as astate conditional distribution over actions parametrized byθ Together with the transition probabilities this givesrise to a state visitation distribution micro(s) We also con-sider per-objective action-value functions The action-value function for objective k is defined as the expectedreturn (ie cumulative discounted reward) from choosingaction a in state s for objective k and then following pol-icy π Qπk (s a) = Eπ[
suminfint=0 γ
trk(st at)|s0 = s a0 = a]We can represent this function using the recursive expres-sion Qπk (st at) = Ep(st+1|stat)
[rk(st at) + γV πk (st+1)
]
where V πk (s) = Eπ[Qπk (s a)] is the value function of π forobjective k
Problem Statement For any MO-MDP there is a setof nondominated policies ie the Pareto front A policyis nondominated if there is no other policy that improvesits expected return for an objective without reducing theexpected return of at least one other objective Given apreference setting our goal is to find a nondominated policyπθ that satisfies those preferences In our approach a settingof constraints does not directly correspond to a particularscalarization but we show that by varying these constraintsettings we can indeed trace out a Pareto front of policies
Algorithm 1 MO-MPO One policy improvement step1 given batch size (L) number of actions to sample (M ) (N )
Q-functions Qπoldk (s a)Nk=1 preferences εkNk=1 previous
policy πold previous temperatures ηkNk=1 replay buffer Dfirst-order gradient-based optimizer O
23 initialize πθ from the parameters of πold4 repeat5 Collect dataset si aij Qijk
LMNijk where
6 M actions aij sim πold(a|si) and Qijk = Qπoldk (si aij)
78 Compute action distribution for each objective9 for k = 1 N do
10 δηk larr nablaηkηkεk + ηksumLi
1Llog
(sumMj
1M
exp(Q
ijkηk
))11 Update ηk based on δηk using optimizer O12 qijk prop exp(
Qijkηk
)
13 end for1415 Update parametric policy16 δπ larr minusnablaθ
sumLi
sumMj
sumNk q
ijk log πθ(a
ij |si)17 (subject to additional KL regularization see Sec 422)18 Update πθ based on δπ using optimizer O1920 until fixed number of steps21 return πold = πθ
4 MethodWe propose a policy iteration algorithm for multi-objectiveRL Policy iteration algorithms decompose the RL probleminto two sub-problems and iterate until convergence
1 Policy evaluation estimate Q-functions given policy2 Policy improvement update policy given Q-functions
Algorithm 1 summarizes this two-step multi-objective pol-icy improvement procedure In Appendix E we explain howthis can be derived from the ldquoRL as inferencerdquo perspective
We describe multi-objective MPO in this section and explainmulti-objective V-MPO in Appendix D When there is onlyone objective MO-(V-)MPO reduces to (V-)MPO
41 Multi-Objective Policy Evaluation
In this step we learn Q-functions to evaluate the previouspolicy πold We train a separate Q-function per objectivefollowing the Q-decomposition approach (Russell amp Zim-dars 2003) In principle any Q-learning algorithm can beused as long as the target Q-value is computed with respectto πold2 In this paper we use the Retrace objective (Munos
2Russell amp Zimdars (2003) prove critics suffer from ldquoillusion ofcontrolrdquo if they are trained with conventional Q-learning (Watkins1989) In other words if each critic computes its target Q-valuebased on its own best action for the next state then they overesti-mate Q-values because in reality the parametric policy πθ (thatconsiders all criticsrsquo opinions) is in charge of choosing actions
A Distributional View on Multi-Objective Policy Optimization
et al 2016) to learn a Q-function Qπoldk (s aφk) for each
objective k parameterized by φk as follows
minφkN1
Nsumk=1
E(sa)simD
[(Qretk (s a)minusQπold
k (s aφk))2]
where Qretk is the Retrace target for objective k and the
previous policy πold and D is a replay buffer containinggathered transitions See Appendix C for details
42 Multi-Objective Policy Improvement
Given the previous policy πold(a|s) and associated Q-functions Qπold
k (s a)Nk=1 our goal is to improve the previ-ous policy for a given visitation distribution micro(s)3 To thisend we learn an action distribution for each Q-function andcombine these to obtain the next policy πnew(a|s) This is amulti-objective variant of the two-step policy improvementprocedure employed by MPO (Abdolmaleki et al 2018b)
In the first step for each objective k we learn an improvedaction distribution qk(a|s) such that Eqk(a|s)[Q
πoldk (s a)] ge
Eπold(a|s)[Qπoldk (s a)] where states s are drawn from a visi-
tation distribution micro(s)
In the second step we combine and distill the improveddistributions qk into a new parametric policy πnew (withparameters θnew) by minimizing the KL-divergence betweenthe distributions and the new parametric policy ie
θnew = argminθ
Nsumk=1
Emicro(s)
[KL(qk(a|s)πθ(a|s)
)] (1)
This is a supervised learning loss that performs maximumlikelihood estimate of distributions qk Next we will explainthese two steps in more detail
421 OBTAINING ACTION DISTRIBUTIONS PEROBJECTIVE (STEP 1)
To obtain the per-objective improved action distributionsqk(a|s) we optimize the standard RL objective for eachobjective Qk
maxqk
ints
micro(s)
inta
qk(a|s)Qk(s a) dads (2)
stints
micro(s) KL(qk(a|s)πold(a|s)) ds lt εk
where εk denotes the allowed expected KL divergence forobjective k We use these εk to encode preferences over ob-jectives More concretely εk defines the allowed influenceof objective k on the change of the policy
3In practice we use draws from the replay buffer to estimateexpectations over the visitation distribution micro(s)
For nonparametric action distributions qk(a|s) we can solvethis constrained optimization problem in closed form foreach state s sampled from micro(s) (Abdolmaleki et al 2018b)
qk(a|s) prop πold(a|s) exp(Qk(s a)
ηk
) (3)
where the temperature ηk is computed based on the corre-sponding εk by solving the following convex dual function
ηk = argminη
η εk + (4)
η
ints
micro(s) log
inta
πold(a|s) exp(Qk(s a)
η
)dads
In order to evaluate qk(a|s) and the integrals in (4) we drawL states from the replay buffer and for each state sampleM actions from the current policy πold In practice wemaintain one temperature parameter ηk per objective Wefound that optimizing the dual function by performing a fewsteps of gradient descent on ηk is effective and we initializewith the solution found in the previous policy iteration stepSince ηk should be positive we use a projection operatorafter each gradient step to maintain ηk gt 0 Please refer toAppendix C for derivation details
Application to Other Deep RL Algorithms Since theconstraints εk in (2) encode the preferences over objectivessolving this optimization problem with good satisfactionof constraints is key for learning a policy that satisfies thedesired preferences For nonparametric action distributionsqk(a|s) we can satisfy these constraints exactly One coulduse any policy gradient method (eg Schulman et al 20152017 Heess et al 2015 Haarnoja et al 2018) to obtainqk(a|s) in a parametric form instead However solving theconstrained optimization for parametric qk(a|s) is not exactand the constraints may not be well satisfied which impedesthe use of εk to encode preferences Moreover assuming aparametric qk(a|s) requires maintaining a function approx-imator (eg a neural network) per objective which cansignificantly increase the complexity of the algorithm andlimits scalability
Choosing εk It is more intuitive to encode preferences viaεk rather than via scalarization weights because the formeris invariant to the scale of rewards In other words having adesired preference across objectives narrows down the rangeof reasonable choices for εk but does not narrow down therange of reasonable choices for scalarization weights Inorder to identify reasonable scalarization weights a RL prac-titioner needs to additionally be familiar with the scale ofrewards for each objective In practice we have found thatlearning performance is robust to a wide range of scales forεk It is the relative scales of the εk that matter for encodingpreferences over objectivesmdashthe larger a particular εk iswith respect to others the more that objective k is preferredOn the other hand if εk = 0 then objective k will have
A Distributional View on Multi-Objective Policy Optimization
no influence and will effectively be ignored In AppendixA1 we provide suggestions for setting εk given a desiredpreference over objectives
422 FITTING A NEW PARAMETRIC POLICY (STEP 2)
In the previous section for each objective k we have ob-tained an improved action distribution qk(a|s) Next wewant to combine these distributions to obtain a single para-metric policy that trades off the objectives according tothe constraints εk that we set For this we solve a super-vised learning problem that fits a parametric policy to theper-objective action distributions from step 1
θnew = argmaxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) dads
stints
micro(s) KL(πold(a|s) πθ(a|s)) ds lt β (5)
where θ are the parameters of our policy (a neural network)and the KL constraint enforces a trust region of size β thatlimits the overall change in the parametric policy The KLconstraint in this step has a regularization effect that pre-vents the policy from overfitting to the sample-based actiondistributions and therefore avoids premature convergenceand improves stability of learning (Schulman et al 2015Abdolmaleki et al 2018ab)
Similar to the first policy improvement step we evaluatethe integrals by using the L states sampled from the replaybuffer and the M actions per state sampled from the oldpolicy In order to optimize (5) using gradient descent weemploy Lagrangian relaxation similar to in MPO (Abdol-maleki et al 2018a) (see Appendix C for more detail)
5 Experiments Toy DomainsIn the empirical evaluation that follows we will first demon-strate the mechanics and scale-invariance of MO-MPO ina single-state environment (Sec 51) and then show thatMO-MPO can find all Pareto-dominant policies in a popularMORL benchmark (Sec 52) Finally we show the benefitof using MO-MPO in high-dimensional continuous controldomains including on a real robot (Sec 6) AppendicesA and B contain a detailed description of all domains andtasks experimental setups and implementation details
Baselines The goal of our empirical evaluation is toanalyze the benefit of using our proposed multi-objectivepolicy improvement step (Sec 422) that encodes prefer-ences over objectives via constraints εk on expected KL-divergences rather than via weights wk Thus we primarilycompare MO-MPO against scalarized MPO which relieson linear scalarization weights wk to encode preferencesThe only difference between MO-MPO and scalarized MPOis the policy improvement step for scalarized MPO a single
Figure 2 A simple environment in which the agent starts at stateS0 and chooses to navigate to one of three terminal states Thereare two objectives Taking the left action for instance leads toa reward of 1 for the first objective and 4 for the second
improved action distribution q(a|s) is computed based onsumk wkQk(s a) and a single KL constraint ε
State-of-the-art approaches that combine MORL with deepRL assume linear scalarization as well either learning aseparate policy for each setting of weights (Mossalam et al2016) or learning a single policy conditioned on scalar-ization weights (Friedman amp Fontaine 2018 Abels et al2019) Scalarized MPO addresses the former problemwhich is easier The policy evaluation step in scalarizedMPO is analagous to scalarized Q-learning proposed byMossalam et al (2016) As we show later in Sec 6 evenlearning an optimal policy for a single scalarization is diffi-cult in high-dimensional continuous control domains
51 Simple World
First we will examine the behavior of MO-MPO in a simplemulti-armed bandit with three actions (up right andleft) (Fig 2) inspired by Russell amp Zimdars (2003)We train policies with scalarized MPO and with MO-MPOThe policy evaluation step is exact because the Q-valuefunction for each objective is known it is equal to thereward received for that objective after taking each actionas labeled in Fig 24
We consider three possible desired preferences equal pref-erence for the two objectives preferring the first and pre-ferring the second Encoding preferences in scalarizedMPO amounts to choosing appropriate linear scalarizationweights and in MO-MPO amounts to choosing appropriateεrsquos We use the following weights and εrsquos
bull equal preference weights [05 05] or εrsquos [001 001]bull prefer first weights [09 01] or εrsquos [001 0002]bull prefer second weights [01 09] or εrsquos [0002 001]
We set ε = 001 for scalarized MPO If we start with a4The policy improvement step can also be computed exactly
because solving for the optimal temperature η (or η1 and η2 in theMO-MPO case) is a convex optimization problem and the KL-constrained policy update is also a convex optimization problemwhen there is only one possible state We use CVXOPT (Andersenet al 2020) as our convex optimization solver
A Distributional View on Multi-Objective Policy Optimization
uniform policy and run MPO with β = 0001 until thepolicy converges scalarized MPO and MO-MPO result insimilar policies (Fig 3 solid bars) up for equal preferenceright for prefer first and left for prefer second
However if we make the rewards imbalanced by multiply-ing the rewards obtained for the first objective by 20 (egleft now obtains a reward of [20 4]) we see that the poli-cies learned by scalarized MPO shift to preferring the opti-mal action for the first objective (right) in both the equalpreference and prefer second cases (Fig 3 striped bars)In contrast the final policies for MO-MPO are the same asfor balanced rewards because in each policy improvementstep MO-MPO optimizes for a separate temperature ηk thatscales each objectiversquos Q-value function This ηk is com-puted based on the corresponding allowed KL-divergenceεk so when the rewards for any objective k are multipliedby a factor but εk remains the same the computed ηk endsup being scaled by that factor as well neutralizing the effectof the scaling of rewards (see Eq (4))
Even in this simple environment we see that MO-MPOrsquosscale-invariant way of encoding preferences is valuable Inmore complex domains in which the Q-value functionsmust be learned in parallel with the policy the (automatic)dynamic adjustment of temperatures ηk per objective be-comes more essential (Sec 6)
The scale of εk controls the amount that objective k caninfluence the policyrsquos update If we set ε1 = 001 and sweepover the range from 0 to 001 for ε2 the resulting policies gofrom always picking right to splitting probability acrossright and up to always picking up (Fig 4 right) Incontrast setting weights leads to policies quickly convergingto placing all probability on a single action (Fig 4 left)We hypothesize this limits the ability of scalarized MPOto explore and find compromise policies (that perform wellwith respect to all objectives) in more challenging domains
52 Deep Sea Treasure
An important quality of any MORL approach is the abilityto find a variety of policies on the true Pareto front (Roijerset al 2013) We demonstrate this in Deep Sea Treasure(DST) (Vamplew et al 2011) a popular benchmark for test-ing MORL approaches DST consists of a 11times10 grid worldwith ten treasure locations The agent starts in the upper leftcorner (s0 = (0 0)) and has a choice of four actions (mov-ing one square up right down and left) When theagent picks up a treasure the episode terminates The agenthas two objectives treasure value and time penalty Thetime penalty is minus1 for each time step that the agent takesand farther-away treasures have higher treasure values Weuse the treasure values in Yang et al (2019)
We ran scalarized MPO with weightings [w 1 minus w] and
00
05
10equal preference prefer first prefer second
left up right00
05
10
left up right left up right
scalarizeddistributional
scalarized imbalanced rewardsdistributional imbalanced rewards
Figure 3 When two objectives have comparable reward scales(solid bars) scalarized MPO (first row) and MO-MPO (second row)learn similar policies across three different preferences Howeverwhen the scale of the first objective is much higher (striped bars)scalarized MPO shifts to always preferring the first objective Incontrast MO-MPO is scale-invariant and still learns policies thatsatisfy the preferences The y-axis denotes action probability
00 05 10p(left)
0
1distributional
[001]
[0101]
[010]
00 05 10p(left)
0
1p(
righ
t)scalarized
[01]
[55]
[10][w1 w2]
00 05 10p(left)
0
1p(
righ
t)scalarized
[01]
[55]
[10]
00 05 10p(left)
0
1
p(righ
t)
scalarized
[01]
[55]
[10][1205981 1205982]
00 05 10p(left)
0
1distributional
[001]
[0101]
[010]
Figure 4 A visualization of policies during learningmdasheach curvecorresponds to a particular setting of weights (left) or εrsquos (right)Policies are initialized as uniform (the blue dot) and are traineduntil convergence Each point (x y) corresponds to the policywith p(left) = x p(right) = y and p(up) = 1minusxminusy Thetop left and bottom right blue stars denote the optimal policy forthe first and second objectives respectively
w isin [0 001 002 1] All policies converged to a pointon the true Pareto front and all but three found the optimalpolicy for that weighting In terms of coverage policieswere found for eight out of ten points on the Pareto front5
We ran MO-MPO on this task as well for a range of εεtime isin [001 002 005] and εtreasure = c lowast εtime wherec isin [05 051 052 15] All runs converged to a policyon the true Pareto front and MO-MPO finds policies for allten points on the front (Fig 5 left) Note that it is the ratioof εrsquos that matters rather than the exact settingsmdashacrossall settings of εtime similar ratios of εtreasure to εtime resultin similar policies as this ratio increases policies tend toprefer higher-value treasures (Fig 5 right)
5Since scalarized MPO consistently finds the optimal policyfor any given weighting in this task with a more strategic selectionof weights we expect policies for all ten points would be found
A Distributional View on Multi-Objective Policy Optimization
Figure 5 Left Blue stars mark the true Pareto front for Deep SeaTreasure MO-MPO with a variety of settings for εk discoversall points on the true Pareto front The area of the orange circlesis proportional to the number of εk settings that converged tothat point Right As more preference is given to the treasureobjective (ie as x increases) policies tend to prefer higher-valuetreasures Each orange dot in the scatterplot corresponds to aparticular setting of εk
6 Experiments Continuous Control DomainsThe advantage of encoding preferences via εrsquos rather thanvia weights is apparent in more complex domains We com-pared our approaches MO-MPO and MO-V-MPO againstscalarized MPO and V-MPO in four high-dimensional con-tinuous control domains in MuJoCo (Todorov et al 2012)and on a real robot The domains we consider are
Humanoid We use the humanoid run task defined in Deep-Mind Control Suite (Tassa et al 2018) Policies must opti-mize for horizontal speed h while minimizing energy usageThe task reward is min(h10 1) where h is in meters persecond and the energy usage penalty is action `2-norm Thehumanoid has 21 degrees of freedom and the observationconsists of joint angles joint velocities head height handand feet positions torso vertical orientation and center-of-mass velocity for a total of 67 dimensions
Shadow Hand We consider three tasks on the ShadowDexterous Hand touch turn and orient In the touch andturn tasks policies must complete the task while minimiz-ing ldquopainrdquo A sparse task reward of 10 is given for pressingthe block with greater than 5N of force or for turning thedial from a random initial location to the target locationThe pain penalty penalizes the robot for colliding with ob-jects at high speed this penalty is defined as in Huang et al(2019) In the orient task there are three aligned objectivestouching the rectangular peg lifting it to a given height andorienting it to be perpendicular to the ground All threerewards are between 0 and 1 The Shadow Hand has five
fingers and 24 degrees of freedom actuated by 20 motorsThe observation consists of joint angles joint velocitiesand touch sensors for a total of 63 dimensions The touchand turn tasks terminate when the goal is reached or after 5seconds and the orient task terminates after 10 seconds
Humanoid Mocap We consider a large-scale humanoidmotion capture tracking task similar to in Peng et al (2018)in which policies must learn to follow motion capture ref-erence data6 There are five objectives each capturing adifferent aspect of the similarity of the pose between thesimulated humanoid and the mocap target joint orientationsjoint velocities hand and feet positions center-of-mass po-sitions and certain body positions and joint angles Theseobjectives are described in detail in Appendix B4 In orderto balance these multiple objectives prior work relied onheavily-tuned reward functions (eg Peng et al 2018) Thehumanoid has 56 degrees of freedom and the observation is1021-dimensional consisting of proprioceptive observationsas well as six steps of motion capture reference frames Intotal we use about 40 minutes of locomotion mocap datamaking this an extremely challenging domain
Sawyer Peg-in-Hole We train a Rethink Robotics Sawyerrobot arm to insert a cylindrical peg into a hole while min-imizing wrist forces The task reward is shaped towardpositioning the peg directly above the hole and increasesfor insertion and the penalty is the `1-norm of Cartesianforces measured by the wrist force-torque sensor The latterimplicitly penalizes contacts and impacts as well as exces-sive directional change (due to the gripperrsquos inertia inducingforces when accelerating) We impose a force thresholdto protect the hardwaremdashif this threshold is exceeded theepisode is terminated The action space is the end effectorrsquosCartesian velocity and the observation is 102-dimensionalconsisting of Cartesian position joint position and velocitywrist force-torque and joint action for three timesteps
61 Evaluation Metric
We run MO-(V-)MPO and scalarized (V-)MPO with a widerange of constraint settings εk and scalarization weightswk respectively corresponding to a wide range of possibledesired preferences (The exact settings are provided inAppendix A) For tasks with two objectives we plot thePareto front found by each approach We also computethe hypervolume of each found Pareto front this metric iscommonly-used for evaluating MORL algorithms (Vamplewet al 2011) Given a set of policies Π and a reference policyr that is dominated by all policies in this set this metric isthe hypervolume of the space of all policies that dominater and are dominated by at least one policy in Π We useDEAP (Fortin et al 2012) to compute hypervolumes
6This task was developed concurrently by Anonymous (2020)and does not constitute a contribution of this paper
A Distributional View on Multi-Objective Policy Optimization
3000 00
400
800
task
rewar
d
mid-training
3000 0
action norm penalty
end of training
3000 0
normal scale
30000 0
10x penalty
MO-MPO scalarized MPO
5 0pain penalty
00
05
task
rewar
d
touch mid-training
5 0pain penalty
touch
5 0pain penalty
turn
0 100z
0
150
angl
e
orient
3000 00
400
800
task
rewar
d
mid-training
3000 0
action norm penalty
end of training
3000 0
normal scale
30000 0
10x penalty
MO-MPO scalarized MPO
5 0pain penalty
00
05
task
rewar
d
touch mid-training
5 0pain penalty
touch
5 0pain penalty
turn
0 100z
0
150
angl
e
orient
touch mid-training
1x penalty
action norm penalty
pain penalty
3000 00
400
800
task
rewar
d
mid-training
3000 0
action norm penalty
end of training
3000 0
normal scale
30000 0
10x penalty
MO-MPO scalarized MPOscalarized MPOMO-MPO
touch end of training
Figure 6 Pareto fronts found by MO-MPO and scalarized MO-MPO for humanoid run (top row) and Shadow Hand tasks Eachdot represents a single trained policy Corresponding hypervol-umes are in Table 1 Task reward is discounted for touch and turnwith a discount factor of 099 For orient the x and y axes are thetotal reward for the lift and orientation objectives respectively
62 Results Humanoid and Shadow Hand
In the run touch and turn tasks the two objectives arecompetingmdasha very high preference for minimizing actionnorm or pain as opposed to getting task reward will resultin a policy that always chooses zero-valued actions Acrossthese three tasks the Pareto front found by MO-MPO issuperior to the one found by scalarized MPO with respectto the hypervolume metric (Table 1)7 In particular MO-MPO finds more policies that perform well with respect toboth objectives ie are in the upper right portion of thePareto front MO-MPO also speeds up learning on run andtouch (Fig 6) Qualitatively MO-MPO trains run policiesthat look more natural and ldquohuman-likerdquo videos are athttpsitesgooglecomviewmo-mpo
When we scale the action norm penalty by 10times for runscalarized MPO policies no longer achieve high task rewardwhereas MO-MPO policies do (Fig 6 top right) Thissupports that MO-MPOrsquos encoding of preferences is indeedscale-invariant When the objectives are aligned and havesimilarly scaled rewards as in the orient task MO-MPOand scalarized MPO perform similarly as expected
Ablation We also ran vanilla MPO on humanoid run withthe same range of weight settings as for scalarized MPO toinvestigate how useful Q-decomposition is In vanilla MPOwe train a single critic on the scalarized reward function
7We use hypervolume reference points of [0minus104] for run[0minus105] for run with 10times penalty [0minus20] for touch and turnand the zero vector for orient and humanoid mocap
Task scalarized MPO MO-MPO
Humanoid run mid-training 11times106 33times106
Humanoid run 64times106 71times106
Humanoid run 1times penalty 50times106 59times106
Humanoid run 10times penalty 26times107 69times107
Shadow touch mid-training 143 156Shadow touch 162 164Shadow turn 144 154Shadow orient 28times104 28times104
Humanoid Mocap 386times10minus6 341times10minus6
Table 1 Hypervolume measurements across tasks and approaches
which is equivalent to removing Q-decomposition fromscalarized MPO Vanilla MPO trains policies that achievesimilar task reward (up to 800) but with twice the actionnorm penalty (up to minus104) As a result the hypervolumeof the Pareto front that vanilla MPO finds is more than amagnitude worse than that of scalarized MPO and MO-MPO(22times106 versus 26times107 and 69times107 respectively)
63 Results Humanoid Mocap
The objectives in this task are mostly aligned In contrast tothe other experiments we use V-MPO (Song et al 2020) asthe base algorithm because it outperforms MPO in learningthis task In addition since V-MPO is an on-policy vari-ant of MPO this enables us to evaluate our approach inthe on-policy setting Each training run is very computa-tionally expensive so we train only a handful of policieseach8 None of the MO-V-MPO policies are dominatedby those found by V-MPO In fact although the weightsspan a wide range of ldquopreferencesrdquo for the joint velocity theonly policies found by V-MPO that are not dominated by aMO-V-MPO policy are those with extreme values for jointvelocity reward (either le 0006 or ge 0018) whereas it isbetween 00103 and 00121 for MO-V-MPO policies
Although the hypervolume of the Pareto front found by V-MPO is higher than that of MO-V-MPO (Table 1) findingpolicies that over- or under- prioritize any objective is unde-sirable Qualitatively the policies trained with MO-V-MPOlook more similar to the mocap reference datamdashthey exhibitless feet jittering compared to those trained with scalarizedV-MPO this can be seen in the corresponding video
64 Results Sawyer Peg-in-Hole
In this task we would like the robot to prioritize successfultask completion while minimizing wrist forces With this
8For MO-V-MPO we set all εk = 001 Also for each objec-tive we set εk = 0001 and set all others to 001 For V-MPOwe fix the weights of four objectives to reasonable values and tryweights of [01 03 06 1 5] for matching joint velocities
A Distributional View on Multi-Objective Policy Optimization
0 200000 400000
200
400
number of policy improvement steps
task reward
0 200000 400000
minus12
00minus
800
wrist force penalty
MO-MPO scalarized MPO [095 005] [08 02]
Figure 7 Task reward (left) and wrist force penalty (right) learningcurves for the Sawyer peg-in-hole task The policy trained withMO-MPO quickly learns to optimize both objectives whereas theother two do not Each line represents a single trained policy
in mind for MO-MPO we set εtask = 01 and εforce = 005and for scalarized MPO we try [wtask wforce] = [095 005]and [08 02] We find that policies trained with scalarizedMPO focus all learning on a single objective at the beginningof training we also observed this in the touch task wherescalarized MPO policies quickly learn to either maximizetask reward or minimize pain penalty but not both (Fig 6bottom left) In contrast the policy trained with MO-MPOsimultaneously optimizes for both objectives throughouttraining In fact throughout training the MO-MPO policydoes just as well with respect to task reward as the scalarizedMPO policy that cares more about task reward and similarlyfor wrist force penalty (Fig 7)
7 Conclusions and Future WorkIn this paper we presented a new distributional perspectiveon multi objective reinforcement learning that is derivedfrom the Rl-as-inference perspective This view leads to twonovel multi-objective RL algorithms namely MO-MPO andMO-V-MPO We showed these algorithms enable practition-ers to encode preferences in a scale-invariant way AlthoughMO-(V-)MPO is a single-policy approach to MORL it iscapable of producing a variety of Pareto-dominant policiesA limitation of this work is that we produced this set ofpolicies by iterating over a relatively large number of εrsquoswhich is computationally expensive In future work weplan to extend MO-(V-)MPO into a true multiple-policyMORL approach either by conditioning the policy on set-tings of εrsquos or by developing a way to strategically select εrsquosto train policies for analogous to what prior work has donefor weights (eg Roijers et al (2014))
AcknowledgmentsThe authors would like to thank Guillaume DesjardinsCsaba Szepesvari Jost Tobias Springenberg Steven BohezPhilemon Brakel Brendan Tracey Jonas Degrave JonasBuchli Leslie Fritz Chloe Rosenberg and many others atDeepMind for their support and feedback on this paper
ReferencesAbdolmaleki A Springenberg J T Degrave J Bohez S
Tassa Y Belov D Heess N and Riedmiller M Rela-tive entropy regularized policy iteration arXiv preprintarXiv181202256 2018a
Abdolmaleki A Springenberg J T Tassa Y Munos RHeess N and Riedmiller M Maximum a posterioripolicy optimisation In Proceedings of the Sixth Interna-tional Conference on Learning Representations (ICLR)2018b
Abels A Roijers D M Lenaerts T Nowe A andSteckelmacher D Dynamic weights in multi-objectivedeep reinforcement learning In Proceedings of the 36thInternational Conference on Machine Learning (ICML)2019
Achiam J Held D Tamar A and Abbeel P Constrainedpolicy optimization In Proceedings of the 34th Inter-national Conference on Machine Learning (ICML) pp22ndash31 2017
Amodei D Olah C Steinhardt J Christiano P Schul-man J and Mane D Concrete problems in AI safetyarXiv preprint arXiv160606565 2016
Andersen M S Dahl J and Vandenberghe L CVXOPTA Python package for convex optimization version 12httpscvxoptorg 2020
Anonymous CoMic Co-Training and Mimicry forReusable Skills 2020
Barrett L and Narayanan S Learning all optimal policieswith multiple criteria In Proceedings of the 25th Inter-national Conference on Machine Learning (ICML) pp41ndash47 2008
Barth-Maron G Hoffman M W Budden D Dabney WHorgan D TB D Muldal A Heess N and LillicrapT Distributed distributional deterministic policy gradi-ents In Proceedings of the Sixth International Conferenceon Learning Representations (ICLR) 2018
Bohez S Abdolmaleki A Neunert M Buchli J HeessN and Hadsell R Value constrained model-free contin-uous control arXiv preprint arXiv190204623 2019
Chen X Ghadirzadeh A Bjorkman M and Jensfelt PMeta-learning for multi-objective reinforcement learningarXiv preprint arXiv181103376 2019
Chentanez N Muller M Macklin M Makoviychuk Vand Jeschke S Physics-based motion capture imitationwith deep reinforcement learning In International Con-ference on Motion Interaction and Games (MIG) ACM2018
A Distributional View on Multi-Objective Policy Optimization
Chow Y Nachum O Duenez-Guzman E andGhavamzadeh M A Lyapunov-based approach to safereinforcement learning In Proceedings of the 31st Inter-national Conference on Neural Information ProcessingSystems (NeurIPS) pp 8092ndash8101 2018
Daneshmand M Tale Masouleh M Saadatzi M HOzcinar C and Anbarjafari G A robust proportion-preserving composite objective function for scale-invariant multi-objective optimization Scientia Iranica24(6)2977ndash2991 2017
Fortin F-A De Rainville F-M Gardner M-A ParizeauM and Gagne C DEAP Evolutionary algorithms madeeasy Journal of Machine Learning Research 132171ndash2175 July 2012
Friedman E and Fontaine F Generalizing across multi-objective reward functions in deep reinforcement learningarXiv preprint arXiv180906364 2018
Gabor Z Kalmar Z and Szepesvari C Multi-criteriareinforcement learning In Proceedings of the FifteenthInternational Conference on Machine Learning (ICML)pp 197ndash205 1998
Grodzevich O and Romanko O Normalization and othertopics in multi-objective optimization In Proceedings ofthe Fields-MITACS Industrial Problems Workshop 2006
Haarnoja T Zhou A Abbeel P and Levine S Softactor-critic Off-policy maximum entropy deep reinforce-ment learning with a stochastic actor In Proceedings ofthe 35th International Conference on Machine Learning(ICML) pp 1861ndash1870 2018
Heess N Wayne G Silver D Lillicrap T Erez T andTassa Y Learning continuous control policies by stochas-tic value gradients In Advances in Neural InformationProcessing Systems 28 (NIPS) 2015
Hessel M Soyer H Espeholt L Czarnecki W SchmittS and van Hasselt H Multi-task deep reinforcementlearning with popart In Proceedings of the AAAI Confer-ence on Artificial Intelligence volume 33 pp 3796ndash38032019
Huang S H Zambelli M Kay J Martins M F TassaY Pilarski P M and Hadsell R Learning gentle objectmanipulation with curiosity-driven deep reinforcementlearning arXiv preprint arXiv190308542 2019
Ishibuchi H Doi K and Nojima Y On the effect ofnormalization in MOEAD for multi-objective and many-objective optimization Complex amp Intelligent Systems 3(4)279ndash294 2017
Kim I and de Weck O Adaptive weighted sum methodfor multiobjective optimization A new method for paretofront generation Structural and Multidisciplinary Opti-mization 31(2)105ndash116 2006
Kingma D P and Ba J Adam A method for stochasticoptimization In Proceedings of the Third InternationalConference on Learning Representations (ICLR) 2015
Le H Voloshin C and Yue Y Batch policy learningunder constraints In Proceedings of the 36th Inter-national Conference on Machine Learning (ICML) pp3703ndash3712 2019
Levine S Finn C Darrell T and Abbeel P End-to-endtraining of deep visuomotor policies Journal of MachineLearning Research 17(39)1ndash40 2016
Liu C Xu X and Hu D Multiobjective reinforcementlearning A comprehensive overview IEEE Transactionson Systems Man and Cybernetics Systems 45(3)385ndash398 2015
Marler R T and Arora J S Function-transformationmethods for multi-objective optimization EngineeringOptimization 37(6)551ndash570 2005
Martens J New perspectives on the natural gradient methodarXiv preprint arXiv14121193 2014
Merel J Ahuja A Pham V Tunyasuvunakool S LiuS Tirumala D Heess N and Wayne G Hierarchi-cal visuomotor control of humanoids In InternationalConference on Learning Representations (ICLR) 2019
Mnih V Kavukcuoglu K Silver D Rusu A A VenessJ Bellemare M G Graves A Riedmiller M Fidje-land A K Ostrovski G et al Human-level controlthrough deep reinforcement learning Nature 518(7540)529 2015
Moffaert K V and Nowe A Multi-objective reinforcementlearning using sets of Pareto dominating policies Journalof Machine Learning Research 153663ndash3692 2014
Mossalam H Assael Y M Roijers D M and WhitesonS Multi-objective deep reinforcement learning arXivpreprint arXiv161002707 2016
Munos R Stepleton T Harutyunyan A and BellemareM Safe and efficient off-policy reinforcement learningIn Proceedings of the 29th International Conference onNeural Information Processing Systems pp 1054ndash10622016
Natarajan S and Tadepalli P Dynamic preferences inmulti-criteria reinforcement learning In Proceedings ofthe 22nd International Conference on Machine Learning(ICML) pp 601ndash608 2005
A Distributional View on Multi-Objective Policy Optimization
Nottingham K Balakrishnan A Deshmukh J Christo-pherson C and Wingate D Using logical specificationsof objectives in multi-objective reinforcement learningarXiv preprint arXiv191001723 2019
Parisi S Pirotta M Smacchia N Bascetta L andRestelli M Policy gradient approaches for multi-objective sequential decision making In Proceedingsof the International Joint Conference on Neural Networks(IJCNN) pp 2323ndash2330 2014
Peng X B Abbeel P Levine S and van de Panne MDeepmimic Example-guided deep reinforcement learn-ing of physics-based character skills ACM Transactionson Graphics 37(4) 2018
Pirotta M Parisi S and Restelli M Multi-objectivereinforcement learning with continuous Pareto frontierapproximation In Proceedings of the Twenty-Ninth AAAIConference on Artificial Intelligence (AAAI) 2015
Reymond M and Nowe A Pareto-DQN Approximat-ing the Pareto front in complex multi-objective decisionproblems In Proceedings of the Adaptive and LearningAgents Workshop at the 18th International Conferenceon Autonomous Agents and MultiAgent Systems AAMAS2019
Riedmiller M Hafner R Lampe T Neunert M De-grave J Van de Wiele T Mnih V Heess N and Sprin-genberg J T Learning by playing - solving sparse rewardtasks from scratch arXiv preprint arXiv1802105672018
Roijers D M Vamplew P Whiteson S and Dazeley RA survey of multi-objective sequential decision-makingJournal of Artificial Intelligence Research 48(1)67ndash1132013
Roijers D M Whiteson S and Oliehoek F A Linearsupport for multi-objective coordination graphs In Pro-ceedings of the International Conference on AutonomousAgents and Multi-Agent Systems (AAMAS) pp 1297ndash1304 2014
Russell S and Zimdars A L Q-decomposition for rein-forcement learning agents In Proceedings of the Twenti-eth International Conference on International Conferenceon Machine Learning (ICML) pp 656ndash663 2003
Schulman J Levine S Abbeel P Jordan M and MoritzP Trust region policy optimization In Proceedings ofthe 32nd International Conference on Machine Learning(ICML) 2015
Schulman J Wolski F Dhariwal P Radford A andKlimov O Proximal policy optimization algorithmsarXiv preprint arXiv170706347 2017
Shadow Robot Company Shadow dexterous handhttpswwwshadowrobotcomproductsdexterous-hand 2020
Silver D Huang A Maddison C J Guez A Sifre Lvan den Driessche G Schrittwieser J Antonoglou IPanneershelvam V Lanctot M Dieleman S GreweD Nham J Kalchbrenner N Sutskever I Lillicrap TLeach M Kavukcuoglu K Graepel T and HassabisD Mastering the game of Go with deep neural networksand tree search Nature 529484ndash503 2016
Song H F Abdolmaleki A Springenberg J T ClarkA Soyer H Rae J W Noury S Ahuja A LiuS Tirumala D Heess N Belov D Riedmiller Mand Botvinick M M V-MPO On-policy maximum aposteriori policy optimization for discrete and continu-ous control In International Conference on LearningRepresentations 2020
Sutton R S Learning to predict by the methods of temporaldifferences Machine learning 3(1)9ndash44 1988
Tan J Zhang T Coumans E Iscen A Bai Y HafnerD Bohez S and Vanhoucke V Sim-to-real Learningagile locomotion for quadruped robots In Proceedingsof Robotics Science and Systems (RSS) 2018
Tassa Y Doron Y Muldal A Erez T Li Yde Las Casas D Budden D Abdolmaleki A Merel JLefrancq A Lillicrap T and Riedmiller M Deepmindcontrol suite arXiv preprint arXiv180100690 2018
Teh Y W Bapst V Czarnecki W M Quan J Kirk-patrick J Hadsell R Heess N and Pascanu R DistralRobust multitask reinforcement learning In Proceedingsof the 31st International Conference on Neural Informa-tion Processing Systems (NeurIPS) 2017
Todorov E Erez T and Tassa Y Mujoco A physicsengine for model-based control In Proceedings of theIEEERSJ International Conference on Intelligent Robotsand Systems (IROS) pp 5026ndash5033 2012
Vamplew P Dazeley R Berry A Issabekov R andDekker E Empirical evaluation methods for multiobjec-tive reinforcement learning algorithms Machine Learn-ing 84(1)51ndash80 Jul 2011
van Hasselt H P Guez A Hessel M Mnih V and Sil-ver D Learning values across many orders of magnitudeIn Advances in Neural Information Processing Systemspp 4287ndash4295 2016
Van Moffaert K Drugan M M and Nowe A Scalar-ized multi-objective reinforcement learning Novel de-sign techniques In Proceedings of the IEEE Symposiumon Adaptive Dynamic Programming and ReinforcementLearning (ADPRL) pp 191ndash199 2013
A Distributional View on Multi-Objective Policy Optimization
van Seijen H Fatemi M Romoff J Laroche R BarnesT and Tsang J Hybrid reward architecture for reinforce-ment learning In Proceedings of the 31st InternationalConference on Neural Information Processing Systems(NeurIPS) pp 5398ndash5408 2017
Watkins C J C H Learning from Delayed Rewards PhDthesis Kingrsquos College Cambridge UK May 1989
Wray K H Zilberstein S and Mouaddib A-I Multi-objective MDPs with conditional lexicographic rewardpreferences In Proceedings of the Twenty-Ninth AAAIConference on Artificial Intelligence (AAAI) pp 3418ndash3424 2015
Wulfmeier M Abdolmaleki A Hafner R SpringenbergJ T Neunert M Hertweck T Lampe T Siegel NHeess N and Riedmiller M Regularized hierarchicalpolicies for compositional transfer in robotics arXivpreprint arXiv190611228 2019
Yang R Sun X and Narasimhan K A generalized algo-rithm for multi-objective reinforcement learning and pol-icy adaptation In Proceedings of the 32nd InternationalConference on Neural Information Processing Systems(NeurIPS) pp 14610ndash14621 2019
Zeng A Song S Lee J Rodriguez A and FunkhouserT Tossingbot Learning to throw arbitrary objects withresidual physics In Proceedings of Robotics Scienceand Systems (RSS) 2019
Zhan H and Cao Y Relationship explainable multi-objective optimization via vector value function basedreinforcement learning arXiv preprint arXiv1910019192019
Zuluaga M Krause A and Puschel M ε-pal An ac-tive learning approach to the multi-objective optimizationproblem Journal of Machine Learning Research 17(1)3619ndash3650 2016
A Distributional View on Multi-Objective Policy Optimization
A Experiment DetailsIn this section we describe implementation details and spec-ify the hyperparameters used for our algorithm In all ourexperiments the policy and critic(s) are implemented withfeed-forward neural networks We use Adam (Kingma ampBa 2015) for optimization
In our continuous control tasks the policy returns a Gaus-sian distribution with a diagonal covariance matrix ieπθ(a|s) = N (microΣ) The policy is parametrized by aneural network which outputs the mean micro = micro(s) and di-agonal Cholesky factors A = A(s) such that Σ = AAT The diagonal factor A has positive diagonal elements en-forced by the softplus transform Aii larr log(1 + exp(Aii))to ensure positive definiteness of the diagonal covariancematrix
Table 2 shows the default hyperparameters that we use forMO-MPO and scalarized MPO and Table 3 shows the hy-perparameters that differ from these defaults for the hu-manoid Shadow Hand and Sawyer experiments For alltasks that we train policies with MO-MPO for a separatecritic is trained for each objective with a shared first layerThe only exception is the action norm penalty in humanoidrun which is computed exactly from the action We foundthat for both policy and critic networks layer normalizationof the first layer followed by a hyperbolic tangent (tanh) isimportant for the stability of the algorithm
In our experiments comparing the Pareto front found byMO-MPO versus scalarized MPO we ran MO-MPO with arange of εk settings and ran scalarized MPO with a range ofweight settings to obtain a set of policies for each algorithmThe settings for humanoid run and the Shadow Hand tasksare listed in Table 4 The settings for humanoid mocap andSawyer are specified in the main paper in Sec 63 and 64respectively
In the following subsections we first give suggestions forchoosing appropriate εk to encode a particular preference(Sec A1) then we describe the discrete MO-MPO usedfor Deep Sea Treasure (Sec A2) and finally we describeimplementation details for humanoid mocap (Sec A3)
A1 Suggestions for Setting ε
Our proposed algorithm MO-(V-)MPO requires practition-ers to translate a desired preference across objectives to nu-merical choices for εk for each objective k At first glancethis may seem daunting However in practice we havefound that encoding preferences via εk is often more in-tuitive than doing so via scalarization In this subsectionwe seek to give an intuition on how to set εk for differentdesired preferences Recall that each εk controls the influ-ence of objective k on the policy update by constraining theKL-divergence between each objective-specific distribution
Hyperparameter Default
policy networklayer sizes (300 200)take tanh of action mean yesminimum variance 10minus12
maximum variance unbounded
critic network(s)layer sizes (400 400 300)take tanh of action yesRetrace sequence size 8discount factor γ 099
both policy and critic networkslayer norm on first layer yestanh on output of layer norm yesactivation (after each hidden layer) ELU
MPOactions sampled per state 20default ε 01KL-constraint on policy mean βmicro 10minus3
KL-constraint on policy covariance βΣ 10minus5
initial temperature η 1
trainingbatch size 512replay buffer size 106
Adam learning rate 3times10minus4
Adam ε 10minus3
target network update period 200
Table 2 Default hyperparameters for MO-MPO and scalarizedMPO with decoupled update on mean and covariance (Sec C3)
and the current policy (Sec 41) We generally choose εk inthe range of 0001 to 01
Equal Preference When all objectives are equally impor-tant the general rule is to set all εk to the same value Wedid this in our orient and humanoid mocap tasks where ob-jectives are aligned or mostly aligned In contrast it can betricky to choose appropriate weights in linear scalarizationto encode equal preferencesmdashsetting all weights equal to1K (where K is the number of objectives) is only appro-priate if the objectivesrsquo rewards are of similar scales Weexplored this in Sec 51 and Fig 3
When setting all εk to the same value what should thisvalue be The larger εk is the more influence the objectiveswill have on the policy update step Since the per-objectivecritics are learned in parallel with the policy setting εktoo high tends to destabilize learning because early on intraining when the critics produce unreliable Q-values theirinfluence on the policy will lead it in the wrong directionOn the other hand if εk is set too low then it slows downlearning because the per-objective action distribution isonly allowed to deviate by a tiny amount from the currentpolicy and the updated policy is obtained via supervisedlearning on the combination of these action distributions
A Distributional View on Multi-Objective Policy Optimization
Humanoid run
policy networklayer sizes (400 400 300)take tanh of action mean nominimum variance 10minus8
critic network(s)layer sizes (500 500 400)take tanh of action no
MPOKL-constraint on policy mean βmicro 5times10minus3
KL-constraint on policy covariance βΣ 10minus6
trainingAdam learning rate 2times10minus4
Adam ε 10minus8
Shadow Hand
MPOactions sampled per state 30default ε 001
Sawyer
MPOactions sampled per state 15
trainingtarget network update period 100
Table 3 Hyperparameters for humanoid Shadow Hand andSawyer experiments that differ from the defaults in Table 2
Eventually the learning will converge to more or less thesame policy though as long as εk is not set too high
Unequal Preference When there is a difference in pref-erences across objectives the relative scale of εk is whatmatters The more the relative scale of εk is compared to εlthe more influence objective k has over the policy updatecompared to objective l And in the extreme case whenεl is near-zero for objective l then objective l will haveno influence on the policy update and will effectively beignored We explored this briefly in Sec 51 and Fig 4
One common example of unequal preferences is when wewould like an agent to complete a task while minimiz-ing other objectivesmdasheg an action norm penalty ldquopainrdquopenalty or wrist force-torque penalty in our experimentsIn this case the ε for the task objective should be higherthan that for the other objectives to incentivize the agent toprioritize actually doing the task If the ε for the penaltiesis too high then the agent will care more about minimiz-ing the penalty (which can typically be achieved by simplytaking no actions) rather than doing the task which is notparticularly useful
The scale of εk has a similar effect as in the equal preference
Condition Settings
Humanoid run (one seed per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin linspace(0 015 100)
MO-MPO εtask = 01εpenalty isin linspace(10minus6 015 100)
Humanoid run normal vs scaled (three seeds per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin 001 005 01
MO-MPO εtask = 01εpenalty isin 001 005 01
Shadow Hand touch and turn (three seeds per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin linspace(0 09 10)
MO-MPO εtask = 001εpenalty isin linspace(0001 0015 15)
Shadow Hand orient (ten seeds per setting)
scalarized MPO wtouch = wheight = worientation = 13MO-MPO εtouch = εheight = εorientation = 001
Table 4 Settings for εk and weights (linspace(x y z) denotes aset of z evenly-spaced values between x and y)
case If the scale of εk is too high or too low then the sameissues arise as discussed for equal preferences If all εkincrease or decrease in scale by the same (moderate) factorand thus their relative scales remain the same then typicallythey will converge to more or less the same policy Thiscan be seen in Fig 5 (right) in the Deep Sea Treasuredomain regardless of whether εtime is 001 002 or 005the relationship between the relative scale of εtreasure andεtime to the treasure that the policy converges to selecting isessentially the same
A2 Deep Sea Treasure
In order to handle discrete actions we make several mi-nor adjustments to scalarized MPO and MO-MPO Thepolicy returns a categorical distribution rather than a Gaus-sian The policy is parametrized by a neural network whichoutputs a vector of logits (ie unnormalized log probabil-ities) of size |A| The KL constraint on the change of thispolicy β is 0001 The input to the critic network is thestate concatenated with a four-dimensional one-hot vectordenoting which action is chosen (eg the up action corre-sponds to [1 0 0 0]gt) Critics are trained with one-steptemporal-difference error with a discount of 0999 Otherthan these changes the network architectures and the MPOand training hyperparameters are the same as in Table 2
A Distributional View on Multi-Objective Policy Optimization
Humanoid Mocap
Scalarized VMPOKL-constraint on policy mean βmicro 01KL-constraint on policy covariance βΣ 10minus5
default ε 01initial temperature η 1
TrainingAdam learning rate 10minus4
batch size 128unroll length (for n-step return Sec D1) 32
Table 5 Hyperparameters for the humanoid mocap experiments
A3 Humanoid Mocap
For the humanoid mocap experiments we used the follow-ing architecture for both MO-VMPO and VMPO for thepolicy we process a concatentation of the mocap referenceobservations and the proprioceptive observations by a twolayer MLP with 1024 hidden units per layer This referenceencoder is followed by linear layers to produce the meanand log standard deviation of a stochastic latent variableThese latent variables are then concatenated with the propri-oceptive observations and processed by another two layerMLP with 1024 hidden units per layer to produce the actiondistribution For VMPO we use a three layer MLP with1024 hidden units per layer as the critic For MO-VMPOwe use a shared two layer MLP with 1024 hidden units perlayer followed by a one layer MLP with 1024 hidden unitsper layer per objective In both cases we use k-step returnsto train the critic with a discount factor of 095 Table 5shows additional hyperparameters used in our experiments
B Experimental DomainsWe evaluated our approach on one discrete domain (DeepSea Treasure) three simulated continuous control domains(humanoid Shadow Hand and humanoid mocap) and onereal-world continuous control domain (Sawyer robot) Herewe provide more detail about these domains and the objec-tives used in each task
B1 Deep Sea Treasure
Deep Sea Treasure (DST) is a 11times10 grid-world domainthe state space S consists of the x and y position of the agentand the action spaceA is up right down left Thelayout of the environment and values of the treasures areshown in Fig 8 If the action would cause the agent tocollide with the sea floor or go out of bounds it has noeffect Farther-away treasures have higher values Theepisode terminates when the agent collects a treasure orafter 200 timesteps
237
224
203196
161151140
115
82
07
Figure 8 Deep Sea Treasure environment from Vamplew et al(2011) with weights from Yang et al (2019) Treasures are labeledwith their respective values The agent can move around freelyin the white squares but cannot enter the black squares (ie theocean floor)
There are two objectives time penalty and treasure valueA time penalty of minus1 is given at each time step The agentreceives the value of the treasure it picks up as the rewardfor the treasure objective In other words when the agentpicks up a treasure of value v the reward vector is [minus1 v]otherwise it is [minus1 0]
B2 Shadow Hand
Our robot platform is a simulated Shadow Dexterous Hand(Shadow Robot Company 2020) in the MuJoCo physicsengine (Todorov et al 2012) The Shadow Hand has fivefingers and 24 degrees of freedom actuated by 20 motorsThe observation consists of joint angles joint velocities andtouch sensors for a total of 63 dimensions Each fingertipof our Shadow Hand has a 4times4 spatial touch sensor Thissensor has three channels measuring the normal force andthe x and y-axis tangential forces for a sensor dimension of4times4times3 per fingertip We simplify this sensor by summingacross spatial dimensions resulting in a 1times1times3 observationper fingertip
In the touch task there is a block in the environment that isalways fixed in the same pose In the turn task there is a dialin the environment that can be rotated and it is initialized toa random position between minus30 and 30 The target angleof the dial is 0 The angle of the dial is included in theagentrsquos observation In the orient task the robot interactswith a rectangular peg in the environment the initial andtarget pose of the peg remains the same across episodesThe pose of the block is included in the agentrsquos observationencoded as the xyz positions of four corners of the block(based on how Levine et al (2016) encodes end-effectorpose)
A Distributional View on Multi-Objective Policy Optimization
Figure 9 These images show what task completion looks like forthe touch turn and orient Shadow Hand tasks (from left to right)
0 2 4 6impact force
minus6
minus4
minus2
0
pain
pen
alty
Figure 10 In the touch and turn Shadow Hand tasks the painpenalty is computed based on the impact force as plotted here Forlow impact forces the pain penalty is near-zero For high impactforces the pain penalty is equal to the negative of the impact force
B21 BALANCING TASK COMPLETION AND PAIN
In the touch and turn tasks there are two objectives ldquopainrdquopenalty and task completion A sparse task completion re-ward of 1 is given for pressing a block with greater than 5Nof force or turning a dial to a fixed target angle respectivelyIn both tasks the episode terminates when the agent com-pletes the task ie the agent gets a total reward of either0 or 1 for the task completion objective per episode ThePareto plots for these two tasks in the main paper (Fig 6)show the discounted task reward (with a discount factor ofγ = 099) to capture how long it takes agents to completethe task
The ldquopainrdquo penalty is computed as in (Huang et al 2019) Itis based on the impact force m(s sprime) which is the increasein force from state s to the next state sprime In our tasks this ismeasured by a touch sensor on the block or dial The painpenalty is equal to the negative of the impact force scaledby how unacceptable it is
rpain(s a sprime) = minus[1minus aλ(m(s sprime))
]m(s sprime) (6)
where aλ(middot) isin [0 1] computes the acceptability of an impactforce aλ(middot) should be a monotonically decreasing functionthat captures how resilient the robot and the environmentare to impacts As in (Huang et al 2019) we use
aλ(m) = sigmoid(λ1(minusm+ λ2)) (7)
with λ = [2 2]gt The relationship between pain penaltyand impact force is plotted in Fig 10
675 700 725zpeg
0
1
rew
ard
height objective
0 2d(θ θpeg)
orientation objective
Figure 11 In the orient Shadow Hand task the height and orienta-tion objectives have shaped rewards as plotted here
B22 ALIGNED OBJECTIVES
In the orient task there are three non-competing objectivestouching the peg lifting the peg to a target height and ori-enting the peg to be perpendicular to the ground The targetz position and orientation is shown by the gray transparentpeg (Fig 9 right-most) note that the x and y position isnot specified so the robot in the figure gets the maximumreward with respect to all three objectives
The touch objective has a sparse reward of 1 for activat-ing the pegrsquos touch sensor and zero otherwise For theheight objective the target z position is 7cm above theground the pegrsquos z position is computed with respect toits center of mass The shaped reward for this objectiveis 1 minus tanh(50|ztarget minus zpeg|) For the orientation objec-tive since the peg is symmetrical there are eight possibleorientations that are equally valid The acceptable targetorientationsQtarget and the pegrsquos orientation qpeg are denotedas quaternions The shaped reward is computed with respectto the closest target quaternion as
minqisinQtarget
1minus tanh(2lowastd(q qpeg))
where d(middot) denotes the `2-norm of the axis-angle equivalent(in radians) for the distance between the two quaternionsFig 11 shows the shaping of the height and orientationrewards
B3 Humanoid
We make use of the humanoid run task from Tassa et al(2018)9 The observation dimension is 67 and the actiondimension is 21 Actions are joint accelerations with min-imum and maximum limits of minus1 and 1 respectively Forthis task there are two objectives
bull The original task reward given by the environment Thegoal is is to achieve a horizontal speed of 10 meters persecond in any direction This reward is shaped it isequal to min(h10 1) where h is in meters per secondFor this objective we learn a Q-function
9This task is available at githubcomdeepminddm control
A Distributional View on Multi-Objective Policy Optimization
bull Limiting energy usage by penalizing taking high-magnitude actions The penalty is the `2-norm of theaction vector ie rpenalty(a) = minusa2 For this ob-jective we do not learn a Q-function Instead wecompute this penalty directly to evaluate a given actionin a state-independent way during policy optimization
B4 Humanoid Mocap
The motion capture tracking task used in this paper wasdeveloped as part of a concurrent submission (Anonymous2020) and does not constitute a contribution of this paperWe use a simulated humanoid adapted from the ldquoCMUhumanoidrdquo available at dm controllocomotion (Merel et al2019) We adjusted various body and actuator parametersto be more comparable to that of an average human Thishumanoid has 56 degrees of freedom and the observation is1021-dimensional
This motion capture tracking task is broadly similar to previ-ous work on motion capture tracking (Chentanez et al 2018Peng et al 2018 Merel et al 2019) The task is associatedwith an underlying set of motion capture clips For thispaper we used roughly 40 minutes of locomotion motioncapture clips from the CMU motion capture database10
At the beginning of a new episode we select a frame uni-formly at random from all frames in the underlying mocapdata (excluding the last 10 frames of each clip) The simu-lated humanoid is then initialized to the pose of the selectedframe
The observations in this environment are various propriocep-tive observations about the current state of the body as wellas the relative target positions and orientations of 13 differ-ent body parts in the humanoidrsquos local frame We providethe agent with the target poses for the next 5 timesteps
The choice of reward function is crucial for motion capturetracking tasks Here we consider five different reward com-ponents each capturing a different aspect of the similarityof the pose between the simulated humanoid and the mocaptarget Four of our reward components were initially pro-posed by Peng et al (2018) The first reward component isbased on the difference in the center of mass
rcom = exp(minus10pcom minus pref
com2)
where pcom and prefcom are the positions of the center of mass
of the simulated humanoid and the mocap reference re-spectively The second reward component is based on thedifference in joint angle velocities
rvel = exp(minus01qvel minus qref
vel2)
where qvel and qrefvel are the joint angle velocities of the simu-
lated humanoid and the mocap reference respectively The
10This database is available at mocapcscmuedu
third reward component is based on the difference in end-effector positions
rapp = exp(minus40papp minus pref
app2)
where papp and prefapp are the end effector positions of the
simulated humanoid and the mocap reference respectivelyThe fourth reward component is based on the difference injoint orientations
rquat = exp(minus2qquat qref
quat2)
where denotes quaternion differences and qquat and qrefquat
are the joint quaternions of the simulated humanoid and themocap reference respectively
Finally the last reward component is based on difference inthe joint angles and the Euclidean positions of a set of 13body parts
rtrunc = 1minus 1
03(
=ε︷ ︸︸ ︷bpos minus bref
pos1 + qpos minus qrefpos1)
where bpos and brefpos correspond to the body positions of the
simulated character and the mocap reference and qpos andqref
pos correspond to the joint angles
We include an early termination condition in our task that islinked to this last reward term We terminate the episode ifε gt 03 which ensures that rtrunc isin [0 1]
In our MO-VMPO experiments we treat each reward com-ponent as a separate objective In our VMPO experimentswe use a reward of the form
r =1
2rtrunc +
1
2(01rcom + λrvel + 015rapp + 065rquat)
(8)
varying λ as described in the main paper (Sec 63)
B5 Sawyer
The Sawyer peg-in-hole setup consists of a Rethink RoboticsSawyer robot arm The robot has 7 joints driven by serieselastic actuators On the wrist a Robotiq FT 300 force-torque sensor is mounted followed by a Robotiq 2F85 par-allel gripper We implement a 3D end-effector Cartesianvelocity controller that maintains a fixed orientation of thegripper The agent controls the Cartesian velocity setpointsThe gripper is not actuated in this setup The observationsprovided are the joint positions velocities and torques theend-effector pose the wrist force-torque measurements andthe joint velocity command output of the Cartesian con-troller We augment each observation with the two previousobservations as a history
In each episode the peg is initalized randomly within an8times8times8 cm workspace (with the peg outside of the hole)
A Distributional View on Multi-Objective Policy Optimization
The environment is then run at 20 Hz for 600 steps Theaction limits are plusmn005 ms on each axis If the magnitudeof the wrist force-torque measurements exceeds 15 N onany horizontal axis (x and y axis) or 5 N on the vertical axis(z axis) the episode is terminated early
There are two objectives the task itself (inserting the peg)and minimizing wrist force measurements
The reward for the task objective is defined as
rinsertion = max(
02rapproach rinserted
)(9)
rapproach = s(ppeg popening) (10)
rinserted = raligned s(ppegz pbottom
z ) (11)
raligned = ppegxy minus popening
xy lt d2 (12)
where ppeg is the peg position popening the position of thehole opening pbottom the bottom position of the hole and dthe diameter of the hole
s(p1 p2) is a shaping operator
s(p1 p2) = 1minus tanh2(atanh(
radicl)
εp1 minus p22) (13)
which gives a reward of 1minus l if p1 and p2 are at a Euclideandistance of ε Here we chose lapproach = 095 εapproach =001 linserted = 05 and εinserted = 004 Intentionally 004corresponds to the length of the peg such that rinserted = 05if ppeg = popening As a result if the peg is in the approachposition rinserted is dominant over rapproach in rinsertion
Intuitively there is a shaped reward for aligning the pegwith a magnitude of 02 After accurate alignment withina horizontal tolerance the overall reward is dominated bya vertical shaped reward component for insertion The hor-izontal tolerance threshold ensures that the agent is notencouraged to try to insert the peg from the side
The reward for the secondary objective minimizing wristforce measurements is defined as
rforce = minusF1 (14)
where F = (Fx Fy Fz) are the forces measured by thewrist sensor
C Algorithmic DetailsC1 General Algorithm Description
We maintain one online network and one target network foreach Q-function associated to objective k with parametersdenoted by φk and φprimek respectively We also maintain oneonline network and one target network for the policy withparameters denoted by θ and θprime respectively Target net-works are updated every fixed number of steps by copying
parameters from the online network Online networks areupdated using gradient descent in each learning iterationNote that in the main paper we refer to the target policynetwork as the old policy πold
We use an asynchronous actor-learner setup in which ac-tors regularly fetch policy parameters from the learner andact in the environment writing these transitions to the re-play buffer We refer to this policy as the behavior policythroughout the paper The learner uses the transitions inthe replay buffer to update the (online) Q-functions and thepolicy Please see Algorithms 2 and 3 for more details onthe actor and learner processes respectively
Algorithm 2 MO-MPO Asynchronous Actor
1 given (N) reward functions rk(s a)Nk=1 T steps perepisode
2 repeat3 Fetch current policy parameters θ from learner4 Collect trajectory from environment5 τ = 6 for t = 0 T do7 at sim πθ(middot|st)8 Execute action and determine rewards9 r = [r1(st at) rN (st at)]
10 τ larr τ cup (st at r πθ(at|st))11 end for12 Send trajectory τ to replay buffer13 until end of training
C2 Retrace for Multi-Objective Policy Evaluation
Recall that the goal of the policy evaluation stage (Sec 41)is to learn the Q-function Qφk
(s a) for each objective kparametrized by φk These Q-functions are with respectto the policy πold We emphasize that it is valid to use anyoff-the-shelf Q-learning algorithm such as TD(0) (Sutton1988) to learn these Q-functions We choose to use Retrace(Munos et al 2016) described here
Given a replay buffer D containing trajectory snippetsτ = (s0 a0 r0 s1) (sT aT rT sT+1) where rtdenotes a reward vector rk(st at)Nk=1 that consists of ascalar reward for each ofN objectives the Retrace objectiveis as follows
minφk
EτsimD[(Qretk (st at)minus Qφk
(st at))2] (15)
with
Qretk (st at) = Qφprime
k(st at) +
Tsumj=t
γjminust( jprodz=t+1
cz
)δj
δj = rk(sj aj) + γV (sj+1)minus Qφprimek(sj aj)
V (sj+1) = Eπold(a|sj+1)[Qφprimek(sj+1 a)]
A Distributional View on Multi-Objective Policy Optimization
Algorithm 3 MO-MPO Asynchronous Learner1 given batch size (L) number of actions (M) (N) Q-functions
(N) preferences εkNk=1 policy networks and replay bufferD
2 initialize Lagrangians ηkNk=1 and ν target networks andonline networks such that πθprime = πθ and Qφprime
k= Qφk
3 repeat4 repeat5 Collect dataset si aij Qijk
LMNijk where
6 si sim D aij sim πθprime(a|si) and Qijk = Qφkprime(si aij)
78 Compute action distribution for each objective9 for k = 1 N do
10 δηk larr nablaηkηkεk + ηksumLi
1Llog
(sumMj
1M
exp(Q
ijkηk
))11 Update ηk based on δηk12 qijk prop exp(
Qijkηk
)
13 end for1415 Update parametric policy with trust region16 sg denotes a stop-gradient17 δπ larr minusnablaθ
sumLi
sumMj
sumNk q
ijk log πθ(a
ij |si)18 +sg(ν)
(β minus
sumLi KL(πθprime(a|si) πθ(a|si)) ds
)19 δν larr nablaνν
(βminussumLi KL(πθprime(a|si) πsg(θ)(a|si)) ds
)20 Update πθ based on δπ21 Update ν based on δν2223 Update Q-functions24 for k = 1 N do25 δφk larr nablaφk
sum(stat)isinτsimD
(Qφk (st at)minusQ
retk
)226 with Qret
k as in Eq 1527 Update φk based on δφk
28 end for2930 until fixed number of steps31 Update target networks32 πθprime = πθ Qφk
prime = Qφk
33 until convergence
The importance weights ck are defined as
cz = min
(1πold(az|sz)b(az|sz)
)
where b(az|sz) denotes the behavior policy used to col-lect trajectories in the environment When j = t we set(prodj
z=t+1 cz
)= 1
We also make use of a target network for each Q-function(Mnih et al 2015) parametrized by φprimek which we copyfrom the online network φk after a fixed number of gradientsteps on the Retrace objective (15)
C3 Policy Fitting
Recall that fitting the policy in the policy improvementstage (Sec 422) requires solving the following constrained
optimization
πnew = argmaxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds
stints
micro(s) KL(πold(a|s) πθ(a|s)) ds lt β (16)
We first write the generalized Lagrangian equation ie
L(θ ν) =
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds (17)
+ ν(β minus
ints
micro(s) KL(πold(a|s) πθ(a|s)) ds)
where ν is the Lagrangian multiplier Now we solve thefollowing primal problem
maxθ
minνgt0
L(θ ν)
To obtain the parameters θ for the updated policy we solvefor ν and θ by alternating between 1) fixing ν to its currentvalue and optimizing for θ and 2) fixing θ to its currentvalue and optimizing for ν This can be applied for anypolicy output distribution
For Gaussian policies we decouple the update rules to op-timize for mean and covariance independently as in (Ab-dolmaleki et al 2018b) This allows for setting differentKL bounds for the mean (βmicro) and covariance (βΣ) whichresults in more stable learning To do this we first separateout the following two policies for mean and covariance
πmicroθ (a|s) = N(amicroθ(s)Σθold
(s)) (18)
πΣθ (a|s) = N
(amicroθold
(s)Σθ(s)) (19)
Policy πmicroθ (a|s) takes the mean from the online policy net-work and covariance from the target policy network andpolicy πΣ
θ (a|s) takes the mean from the target policy net-work and covariance from online policy network Now ouroptimization problem has the following form
maxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s)(
log πmicroθ (a|s)πΣθ (a|s)
)da ds
stints
micro(s) KL(πold(a|s) πmicroθ (a|s)) ds lt βmicroints
micro(s) KL(πold(a|s) πΣθ (a|s)) ds lt βΣ (20)
As in (Abdolmaleki et al 2018b) we set a much smallerbound for covariance than for mean to keep the explorationand avoid premature convergence We can solve this op-timization problem using the same general procedure asdescribed above
A Distributional View on Multi-Objective Policy Optimization
C4 Derivation of Dual Function for η
Recall that obtaining the per-objective improved action dis-tributions qk(a|s) in the policy improvement stage (Sec421) requires solving a convex dual function for the tem-perature ηk for each objective For the derivation of thisdual function please refer to Appendix D2 of the originalMPO paper (Abdolmaleki et al 2018b)
D MO-V-MPO Multi-Objective On-PolicyMPO
In this section we describe how MO-MPO can be adaptedto the on-policy case in which a state-value function V (s)is learned and used to estimate the advantages for eachobjective We call this approach MO-V-MPO
D1 Multi-objective policy evaluation in the on-policysetting
In the on-policy setting to evaluate the previous policyπold we use advantages A(s a) estimated from a learnedstate-value function V (s) instead of a state-action valuefunction Q(s a) as in the main text We train a separateV -function for each objective by regressing to the stan-dard n-step return (Sutton 1988) associated with each ob-jective More concretely given trajectory snippets τ =(s0 a0 r0) (sT aT rT ) where rt denotes a rewardvector rk(st at)Nk=1 that consists of rewards for all N ob-jectives we find value function parameters φk by optimizingthe following objective
minφ1k
Nsumk=1
Eτ[(G
(T )k (st at)minus V πold
φk(st))
2] (21)
Here G(T )k (st at) is the T -step target for value function
k which uses the actual rewards in the trajectory andbootstraps from the current value function for the restG
(T )k (st at) =
sumTminus1`=t γ
`minustrk(s` a`) + γTminustV πoldφk
(s`+T )The advantages are then estimated as Aπold
k (st at) =
G(T )k (st at)minus V πold
φk(st)
D2 Multi-objective policy improvement in theon-policy setting
Given the previous policy πold(a|s) and estimated advan-tages Aπold
k (s a)k=1N associated with this policy foreach objective our goal is to improve the previous policyTo this end we first learn an improved variational distribu-tion qk(s a) for each objective and then combine and distillthe variational distributions into a new parametric policyπnew(a|s) Unlike in MO-MPO for MO-V-MPO we use thejoint distribution qk(s a) rather than local policies qk(a|s)because without a learned Q-function only one action perstate is available for learning This is a multi-objective vari-
ant of the two-step policy improvement procedure employedby V-MPO (Song et al 2020)
D21 OBTAINING IMPROVED VARIATIONALDISTRIBUTIONS PER OBJECTIVE (STEP 1)
In order to obtain the improved variational distributionsqk(s a) we optimize the RL objective for each objectiveAk
maxqk
intsa
qk(s a)Ak(s a) dads (22)
st KL(qk(s a)pold(s a)) lt εk
where the KL-divergence is computed over all (s a) εk de-notes the allowed expected KL divergence and pold(s a) =micro(s)πold(a|s) is the state-action distribution associated withπold
As in MO-MPO we use these εk to define the preferencesover objectives More concretely εk defines the allowed con-tribution of objective k to the change of the policy There-fore the larger a particular εk is with respect to others themore that objective k is preferred On the other hand ifεk = 0 is zero then objective k will have no contribution tothe change of the policy and will effectively be ignored
Equation (22) can be solved in closed form
qk(s a) prop pold(s a) exp(Ak(s a)
ηk
) (23)
where the temperature ηk is computed based on the con-straint εk by solving the following convex dual problem
ηk = argminηk
[ηk εk + (24)
ηk log
intsa
pold(s a) exp(Ak(s a)
η
)da ds
]
We can perform the optimization along with the loss bytaking a gradient descent step on ηk and we initialize withthe solution found in the previous policy iteration step Sinceηk must be positive we use a projection operator after eachgradient step to maintain ηk gt 0
Top-k advantages As in Song et al (2020) in practicewe used the samples corresponding to the top 50 of ad-vantages in each batch of data
D22 FITTING A NEW PARAMETRIC POLICY (STEP 2)
We next want to combine and distill the state-action distribu-tions obtained in the previous step into a single parametricpolicy πnew(a|s) that favors all of the objectives accordingto the preferences specified by εk For this we solve a su-pervised learning problem that fits a parametric policy asfollows
A Distributional View on Multi-Objective Policy Optimization
πnew = argmaxθ
Nsumk=1
intsa
qk(s a) log πθ(a|s) da ds
stints
KL(πold(a|s) πθ(a|s)) ds lt β (25)
where θ are the parameters of our function approximator(a neural network) which we initialize from the weights ofthe previous policy πold and the KL constraint enforces atrust region of size β that limits the overall change in theparametric policy to improve stability of learning As inMPO the KL constraint in this step has a regularizationeffect that prevents the policy from overfitting to the localpolicies and therefore avoids premature convergence
In order to optimize Equation (25) we employ Lagrangianrelaxation similar to the one employed for ordinary V-MPO (Song et al 2020)
E Multi-Objective Policy Improvement asInference
In the main paper we motivated the multi-objective policyupdate rules from an intuitive perspective In this section weshow that our multi-objective policy improvement algorithmcan also be derived from the RL-as-inference perspective Inthe policy improvement stage we assume that a Q-functionfor each objective is given and we would like to improveour policy with respect to these Q-functions The deriva-tion here extends the derivation for the policy improvementalgorithm in (single-objective) MPO in Abdolmaleki et al(2018a) (in appendix) to the multi-objective case
We assume there are observable binary improvement eventsRkNk=1 for each objective Rk = 1 indicates that ourpolicy has improved for objective k whileRk = 0 indicatesthat it has not Now we ask if the policy has improved withrespect to all objectives ie Rk = 1Nk=1 what would theparameters θ of that improved policy be More concretelythis is equivalent to the maximum a posteriori estimate forRk = 1Nk=1
maxθ
pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)
where the probability of improvement events depends on θAssuming independence between improvement events Rkand using log probabilities leads to the following objective
maxθ
Nsumk=1
log pθ(Rk = 1) + log p(θ) (27)
The prior distribution over parameters p(θ) is fixed duringthe policy improvement step We set the prior such that πθ
stays close to the target policy during each policy improve-ment step to improve stability of learning (Sec E2)
We use the standard expectation-maximization (EM) algo-rithm to efficiently maximize
sumNk=1 log pθ(Rk = 1) The
EM algorithm repeatedly constructs a tight lower bound inthe E-step given the previous estimate of θ (which corre-sponds to the target policy in our case) and then optimizesthat lower bound in the M-step We introduce a variationaldistribution qk(s a) per objective and use this to decom-pose the log-evidence
sumNk=1 log pθ(Rk = 1) as follows
Nsumk=1
log pθ(Rk = 1) (28)
=
Nsumk=1
KL(qk(s a) pθ(s a|Rk = 1))minus
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a))
The second term of this decomposition is expanded as
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a)) (29)
=
Nsumk=1
Eqk(sa)
[log
pθ(Rk = 1 s a)
qk(s a)
]=
Nsumk=1
Eqk(sa)
[log
p(Rk = 1|s a)πθ(a|s)micro(s)
qk(s a)
]
where micro(s) is the stationary state distribution which is as-sumed to be given in each policy improvement step Inpractice micro(s) is approximated by the distribution of thestates in the replay buffer p(Rk = 1|s a) is the likelihoodof the improvement event for objective k if our policy choseaction a in state s
The first term of the decomposition in Equation (28) isalways positive so the second term is a lower bound onthe log-evidence
sumNk=1 log pθ(Rk = 1) πθ(a|s) and
qk(a|s) = qk(sa)micro(s) are unknown even though micro(s) is given
In the E-step we estimate qk(a|s) for each objective k byminimizing the first term given θ = θprime (the parameters ofthe target policy) Then in the M-step we find a new θ byfitting the parametric policy to the distributions from firststep
E1 E-Step
The E-step corresponds to the first step of policy improve-ment in the main paper (Sec 421) In the E-step wechoose the variational distributions qk(a|s)Nk=1 such thatthe lower bound on
sumNk=1 log pθ(Rk = 1) is as tight as
A Distributional View on Multi-Objective Policy Optimization
possible when θ = θprime the parameters of the target policyThe lower bound is tight when the first term of the decompo-sition in Equation (28) is zero so we choose qk to minimizethis term We can optimize for each variational distributionqk independently
qk(a|s) = argminq
Emicro(s)
[KL(q(a|s) pθprime(s a|Rk = 1))
]= argmin
qEmicro(s)
[KL(q(a|s) πθprime(a|s))minus
Eq(a|s) log p(Rk = 1|s a))]
(30)
We can solve this optimization problem in closed formwhich gives us
qk(a|s) =πθprime(a|s) p(Rk = 1|s a)intπθprime(a|s) p(Rk = 1|s a) da
This solution weighs the actions based on their relativeimprovement likelihood p(Rk = 1|s a) for each objectiveWe define the likelihood of an improvement event Rk as
p(Rk = 1|s a) prop exp(Qk(s a)
αk
)
where αk is an objective-dependent temperature parameterthat controls how greedy the solution qk(a|s) is with respectto its associated objective Qk(s a) For example given abatch of state-action pairs at the extremes an αk of zerowould give all probability to the state-action pair with themaximum Q-value while an αk of positive infinity wouldgive the same probability to all state-action pairs In orderto automatically optimize for αk we plug the exponentialtransformation into (30) which leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) dadsminus
αk
intmicro(s) KL(q(a|s) πθprime(a|s)) ds (31)
If we instead enforce the bound on KL divergence as a hardconstraint (which we do in practice) that leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) da ds (32)
stintmicro(s) KL(q(a|s) πθprime(a|s)) ds lt εk
where as described in the main paper the parameter εkdefines preferences over objectives If we now assume thatqk(a|s) is a non-parametric sample-based distribution as in(Abdolmaleki et al 2018b) we can solve this constrained
optimization problem in closed form for each sampled states by setting
q(a|s) prop πθprime(a|s) exp(Qk(s a)
ηk
) (33)
where ηk is computed by solving the following convex dualfunction
ηk = argminη
ηεk+ (34)
η
intmicro(s) log
intπθprime(a|s) exp
(Qk(s a)
η
)da ds
Equations (32) (33) and (34) are equivalent to those givenfor the first step of multi-objective policy improvement inthe main paper (Sec 421) Please see refer to AppendixD2 in (Abdolmaleki et al 2018b) for details on derivationof the dual function
E2 M-Step
After obtaining the variational distributions qk(a|s) foreach objective k we have found a tight lower bound forsumNk=1 log pθ(Rk = 1) when θ = θprime We can now obtain
the parameters θ of the updated policy πθ(a|s) by optimiz-ing this lower bound After rearranging and dropping theterms in (27) that do not involve θ we obtain
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) da ds
+ log p(θ)
This corresponds to the supervised learning step of policyimprovement in the main paper (Sec 422)
For the prior p(θ) on the parameters θ we enforce that thenew policy πθ(a|s) is close to the target policy πθprime(a|s)We therefore optimize for
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) dads
stintmicro(s) KL(πθprime(a|s) πθ(a|s)) ds lt β
Because of the constraint on the change of the parametricpolicy we do not greedily optimize the M-step objectiveSee (Abdolmaleki et al 2018b) for more details
A Distributional View on Multi-Objective Policy Optimization
bution space thus making it invariant to the scale of rewardsIn principle our approach can be combined with any RLalgorithm regardless of whether it is off-policy or on-policyWe combine it with maximum a posteriori policy optimiza-tion (MPO) (Abdolmaleki et al 2018ab) an off-policyactor-critic RL algorithm and V-MPO (Song et al 2020)an on-policy variant of MPO We call these two algorithmsmulti-objective MPO (MO-MPO) and multi-objective V-MPO (MO-V-MPO) respectively
Our main contribution is providing a distributional view onMORL which enables scale-invariant encoding of prefer-ences We show that this is a theoretically-grounded ap-proach that arises from taking an RL-as-inference perspec-tive of MORL Empirically we analyze the mechanics ofMO-MPO and show it finds all Pareto-optimal policies in apopular MORL benchmark task Finally we demonstratethat MO-MPO and MO-V-MPO outperform scalarized ap-proaches on multi-objective tasks across several challenginghigh-dimensional continuous control domains (Fig 1)
2 Related Work21 Multi-Objective Reinforcement Learning
Multi-objective reinforcement learning (MORL) algorithmsare either single-policy or multiple-policy (Vamplew et al2011) Single-policy approaches seek to find the optimalpolicy for a given scalarization of the multi-objective prob-lem Often this scalarization is linear but other choices havealso been explored (Van Moffaert et al 2013)
However the scalarization may be unknown at training timeor it may change over time Multiple-policy approacheshandle this by finding a set of policies that approximatesthe true Pareto front Some approaches repeatedly call asingle-policy MORL algorithm with strategically-chosenscalarizations (Natarajan amp Tadepalli 2005 Roijers et al2014 Mossalam et al 2016 Zuluaga et al 2016) Otherapproaches learn a set of policies simultaneously by using amulti-objective variant of the Q-learning update rule (Barrettamp Narayanan 2008 Moffaert amp Nowe 2014 Reymond ampNowe 2019 Yang et al 2019) or by modifying gradient-based policy search (Parisi et al 2014 Pirotta et al 2015)
Most existing approaches for finding the Pareto front arelimited to discrete state and action spaces in which tabularalgorithms are sufficient Although recent work combiningMORL with deep RL handles high-dimensional observa-tions this is in domains with low-dimensional and usuallydiscrete action spaces (Mossalam et al 2016 van Seijenet al 2017 Friedman amp Fontaine 2018 Abels et al 2019Reymond amp Nowe 2019 Yang et al 2019 Nottinghamet al 2019) In contrast we evaluate our approach on con-
tinuous control tasks with more than 20 action dimensions1
A couple of recent works have applied deep MORL to findthe Pareto front in continuous control tasks these worksassume scalarization and rely on additionally learning ei-ther a meta-policy (Chen et al 2019) or inter-objectiverelationships (Zhan amp Cao 2019) We take an orthogonalapproach to existing approaches one encodes preferencesvia constraints on the influence of each objective on thepolicy update instead of via scalarization MO-MPO canbe run multiple times with different constraint settings tofind a Pareto front of policies
22 Constrained Reinforcement Learning
An alternate way of setting preferences is to enforce thatpolicies meet certain constraints For instance thresholdlexicographic ordering approaches optimize a (single) ob-jective while meeting specified threshold values on the otherobjectives (Gabor et al 1998) optionally with slack (Wrayet al 2015) Similarly safe RL is concerned with learningpolicies that optimize a scalar reward while not violatingsafety constraints (Achiam et al 2017 Chow et al 2018)this has also been studied in the off-policy batch RL setting(Le et al 2019) Related work minimizes costs while ensur-ing the policy meets a constraint on the minimum expectedreturn (Bohez et al 2019) but this requires that the desiredor achievable reward is known a priori In contrast MO-MPO does not require knowledge of the scale of rewardsIn fact often there is no easy way to specify constraints onobjectives eg it is difficult to figure out a priori how muchactuator effort a robot will need to use to perform a task
23 Multi-Task Reinforcement Learning
Multi-task reinforcement learning can also be cast as aMORL problem Generally these algorithms learn a separatepolicy for each task with shared learning across tasks (Tehet al 2017 Riedmiller et al 2018 Wulfmeier et al 2019)In particular Distral (Teh et al 2017) learns a shared priorthat regularizes the per-task policies to be similar to eachother and thus captures essential structure that is sharedacross tasks MO-MPO differs in that the goal is to learn asingle policy that must trade off across different objectives
Other multi-task RL algorithms seek to train a single agentto solve different tasks and thus need to handle differentreward scales across tasks Prior work uses adaptive normal-ization for the targets in value-based RL so that the agent
1MORL for continuous control tasks is difficult because thepolicy can no longer output arbitrary action distributions whichlimits how well it can compromise between competing objectivesIn state-of-the-art RL algorithms for continuous control policiestypically output a single action (eg D4PG (Barth-Maron et al2018)) or a Gaussian (eg PPO (Schulman et al 2017) MPO(Abdolmaleki et al 2018b) and SAC (Haarnoja et al 2018))
A Distributional View on Multi-Objective Policy Optimization
cares equally about all tasks (van Hasselt et al 2016 Hesselet al 2019) Similarly prior work in multi-objective opti-mization has dealt with objectives of different units andorscales by normalizing objectives to have similar magnitudes(Marler amp Arora 2005 Grodzevich amp Romanko 2006 Kimamp de Weck 2006 Daneshmand et al 2017 Ishibuchi et al2017) MO-MPO can also be seen as doing adaptive nor-malization but for any preference over objectives not justequal preferences
In general invariance to reparameterization of the functionapproximator has been investigated in optimization liter-ature resulting in for example natural gradient methods(Martens 2014) The common tool here is measuring dis-tances in function space instead of parameter space usingKL-divergence Similarly in this work to achieve invari-ance to the scale of objectives we use KL-divergence overpolicies to encode preferences
3 Background and NotationMulti Objective Markov Decision Process In this pa-per we consider a multi-objective RL problem defined by amulti-objective Markov Decision Process (MO-MDP) TheMO-MDP consists of states s isin S and actions a isin Aan initial state distribution p(s0) transition probabilitiesp(st+1|st at) which define the probability of changingfrom state st to st+1 when taking action at reward func-tions rk(s a) isin RNk=1 per objective k and a discountfactor γ isin [0 1) We define our policy πθ(a|s) as astate conditional distribution over actions parametrized byθ Together with the transition probabilities this givesrise to a state visitation distribution micro(s) We also con-sider per-objective action-value functions The action-value function for objective k is defined as the expectedreturn (ie cumulative discounted reward) from choosingaction a in state s for objective k and then following pol-icy π Qπk (s a) = Eπ[
suminfint=0 γ
trk(st at)|s0 = s a0 = a]We can represent this function using the recursive expres-sion Qπk (st at) = Ep(st+1|stat)
[rk(st at) + γV πk (st+1)
]
where V πk (s) = Eπ[Qπk (s a)] is the value function of π forobjective k
Problem Statement For any MO-MDP there is a setof nondominated policies ie the Pareto front A policyis nondominated if there is no other policy that improvesits expected return for an objective without reducing theexpected return of at least one other objective Given apreference setting our goal is to find a nondominated policyπθ that satisfies those preferences In our approach a settingof constraints does not directly correspond to a particularscalarization but we show that by varying these constraintsettings we can indeed trace out a Pareto front of policies
Algorithm 1 MO-MPO One policy improvement step1 given batch size (L) number of actions to sample (M ) (N )
Q-functions Qπoldk (s a)Nk=1 preferences εkNk=1 previous
policy πold previous temperatures ηkNk=1 replay buffer Dfirst-order gradient-based optimizer O
23 initialize πθ from the parameters of πold4 repeat5 Collect dataset si aij Qijk
LMNijk where
6 M actions aij sim πold(a|si) and Qijk = Qπoldk (si aij)
78 Compute action distribution for each objective9 for k = 1 N do
10 δηk larr nablaηkηkεk + ηksumLi
1Llog
(sumMj
1M
exp(Q
ijkηk
))11 Update ηk based on δηk using optimizer O12 qijk prop exp(
Qijkηk
)
13 end for1415 Update parametric policy16 δπ larr minusnablaθ
sumLi
sumMj
sumNk q
ijk log πθ(a
ij |si)17 (subject to additional KL regularization see Sec 422)18 Update πθ based on δπ using optimizer O1920 until fixed number of steps21 return πold = πθ
4 MethodWe propose a policy iteration algorithm for multi-objectiveRL Policy iteration algorithms decompose the RL probleminto two sub-problems and iterate until convergence
1 Policy evaluation estimate Q-functions given policy2 Policy improvement update policy given Q-functions
Algorithm 1 summarizes this two-step multi-objective pol-icy improvement procedure In Appendix E we explain howthis can be derived from the ldquoRL as inferencerdquo perspective
We describe multi-objective MPO in this section and explainmulti-objective V-MPO in Appendix D When there is onlyone objective MO-(V-)MPO reduces to (V-)MPO
41 Multi-Objective Policy Evaluation
In this step we learn Q-functions to evaluate the previouspolicy πold We train a separate Q-function per objectivefollowing the Q-decomposition approach (Russell amp Zim-dars 2003) In principle any Q-learning algorithm can beused as long as the target Q-value is computed with respectto πold2 In this paper we use the Retrace objective (Munos
2Russell amp Zimdars (2003) prove critics suffer from ldquoillusion ofcontrolrdquo if they are trained with conventional Q-learning (Watkins1989) In other words if each critic computes its target Q-valuebased on its own best action for the next state then they overesti-mate Q-values because in reality the parametric policy πθ (thatconsiders all criticsrsquo opinions) is in charge of choosing actions
A Distributional View on Multi-Objective Policy Optimization
et al 2016) to learn a Q-function Qπoldk (s aφk) for each
objective k parameterized by φk as follows
minφkN1
Nsumk=1
E(sa)simD
[(Qretk (s a)minusQπold
k (s aφk))2]
where Qretk is the Retrace target for objective k and the
previous policy πold and D is a replay buffer containinggathered transitions See Appendix C for details
42 Multi-Objective Policy Improvement
Given the previous policy πold(a|s) and associated Q-functions Qπold
k (s a)Nk=1 our goal is to improve the previ-ous policy for a given visitation distribution micro(s)3 To thisend we learn an action distribution for each Q-function andcombine these to obtain the next policy πnew(a|s) This is amulti-objective variant of the two-step policy improvementprocedure employed by MPO (Abdolmaleki et al 2018b)
In the first step for each objective k we learn an improvedaction distribution qk(a|s) such that Eqk(a|s)[Q
πoldk (s a)] ge
Eπold(a|s)[Qπoldk (s a)] where states s are drawn from a visi-
tation distribution micro(s)
In the second step we combine and distill the improveddistributions qk into a new parametric policy πnew (withparameters θnew) by minimizing the KL-divergence betweenthe distributions and the new parametric policy ie
θnew = argminθ
Nsumk=1
Emicro(s)
[KL(qk(a|s)πθ(a|s)
)] (1)
This is a supervised learning loss that performs maximumlikelihood estimate of distributions qk Next we will explainthese two steps in more detail
421 OBTAINING ACTION DISTRIBUTIONS PEROBJECTIVE (STEP 1)
To obtain the per-objective improved action distributionsqk(a|s) we optimize the standard RL objective for eachobjective Qk
maxqk
ints
micro(s)
inta
qk(a|s)Qk(s a) dads (2)
stints
micro(s) KL(qk(a|s)πold(a|s)) ds lt εk
where εk denotes the allowed expected KL divergence forobjective k We use these εk to encode preferences over ob-jectives More concretely εk defines the allowed influenceof objective k on the change of the policy
3In practice we use draws from the replay buffer to estimateexpectations over the visitation distribution micro(s)
For nonparametric action distributions qk(a|s) we can solvethis constrained optimization problem in closed form foreach state s sampled from micro(s) (Abdolmaleki et al 2018b)
qk(a|s) prop πold(a|s) exp(Qk(s a)
ηk
) (3)
where the temperature ηk is computed based on the corre-sponding εk by solving the following convex dual function
ηk = argminη
η εk + (4)
η
ints
micro(s) log
inta
πold(a|s) exp(Qk(s a)
η
)dads
In order to evaluate qk(a|s) and the integrals in (4) we drawL states from the replay buffer and for each state sampleM actions from the current policy πold In practice wemaintain one temperature parameter ηk per objective Wefound that optimizing the dual function by performing a fewsteps of gradient descent on ηk is effective and we initializewith the solution found in the previous policy iteration stepSince ηk should be positive we use a projection operatorafter each gradient step to maintain ηk gt 0 Please refer toAppendix C for derivation details
Application to Other Deep RL Algorithms Since theconstraints εk in (2) encode the preferences over objectivessolving this optimization problem with good satisfactionof constraints is key for learning a policy that satisfies thedesired preferences For nonparametric action distributionsqk(a|s) we can satisfy these constraints exactly One coulduse any policy gradient method (eg Schulman et al 20152017 Heess et al 2015 Haarnoja et al 2018) to obtainqk(a|s) in a parametric form instead However solving theconstrained optimization for parametric qk(a|s) is not exactand the constraints may not be well satisfied which impedesthe use of εk to encode preferences Moreover assuming aparametric qk(a|s) requires maintaining a function approx-imator (eg a neural network) per objective which cansignificantly increase the complexity of the algorithm andlimits scalability
Choosing εk It is more intuitive to encode preferences viaεk rather than via scalarization weights because the formeris invariant to the scale of rewards In other words having adesired preference across objectives narrows down the rangeof reasonable choices for εk but does not narrow down therange of reasonable choices for scalarization weights Inorder to identify reasonable scalarization weights a RL prac-titioner needs to additionally be familiar with the scale ofrewards for each objective In practice we have found thatlearning performance is robust to a wide range of scales forεk It is the relative scales of the εk that matter for encodingpreferences over objectivesmdashthe larger a particular εk iswith respect to others the more that objective k is preferredOn the other hand if εk = 0 then objective k will have
A Distributional View on Multi-Objective Policy Optimization
no influence and will effectively be ignored In AppendixA1 we provide suggestions for setting εk given a desiredpreference over objectives
422 FITTING A NEW PARAMETRIC POLICY (STEP 2)
In the previous section for each objective k we have ob-tained an improved action distribution qk(a|s) Next wewant to combine these distributions to obtain a single para-metric policy that trades off the objectives according tothe constraints εk that we set For this we solve a super-vised learning problem that fits a parametric policy to theper-objective action distributions from step 1
θnew = argmaxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) dads
stints
micro(s) KL(πold(a|s) πθ(a|s)) ds lt β (5)
where θ are the parameters of our policy (a neural network)and the KL constraint enforces a trust region of size β thatlimits the overall change in the parametric policy The KLconstraint in this step has a regularization effect that pre-vents the policy from overfitting to the sample-based actiondistributions and therefore avoids premature convergenceand improves stability of learning (Schulman et al 2015Abdolmaleki et al 2018ab)
Similar to the first policy improvement step we evaluatethe integrals by using the L states sampled from the replaybuffer and the M actions per state sampled from the oldpolicy In order to optimize (5) using gradient descent weemploy Lagrangian relaxation similar to in MPO (Abdol-maleki et al 2018a) (see Appendix C for more detail)
5 Experiments Toy DomainsIn the empirical evaluation that follows we will first demon-strate the mechanics and scale-invariance of MO-MPO ina single-state environment (Sec 51) and then show thatMO-MPO can find all Pareto-dominant policies in a popularMORL benchmark (Sec 52) Finally we show the benefitof using MO-MPO in high-dimensional continuous controldomains including on a real robot (Sec 6) AppendicesA and B contain a detailed description of all domains andtasks experimental setups and implementation details
Baselines The goal of our empirical evaluation is toanalyze the benefit of using our proposed multi-objectivepolicy improvement step (Sec 422) that encodes prefer-ences over objectives via constraints εk on expected KL-divergences rather than via weights wk Thus we primarilycompare MO-MPO against scalarized MPO which relieson linear scalarization weights wk to encode preferencesThe only difference between MO-MPO and scalarized MPOis the policy improvement step for scalarized MPO a single
Figure 2 A simple environment in which the agent starts at stateS0 and chooses to navigate to one of three terminal states Thereare two objectives Taking the left action for instance leads toa reward of 1 for the first objective and 4 for the second
improved action distribution q(a|s) is computed based onsumk wkQk(s a) and a single KL constraint ε
State-of-the-art approaches that combine MORL with deepRL assume linear scalarization as well either learning aseparate policy for each setting of weights (Mossalam et al2016) or learning a single policy conditioned on scalar-ization weights (Friedman amp Fontaine 2018 Abels et al2019) Scalarized MPO addresses the former problemwhich is easier The policy evaluation step in scalarizedMPO is analagous to scalarized Q-learning proposed byMossalam et al (2016) As we show later in Sec 6 evenlearning an optimal policy for a single scalarization is diffi-cult in high-dimensional continuous control domains
51 Simple World
First we will examine the behavior of MO-MPO in a simplemulti-armed bandit with three actions (up right andleft) (Fig 2) inspired by Russell amp Zimdars (2003)We train policies with scalarized MPO and with MO-MPOThe policy evaluation step is exact because the Q-valuefunction for each objective is known it is equal to thereward received for that objective after taking each actionas labeled in Fig 24
We consider three possible desired preferences equal pref-erence for the two objectives preferring the first and pre-ferring the second Encoding preferences in scalarizedMPO amounts to choosing appropriate linear scalarizationweights and in MO-MPO amounts to choosing appropriateεrsquos We use the following weights and εrsquos
bull equal preference weights [05 05] or εrsquos [001 001]bull prefer first weights [09 01] or εrsquos [001 0002]bull prefer second weights [01 09] or εrsquos [0002 001]
We set ε = 001 for scalarized MPO If we start with a4The policy improvement step can also be computed exactly
because solving for the optimal temperature η (or η1 and η2 in theMO-MPO case) is a convex optimization problem and the KL-constrained policy update is also a convex optimization problemwhen there is only one possible state We use CVXOPT (Andersenet al 2020) as our convex optimization solver
A Distributional View on Multi-Objective Policy Optimization
uniform policy and run MPO with β = 0001 until thepolicy converges scalarized MPO and MO-MPO result insimilar policies (Fig 3 solid bars) up for equal preferenceright for prefer first and left for prefer second
However if we make the rewards imbalanced by multiply-ing the rewards obtained for the first objective by 20 (egleft now obtains a reward of [20 4]) we see that the poli-cies learned by scalarized MPO shift to preferring the opti-mal action for the first objective (right) in both the equalpreference and prefer second cases (Fig 3 striped bars)In contrast the final policies for MO-MPO are the same asfor balanced rewards because in each policy improvementstep MO-MPO optimizes for a separate temperature ηk thatscales each objectiversquos Q-value function This ηk is com-puted based on the corresponding allowed KL-divergenceεk so when the rewards for any objective k are multipliedby a factor but εk remains the same the computed ηk endsup being scaled by that factor as well neutralizing the effectof the scaling of rewards (see Eq (4))
Even in this simple environment we see that MO-MPOrsquosscale-invariant way of encoding preferences is valuable Inmore complex domains in which the Q-value functionsmust be learned in parallel with the policy the (automatic)dynamic adjustment of temperatures ηk per objective be-comes more essential (Sec 6)
The scale of εk controls the amount that objective k caninfluence the policyrsquos update If we set ε1 = 001 and sweepover the range from 0 to 001 for ε2 the resulting policies gofrom always picking right to splitting probability acrossright and up to always picking up (Fig 4 right) Incontrast setting weights leads to policies quickly convergingto placing all probability on a single action (Fig 4 left)We hypothesize this limits the ability of scalarized MPOto explore and find compromise policies (that perform wellwith respect to all objectives) in more challenging domains
52 Deep Sea Treasure
An important quality of any MORL approach is the abilityto find a variety of policies on the true Pareto front (Roijerset al 2013) We demonstrate this in Deep Sea Treasure(DST) (Vamplew et al 2011) a popular benchmark for test-ing MORL approaches DST consists of a 11times10 grid worldwith ten treasure locations The agent starts in the upper leftcorner (s0 = (0 0)) and has a choice of four actions (mov-ing one square up right down and left) When theagent picks up a treasure the episode terminates The agenthas two objectives treasure value and time penalty Thetime penalty is minus1 for each time step that the agent takesand farther-away treasures have higher treasure values Weuse the treasure values in Yang et al (2019)
We ran scalarized MPO with weightings [w 1 minus w] and
00
05
10equal preference prefer first prefer second
left up right00
05
10
left up right left up right
scalarizeddistributional
scalarized imbalanced rewardsdistributional imbalanced rewards
Figure 3 When two objectives have comparable reward scales(solid bars) scalarized MPO (first row) and MO-MPO (second row)learn similar policies across three different preferences Howeverwhen the scale of the first objective is much higher (striped bars)scalarized MPO shifts to always preferring the first objective Incontrast MO-MPO is scale-invariant and still learns policies thatsatisfy the preferences The y-axis denotes action probability
00 05 10p(left)
0
1distributional
[001]
[0101]
[010]
00 05 10p(left)
0
1p(
righ
t)scalarized
[01]
[55]
[10][w1 w2]
00 05 10p(left)
0
1p(
righ
t)scalarized
[01]
[55]
[10]
00 05 10p(left)
0
1
p(righ
t)
scalarized
[01]
[55]
[10][1205981 1205982]
00 05 10p(left)
0
1distributional
[001]
[0101]
[010]
Figure 4 A visualization of policies during learningmdasheach curvecorresponds to a particular setting of weights (left) or εrsquos (right)Policies are initialized as uniform (the blue dot) and are traineduntil convergence Each point (x y) corresponds to the policywith p(left) = x p(right) = y and p(up) = 1minusxminusy Thetop left and bottom right blue stars denote the optimal policy forthe first and second objectives respectively
w isin [0 001 002 1] All policies converged to a pointon the true Pareto front and all but three found the optimalpolicy for that weighting In terms of coverage policieswere found for eight out of ten points on the Pareto front5
We ran MO-MPO on this task as well for a range of εεtime isin [001 002 005] and εtreasure = c lowast εtime wherec isin [05 051 052 15] All runs converged to a policyon the true Pareto front and MO-MPO finds policies for allten points on the front (Fig 5 left) Note that it is the ratioof εrsquos that matters rather than the exact settingsmdashacrossall settings of εtime similar ratios of εtreasure to εtime resultin similar policies as this ratio increases policies tend toprefer higher-value treasures (Fig 5 right)
5Since scalarized MPO consistently finds the optimal policyfor any given weighting in this task with a more strategic selectionof weights we expect policies for all ten points would be found
A Distributional View on Multi-Objective Policy Optimization
Figure 5 Left Blue stars mark the true Pareto front for Deep SeaTreasure MO-MPO with a variety of settings for εk discoversall points on the true Pareto front The area of the orange circlesis proportional to the number of εk settings that converged tothat point Right As more preference is given to the treasureobjective (ie as x increases) policies tend to prefer higher-valuetreasures Each orange dot in the scatterplot corresponds to aparticular setting of εk
6 Experiments Continuous Control DomainsThe advantage of encoding preferences via εrsquos rather thanvia weights is apparent in more complex domains We com-pared our approaches MO-MPO and MO-V-MPO againstscalarized MPO and V-MPO in four high-dimensional con-tinuous control domains in MuJoCo (Todorov et al 2012)and on a real robot The domains we consider are
Humanoid We use the humanoid run task defined in Deep-Mind Control Suite (Tassa et al 2018) Policies must opti-mize for horizontal speed h while minimizing energy usageThe task reward is min(h10 1) where h is in meters persecond and the energy usage penalty is action `2-norm Thehumanoid has 21 degrees of freedom and the observationconsists of joint angles joint velocities head height handand feet positions torso vertical orientation and center-of-mass velocity for a total of 67 dimensions
Shadow Hand We consider three tasks on the ShadowDexterous Hand touch turn and orient In the touch andturn tasks policies must complete the task while minimiz-ing ldquopainrdquo A sparse task reward of 10 is given for pressingthe block with greater than 5N of force or for turning thedial from a random initial location to the target locationThe pain penalty penalizes the robot for colliding with ob-jects at high speed this penalty is defined as in Huang et al(2019) In the orient task there are three aligned objectivestouching the rectangular peg lifting it to a given height andorienting it to be perpendicular to the ground All threerewards are between 0 and 1 The Shadow Hand has five
fingers and 24 degrees of freedom actuated by 20 motorsThe observation consists of joint angles joint velocitiesand touch sensors for a total of 63 dimensions The touchand turn tasks terminate when the goal is reached or after 5seconds and the orient task terminates after 10 seconds
Humanoid Mocap We consider a large-scale humanoidmotion capture tracking task similar to in Peng et al (2018)in which policies must learn to follow motion capture ref-erence data6 There are five objectives each capturing adifferent aspect of the similarity of the pose between thesimulated humanoid and the mocap target joint orientationsjoint velocities hand and feet positions center-of-mass po-sitions and certain body positions and joint angles Theseobjectives are described in detail in Appendix B4 In orderto balance these multiple objectives prior work relied onheavily-tuned reward functions (eg Peng et al 2018) Thehumanoid has 56 degrees of freedom and the observation is1021-dimensional consisting of proprioceptive observationsas well as six steps of motion capture reference frames Intotal we use about 40 minutes of locomotion mocap datamaking this an extremely challenging domain
Sawyer Peg-in-Hole We train a Rethink Robotics Sawyerrobot arm to insert a cylindrical peg into a hole while min-imizing wrist forces The task reward is shaped towardpositioning the peg directly above the hole and increasesfor insertion and the penalty is the `1-norm of Cartesianforces measured by the wrist force-torque sensor The latterimplicitly penalizes contacts and impacts as well as exces-sive directional change (due to the gripperrsquos inertia inducingforces when accelerating) We impose a force thresholdto protect the hardwaremdashif this threshold is exceeded theepisode is terminated The action space is the end effectorrsquosCartesian velocity and the observation is 102-dimensionalconsisting of Cartesian position joint position and velocitywrist force-torque and joint action for three timesteps
61 Evaluation Metric
We run MO-(V-)MPO and scalarized (V-)MPO with a widerange of constraint settings εk and scalarization weightswk respectively corresponding to a wide range of possibledesired preferences (The exact settings are provided inAppendix A) For tasks with two objectives we plot thePareto front found by each approach We also computethe hypervolume of each found Pareto front this metric iscommonly-used for evaluating MORL algorithms (Vamplewet al 2011) Given a set of policies Π and a reference policyr that is dominated by all policies in this set this metric isthe hypervolume of the space of all policies that dominater and are dominated by at least one policy in Π We useDEAP (Fortin et al 2012) to compute hypervolumes
6This task was developed concurrently by Anonymous (2020)and does not constitute a contribution of this paper
A Distributional View on Multi-Objective Policy Optimization
3000 00
400
800
task
rewar
d
mid-training
3000 0
action norm penalty
end of training
3000 0
normal scale
30000 0
10x penalty
MO-MPO scalarized MPO
5 0pain penalty
00
05
task
rewar
d
touch mid-training
5 0pain penalty
touch
5 0pain penalty
turn
0 100z
0
150
angl
e
orient
3000 00
400
800
task
rewar
d
mid-training
3000 0
action norm penalty
end of training
3000 0
normal scale
30000 0
10x penalty
MO-MPO scalarized MPO
5 0pain penalty
00
05
task
rewar
d
touch mid-training
5 0pain penalty
touch
5 0pain penalty
turn
0 100z
0
150
angl
e
orient
touch mid-training
1x penalty
action norm penalty
pain penalty
3000 00
400
800
task
rewar
d
mid-training
3000 0
action norm penalty
end of training
3000 0
normal scale
30000 0
10x penalty
MO-MPO scalarized MPOscalarized MPOMO-MPO
touch end of training
Figure 6 Pareto fronts found by MO-MPO and scalarized MO-MPO for humanoid run (top row) and Shadow Hand tasks Eachdot represents a single trained policy Corresponding hypervol-umes are in Table 1 Task reward is discounted for touch and turnwith a discount factor of 099 For orient the x and y axes are thetotal reward for the lift and orientation objectives respectively
62 Results Humanoid and Shadow Hand
In the run touch and turn tasks the two objectives arecompetingmdasha very high preference for minimizing actionnorm or pain as opposed to getting task reward will resultin a policy that always chooses zero-valued actions Acrossthese three tasks the Pareto front found by MO-MPO issuperior to the one found by scalarized MPO with respectto the hypervolume metric (Table 1)7 In particular MO-MPO finds more policies that perform well with respect toboth objectives ie are in the upper right portion of thePareto front MO-MPO also speeds up learning on run andtouch (Fig 6) Qualitatively MO-MPO trains run policiesthat look more natural and ldquohuman-likerdquo videos are athttpsitesgooglecomviewmo-mpo
When we scale the action norm penalty by 10times for runscalarized MPO policies no longer achieve high task rewardwhereas MO-MPO policies do (Fig 6 top right) Thissupports that MO-MPOrsquos encoding of preferences is indeedscale-invariant When the objectives are aligned and havesimilarly scaled rewards as in the orient task MO-MPOand scalarized MPO perform similarly as expected
Ablation We also ran vanilla MPO on humanoid run withthe same range of weight settings as for scalarized MPO toinvestigate how useful Q-decomposition is In vanilla MPOwe train a single critic on the scalarized reward function
7We use hypervolume reference points of [0minus104] for run[0minus105] for run with 10times penalty [0minus20] for touch and turnand the zero vector for orient and humanoid mocap
Task scalarized MPO MO-MPO
Humanoid run mid-training 11times106 33times106
Humanoid run 64times106 71times106
Humanoid run 1times penalty 50times106 59times106
Humanoid run 10times penalty 26times107 69times107
Shadow touch mid-training 143 156Shadow touch 162 164Shadow turn 144 154Shadow orient 28times104 28times104
Humanoid Mocap 386times10minus6 341times10minus6
Table 1 Hypervolume measurements across tasks and approaches
which is equivalent to removing Q-decomposition fromscalarized MPO Vanilla MPO trains policies that achievesimilar task reward (up to 800) but with twice the actionnorm penalty (up to minus104) As a result the hypervolumeof the Pareto front that vanilla MPO finds is more than amagnitude worse than that of scalarized MPO and MO-MPO(22times106 versus 26times107 and 69times107 respectively)
63 Results Humanoid Mocap
The objectives in this task are mostly aligned In contrast tothe other experiments we use V-MPO (Song et al 2020) asthe base algorithm because it outperforms MPO in learningthis task In addition since V-MPO is an on-policy vari-ant of MPO this enables us to evaluate our approach inthe on-policy setting Each training run is very computa-tionally expensive so we train only a handful of policieseach8 None of the MO-V-MPO policies are dominatedby those found by V-MPO In fact although the weightsspan a wide range of ldquopreferencesrdquo for the joint velocity theonly policies found by V-MPO that are not dominated by aMO-V-MPO policy are those with extreme values for jointvelocity reward (either le 0006 or ge 0018) whereas it isbetween 00103 and 00121 for MO-V-MPO policies
Although the hypervolume of the Pareto front found by V-MPO is higher than that of MO-V-MPO (Table 1) findingpolicies that over- or under- prioritize any objective is unde-sirable Qualitatively the policies trained with MO-V-MPOlook more similar to the mocap reference datamdashthey exhibitless feet jittering compared to those trained with scalarizedV-MPO this can be seen in the corresponding video
64 Results Sawyer Peg-in-Hole
In this task we would like the robot to prioritize successfultask completion while minimizing wrist forces With this
8For MO-V-MPO we set all εk = 001 Also for each objec-tive we set εk = 0001 and set all others to 001 For V-MPOwe fix the weights of four objectives to reasonable values and tryweights of [01 03 06 1 5] for matching joint velocities
A Distributional View on Multi-Objective Policy Optimization
0 200000 400000
200
400
number of policy improvement steps
task reward
0 200000 400000
minus12
00minus
800
wrist force penalty
MO-MPO scalarized MPO [095 005] [08 02]
Figure 7 Task reward (left) and wrist force penalty (right) learningcurves for the Sawyer peg-in-hole task The policy trained withMO-MPO quickly learns to optimize both objectives whereas theother two do not Each line represents a single trained policy
in mind for MO-MPO we set εtask = 01 and εforce = 005and for scalarized MPO we try [wtask wforce] = [095 005]and [08 02] We find that policies trained with scalarizedMPO focus all learning on a single objective at the beginningof training we also observed this in the touch task wherescalarized MPO policies quickly learn to either maximizetask reward or minimize pain penalty but not both (Fig 6bottom left) In contrast the policy trained with MO-MPOsimultaneously optimizes for both objectives throughouttraining In fact throughout training the MO-MPO policydoes just as well with respect to task reward as the scalarizedMPO policy that cares more about task reward and similarlyfor wrist force penalty (Fig 7)
7 Conclusions and Future WorkIn this paper we presented a new distributional perspectiveon multi objective reinforcement learning that is derivedfrom the Rl-as-inference perspective This view leads to twonovel multi-objective RL algorithms namely MO-MPO andMO-V-MPO We showed these algorithms enable practition-ers to encode preferences in a scale-invariant way AlthoughMO-(V-)MPO is a single-policy approach to MORL it iscapable of producing a variety of Pareto-dominant policiesA limitation of this work is that we produced this set ofpolicies by iterating over a relatively large number of εrsquoswhich is computationally expensive In future work weplan to extend MO-(V-)MPO into a true multiple-policyMORL approach either by conditioning the policy on set-tings of εrsquos or by developing a way to strategically select εrsquosto train policies for analogous to what prior work has donefor weights (eg Roijers et al (2014))
AcknowledgmentsThe authors would like to thank Guillaume DesjardinsCsaba Szepesvari Jost Tobias Springenberg Steven BohezPhilemon Brakel Brendan Tracey Jonas Degrave JonasBuchli Leslie Fritz Chloe Rosenberg and many others atDeepMind for their support and feedback on this paper
ReferencesAbdolmaleki A Springenberg J T Degrave J Bohez S
Tassa Y Belov D Heess N and Riedmiller M Rela-tive entropy regularized policy iteration arXiv preprintarXiv181202256 2018a
Abdolmaleki A Springenberg J T Tassa Y Munos RHeess N and Riedmiller M Maximum a posterioripolicy optimisation In Proceedings of the Sixth Interna-tional Conference on Learning Representations (ICLR)2018b
Abels A Roijers D M Lenaerts T Nowe A andSteckelmacher D Dynamic weights in multi-objectivedeep reinforcement learning In Proceedings of the 36thInternational Conference on Machine Learning (ICML)2019
Achiam J Held D Tamar A and Abbeel P Constrainedpolicy optimization In Proceedings of the 34th Inter-national Conference on Machine Learning (ICML) pp22ndash31 2017
Amodei D Olah C Steinhardt J Christiano P Schul-man J and Mane D Concrete problems in AI safetyarXiv preprint arXiv160606565 2016
Andersen M S Dahl J and Vandenberghe L CVXOPTA Python package for convex optimization version 12httpscvxoptorg 2020
Anonymous CoMic Co-Training and Mimicry forReusable Skills 2020
Barrett L and Narayanan S Learning all optimal policieswith multiple criteria In Proceedings of the 25th Inter-national Conference on Machine Learning (ICML) pp41ndash47 2008
Barth-Maron G Hoffman M W Budden D Dabney WHorgan D TB D Muldal A Heess N and LillicrapT Distributed distributional deterministic policy gradi-ents In Proceedings of the Sixth International Conferenceon Learning Representations (ICLR) 2018
Bohez S Abdolmaleki A Neunert M Buchli J HeessN and Hadsell R Value constrained model-free contin-uous control arXiv preprint arXiv190204623 2019
Chen X Ghadirzadeh A Bjorkman M and Jensfelt PMeta-learning for multi-objective reinforcement learningarXiv preprint arXiv181103376 2019
Chentanez N Muller M Macklin M Makoviychuk Vand Jeschke S Physics-based motion capture imitationwith deep reinforcement learning In International Con-ference on Motion Interaction and Games (MIG) ACM2018
A Distributional View on Multi-Objective Policy Optimization
Chow Y Nachum O Duenez-Guzman E andGhavamzadeh M A Lyapunov-based approach to safereinforcement learning In Proceedings of the 31st Inter-national Conference on Neural Information ProcessingSystems (NeurIPS) pp 8092ndash8101 2018
Daneshmand M Tale Masouleh M Saadatzi M HOzcinar C and Anbarjafari G A robust proportion-preserving composite objective function for scale-invariant multi-objective optimization Scientia Iranica24(6)2977ndash2991 2017
Fortin F-A De Rainville F-M Gardner M-A ParizeauM and Gagne C DEAP Evolutionary algorithms madeeasy Journal of Machine Learning Research 132171ndash2175 July 2012
Friedman E and Fontaine F Generalizing across multi-objective reward functions in deep reinforcement learningarXiv preprint arXiv180906364 2018
Gabor Z Kalmar Z and Szepesvari C Multi-criteriareinforcement learning In Proceedings of the FifteenthInternational Conference on Machine Learning (ICML)pp 197ndash205 1998
Grodzevich O and Romanko O Normalization and othertopics in multi-objective optimization In Proceedings ofthe Fields-MITACS Industrial Problems Workshop 2006
Haarnoja T Zhou A Abbeel P and Levine S Softactor-critic Off-policy maximum entropy deep reinforce-ment learning with a stochastic actor In Proceedings ofthe 35th International Conference on Machine Learning(ICML) pp 1861ndash1870 2018
Heess N Wayne G Silver D Lillicrap T Erez T andTassa Y Learning continuous control policies by stochas-tic value gradients In Advances in Neural InformationProcessing Systems 28 (NIPS) 2015
Hessel M Soyer H Espeholt L Czarnecki W SchmittS and van Hasselt H Multi-task deep reinforcementlearning with popart In Proceedings of the AAAI Confer-ence on Artificial Intelligence volume 33 pp 3796ndash38032019
Huang S H Zambelli M Kay J Martins M F TassaY Pilarski P M and Hadsell R Learning gentle objectmanipulation with curiosity-driven deep reinforcementlearning arXiv preprint arXiv190308542 2019
Ishibuchi H Doi K and Nojima Y On the effect ofnormalization in MOEAD for multi-objective and many-objective optimization Complex amp Intelligent Systems 3(4)279ndash294 2017
Kim I and de Weck O Adaptive weighted sum methodfor multiobjective optimization A new method for paretofront generation Structural and Multidisciplinary Opti-mization 31(2)105ndash116 2006
Kingma D P and Ba J Adam A method for stochasticoptimization In Proceedings of the Third InternationalConference on Learning Representations (ICLR) 2015
Le H Voloshin C and Yue Y Batch policy learningunder constraints In Proceedings of the 36th Inter-national Conference on Machine Learning (ICML) pp3703ndash3712 2019
Levine S Finn C Darrell T and Abbeel P End-to-endtraining of deep visuomotor policies Journal of MachineLearning Research 17(39)1ndash40 2016
Liu C Xu X and Hu D Multiobjective reinforcementlearning A comprehensive overview IEEE Transactionson Systems Man and Cybernetics Systems 45(3)385ndash398 2015
Marler R T and Arora J S Function-transformationmethods for multi-objective optimization EngineeringOptimization 37(6)551ndash570 2005
Martens J New perspectives on the natural gradient methodarXiv preprint arXiv14121193 2014
Merel J Ahuja A Pham V Tunyasuvunakool S LiuS Tirumala D Heess N and Wayne G Hierarchi-cal visuomotor control of humanoids In InternationalConference on Learning Representations (ICLR) 2019
Mnih V Kavukcuoglu K Silver D Rusu A A VenessJ Bellemare M G Graves A Riedmiller M Fidje-land A K Ostrovski G et al Human-level controlthrough deep reinforcement learning Nature 518(7540)529 2015
Moffaert K V and Nowe A Multi-objective reinforcementlearning using sets of Pareto dominating policies Journalof Machine Learning Research 153663ndash3692 2014
Mossalam H Assael Y M Roijers D M and WhitesonS Multi-objective deep reinforcement learning arXivpreprint arXiv161002707 2016
Munos R Stepleton T Harutyunyan A and BellemareM Safe and efficient off-policy reinforcement learningIn Proceedings of the 29th International Conference onNeural Information Processing Systems pp 1054ndash10622016
Natarajan S and Tadepalli P Dynamic preferences inmulti-criteria reinforcement learning In Proceedings ofthe 22nd International Conference on Machine Learning(ICML) pp 601ndash608 2005
A Distributional View on Multi-Objective Policy Optimization
Nottingham K Balakrishnan A Deshmukh J Christo-pherson C and Wingate D Using logical specificationsof objectives in multi-objective reinforcement learningarXiv preprint arXiv191001723 2019
Parisi S Pirotta M Smacchia N Bascetta L andRestelli M Policy gradient approaches for multi-objective sequential decision making In Proceedingsof the International Joint Conference on Neural Networks(IJCNN) pp 2323ndash2330 2014
Peng X B Abbeel P Levine S and van de Panne MDeepmimic Example-guided deep reinforcement learn-ing of physics-based character skills ACM Transactionson Graphics 37(4) 2018
Pirotta M Parisi S and Restelli M Multi-objectivereinforcement learning with continuous Pareto frontierapproximation In Proceedings of the Twenty-Ninth AAAIConference on Artificial Intelligence (AAAI) 2015
Reymond M and Nowe A Pareto-DQN Approximat-ing the Pareto front in complex multi-objective decisionproblems In Proceedings of the Adaptive and LearningAgents Workshop at the 18th International Conferenceon Autonomous Agents and MultiAgent Systems AAMAS2019
Riedmiller M Hafner R Lampe T Neunert M De-grave J Van de Wiele T Mnih V Heess N and Sprin-genberg J T Learning by playing - solving sparse rewardtasks from scratch arXiv preprint arXiv1802105672018
Roijers D M Vamplew P Whiteson S and Dazeley RA survey of multi-objective sequential decision-makingJournal of Artificial Intelligence Research 48(1)67ndash1132013
Roijers D M Whiteson S and Oliehoek F A Linearsupport for multi-objective coordination graphs In Pro-ceedings of the International Conference on AutonomousAgents and Multi-Agent Systems (AAMAS) pp 1297ndash1304 2014
Russell S and Zimdars A L Q-decomposition for rein-forcement learning agents In Proceedings of the Twenti-eth International Conference on International Conferenceon Machine Learning (ICML) pp 656ndash663 2003
Schulman J Levine S Abbeel P Jordan M and MoritzP Trust region policy optimization In Proceedings ofthe 32nd International Conference on Machine Learning(ICML) 2015
Schulman J Wolski F Dhariwal P Radford A andKlimov O Proximal policy optimization algorithmsarXiv preprint arXiv170706347 2017
Shadow Robot Company Shadow dexterous handhttpswwwshadowrobotcomproductsdexterous-hand 2020
Silver D Huang A Maddison C J Guez A Sifre Lvan den Driessche G Schrittwieser J Antonoglou IPanneershelvam V Lanctot M Dieleman S GreweD Nham J Kalchbrenner N Sutskever I Lillicrap TLeach M Kavukcuoglu K Graepel T and HassabisD Mastering the game of Go with deep neural networksand tree search Nature 529484ndash503 2016
Song H F Abdolmaleki A Springenberg J T ClarkA Soyer H Rae J W Noury S Ahuja A LiuS Tirumala D Heess N Belov D Riedmiller Mand Botvinick M M V-MPO On-policy maximum aposteriori policy optimization for discrete and continu-ous control In International Conference on LearningRepresentations 2020
Sutton R S Learning to predict by the methods of temporaldifferences Machine learning 3(1)9ndash44 1988
Tan J Zhang T Coumans E Iscen A Bai Y HafnerD Bohez S and Vanhoucke V Sim-to-real Learningagile locomotion for quadruped robots In Proceedingsof Robotics Science and Systems (RSS) 2018
Tassa Y Doron Y Muldal A Erez T Li Yde Las Casas D Budden D Abdolmaleki A Merel JLefrancq A Lillicrap T and Riedmiller M Deepmindcontrol suite arXiv preprint arXiv180100690 2018
Teh Y W Bapst V Czarnecki W M Quan J Kirk-patrick J Hadsell R Heess N and Pascanu R DistralRobust multitask reinforcement learning In Proceedingsof the 31st International Conference on Neural Informa-tion Processing Systems (NeurIPS) 2017
Todorov E Erez T and Tassa Y Mujoco A physicsengine for model-based control In Proceedings of theIEEERSJ International Conference on Intelligent Robotsand Systems (IROS) pp 5026ndash5033 2012
Vamplew P Dazeley R Berry A Issabekov R andDekker E Empirical evaluation methods for multiobjec-tive reinforcement learning algorithms Machine Learn-ing 84(1)51ndash80 Jul 2011
van Hasselt H P Guez A Hessel M Mnih V and Sil-ver D Learning values across many orders of magnitudeIn Advances in Neural Information Processing Systemspp 4287ndash4295 2016
Van Moffaert K Drugan M M and Nowe A Scalar-ized multi-objective reinforcement learning Novel de-sign techniques In Proceedings of the IEEE Symposiumon Adaptive Dynamic Programming and ReinforcementLearning (ADPRL) pp 191ndash199 2013
A Distributional View on Multi-Objective Policy Optimization
van Seijen H Fatemi M Romoff J Laroche R BarnesT and Tsang J Hybrid reward architecture for reinforce-ment learning In Proceedings of the 31st InternationalConference on Neural Information Processing Systems(NeurIPS) pp 5398ndash5408 2017
Watkins C J C H Learning from Delayed Rewards PhDthesis Kingrsquos College Cambridge UK May 1989
Wray K H Zilberstein S and Mouaddib A-I Multi-objective MDPs with conditional lexicographic rewardpreferences In Proceedings of the Twenty-Ninth AAAIConference on Artificial Intelligence (AAAI) pp 3418ndash3424 2015
Wulfmeier M Abdolmaleki A Hafner R SpringenbergJ T Neunert M Hertweck T Lampe T Siegel NHeess N and Riedmiller M Regularized hierarchicalpolicies for compositional transfer in robotics arXivpreprint arXiv190611228 2019
Yang R Sun X and Narasimhan K A generalized algo-rithm for multi-objective reinforcement learning and pol-icy adaptation In Proceedings of the 32nd InternationalConference on Neural Information Processing Systems(NeurIPS) pp 14610ndash14621 2019
Zeng A Song S Lee J Rodriguez A and FunkhouserT Tossingbot Learning to throw arbitrary objects withresidual physics In Proceedings of Robotics Scienceand Systems (RSS) 2019
Zhan H and Cao Y Relationship explainable multi-objective optimization via vector value function basedreinforcement learning arXiv preprint arXiv1910019192019
Zuluaga M Krause A and Puschel M ε-pal An ac-tive learning approach to the multi-objective optimizationproblem Journal of Machine Learning Research 17(1)3619ndash3650 2016
A Distributional View on Multi-Objective Policy Optimization
A Experiment DetailsIn this section we describe implementation details and spec-ify the hyperparameters used for our algorithm In all ourexperiments the policy and critic(s) are implemented withfeed-forward neural networks We use Adam (Kingma ampBa 2015) for optimization
In our continuous control tasks the policy returns a Gaus-sian distribution with a diagonal covariance matrix ieπθ(a|s) = N (microΣ) The policy is parametrized by aneural network which outputs the mean micro = micro(s) and di-agonal Cholesky factors A = A(s) such that Σ = AAT The diagonal factor A has positive diagonal elements en-forced by the softplus transform Aii larr log(1 + exp(Aii))to ensure positive definiteness of the diagonal covariancematrix
Table 2 shows the default hyperparameters that we use forMO-MPO and scalarized MPO and Table 3 shows the hy-perparameters that differ from these defaults for the hu-manoid Shadow Hand and Sawyer experiments For alltasks that we train policies with MO-MPO for a separatecritic is trained for each objective with a shared first layerThe only exception is the action norm penalty in humanoidrun which is computed exactly from the action We foundthat for both policy and critic networks layer normalizationof the first layer followed by a hyperbolic tangent (tanh) isimportant for the stability of the algorithm
In our experiments comparing the Pareto front found byMO-MPO versus scalarized MPO we ran MO-MPO with arange of εk settings and ran scalarized MPO with a range ofweight settings to obtain a set of policies for each algorithmThe settings for humanoid run and the Shadow Hand tasksare listed in Table 4 The settings for humanoid mocap andSawyer are specified in the main paper in Sec 63 and 64respectively
In the following subsections we first give suggestions forchoosing appropriate εk to encode a particular preference(Sec A1) then we describe the discrete MO-MPO usedfor Deep Sea Treasure (Sec A2) and finally we describeimplementation details for humanoid mocap (Sec A3)
A1 Suggestions for Setting ε
Our proposed algorithm MO-(V-)MPO requires practition-ers to translate a desired preference across objectives to nu-merical choices for εk for each objective k At first glancethis may seem daunting However in practice we havefound that encoding preferences via εk is often more in-tuitive than doing so via scalarization In this subsectionwe seek to give an intuition on how to set εk for differentdesired preferences Recall that each εk controls the influ-ence of objective k on the policy update by constraining theKL-divergence between each objective-specific distribution
Hyperparameter Default
policy networklayer sizes (300 200)take tanh of action mean yesminimum variance 10minus12
maximum variance unbounded
critic network(s)layer sizes (400 400 300)take tanh of action yesRetrace sequence size 8discount factor γ 099
both policy and critic networkslayer norm on first layer yestanh on output of layer norm yesactivation (after each hidden layer) ELU
MPOactions sampled per state 20default ε 01KL-constraint on policy mean βmicro 10minus3
KL-constraint on policy covariance βΣ 10minus5
initial temperature η 1
trainingbatch size 512replay buffer size 106
Adam learning rate 3times10minus4
Adam ε 10minus3
target network update period 200
Table 2 Default hyperparameters for MO-MPO and scalarizedMPO with decoupled update on mean and covariance (Sec C3)
and the current policy (Sec 41) We generally choose εk inthe range of 0001 to 01
Equal Preference When all objectives are equally impor-tant the general rule is to set all εk to the same value Wedid this in our orient and humanoid mocap tasks where ob-jectives are aligned or mostly aligned In contrast it can betricky to choose appropriate weights in linear scalarizationto encode equal preferencesmdashsetting all weights equal to1K (where K is the number of objectives) is only appro-priate if the objectivesrsquo rewards are of similar scales Weexplored this in Sec 51 and Fig 3
When setting all εk to the same value what should thisvalue be The larger εk is the more influence the objectiveswill have on the policy update step Since the per-objectivecritics are learned in parallel with the policy setting εktoo high tends to destabilize learning because early on intraining when the critics produce unreliable Q-values theirinfluence on the policy will lead it in the wrong directionOn the other hand if εk is set too low then it slows downlearning because the per-objective action distribution isonly allowed to deviate by a tiny amount from the currentpolicy and the updated policy is obtained via supervisedlearning on the combination of these action distributions
A Distributional View on Multi-Objective Policy Optimization
Humanoid run
policy networklayer sizes (400 400 300)take tanh of action mean nominimum variance 10minus8
critic network(s)layer sizes (500 500 400)take tanh of action no
MPOKL-constraint on policy mean βmicro 5times10minus3
KL-constraint on policy covariance βΣ 10minus6
trainingAdam learning rate 2times10minus4
Adam ε 10minus8
Shadow Hand
MPOactions sampled per state 30default ε 001
Sawyer
MPOactions sampled per state 15
trainingtarget network update period 100
Table 3 Hyperparameters for humanoid Shadow Hand andSawyer experiments that differ from the defaults in Table 2
Eventually the learning will converge to more or less thesame policy though as long as εk is not set too high
Unequal Preference When there is a difference in pref-erences across objectives the relative scale of εk is whatmatters The more the relative scale of εk is compared to εlthe more influence objective k has over the policy updatecompared to objective l And in the extreme case whenεl is near-zero for objective l then objective l will haveno influence on the policy update and will effectively beignored We explored this briefly in Sec 51 and Fig 4
One common example of unequal preferences is when wewould like an agent to complete a task while minimiz-ing other objectivesmdasheg an action norm penalty ldquopainrdquopenalty or wrist force-torque penalty in our experimentsIn this case the ε for the task objective should be higherthan that for the other objectives to incentivize the agent toprioritize actually doing the task If the ε for the penaltiesis too high then the agent will care more about minimiz-ing the penalty (which can typically be achieved by simplytaking no actions) rather than doing the task which is notparticularly useful
The scale of εk has a similar effect as in the equal preference
Condition Settings
Humanoid run (one seed per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin linspace(0 015 100)
MO-MPO εtask = 01εpenalty isin linspace(10minus6 015 100)
Humanoid run normal vs scaled (three seeds per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin 001 005 01
MO-MPO εtask = 01εpenalty isin 001 005 01
Shadow Hand touch and turn (three seeds per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin linspace(0 09 10)
MO-MPO εtask = 001εpenalty isin linspace(0001 0015 15)
Shadow Hand orient (ten seeds per setting)
scalarized MPO wtouch = wheight = worientation = 13MO-MPO εtouch = εheight = εorientation = 001
Table 4 Settings for εk and weights (linspace(x y z) denotes aset of z evenly-spaced values between x and y)
case If the scale of εk is too high or too low then the sameissues arise as discussed for equal preferences If all εkincrease or decrease in scale by the same (moderate) factorand thus their relative scales remain the same then typicallythey will converge to more or less the same policy Thiscan be seen in Fig 5 (right) in the Deep Sea Treasuredomain regardless of whether εtime is 001 002 or 005the relationship between the relative scale of εtreasure andεtime to the treasure that the policy converges to selecting isessentially the same
A2 Deep Sea Treasure
In order to handle discrete actions we make several mi-nor adjustments to scalarized MPO and MO-MPO Thepolicy returns a categorical distribution rather than a Gaus-sian The policy is parametrized by a neural network whichoutputs a vector of logits (ie unnormalized log probabil-ities) of size |A| The KL constraint on the change of thispolicy β is 0001 The input to the critic network is thestate concatenated with a four-dimensional one-hot vectordenoting which action is chosen (eg the up action corre-sponds to [1 0 0 0]gt) Critics are trained with one-steptemporal-difference error with a discount of 0999 Otherthan these changes the network architectures and the MPOand training hyperparameters are the same as in Table 2
A Distributional View on Multi-Objective Policy Optimization
Humanoid Mocap
Scalarized VMPOKL-constraint on policy mean βmicro 01KL-constraint on policy covariance βΣ 10minus5
default ε 01initial temperature η 1
TrainingAdam learning rate 10minus4
batch size 128unroll length (for n-step return Sec D1) 32
Table 5 Hyperparameters for the humanoid mocap experiments
A3 Humanoid Mocap
For the humanoid mocap experiments we used the follow-ing architecture for both MO-VMPO and VMPO for thepolicy we process a concatentation of the mocap referenceobservations and the proprioceptive observations by a twolayer MLP with 1024 hidden units per layer This referenceencoder is followed by linear layers to produce the meanand log standard deviation of a stochastic latent variableThese latent variables are then concatenated with the propri-oceptive observations and processed by another two layerMLP with 1024 hidden units per layer to produce the actiondistribution For VMPO we use a three layer MLP with1024 hidden units per layer as the critic For MO-VMPOwe use a shared two layer MLP with 1024 hidden units perlayer followed by a one layer MLP with 1024 hidden unitsper layer per objective In both cases we use k-step returnsto train the critic with a discount factor of 095 Table 5shows additional hyperparameters used in our experiments
B Experimental DomainsWe evaluated our approach on one discrete domain (DeepSea Treasure) three simulated continuous control domains(humanoid Shadow Hand and humanoid mocap) and onereal-world continuous control domain (Sawyer robot) Herewe provide more detail about these domains and the objec-tives used in each task
B1 Deep Sea Treasure
Deep Sea Treasure (DST) is a 11times10 grid-world domainthe state space S consists of the x and y position of the agentand the action spaceA is up right down left Thelayout of the environment and values of the treasures areshown in Fig 8 If the action would cause the agent tocollide with the sea floor or go out of bounds it has noeffect Farther-away treasures have higher values Theepisode terminates when the agent collects a treasure orafter 200 timesteps
237
224
203196
161151140
115
82
07
Figure 8 Deep Sea Treasure environment from Vamplew et al(2011) with weights from Yang et al (2019) Treasures are labeledwith their respective values The agent can move around freelyin the white squares but cannot enter the black squares (ie theocean floor)
There are two objectives time penalty and treasure valueA time penalty of minus1 is given at each time step The agentreceives the value of the treasure it picks up as the rewardfor the treasure objective In other words when the agentpicks up a treasure of value v the reward vector is [minus1 v]otherwise it is [minus1 0]
B2 Shadow Hand
Our robot platform is a simulated Shadow Dexterous Hand(Shadow Robot Company 2020) in the MuJoCo physicsengine (Todorov et al 2012) The Shadow Hand has fivefingers and 24 degrees of freedom actuated by 20 motorsThe observation consists of joint angles joint velocities andtouch sensors for a total of 63 dimensions Each fingertipof our Shadow Hand has a 4times4 spatial touch sensor Thissensor has three channels measuring the normal force andthe x and y-axis tangential forces for a sensor dimension of4times4times3 per fingertip We simplify this sensor by summingacross spatial dimensions resulting in a 1times1times3 observationper fingertip
In the touch task there is a block in the environment that isalways fixed in the same pose In the turn task there is a dialin the environment that can be rotated and it is initialized toa random position between minus30 and 30 The target angleof the dial is 0 The angle of the dial is included in theagentrsquos observation In the orient task the robot interactswith a rectangular peg in the environment the initial andtarget pose of the peg remains the same across episodesThe pose of the block is included in the agentrsquos observationencoded as the xyz positions of four corners of the block(based on how Levine et al (2016) encodes end-effectorpose)
A Distributional View on Multi-Objective Policy Optimization
Figure 9 These images show what task completion looks like forthe touch turn and orient Shadow Hand tasks (from left to right)
0 2 4 6impact force
minus6
minus4
minus2
0
pain
pen
alty
Figure 10 In the touch and turn Shadow Hand tasks the painpenalty is computed based on the impact force as plotted here Forlow impact forces the pain penalty is near-zero For high impactforces the pain penalty is equal to the negative of the impact force
B21 BALANCING TASK COMPLETION AND PAIN
In the touch and turn tasks there are two objectives ldquopainrdquopenalty and task completion A sparse task completion re-ward of 1 is given for pressing a block with greater than 5Nof force or turning a dial to a fixed target angle respectivelyIn both tasks the episode terminates when the agent com-pletes the task ie the agent gets a total reward of either0 or 1 for the task completion objective per episode ThePareto plots for these two tasks in the main paper (Fig 6)show the discounted task reward (with a discount factor ofγ = 099) to capture how long it takes agents to completethe task
The ldquopainrdquo penalty is computed as in (Huang et al 2019) Itis based on the impact force m(s sprime) which is the increasein force from state s to the next state sprime In our tasks this ismeasured by a touch sensor on the block or dial The painpenalty is equal to the negative of the impact force scaledby how unacceptable it is
rpain(s a sprime) = minus[1minus aλ(m(s sprime))
]m(s sprime) (6)
where aλ(middot) isin [0 1] computes the acceptability of an impactforce aλ(middot) should be a monotonically decreasing functionthat captures how resilient the robot and the environmentare to impacts As in (Huang et al 2019) we use
aλ(m) = sigmoid(λ1(minusm+ λ2)) (7)
with λ = [2 2]gt The relationship between pain penaltyand impact force is plotted in Fig 10
675 700 725zpeg
0
1
rew
ard
height objective
0 2d(θ θpeg)
orientation objective
Figure 11 In the orient Shadow Hand task the height and orienta-tion objectives have shaped rewards as plotted here
B22 ALIGNED OBJECTIVES
In the orient task there are three non-competing objectivestouching the peg lifting the peg to a target height and ori-enting the peg to be perpendicular to the ground The targetz position and orientation is shown by the gray transparentpeg (Fig 9 right-most) note that the x and y position isnot specified so the robot in the figure gets the maximumreward with respect to all three objectives
The touch objective has a sparse reward of 1 for activat-ing the pegrsquos touch sensor and zero otherwise For theheight objective the target z position is 7cm above theground the pegrsquos z position is computed with respect toits center of mass The shaped reward for this objectiveis 1 minus tanh(50|ztarget minus zpeg|) For the orientation objec-tive since the peg is symmetrical there are eight possibleorientations that are equally valid The acceptable targetorientationsQtarget and the pegrsquos orientation qpeg are denotedas quaternions The shaped reward is computed with respectto the closest target quaternion as
minqisinQtarget
1minus tanh(2lowastd(q qpeg))
where d(middot) denotes the `2-norm of the axis-angle equivalent(in radians) for the distance between the two quaternionsFig 11 shows the shaping of the height and orientationrewards
B3 Humanoid
We make use of the humanoid run task from Tassa et al(2018)9 The observation dimension is 67 and the actiondimension is 21 Actions are joint accelerations with min-imum and maximum limits of minus1 and 1 respectively Forthis task there are two objectives
bull The original task reward given by the environment Thegoal is is to achieve a horizontal speed of 10 meters persecond in any direction This reward is shaped it isequal to min(h10 1) where h is in meters per secondFor this objective we learn a Q-function
9This task is available at githubcomdeepminddm control
A Distributional View on Multi-Objective Policy Optimization
bull Limiting energy usage by penalizing taking high-magnitude actions The penalty is the `2-norm of theaction vector ie rpenalty(a) = minusa2 For this ob-jective we do not learn a Q-function Instead wecompute this penalty directly to evaluate a given actionin a state-independent way during policy optimization
B4 Humanoid Mocap
The motion capture tracking task used in this paper wasdeveloped as part of a concurrent submission (Anonymous2020) and does not constitute a contribution of this paperWe use a simulated humanoid adapted from the ldquoCMUhumanoidrdquo available at dm controllocomotion (Merel et al2019) We adjusted various body and actuator parametersto be more comparable to that of an average human Thishumanoid has 56 degrees of freedom and the observation is1021-dimensional
This motion capture tracking task is broadly similar to previ-ous work on motion capture tracking (Chentanez et al 2018Peng et al 2018 Merel et al 2019) The task is associatedwith an underlying set of motion capture clips For thispaper we used roughly 40 minutes of locomotion motioncapture clips from the CMU motion capture database10
At the beginning of a new episode we select a frame uni-formly at random from all frames in the underlying mocapdata (excluding the last 10 frames of each clip) The simu-lated humanoid is then initialized to the pose of the selectedframe
The observations in this environment are various propriocep-tive observations about the current state of the body as wellas the relative target positions and orientations of 13 differ-ent body parts in the humanoidrsquos local frame We providethe agent with the target poses for the next 5 timesteps
The choice of reward function is crucial for motion capturetracking tasks Here we consider five different reward com-ponents each capturing a different aspect of the similarityof the pose between the simulated humanoid and the mocaptarget Four of our reward components were initially pro-posed by Peng et al (2018) The first reward component isbased on the difference in the center of mass
rcom = exp(minus10pcom minus pref
com2)
where pcom and prefcom are the positions of the center of mass
of the simulated humanoid and the mocap reference re-spectively The second reward component is based on thedifference in joint angle velocities
rvel = exp(minus01qvel minus qref
vel2)
where qvel and qrefvel are the joint angle velocities of the simu-
lated humanoid and the mocap reference respectively The
10This database is available at mocapcscmuedu
third reward component is based on the difference in end-effector positions
rapp = exp(minus40papp minus pref
app2)
where papp and prefapp are the end effector positions of the
simulated humanoid and the mocap reference respectivelyThe fourth reward component is based on the difference injoint orientations
rquat = exp(minus2qquat qref
quat2)
where denotes quaternion differences and qquat and qrefquat
are the joint quaternions of the simulated humanoid and themocap reference respectively
Finally the last reward component is based on difference inthe joint angles and the Euclidean positions of a set of 13body parts
rtrunc = 1minus 1
03(
=ε︷ ︸︸ ︷bpos minus bref
pos1 + qpos minus qrefpos1)
where bpos and brefpos correspond to the body positions of the
simulated character and the mocap reference and qpos andqref
pos correspond to the joint angles
We include an early termination condition in our task that islinked to this last reward term We terminate the episode ifε gt 03 which ensures that rtrunc isin [0 1]
In our MO-VMPO experiments we treat each reward com-ponent as a separate objective In our VMPO experimentswe use a reward of the form
r =1
2rtrunc +
1
2(01rcom + λrvel + 015rapp + 065rquat)
(8)
varying λ as described in the main paper (Sec 63)
B5 Sawyer
The Sawyer peg-in-hole setup consists of a Rethink RoboticsSawyer robot arm The robot has 7 joints driven by serieselastic actuators On the wrist a Robotiq FT 300 force-torque sensor is mounted followed by a Robotiq 2F85 par-allel gripper We implement a 3D end-effector Cartesianvelocity controller that maintains a fixed orientation of thegripper The agent controls the Cartesian velocity setpointsThe gripper is not actuated in this setup The observationsprovided are the joint positions velocities and torques theend-effector pose the wrist force-torque measurements andthe joint velocity command output of the Cartesian con-troller We augment each observation with the two previousobservations as a history
In each episode the peg is initalized randomly within an8times8times8 cm workspace (with the peg outside of the hole)
A Distributional View on Multi-Objective Policy Optimization
The environment is then run at 20 Hz for 600 steps Theaction limits are plusmn005 ms on each axis If the magnitudeof the wrist force-torque measurements exceeds 15 N onany horizontal axis (x and y axis) or 5 N on the vertical axis(z axis) the episode is terminated early
There are two objectives the task itself (inserting the peg)and minimizing wrist force measurements
The reward for the task objective is defined as
rinsertion = max(
02rapproach rinserted
)(9)
rapproach = s(ppeg popening) (10)
rinserted = raligned s(ppegz pbottom
z ) (11)
raligned = ppegxy minus popening
xy lt d2 (12)
where ppeg is the peg position popening the position of thehole opening pbottom the bottom position of the hole and dthe diameter of the hole
s(p1 p2) is a shaping operator
s(p1 p2) = 1minus tanh2(atanh(
radicl)
εp1 minus p22) (13)
which gives a reward of 1minus l if p1 and p2 are at a Euclideandistance of ε Here we chose lapproach = 095 εapproach =001 linserted = 05 and εinserted = 004 Intentionally 004corresponds to the length of the peg such that rinserted = 05if ppeg = popening As a result if the peg is in the approachposition rinserted is dominant over rapproach in rinsertion
Intuitively there is a shaped reward for aligning the pegwith a magnitude of 02 After accurate alignment withina horizontal tolerance the overall reward is dominated bya vertical shaped reward component for insertion The hor-izontal tolerance threshold ensures that the agent is notencouraged to try to insert the peg from the side
The reward for the secondary objective minimizing wristforce measurements is defined as
rforce = minusF1 (14)
where F = (Fx Fy Fz) are the forces measured by thewrist sensor
C Algorithmic DetailsC1 General Algorithm Description
We maintain one online network and one target network foreach Q-function associated to objective k with parametersdenoted by φk and φprimek respectively We also maintain oneonline network and one target network for the policy withparameters denoted by θ and θprime respectively Target net-works are updated every fixed number of steps by copying
parameters from the online network Online networks areupdated using gradient descent in each learning iterationNote that in the main paper we refer to the target policynetwork as the old policy πold
We use an asynchronous actor-learner setup in which ac-tors regularly fetch policy parameters from the learner andact in the environment writing these transitions to the re-play buffer We refer to this policy as the behavior policythroughout the paper The learner uses the transitions inthe replay buffer to update the (online) Q-functions and thepolicy Please see Algorithms 2 and 3 for more details onthe actor and learner processes respectively
Algorithm 2 MO-MPO Asynchronous Actor
1 given (N) reward functions rk(s a)Nk=1 T steps perepisode
2 repeat3 Fetch current policy parameters θ from learner4 Collect trajectory from environment5 τ = 6 for t = 0 T do7 at sim πθ(middot|st)8 Execute action and determine rewards9 r = [r1(st at) rN (st at)]
10 τ larr τ cup (st at r πθ(at|st))11 end for12 Send trajectory τ to replay buffer13 until end of training
C2 Retrace for Multi-Objective Policy Evaluation
Recall that the goal of the policy evaluation stage (Sec 41)is to learn the Q-function Qφk
(s a) for each objective kparametrized by φk These Q-functions are with respectto the policy πold We emphasize that it is valid to use anyoff-the-shelf Q-learning algorithm such as TD(0) (Sutton1988) to learn these Q-functions We choose to use Retrace(Munos et al 2016) described here
Given a replay buffer D containing trajectory snippetsτ = (s0 a0 r0 s1) (sT aT rT sT+1) where rtdenotes a reward vector rk(st at)Nk=1 that consists of ascalar reward for each ofN objectives the Retrace objectiveis as follows
minφk
EτsimD[(Qretk (st at)minus Qφk
(st at))2] (15)
with
Qretk (st at) = Qφprime
k(st at) +
Tsumj=t
γjminust( jprodz=t+1
cz
)δj
δj = rk(sj aj) + γV (sj+1)minus Qφprimek(sj aj)
V (sj+1) = Eπold(a|sj+1)[Qφprimek(sj+1 a)]
A Distributional View on Multi-Objective Policy Optimization
Algorithm 3 MO-MPO Asynchronous Learner1 given batch size (L) number of actions (M) (N) Q-functions
(N) preferences εkNk=1 policy networks and replay bufferD
2 initialize Lagrangians ηkNk=1 and ν target networks andonline networks such that πθprime = πθ and Qφprime
k= Qφk
3 repeat4 repeat5 Collect dataset si aij Qijk
LMNijk where
6 si sim D aij sim πθprime(a|si) and Qijk = Qφkprime(si aij)
78 Compute action distribution for each objective9 for k = 1 N do
10 δηk larr nablaηkηkεk + ηksumLi
1Llog
(sumMj
1M
exp(Q
ijkηk
))11 Update ηk based on δηk12 qijk prop exp(
Qijkηk
)
13 end for1415 Update parametric policy with trust region16 sg denotes a stop-gradient17 δπ larr minusnablaθ
sumLi
sumMj
sumNk q
ijk log πθ(a
ij |si)18 +sg(ν)
(β minus
sumLi KL(πθprime(a|si) πθ(a|si)) ds
)19 δν larr nablaνν
(βminussumLi KL(πθprime(a|si) πsg(θ)(a|si)) ds
)20 Update πθ based on δπ21 Update ν based on δν2223 Update Q-functions24 for k = 1 N do25 δφk larr nablaφk
sum(stat)isinτsimD
(Qφk (st at)minusQ
retk
)226 with Qret
k as in Eq 1527 Update φk based on δφk
28 end for2930 until fixed number of steps31 Update target networks32 πθprime = πθ Qφk
prime = Qφk
33 until convergence
The importance weights ck are defined as
cz = min
(1πold(az|sz)b(az|sz)
)
where b(az|sz) denotes the behavior policy used to col-lect trajectories in the environment When j = t we set(prodj
z=t+1 cz
)= 1
We also make use of a target network for each Q-function(Mnih et al 2015) parametrized by φprimek which we copyfrom the online network φk after a fixed number of gradientsteps on the Retrace objective (15)
C3 Policy Fitting
Recall that fitting the policy in the policy improvementstage (Sec 422) requires solving the following constrained
optimization
πnew = argmaxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds
stints
micro(s) KL(πold(a|s) πθ(a|s)) ds lt β (16)
We first write the generalized Lagrangian equation ie
L(θ ν) =
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds (17)
+ ν(β minus
ints
micro(s) KL(πold(a|s) πθ(a|s)) ds)
where ν is the Lagrangian multiplier Now we solve thefollowing primal problem
maxθ
minνgt0
L(θ ν)
To obtain the parameters θ for the updated policy we solvefor ν and θ by alternating between 1) fixing ν to its currentvalue and optimizing for θ and 2) fixing θ to its currentvalue and optimizing for ν This can be applied for anypolicy output distribution
For Gaussian policies we decouple the update rules to op-timize for mean and covariance independently as in (Ab-dolmaleki et al 2018b) This allows for setting differentKL bounds for the mean (βmicro) and covariance (βΣ) whichresults in more stable learning To do this we first separateout the following two policies for mean and covariance
πmicroθ (a|s) = N(amicroθ(s)Σθold
(s)) (18)
πΣθ (a|s) = N
(amicroθold
(s)Σθ(s)) (19)
Policy πmicroθ (a|s) takes the mean from the online policy net-work and covariance from the target policy network andpolicy πΣ
θ (a|s) takes the mean from the target policy net-work and covariance from online policy network Now ouroptimization problem has the following form
maxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s)(
log πmicroθ (a|s)πΣθ (a|s)
)da ds
stints
micro(s) KL(πold(a|s) πmicroθ (a|s)) ds lt βmicroints
micro(s) KL(πold(a|s) πΣθ (a|s)) ds lt βΣ (20)
As in (Abdolmaleki et al 2018b) we set a much smallerbound for covariance than for mean to keep the explorationand avoid premature convergence We can solve this op-timization problem using the same general procedure asdescribed above
A Distributional View on Multi-Objective Policy Optimization
C4 Derivation of Dual Function for η
Recall that obtaining the per-objective improved action dis-tributions qk(a|s) in the policy improvement stage (Sec421) requires solving a convex dual function for the tem-perature ηk for each objective For the derivation of thisdual function please refer to Appendix D2 of the originalMPO paper (Abdolmaleki et al 2018b)
D MO-V-MPO Multi-Objective On-PolicyMPO
In this section we describe how MO-MPO can be adaptedto the on-policy case in which a state-value function V (s)is learned and used to estimate the advantages for eachobjective We call this approach MO-V-MPO
D1 Multi-objective policy evaluation in the on-policysetting
In the on-policy setting to evaluate the previous policyπold we use advantages A(s a) estimated from a learnedstate-value function V (s) instead of a state-action valuefunction Q(s a) as in the main text We train a separateV -function for each objective by regressing to the stan-dard n-step return (Sutton 1988) associated with each ob-jective More concretely given trajectory snippets τ =(s0 a0 r0) (sT aT rT ) where rt denotes a rewardvector rk(st at)Nk=1 that consists of rewards for all N ob-jectives we find value function parameters φk by optimizingthe following objective
minφ1k
Nsumk=1
Eτ[(G
(T )k (st at)minus V πold
φk(st))
2] (21)
Here G(T )k (st at) is the T -step target for value function
k which uses the actual rewards in the trajectory andbootstraps from the current value function for the restG
(T )k (st at) =
sumTminus1`=t γ
`minustrk(s` a`) + γTminustV πoldφk
(s`+T )The advantages are then estimated as Aπold
k (st at) =
G(T )k (st at)minus V πold
φk(st)
D2 Multi-objective policy improvement in theon-policy setting
Given the previous policy πold(a|s) and estimated advan-tages Aπold
k (s a)k=1N associated with this policy foreach objective our goal is to improve the previous policyTo this end we first learn an improved variational distribu-tion qk(s a) for each objective and then combine and distillthe variational distributions into a new parametric policyπnew(a|s) Unlike in MO-MPO for MO-V-MPO we use thejoint distribution qk(s a) rather than local policies qk(a|s)because without a learned Q-function only one action perstate is available for learning This is a multi-objective vari-
ant of the two-step policy improvement procedure employedby V-MPO (Song et al 2020)
D21 OBTAINING IMPROVED VARIATIONALDISTRIBUTIONS PER OBJECTIVE (STEP 1)
In order to obtain the improved variational distributionsqk(s a) we optimize the RL objective for each objectiveAk
maxqk
intsa
qk(s a)Ak(s a) dads (22)
st KL(qk(s a)pold(s a)) lt εk
where the KL-divergence is computed over all (s a) εk de-notes the allowed expected KL divergence and pold(s a) =micro(s)πold(a|s) is the state-action distribution associated withπold
As in MO-MPO we use these εk to define the preferencesover objectives More concretely εk defines the allowed con-tribution of objective k to the change of the policy There-fore the larger a particular εk is with respect to others themore that objective k is preferred On the other hand ifεk = 0 is zero then objective k will have no contribution tothe change of the policy and will effectively be ignored
Equation (22) can be solved in closed form
qk(s a) prop pold(s a) exp(Ak(s a)
ηk
) (23)
where the temperature ηk is computed based on the con-straint εk by solving the following convex dual problem
ηk = argminηk
[ηk εk + (24)
ηk log
intsa
pold(s a) exp(Ak(s a)
η
)da ds
]
We can perform the optimization along with the loss bytaking a gradient descent step on ηk and we initialize withthe solution found in the previous policy iteration step Sinceηk must be positive we use a projection operator after eachgradient step to maintain ηk gt 0
Top-k advantages As in Song et al (2020) in practicewe used the samples corresponding to the top 50 of ad-vantages in each batch of data
D22 FITTING A NEW PARAMETRIC POLICY (STEP 2)
We next want to combine and distill the state-action distribu-tions obtained in the previous step into a single parametricpolicy πnew(a|s) that favors all of the objectives accordingto the preferences specified by εk For this we solve a su-pervised learning problem that fits a parametric policy asfollows
A Distributional View on Multi-Objective Policy Optimization
πnew = argmaxθ
Nsumk=1
intsa
qk(s a) log πθ(a|s) da ds
stints
KL(πold(a|s) πθ(a|s)) ds lt β (25)
where θ are the parameters of our function approximator(a neural network) which we initialize from the weights ofthe previous policy πold and the KL constraint enforces atrust region of size β that limits the overall change in theparametric policy to improve stability of learning As inMPO the KL constraint in this step has a regularizationeffect that prevents the policy from overfitting to the localpolicies and therefore avoids premature convergence
In order to optimize Equation (25) we employ Lagrangianrelaxation similar to the one employed for ordinary V-MPO (Song et al 2020)
E Multi-Objective Policy Improvement asInference
In the main paper we motivated the multi-objective policyupdate rules from an intuitive perspective In this section weshow that our multi-objective policy improvement algorithmcan also be derived from the RL-as-inference perspective Inthe policy improvement stage we assume that a Q-functionfor each objective is given and we would like to improveour policy with respect to these Q-functions The deriva-tion here extends the derivation for the policy improvementalgorithm in (single-objective) MPO in Abdolmaleki et al(2018a) (in appendix) to the multi-objective case
We assume there are observable binary improvement eventsRkNk=1 for each objective Rk = 1 indicates that ourpolicy has improved for objective k whileRk = 0 indicatesthat it has not Now we ask if the policy has improved withrespect to all objectives ie Rk = 1Nk=1 what would theparameters θ of that improved policy be More concretelythis is equivalent to the maximum a posteriori estimate forRk = 1Nk=1
maxθ
pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)
where the probability of improvement events depends on θAssuming independence between improvement events Rkand using log probabilities leads to the following objective
maxθ
Nsumk=1
log pθ(Rk = 1) + log p(θ) (27)
The prior distribution over parameters p(θ) is fixed duringthe policy improvement step We set the prior such that πθ
stays close to the target policy during each policy improve-ment step to improve stability of learning (Sec E2)
We use the standard expectation-maximization (EM) algo-rithm to efficiently maximize
sumNk=1 log pθ(Rk = 1) The
EM algorithm repeatedly constructs a tight lower bound inthe E-step given the previous estimate of θ (which corre-sponds to the target policy in our case) and then optimizesthat lower bound in the M-step We introduce a variationaldistribution qk(s a) per objective and use this to decom-pose the log-evidence
sumNk=1 log pθ(Rk = 1) as follows
Nsumk=1
log pθ(Rk = 1) (28)
=
Nsumk=1
KL(qk(s a) pθ(s a|Rk = 1))minus
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a))
The second term of this decomposition is expanded as
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a)) (29)
=
Nsumk=1
Eqk(sa)
[log
pθ(Rk = 1 s a)
qk(s a)
]=
Nsumk=1
Eqk(sa)
[log
p(Rk = 1|s a)πθ(a|s)micro(s)
qk(s a)
]
where micro(s) is the stationary state distribution which is as-sumed to be given in each policy improvement step Inpractice micro(s) is approximated by the distribution of thestates in the replay buffer p(Rk = 1|s a) is the likelihoodof the improvement event for objective k if our policy choseaction a in state s
The first term of the decomposition in Equation (28) isalways positive so the second term is a lower bound onthe log-evidence
sumNk=1 log pθ(Rk = 1) πθ(a|s) and
qk(a|s) = qk(sa)micro(s) are unknown even though micro(s) is given
In the E-step we estimate qk(a|s) for each objective k byminimizing the first term given θ = θprime (the parameters ofthe target policy) Then in the M-step we find a new θ byfitting the parametric policy to the distributions from firststep
E1 E-Step
The E-step corresponds to the first step of policy improve-ment in the main paper (Sec 421) In the E-step wechoose the variational distributions qk(a|s)Nk=1 such thatthe lower bound on
sumNk=1 log pθ(Rk = 1) is as tight as
A Distributional View on Multi-Objective Policy Optimization
possible when θ = θprime the parameters of the target policyThe lower bound is tight when the first term of the decompo-sition in Equation (28) is zero so we choose qk to minimizethis term We can optimize for each variational distributionqk independently
qk(a|s) = argminq
Emicro(s)
[KL(q(a|s) pθprime(s a|Rk = 1))
]= argmin
qEmicro(s)
[KL(q(a|s) πθprime(a|s))minus
Eq(a|s) log p(Rk = 1|s a))]
(30)
We can solve this optimization problem in closed formwhich gives us
qk(a|s) =πθprime(a|s) p(Rk = 1|s a)intπθprime(a|s) p(Rk = 1|s a) da
This solution weighs the actions based on their relativeimprovement likelihood p(Rk = 1|s a) for each objectiveWe define the likelihood of an improvement event Rk as
p(Rk = 1|s a) prop exp(Qk(s a)
αk
)
where αk is an objective-dependent temperature parameterthat controls how greedy the solution qk(a|s) is with respectto its associated objective Qk(s a) For example given abatch of state-action pairs at the extremes an αk of zerowould give all probability to the state-action pair with themaximum Q-value while an αk of positive infinity wouldgive the same probability to all state-action pairs In orderto automatically optimize for αk we plug the exponentialtransformation into (30) which leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) dadsminus
αk
intmicro(s) KL(q(a|s) πθprime(a|s)) ds (31)
If we instead enforce the bound on KL divergence as a hardconstraint (which we do in practice) that leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) da ds (32)
stintmicro(s) KL(q(a|s) πθprime(a|s)) ds lt εk
where as described in the main paper the parameter εkdefines preferences over objectives If we now assume thatqk(a|s) is a non-parametric sample-based distribution as in(Abdolmaleki et al 2018b) we can solve this constrained
optimization problem in closed form for each sampled states by setting
q(a|s) prop πθprime(a|s) exp(Qk(s a)
ηk
) (33)
where ηk is computed by solving the following convex dualfunction
ηk = argminη
ηεk+ (34)
η
intmicro(s) log
intπθprime(a|s) exp
(Qk(s a)
η
)da ds
Equations (32) (33) and (34) are equivalent to those givenfor the first step of multi-objective policy improvement inthe main paper (Sec 421) Please see refer to AppendixD2 in (Abdolmaleki et al 2018b) for details on derivationof the dual function
E2 M-Step
After obtaining the variational distributions qk(a|s) foreach objective k we have found a tight lower bound forsumNk=1 log pθ(Rk = 1) when θ = θprime We can now obtain
the parameters θ of the updated policy πθ(a|s) by optimiz-ing this lower bound After rearranging and dropping theterms in (27) that do not involve θ we obtain
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) da ds
+ log p(θ)
This corresponds to the supervised learning step of policyimprovement in the main paper (Sec 422)
For the prior p(θ) on the parameters θ we enforce that thenew policy πθ(a|s) is close to the target policy πθprime(a|s)We therefore optimize for
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) dads
stintmicro(s) KL(πθprime(a|s) πθ(a|s)) ds lt β
Because of the constraint on the change of the parametricpolicy we do not greedily optimize the M-step objectiveSee (Abdolmaleki et al 2018b) for more details
A Distributional View on Multi-Objective Policy Optimization
cares equally about all tasks (van Hasselt et al 2016 Hesselet al 2019) Similarly prior work in multi-objective opti-mization has dealt with objectives of different units andorscales by normalizing objectives to have similar magnitudes(Marler amp Arora 2005 Grodzevich amp Romanko 2006 Kimamp de Weck 2006 Daneshmand et al 2017 Ishibuchi et al2017) MO-MPO can also be seen as doing adaptive nor-malization but for any preference over objectives not justequal preferences
In general invariance to reparameterization of the functionapproximator has been investigated in optimization liter-ature resulting in for example natural gradient methods(Martens 2014) The common tool here is measuring dis-tances in function space instead of parameter space usingKL-divergence Similarly in this work to achieve invari-ance to the scale of objectives we use KL-divergence overpolicies to encode preferences
3 Background and NotationMulti Objective Markov Decision Process In this pa-per we consider a multi-objective RL problem defined by amulti-objective Markov Decision Process (MO-MDP) TheMO-MDP consists of states s isin S and actions a isin Aan initial state distribution p(s0) transition probabilitiesp(st+1|st at) which define the probability of changingfrom state st to st+1 when taking action at reward func-tions rk(s a) isin RNk=1 per objective k and a discountfactor γ isin [0 1) We define our policy πθ(a|s) as astate conditional distribution over actions parametrized byθ Together with the transition probabilities this givesrise to a state visitation distribution micro(s) We also con-sider per-objective action-value functions The action-value function for objective k is defined as the expectedreturn (ie cumulative discounted reward) from choosingaction a in state s for objective k and then following pol-icy π Qπk (s a) = Eπ[
suminfint=0 γ
trk(st at)|s0 = s a0 = a]We can represent this function using the recursive expres-sion Qπk (st at) = Ep(st+1|stat)
[rk(st at) + γV πk (st+1)
]
where V πk (s) = Eπ[Qπk (s a)] is the value function of π forobjective k
Problem Statement For any MO-MDP there is a setof nondominated policies ie the Pareto front A policyis nondominated if there is no other policy that improvesits expected return for an objective without reducing theexpected return of at least one other objective Given apreference setting our goal is to find a nondominated policyπθ that satisfies those preferences In our approach a settingof constraints does not directly correspond to a particularscalarization but we show that by varying these constraintsettings we can indeed trace out a Pareto front of policies
Algorithm 1 MO-MPO One policy improvement step1 given batch size (L) number of actions to sample (M ) (N )
Q-functions Qπoldk (s a)Nk=1 preferences εkNk=1 previous
policy πold previous temperatures ηkNk=1 replay buffer Dfirst-order gradient-based optimizer O
23 initialize πθ from the parameters of πold4 repeat5 Collect dataset si aij Qijk
LMNijk where
6 M actions aij sim πold(a|si) and Qijk = Qπoldk (si aij)
78 Compute action distribution for each objective9 for k = 1 N do
10 δηk larr nablaηkηkεk + ηksumLi
1Llog
(sumMj
1M
exp(Q
ijkηk
))11 Update ηk based on δηk using optimizer O12 qijk prop exp(
Qijkηk
)
13 end for1415 Update parametric policy16 δπ larr minusnablaθ
sumLi
sumMj
sumNk q
ijk log πθ(a
ij |si)17 (subject to additional KL regularization see Sec 422)18 Update πθ based on δπ using optimizer O1920 until fixed number of steps21 return πold = πθ
4 MethodWe propose a policy iteration algorithm for multi-objectiveRL Policy iteration algorithms decompose the RL probleminto two sub-problems and iterate until convergence
1 Policy evaluation estimate Q-functions given policy2 Policy improvement update policy given Q-functions
Algorithm 1 summarizes this two-step multi-objective pol-icy improvement procedure In Appendix E we explain howthis can be derived from the ldquoRL as inferencerdquo perspective
We describe multi-objective MPO in this section and explainmulti-objective V-MPO in Appendix D When there is onlyone objective MO-(V-)MPO reduces to (V-)MPO
41 Multi-Objective Policy Evaluation
In this step we learn Q-functions to evaluate the previouspolicy πold We train a separate Q-function per objectivefollowing the Q-decomposition approach (Russell amp Zim-dars 2003) In principle any Q-learning algorithm can beused as long as the target Q-value is computed with respectto πold2 In this paper we use the Retrace objective (Munos
2Russell amp Zimdars (2003) prove critics suffer from ldquoillusion ofcontrolrdquo if they are trained with conventional Q-learning (Watkins1989) In other words if each critic computes its target Q-valuebased on its own best action for the next state then they overesti-mate Q-values because in reality the parametric policy πθ (thatconsiders all criticsrsquo opinions) is in charge of choosing actions
A Distributional View on Multi-Objective Policy Optimization
et al 2016) to learn a Q-function Qπoldk (s aφk) for each
objective k parameterized by φk as follows
minφkN1
Nsumk=1
E(sa)simD
[(Qretk (s a)minusQπold
k (s aφk))2]
where Qretk is the Retrace target for objective k and the
previous policy πold and D is a replay buffer containinggathered transitions See Appendix C for details
42 Multi-Objective Policy Improvement
Given the previous policy πold(a|s) and associated Q-functions Qπold
k (s a)Nk=1 our goal is to improve the previ-ous policy for a given visitation distribution micro(s)3 To thisend we learn an action distribution for each Q-function andcombine these to obtain the next policy πnew(a|s) This is amulti-objective variant of the two-step policy improvementprocedure employed by MPO (Abdolmaleki et al 2018b)
In the first step for each objective k we learn an improvedaction distribution qk(a|s) such that Eqk(a|s)[Q
πoldk (s a)] ge
Eπold(a|s)[Qπoldk (s a)] where states s are drawn from a visi-
tation distribution micro(s)
In the second step we combine and distill the improveddistributions qk into a new parametric policy πnew (withparameters θnew) by minimizing the KL-divergence betweenthe distributions and the new parametric policy ie
θnew = argminθ
Nsumk=1
Emicro(s)
[KL(qk(a|s)πθ(a|s)
)] (1)
This is a supervised learning loss that performs maximumlikelihood estimate of distributions qk Next we will explainthese two steps in more detail
421 OBTAINING ACTION DISTRIBUTIONS PEROBJECTIVE (STEP 1)
To obtain the per-objective improved action distributionsqk(a|s) we optimize the standard RL objective for eachobjective Qk
maxqk
ints
micro(s)
inta
qk(a|s)Qk(s a) dads (2)
stints
micro(s) KL(qk(a|s)πold(a|s)) ds lt εk
where εk denotes the allowed expected KL divergence forobjective k We use these εk to encode preferences over ob-jectives More concretely εk defines the allowed influenceof objective k on the change of the policy
3In practice we use draws from the replay buffer to estimateexpectations over the visitation distribution micro(s)
For nonparametric action distributions qk(a|s) we can solvethis constrained optimization problem in closed form foreach state s sampled from micro(s) (Abdolmaleki et al 2018b)
qk(a|s) prop πold(a|s) exp(Qk(s a)
ηk
) (3)
where the temperature ηk is computed based on the corre-sponding εk by solving the following convex dual function
ηk = argminη
η εk + (4)
η
ints
micro(s) log
inta
πold(a|s) exp(Qk(s a)
η
)dads
In order to evaluate qk(a|s) and the integrals in (4) we drawL states from the replay buffer and for each state sampleM actions from the current policy πold In practice wemaintain one temperature parameter ηk per objective Wefound that optimizing the dual function by performing a fewsteps of gradient descent on ηk is effective and we initializewith the solution found in the previous policy iteration stepSince ηk should be positive we use a projection operatorafter each gradient step to maintain ηk gt 0 Please refer toAppendix C for derivation details
Application to Other Deep RL Algorithms Since theconstraints εk in (2) encode the preferences over objectivessolving this optimization problem with good satisfactionof constraints is key for learning a policy that satisfies thedesired preferences For nonparametric action distributionsqk(a|s) we can satisfy these constraints exactly One coulduse any policy gradient method (eg Schulman et al 20152017 Heess et al 2015 Haarnoja et al 2018) to obtainqk(a|s) in a parametric form instead However solving theconstrained optimization for parametric qk(a|s) is not exactand the constraints may not be well satisfied which impedesthe use of εk to encode preferences Moreover assuming aparametric qk(a|s) requires maintaining a function approx-imator (eg a neural network) per objective which cansignificantly increase the complexity of the algorithm andlimits scalability
Choosing εk It is more intuitive to encode preferences viaεk rather than via scalarization weights because the formeris invariant to the scale of rewards In other words having adesired preference across objectives narrows down the rangeof reasonable choices for εk but does not narrow down therange of reasonable choices for scalarization weights Inorder to identify reasonable scalarization weights a RL prac-titioner needs to additionally be familiar with the scale ofrewards for each objective In practice we have found thatlearning performance is robust to a wide range of scales forεk It is the relative scales of the εk that matter for encodingpreferences over objectivesmdashthe larger a particular εk iswith respect to others the more that objective k is preferredOn the other hand if εk = 0 then objective k will have
A Distributional View on Multi-Objective Policy Optimization
no influence and will effectively be ignored In AppendixA1 we provide suggestions for setting εk given a desiredpreference over objectives
422 FITTING A NEW PARAMETRIC POLICY (STEP 2)
In the previous section for each objective k we have ob-tained an improved action distribution qk(a|s) Next wewant to combine these distributions to obtain a single para-metric policy that trades off the objectives according tothe constraints εk that we set For this we solve a super-vised learning problem that fits a parametric policy to theper-objective action distributions from step 1
θnew = argmaxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) dads
stints
micro(s) KL(πold(a|s) πθ(a|s)) ds lt β (5)
where θ are the parameters of our policy (a neural network)and the KL constraint enforces a trust region of size β thatlimits the overall change in the parametric policy The KLconstraint in this step has a regularization effect that pre-vents the policy from overfitting to the sample-based actiondistributions and therefore avoids premature convergenceand improves stability of learning (Schulman et al 2015Abdolmaleki et al 2018ab)
Similar to the first policy improvement step we evaluatethe integrals by using the L states sampled from the replaybuffer and the M actions per state sampled from the oldpolicy In order to optimize (5) using gradient descent weemploy Lagrangian relaxation similar to in MPO (Abdol-maleki et al 2018a) (see Appendix C for more detail)
5 Experiments Toy DomainsIn the empirical evaluation that follows we will first demon-strate the mechanics and scale-invariance of MO-MPO ina single-state environment (Sec 51) and then show thatMO-MPO can find all Pareto-dominant policies in a popularMORL benchmark (Sec 52) Finally we show the benefitof using MO-MPO in high-dimensional continuous controldomains including on a real robot (Sec 6) AppendicesA and B contain a detailed description of all domains andtasks experimental setups and implementation details
Baselines The goal of our empirical evaluation is toanalyze the benefit of using our proposed multi-objectivepolicy improvement step (Sec 422) that encodes prefer-ences over objectives via constraints εk on expected KL-divergences rather than via weights wk Thus we primarilycompare MO-MPO against scalarized MPO which relieson linear scalarization weights wk to encode preferencesThe only difference between MO-MPO and scalarized MPOis the policy improvement step for scalarized MPO a single
Figure 2 A simple environment in which the agent starts at stateS0 and chooses to navigate to one of three terminal states Thereare two objectives Taking the left action for instance leads toa reward of 1 for the first objective and 4 for the second
improved action distribution q(a|s) is computed based onsumk wkQk(s a) and a single KL constraint ε
State-of-the-art approaches that combine MORL with deepRL assume linear scalarization as well either learning aseparate policy for each setting of weights (Mossalam et al2016) or learning a single policy conditioned on scalar-ization weights (Friedman amp Fontaine 2018 Abels et al2019) Scalarized MPO addresses the former problemwhich is easier The policy evaluation step in scalarizedMPO is analagous to scalarized Q-learning proposed byMossalam et al (2016) As we show later in Sec 6 evenlearning an optimal policy for a single scalarization is diffi-cult in high-dimensional continuous control domains
51 Simple World
First we will examine the behavior of MO-MPO in a simplemulti-armed bandit with three actions (up right andleft) (Fig 2) inspired by Russell amp Zimdars (2003)We train policies with scalarized MPO and with MO-MPOThe policy evaluation step is exact because the Q-valuefunction for each objective is known it is equal to thereward received for that objective after taking each actionas labeled in Fig 24
We consider three possible desired preferences equal pref-erence for the two objectives preferring the first and pre-ferring the second Encoding preferences in scalarizedMPO amounts to choosing appropriate linear scalarizationweights and in MO-MPO amounts to choosing appropriateεrsquos We use the following weights and εrsquos
bull equal preference weights [05 05] or εrsquos [001 001]bull prefer first weights [09 01] or εrsquos [001 0002]bull prefer second weights [01 09] or εrsquos [0002 001]
We set ε = 001 for scalarized MPO If we start with a4The policy improvement step can also be computed exactly
because solving for the optimal temperature η (or η1 and η2 in theMO-MPO case) is a convex optimization problem and the KL-constrained policy update is also a convex optimization problemwhen there is only one possible state We use CVXOPT (Andersenet al 2020) as our convex optimization solver
A Distributional View on Multi-Objective Policy Optimization
uniform policy and run MPO with β = 0001 until thepolicy converges scalarized MPO and MO-MPO result insimilar policies (Fig 3 solid bars) up for equal preferenceright for prefer first and left for prefer second
However if we make the rewards imbalanced by multiply-ing the rewards obtained for the first objective by 20 (egleft now obtains a reward of [20 4]) we see that the poli-cies learned by scalarized MPO shift to preferring the opti-mal action for the first objective (right) in both the equalpreference and prefer second cases (Fig 3 striped bars)In contrast the final policies for MO-MPO are the same asfor balanced rewards because in each policy improvementstep MO-MPO optimizes for a separate temperature ηk thatscales each objectiversquos Q-value function This ηk is com-puted based on the corresponding allowed KL-divergenceεk so when the rewards for any objective k are multipliedby a factor but εk remains the same the computed ηk endsup being scaled by that factor as well neutralizing the effectof the scaling of rewards (see Eq (4))
Even in this simple environment we see that MO-MPOrsquosscale-invariant way of encoding preferences is valuable Inmore complex domains in which the Q-value functionsmust be learned in parallel with the policy the (automatic)dynamic adjustment of temperatures ηk per objective be-comes more essential (Sec 6)
The scale of εk controls the amount that objective k caninfluence the policyrsquos update If we set ε1 = 001 and sweepover the range from 0 to 001 for ε2 the resulting policies gofrom always picking right to splitting probability acrossright and up to always picking up (Fig 4 right) Incontrast setting weights leads to policies quickly convergingto placing all probability on a single action (Fig 4 left)We hypothesize this limits the ability of scalarized MPOto explore and find compromise policies (that perform wellwith respect to all objectives) in more challenging domains
52 Deep Sea Treasure
An important quality of any MORL approach is the abilityto find a variety of policies on the true Pareto front (Roijerset al 2013) We demonstrate this in Deep Sea Treasure(DST) (Vamplew et al 2011) a popular benchmark for test-ing MORL approaches DST consists of a 11times10 grid worldwith ten treasure locations The agent starts in the upper leftcorner (s0 = (0 0)) and has a choice of four actions (mov-ing one square up right down and left) When theagent picks up a treasure the episode terminates The agenthas two objectives treasure value and time penalty Thetime penalty is minus1 for each time step that the agent takesand farther-away treasures have higher treasure values Weuse the treasure values in Yang et al (2019)
We ran scalarized MPO with weightings [w 1 minus w] and
00
05
10equal preference prefer first prefer second
left up right00
05
10
left up right left up right
scalarizeddistributional
scalarized imbalanced rewardsdistributional imbalanced rewards
Figure 3 When two objectives have comparable reward scales(solid bars) scalarized MPO (first row) and MO-MPO (second row)learn similar policies across three different preferences Howeverwhen the scale of the first objective is much higher (striped bars)scalarized MPO shifts to always preferring the first objective Incontrast MO-MPO is scale-invariant and still learns policies thatsatisfy the preferences The y-axis denotes action probability
00 05 10p(left)
0
1distributional
[001]
[0101]
[010]
00 05 10p(left)
0
1p(
righ
t)scalarized
[01]
[55]
[10][w1 w2]
00 05 10p(left)
0
1p(
righ
t)scalarized
[01]
[55]
[10]
00 05 10p(left)
0
1
p(righ
t)
scalarized
[01]
[55]
[10][1205981 1205982]
00 05 10p(left)
0
1distributional
[001]
[0101]
[010]
Figure 4 A visualization of policies during learningmdasheach curvecorresponds to a particular setting of weights (left) or εrsquos (right)Policies are initialized as uniform (the blue dot) and are traineduntil convergence Each point (x y) corresponds to the policywith p(left) = x p(right) = y and p(up) = 1minusxminusy Thetop left and bottom right blue stars denote the optimal policy forthe first and second objectives respectively
w isin [0 001 002 1] All policies converged to a pointon the true Pareto front and all but three found the optimalpolicy for that weighting In terms of coverage policieswere found for eight out of ten points on the Pareto front5
We ran MO-MPO on this task as well for a range of εεtime isin [001 002 005] and εtreasure = c lowast εtime wherec isin [05 051 052 15] All runs converged to a policyon the true Pareto front and MO-MPO finds policies for allten points on the front (Fig 5 left) Note that it is the ratioof εrsquos that matters rather than the exact settingsmdashacrossall settings of εtime similar ratios of εtreasure to εtime resultin similar policies as this ratio increases policies tend toprefer higher-value treasures (Fig 5 right)
5Since scalarized MPO consistently finds the optimal policyfor any given weighting in this task with a more strategic selectionof weights we expect policies for all ten points would be found
A Distributional View on Multi-Objective Policy Optimization
Figure 5 Left Blue stars mark the true Pareto front for Deep SeaTreasure MO-MPO with a variety of settings for εk discoversall points on the true Pareto front The area of the orange circlesis proportional to the number of εk settings that converged tothat point Right As more preference is given to the treasureobjective (ie as x increases) policies tend to prefer higher-valuetreasures Each orange dot in the scatterplot corresponds to aparticular setting of εk
6 Experiments Continuous Control DomainsThe advantage of encoding preferences via εrsquos rather thanvia weights is apparent in more complex domains We com-pared our approaches MO-MPO and MO-V-MPO againstscalarized MPO and V-MPO in four high-dimensional con-tinuous control domains in MuJoCo (Todorov et al 2012)and on a real robot The domains we consider are
Humanoid We use the humanoid run task defined in Deep-Mind Control Suite (Tassa et al 2018) Policies must opti-mize for horizontal speed h while minimizing energy usageThe task reward is min(h10 1) where h is in meters persecond and the energy usage penalty is action `2-norm Thehumanoid has 21 degrees of freedom and the observationconsists of joint angles joint velocities head height handand feet positions torso vertical orientation and center-of-mass velocity for a total of 67 dimensions
Shadow Hand We consider three tasks on the ShadowDexterous Hand touch turn and orient In the touch andturn tasks policies must complete the task while minimiz-ing ldquopainrdquo A sparse task reward of 10 is given for pressingthe block with greater than 5N of force or for turning thedial from a random initial location to the target locationThe pain penalty penalizes the robot for colliding with ob-jects at high speed this penalty is defined as in Huang et al(2019) In the orient task there are three aligned objectivestouching the rectangular peg lifting it to a given height andorienting it to be perpendicular to the ground All threerewards are between 0 and 1 The Shadow Hand has five
fingers and 24 degrees of freedom actuated by 20 motorsThe observation consists of joint angles joint velocitiesand touch sensors for a total of 63 dimensions The touchand turn tasks terminate when the goal is reached or after 5seconds and the orient task terminates after 10 seconds
Humanoid Mocap We consider a large-scale humanoidmotion capture tracking task similar to in Peng et al (2018)in which policies must learn to follow motion capture ref-erence data6 There are five objectives each capturing adifferent aspect of the similarity of the pose between thesimulated humanoid and the mocap target joint orientationsjoint velocities hand and feet positions center-of-mass po-sitions and certain body positions and joint angles Theseobjectives are described in detail in Appendix B4 In orderto balance these multiple objectives prior work relied onheavily-tuned reward functions (eg Peng et al 2018) Thehumanoid has 56 degrees of freedom and the observation is1021-dimensional consisting of proprioceptive observationsas well as six steps of motion capture reference frames Intotal we use about 40 minutes of locomotion mocap datamaking this an extremely challenging domain
Sawyer Peg-in-Hole We train a Rethink Robotics Sawyerrobot arm to insert a cylindrical peg into a hole while min-imizing wrist forces The task reward is shaped towardpositioning the peg directly above the hole and increasesfor insertion and the penalty is the `1-norm of Cartesianforces measured by the wrist force-torque sensor The latterimplicitly penalizes contacts and impacts as well as exces-sive directional change (due to the gripperrsquos inertia inducingforces when accelerating) We impose a force thresholdto protect the hardwaremdashif this threshold is exceeded theepisode is terminated The action space is the end effectorrsquosCartesian velocity and the observation is 102-dimensionalconsisting of Cartesian position joint position and velocitywrist force-torque and joint action for three timesteps
61 Evaluation Metric
We run MO-(V-)MPO and scalarized (V-)MPO with a widerange of constraint settings εk and scalarization weightswk respectively corresponding to a wide range of possibledesired preferences (The exact settings are provided inAppendix A) For tasks with two objectives we plot thePareto front found by each approach We also computethe hypervolume of each found Pareto front this metric iscommonly-used for evaluating MORL algorithms (Vamplewet al 2011) Given a set of policies Π and a reference policyr that is dominated by all policies in this set this metric isthe hypervolume of the space of all policies that dominater and are dominated by at least one policy in Π We useDEAP (Fortin et al 2012) to compute hypervolumes
6This task was developed concurrently by Anonymous (2020)and does not constitute a contribution of this paper
A Distributional View on Multi-Objective Policy Optimization
3000 00
400
800
task
rewar
d
mid-training
3000 0
action norm penalty
end of training
3000 0
normal scale
30000 0
10x penalty
MO-MPO scalarized MPO
5 0pain penalty
00
05
task
rewar
d
touch mid-training
5 0pain penalty
touch
5 0pain penalty
turn
0 100z
0
150
angl
e
orient
3000 00
400
800
task
rewar
d
mid-training
3000 0
action norm penalty
end of training
3000 0
normal scale
30000 0
10x penalty
MO-MPO scalarized MPO
5 0pain penalty
00
05
task
rewar
d
touch mid-training
5 0pain penalty
touch
5 0pain penalty
turn
0 100z
0
150
angl
e
orient
touch mid-training
1x penalty
action norm penalty
pain penalty
3000 00
400
800
task
rewar
d
mid-training
3000 0
action norm penalty
end of training
3000 0
normal scale
30000 0
10x penalty
MO-MPO scalarized MPOscalarized MPOMO-MPO
touch end of training
Figure 6 Pareto fronts found by MO-MPO and scalarized MO-MPO for humanoid run (top row) and Shadow Hand tasks Eachdot represents a single trained policy Corresponding hypervol-umes are in Table 1 Task reward is discounted for touch and turnwith a discount factor of 099 For orient the x and y axes are thetotal reward for the lift and orientation objectives respectively
62 Results Humanoid and Shadow Hand
In the run touch and turn tasks the two objectives arecompetingmdasha very high preference for minimizing actionnorm or pain as opposed to getting task reward will resultin a policy that always chooses zero-valued actions Acrossthese three tasks the Pareto front found by MO-MPO issuperior to the one found by scalarized MPO with respectto the hypervolume metric (Table 1)7 In particular MO-MPO finds more policies that perform well with respect toboth objectives ie are in the upper right portion of thePareto front MO-MPO also speeds up learning on run andtouch (Fig 6) Qualitatively MO-MPO trains run policiesthat look more natural and ldquohuman-likerdquo videos are athttpsitesgooglecomviewmo-mpo
When we scale the action norm penalty by 10times for runscalarized MPO policies no longer achieve high task rewardwhereas MO-MPO policies do (Fig 6 top right) Thissupports that MO-MPOrsquos encoding of preferences is indeedscale-invariant When the objectives are aligned and havesimilarly scaled rewards as in the orient task MO-MPOand scalarized MPO perform similarly as expected
Ablation We also ran vanilla MPO on humanoid run withthe same range of weight settings as for scalarized MPO toinvestigate how useful Q-decomposition is In vanilla MPOwe train a single critic on the scalarized reward function
7We use hypervolume reference points of [0minus104] for run[0minus105] for run with 10times penalty [0minus20] for touch and turnand the zero vector for orient and humanoid mocap
Task scalarized MPO MO-MPO
Humanoid run mid-training 11times106 33times106
Humanoid run 64times106 71times106
Humanoid run 1times penalty 50times106 59times106
Humanoid run 10times penalty 26times107 69times107
Shadow touch mid-training 143 156Shadow touch 162 164Shadow turn 144 154Shadow orient 28times104 28times104
Humanoid Mocap 386times10minus6 341times10minus6
Table 1 Hypervolume measurements across tasks and approaches
which is equivalent to removing Q-decomposition fromscalarized MPO Vanilla MPO trains policies that achievesimilar task reward (up to 800) but with twice the actionnorm penalty (up to minus104) As a result the hypervolumeof the Pareto front that vanilla MPO finds is more than amagnitude worse than that of scalarized MPO and MO-MPO(22times106 versus 26times107 and 69times107 respectively)
63 Results Humanoid Mocap
The objectives in this task are mostly aligned In contrast tothe other experiments we use V-MPO (Song et al 2020) asthe base algorithm because it outperforms MPO in learningthis task In addition since V-MPO is an on-policy vari-ant of MPO this enables us to evaluate our approach inthe on-policy setting Each training run is very computa-tionally expensive so we train only a handful of policieseach8 None of the MO-V-MPO policies are dominatedby those found by V-MPO In fact although the weightsspan a wide range of ldquopreferencesrdquo for the joint velocity theonly policies found by V-MPO that are not dominated by aMO-V-MPO policy are those with extreme values for jointvelocity reward (either le 0006 or ge 0018) whereas it isbetween 00103 and 00121 for MO-V-MPO policies
Although the hypervolume of the Pareto front found by V-MPO is higher than that of MO-V-MPO (Table 1) findingpolicies that over- or under- prioritize any objective is unde-sirable Qualitatively the policies trained with MO-V-MPOlook more similar to the mocap reference datamdashthey exhibitless feet jittering compared to those trained with scalarizedV-MPO this can be seen in the corresponding video
64 Results Sawyer Peg-in-Hole
In this task we would like the robot to prioritize successfultask completion while minimizing wrist forces With this
8For MO-V-MPO we set all εk = 001 Also for each objec-tive we set εk = 0001 and set all others to 001 For V-MPOwe fix the weights of four objectives to reasonable values and tryweights of [01 03 06 1 5] for matching joint velocities
A Distributional View on Multi-Objective Policy Optimization
0 200000 400000
200
400
number of policy improvement steps
task reward
0 200000 400000
minus12
00minus
800
wrist force penalty
MO-MPO scalarized MPO [095 005] [08 02]
Figure 7 Task reward (left) and wrist force penalty (right) learningcurves for the Sawyer peg-in-hole task The policy trained withMO-MPO quickly learns to optimize both objectives whereas theother two do not Each line represents a single trained policy
in mind for MO-MPO we set εtask = 01 and εforce = 005and for scalarized MPO we try [wtask wforce] = [095 005]and [08 02] We find that policies trained with scalarizedMPO focus all learning on a single objective at the beginningof training we also observed this in the touch task wherescalarized MPO policies quickly learn to either maximizetask reward or minimize pain penalty but not both (Fig 6bottom left) In contrast the policy trained with MO-MPOsimultaneously optimizes for both objectives throughouttraining In fact throughout training the MO-MPO policydoes just as well with respect to task reward as the scalarizedMPO policy that cares more about task reward and similarlyfor wrist force penalty (Fig 7)
7 Conclusions and Future WorkIn this paper we presented a new distributional perspectiveon multi objective reinforcement learning that is derivedfrom the Rl-as-inference perspective This view leads to twonovel multi-objective RL algorithms namely MO-MPO andMO-V-MPO We showed these algorithms enable practition-ers to encode preferences in a scale-invariant way AlthoughMO-(V-)MPO is a single-policy approach to MORL it iscapable of producing a variety of Pareto-dominant policiesA limitation of this work is that we produced this set ofpolicies by iterating over a relatively large number of εrsquoswhich is computationally expensive In future work weplan to extend MO-(V-)MPO into a true multiple-policyMORL approach either by conditioning the policy on set-tings of εrsquos or by developing a way to strategically select εrsquosto train policies for analogous to what prior work has donefor weights (eg Roijers et al (2014))
AcknowledgmentsThe authors would like to thank Guillaume DesjardinsCsaba Szepesvari Jost Tobias Springenberg Steven BohezPhilemon Brakel Brendan Tracey Jonas Degrave JonasBuchli Leslie Fritz Chloe Rosenberg and many others atDeepMind for their support and feedback on this paper
ReferencesAbdolmaleki A Springenberg J T Degrave J Bohez S
Tassa Y Belov D Heess N and Riedmiller M Rela-tive entropy regularized policy iteration arXiv preprintarXiv181202256 2018a
Abdolmaleki A Springenberg J T Tassa Y Munos RHeess N and Riedmiller M Maximum a posterioripolicy optimisation In Proceedings of the Sixth Interna-tional Conference on Learning Representations (ICLR)2018b
Abels A Roijers D M Lenaerts T Nowe A andSteckelmacher D Dynamic weights in multi-objectivedeep reinforcement learning In Proceedings of the 36thInternational Conference on Machine Learning (ICML)2019
Achiam J Held D Tamar A and Abbeel P Constrainedpolicy optimization In Proceedings of the 34th Inter-national Conference on Machine Learning (ICML) pp22ndash31 2017
Amodei D Olah C Steinhardt J Christiano P Schul-man J and Mane D Concrete problems in AI safetyarXiv preprint arXiv160606565 2016
Andersen M S Dahl J and Vandenberghe L CVXOPTA Python package for convex optimization version 12httpscvxoptorg 2020
Anonymous CoMic Co-Training and Mimicry forReusable Skills 2020
Barrett L and Narayanan S Learning all optimal policieswith multiple criteria In Proceedings of the 25th Inter-national Conference on Machine Learning (ICML) pp41ndash47 2008
Barth-Maron G Hoffman M W Budden D Dabney WHorgan D TB D Muldal A Heess N and LillicrapT Distributed distributional deterministic policy gradi-ents In Proceedings of the Sixth International Conferenceon Learning Representations (ICLR) 2018
Bohez S Abdolmaleki A Neunert M Buchli J HeessN and Hadsell R Value constrained model-free contin-uous control arXiv preprint arXiv190204623 2019
Chen X Ghadirzadeh A Bjorkman M and Jensfelt PMeta-learning for multi-objective reinforcement learningarXiv preprint arXiv181103376 2019
Chentanez N Muller M Macklin M Makoviychuk Vand Jeschke S Physics-based motion capture imitationwith deep reinforcement learning In International Con-ference on Motion Interaction and Games (MIG) ACM2018
A Distributional View on Multi-Objective Policy Optimization
Chow Y Nachum O Duenez-Guzman E andGhavamzadeh M A Lyapunov-based approach to safereinforcement learning In Proceedings of the 31st Inter-national Conference on Neural Information ProcessingSystems (NeurIPS) pp 8092ndash8101 2018
Daneshmand M Tale Masouleh M Saadatzi M HOzcinar C and Anbarjafari G A robust proportion-preserving composite objective function for scale-invariant multi-objective optimization Scientia Iranica24(6)2977ndash2991 2017
Fortin F-A De Rainville F-M Gardner M-A ParizeauM and Gagne C DEAP Evolutionary algorithms madeeasy Journal of Machine Learning Research 132171ndash2175 July 2012
Friedman E and Fontaine F Generalizing across multi-objective reward functions in deep reinforcement learningarXiv preprint arXiv180906364 2018
Gabor Z Kalmar Z and Szepesvari C Multi-criteriareinforcement learning In Proceedings of the FifteenthInternational Conference on Machine Learning (ICML)pp 197ndash205 1998
Grodzevich O and Romanko O Normalization and othertopics in multi-objective optimization In Proceedings ofthe Fields-MITACS Industrial Problems Workshop 2006
Haarnoja T Zhou A Abbeel P and Levine S Softactor-critic Off-policy maximum entropy deep reinforce-ment learning with a stochastic actor In Proceedings ofthe 35th International Conference on Machine Learning(ICML) pp 1861ndash1870 2018
Heess N Wayne G Silver D Lillicrap T Erez T andTassa Y Learning continuous control policies by stochas-tic value gradients In Advances in Neural InformationProcessing Systems 28 (NIPS) 2015
Hessel M Soyer H Espeholt L Czarnecki W SchmittS and van Hasselt H Multi-task deep reinforcementlearning with popart In Proceedings of the AAAI Confer-ence on Artificial Intelligence volume 33 pp 3796ndash38032019
Huang S H Zambelli M Kay J Martins M F TassaY Pilarski P M and Hadsell R Learning gentle objectmanipulation with curiosity-driven deep reinforcementlearning arXiv preprint arXiv190308542 2019
Ishibuchi H Doi K and Nojima Y On the effect ofnormalization in MOEAD for multi-objective and many-objective optimization Complex amp Intelligent Systems 3(4)279ndash294 2017
Kim I and de Weck O Adaptive weighted sum methodfor multiobjective optimization A new method for paretofront generation Structural and Multidisciplinary Opti-mization 31(2)105ndash116 2006
Kingma D P and Ba J Adam A method for stochasticoptimization In Proceedings of the Third InternationalConference on Learning Representations (ICLR) 2015
Le H Voloshin C and Yue Y Batch policy learningunder constraints In Proceedings of the 36th Inter-national Conference on Machine Learning (ICML) pp3703ndash3712 2019
Levine S Finn C Darrell T and Abbeel P End-to-endtraining of deep visuomotor policies Journal of MachineLearning Research 17(39)1ndash40 2016
Liu C Xu X and Hu D Multiobjective reinforcementlearning A comprehensive overview IEEE Transactionson Systems Man and Cybernetics Systems 45(3)385ndash398 2015
Marler R T and Arora J S Function-transformationmethods for multi-objective optimization EngineeringOptimization 37(6)551ndash570 2005
Martens J New perspectives on the natural gradient methodarXiv preprint arXiv14121193 2014
Merel J Ahuja A Pham V Tunyasuvunakool S LiuS Tirumala D Heess N and Wayne G Hierarchi-cal visuomotor control of humanoids In InternationalConference on Learning Representations (ICLR) 2019
Mnih V Kavukcuoglu K Silver D Rusu A A VenessJ Bellemare M G Graves A Riedmiller M Fidje-land A K Ostrovski G et al Human-level controlthrough deep reinforcement learning Nature 518(7540)529 2015
Moffaert K V and Nowe A Multi-objective reinforcementlearning using sets of Pareto dominating policies Journalof Machine Learning Research 153663ndash3692 2014
Mossalam H Assael Y M Roijers D M and WhitesonS Multi-objective deep reinforcement learning arXivpreprint arXiv161002707 2016
Munos R Stepleton T Harutyunyan A and BellemareM Safe and efficient off-policy reinforcement learningIn Proceedings of the 29th International Conference onNeural Information Processing Systems pp 1054ndash10622016
Natarajan S and Tadepalli P Dynamic preferences inmulti-criteria reinforcement learning In Proceedings ofthe 22nd International Conference on Machine Learning(ICML) pp 601ndash608 2005
A Distributional View on Multi-Objective Policy Optimization
Nottingham K Balakrishnan A Deshmukh J Christo-pherson C and Wingate D Using logical specificationsof objectives in multi-objective reinforcement learningarXiv preprint arXiv191001723 2019
Parisi S Pirotta M Smacchia N Bascetta L andRestelli M Policy gradient approaches for multi-objective sequential decision making In Proceedingsof the International Joint Conference on Neural Networks(IJCNN) pp 2323ndash2330 2014
Peng X B Abbeel P Levine S and van de Panne MDeepmimic Example-guided deep reinforcement learn-ing of physics-based character skills ACM Transactionson Graphics 37(4) 2018
Pirotta M Parisi S and Restelli M Multi-objectivereinforcement learning with continuous Pareto frontierapproximation In Proceedings of the Twenty-Ninth AAAIConference on Artificial Intelligence (AAAI) 2015
Reymond M and Nowe A Pareto-DQN Approximat-ing the Pareto front in complex multi-objective decisionproblems In Proceedings of the Adaptive and LearningAgents Workshop at the 18th International Conferenceon Autonomous Agents and MultiAgent Systems AAMAS2019
Riedmiller M Hafner R Lampe T Neunert M De-grave J Van de Wiele T Mnih V Heess N and Sprin-genberg J T Learning by playing - solving sparse rewardtasks from scratch arXiv preprint arXiv1802105672018
Roijers D M Vamplew P Whiteson S and Dazeley RA survey of multi-objective sequential decision-makingJournal of Artificial Intelligence Research 48(1)67ndash1132013
Roijers D M Whiteson S and Oliehoek F A Linearsupport for multi-objective coordination graphs In Pro-ceedings of the International Conference on AutonomousAgents and Multi-Agent Systems (AAMAS) pp 1297ndash1304 2014
Russell S and Zimdars A L Q-decomposition for rein-forcement learning agents In Proceedings of the Twenti-eth International Conference on International Conferenceon Machine Learning (ICML) pp 656ndash663 2003
Schulman J Levine S Abbeel P Jordan M and MoritzP Trust region policy optimization In Proceedings ofthe 32nd International Conference on Machine Learning(ICML) 2015
Schulman J Wolski F Dhariwal P Radford A andKlimov O Proximal policy optimization algorithmsarXiv preprint arXiv170706347 2017
Shadow Robot Company Shadow dexterous handhttpswwwshadowrobotcomproductsdexterous-hand 2020
Silver D Huang A Maddison C J Guez A Sifre Lvan den Driessche G Schrittwieser J Antonoglou IPanneershelvam V Lanctot M Dieleman S GreweD Nham J Kalchbrenner N Sutskever I Lillicrap TLeach M Kavukcuoglu K Graepel T and HassabisD Mastering the game of Go with deep neural networksand tree search Nature 529484ndash503 2016
Song H F Abdolmaleki A Springenberg J T ClarkA Soyer H Rae J W Noury S Ahuja A LiuS Tirumala D Heess N Belov D Riedmiller Mand Botvinick M M V-MPO On-policy maximum aposteriori policy optimization for discrete and continu-ous control In International Conference on LearningRepresentations 2020
Sutton R S Learning to predict by the methods of temporaldifferences Machine learning 3(1)9ndash44 1988
Tan J Zhang T Coumans E Iscen A Bai Y HafnerD Bohez S and Vanhoucke V Sim-to-real Learningagile locomotion for quadruped robots In Proceedingsof Robotics Science and Systems (RSS) 2018
Tassa Y Doron Y Muldal A Erez T Li Yde Las Casas D Budden D Abdolmaleki A Merel JLefrancq A Lillicrap T and Riedmiller M Deepmindcontrol suite arXiv preprint arXiv180100690 2018
Teh Y W Bapst V Czarnecki W M Quan J Kirk-patrick J Hadsell R Heess N and Pascanu R DistralRobust multitask reinforcement learning In Proceedingsof the 31st International Conference on Neural Informa-tion Processing Systems (NeurIPS) 2017
Todorov E Erez T and Tassa Y Mujoco A physicsengine for model-based control In Proceedings of theIEEERSJ International Conference on Intelligent Robotsand Systems (IROS) pp 5026ndash5033 2012
Vamplew P Dazeley R Berry A Issabekov R andDekker E Empirical evaluation methods for multiobjec-tive reinforcement learning algorithms Machine Learn-ing 84(1)51ndash80 Jul 2011
van Hasselt H P Guez A Hessel M Mnih V and Sil-ver D Learning values across many orders of magnitudeIn Advances in Neural Information Processing Systemspp 4287ndash4295 2016
Van Moffaert K Drugan M M and Nowe A Scalar-ized multi-objective reinforcement learning Novel de-sign techniques In Proceedings of the IEEE Symposiumon Adaptive Dynamic Programming and ReinforcementLearning (ADPRL) pp 191ndash199 2013
A Distributional View on Multi-Objective Policy Optimization
van Seijen H Fatemi M Romoff J Laroche R BarnesT and Tsang J Hybrid reward architecture for reinforce-ment learning In Proceedings of the 31st InternationalConference on Neural Information Processing Systems(NeurIPS) pp 5398ndash5408 2017
Watkins C J C H Learning from Delayed Rewards PhDthesis Kingrsquos College Cambridge UK May 1989
Wray K H Zilberstein S and Mouaddib A-I Multi-objective MDPs with conditional lexicographic rewardpreferences In Proceedings of the Twenty-Ninth AAAIConference on Artificial Intelligence (AAAI) pp 3418ndash3424 2015
Wulfmeier M Abdolmaleki A Hafner R SpringenbergJ T Neunert M Hertweck T Lampe T Siegel NHeess N and Riedmiller M Regularized hierarchicalpolicies for compositional transfer in robotics arXivpreprint arXiv190611228 2019
Yang R Sun X and Narasimhan K A generalized algo-rithm for multi-objective reinforcement learning and pol-icy adaptation In Proceedings of the 32nd InternationalConference on Neural Information Processing Systems(NeurIPS) pp 14610ndash14621 2019
Zeng A Song S Lee J Rodriguez A and FunkhouserT Tossingbot Learning to throw arbitrary objects withresidual physics In Proceedings of Robotics Scienceand Systems (RSS) 2019
Zhan H and Cao Y Relationship explainable multi-objective optimization via vector value function basedreinforcement learning arXiv preprint arXiv1910019192019
Zuluaga M Krause A and Puschel M ε-pal An ac-tive learning approach to the multi-objective optimizationproblem Journal of Machine Learning Research 17(1)3619ndash3650 2016
A Distributional View on Multi-Objective Policy Optimization
A Experiment DetailsIn this section we describe implementation details and spec-ify the hyperparameters used for our algorithm In all ourexperiments the policy and critic(s) are implemented withfeed-forward neural networks We use Adam (Kingma ampBa 2015) for optimization
In our continuous control tasks the policy returns a Gaus-sian distribution with a diagonal covariance matrix ieπθ(a|s) = N (microΣ) The policy is parametrized by aneural network which outputs the mean micro = micro(s) and di-agonal Cholesky factors A = A(s) such that Σ = AAT The diagonal factor A has positive diagonal elements en-forced by the softplus transform Aii larr log(1 + exp(Aii))to ensure positive definiteness of the diagonal covariancematrix
Table 2 shows the default hyperparameters that we use forMO-MPO and scalarized MPO and Table 3 shows the hy-perparameters that differ from these defaults for the hu-manoid Shadow Hand and Sawyer experiments For alltasks that we train policies with MO-MPO for a separatecritic is trained for each objective with a shared first layerThe only exception is the action norm penalty in humanoidrun which is computed exactly from the action We foundthat for both policy and critic networks layer normalizationof the first layer followed by a hyperbolic tangent (tanh) isimportant for the stability of the algorithm
In our experiments comparing the Pareto front found byMO-MPO versus scalarized MPO we ran MO-MPO with arange of εk settings and ran scalarized MPO with a range ofweight settings to obtain a set of policies for each algorithmThe settings for humanoid run and the Shadow Hand tasksare listed in Table 4 The settings for humanoid mocap andSawyer are specified in the main paper in Sec 63 and 64respectively
In the following subsections we first give suggestions forchoosing appropriate εk to encode a particular preference(Sec A1) then we describe the discrete MO-MPO usedfor Deep Sea Treasure (Sec A2) and finally we describeimplementation details for humanoid mocap (Sec A3)
A1 Suggestions for Setting ε
Our proposed algorithm MO-(V-)MPO requires practition-ers to translate a desired preference across objectives to nu-merical choices for εk for each objective k At first glancethis may seem daunting However in practice we havefound that encoding preferences via εk is often more in-tuitive than doing so via scalarization In this subsectionwe seek to give an intuition on how to set εk for differentdesired preferences Recall that each εk controls the influ-ence of objective k on the policy update by constraining theKL-divergence between each objective-specific distribution
Hyperparameter Default
policy networklayer sizes (300 200)take tanh of action mean yesminimum variance 10minus12
maximum variance unbounded
critic network(s)layer sizes (400 400 300)take tanh of action yesRetrace sequence size 8discount factor γ 099
both policy and critic networkslayer norm on first layer yestanh on output of layer norm yesactivation (after each hidden layer) ELU
MPOactions sampled per state 20default ε 01KL-constraint on policy mean βmicro 10minus3
KL-constraint on policy covariance βΣ 10minus5
initial temperature η 1
trainingbatch size 512replay buffer size 106
Adam learning rate 3times10minus4
Adam ε 10minus3
target network update period 200
Table 2 Default hyperparameters for MO-MPO and scalarizedMPO with decoupled update on mean and covariance (Sec C3)
and the current policy (Sec 41) We generally choose εk inthe range of 0001 to 01
Equal Preference When all objectives are equally impor-tant the general rule is to set all εk to the same value Wedid this in our orient and humanoid mocap tasks where ob-jectives are aligned or mostly aligned In contrast it can betricky to choose appropriate weights in linear scalarizationto encode equal preferencesmdashsetting all weights equal to1K (where K is the number of objectives) is only appro-priate if the objectivesrsquo rewards are of similar scales Weexplored this in Sec 51 and Fig 3
When setting all εk to the same value what should thisvalue be The larger εk is the more influence the objectiveswill have on the policy update step Since the per-objectivecritics are learned in parallel with the policy setting εktoo high tends to destabilize learning because early on intraining when the critics produce unreliable Q-values theirinfluence on the policy will lead it in the wrong directionOn the other hand if εk is set too low then it slows downlearning because the per-objective action distribution isonly allowed to deviate by a tiny amount from the currentpolicy and the updated policy is obtained via supervisedlearning on the combination of these action distributions
A Distributional View on Multi-Objective Policy Optimization
Humanoid run
policy networklayer sizes (400 400 300)take tanh of action mean nominimum variance 10minus8
critic network(s)layer sizes (500 500 400)take tanh of action no
MPOKL-constraint on policy mean βmicro 5times10minus3
KL-constraint on policy covariance βΣ 10minus6
trainingAdam learning rate 2times10minus4
Adam ε 10minus8
Shadow Hand
MPOactions sampled per state 30default ε 001
Sawyer
MPOactions sampled per state 15
trainingtarget network update period 100
Table 3 Hyperparameters for humanoid Shadow Hand andSawyer experiments that differ from the defaults in Table 2
Eventually the learning will converge to more or less thesame policy though as long as εk is not set too high
Unequal Preference When there is a difference in pref-erences across objectives the relative scale of εk is whatmatters The more the relative scale of εk is compared to εlthe more influence objective k has over the policy updatecompared to objective l And in the extreme case whenεl is near-zero for objective l then objective l will haveno influence on the policy update and will effectively beignored We explored this briefly in Sec 51 and Fig 4
One common example of unequal preferences is when wewould like an agent to complete a task while minimiz-ing other objectivesmdasheg an action norm penalty ldquopainrdquopenalty or wrist force-torque penalty in our experimentsIn this case the ε for the task objective should be higherthan that for the other objectives to incentivize the agent toprioritize actually doing the task If the ε for the penaltiesis too high then the agent will care more about minimiz-ing the penalty (which can typically be achieved by simplytaking no actions) rather than doing the task which is notparticularly useful
The scale of εk has a similar effect as in the equal preference
Condition Settings
Humanoid run (one seed per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin linspace(0 015 100)
MO-MPO εtask = 01εpenalty isin linspace(10minus6 015 100)
Humanoid run normal vs scaled (three seeds per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin 001 005 01
MO-MPO εtask = 01εpenalty isin 001 005 01
Shadow Hand touch and turn (three seeds per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin linspace(0 09 10)
MO-MPO εtask = 001εpenalty isin linspace(0001 0015 15)
Shadow Hand orient (ten seeds per setting)
scalarized MPO wtouch = wheight = worientation = 13MO-MPO εtouch = εheight = εorientation = 001
Table 4 Settings for εk and weights (linspace(x y z) denotes aset of z evenly-spaced values between x and y)
case If the scale of εk is too high or too low then the sameissues arise as discussed for equal preferences If all εkincrease or decrease in scale by the same (moderate) factorand thus their relative scales remain the same then typicallythey will converge to more or less the same policy Thiscan be seen in Fig 5 (right) in the Deep Sea Treasuredomain regardless of whether εtime is 001 002 or 005the relationship between the relative scale of εtreasure andεtime to the treasure that the policy converges to selecting isessentially the same
A2 Deep Sea Treasure
In order to handle discrete actions we make several mi-nor adjustments to scalarized MPO and MO-MPO Thepolicy returns a categorical distribution rather than a Gaus-sian The policy is parametrized by a neural network whichoutputs a vector of logits (ie unnormalized log probabil-ities) of size |A| The KL constraint on the change of thispolicy β is 0001 The input to the critic network is thestate concatenated with a four-dimensional one-hot vectordenoting which action is chosen (eg the up action corre-sponds to [1 0 0 0]gt) Critics are trained with one-steptemporal-difference error with a discount of 0999 Otherthan these changes the network architectures and the MPOand training hyperparameters are the same as in Table 2
A Distributional View on Multi-Objective Policy Optimization
Humanoid Mocap
Scalarized VMPOKL-constraint on policy mean βmicro 01KL-constraint on policy covariance βΣ 10minus5
default ε 01initial temperature η 1
TrainingAdam learning rate 10minus4
batch size 128unroll length (for n-step return Sec D1) 32
Table 5 Hyperparameters for the humanoid mocap experiments
A3 Humanoid Mocap
For the humanoid mocap experiments we used the follow-ing architecture for both MO-VMPO and VMPO for thepolicy we process a concatentation of the mocap referenceobservations and the proprioceptive observations by a twolayer MLP with 1024 hidden units per layer This referenceencoder is followed by linear layers to produce the meanand log standard deviation of a stochastic latent variableThese latent variables are then concatenated with the propri-oceptive observations and processed by another two layerMLP with 1024 hidden units per layer to produce the actiondistribution For VMPO we use a three layer MLP with1024 hidden units per layer as the critic For MO-VMPOwe use a shared two layer MLP with 1024 hidden units perlayer followed by a one layer MLP with 1024 hidden unitsper layer per objective In both cases we use k-step returnsto train the critic with a discount factor of 095 Table 5shows additional hyperparameters used in our experiments
B Experimental DomainsWe evaluated our approach on one discrete domain (DeepSea Treasure) three simulated continuous control domains(humanoid Shadow Hand and humanoid mocap) and onereal-world continuous control domain (Sawyer robot) Herewe provide more detail about these domains and the objec-tives used in each task
B1 Deep Sea Treasure
Deep Sea Treasure (DST) is a 11times10 grid-world domainthe state space S consists of the x and y position of the agentand the action spaceA is up right down left Thelayout of the environment and values of the treasures areshown in Fig 8 If the action would cause the agent tocollide with the sea floor or go out of bounds it has noeffect Farther-away treasures have higher values Theepisode terminates when the agent collects a treasure orafter 200 timesteps
237
224
203196
161151140
115
82
07
Figure 8 Deep Sea Treasure environment from Vamplew et al(2011) with weights from Yang et al (2019) Treasures are labeledwith their respective values The agent can move around freelyin the white squares but cannot enter the black squares (ie theocean floor)
There are two objectives time penalty and treasure valueA time penalty of minus1 is given at each time step The agentreceives the value of the treasure it picks up as the rewardfor the treasure objective In other words when the agentpicks up a treasure of value v the reward vector is [minus1 v]otherwise it is [minus1 0]
B2 Shadow Hand
Our robot platform is a simulated Shadow Dexterous Hand(Shadow Robot Company 2020) in the MuJoCo physicsengine (Todorov et al 2012) The Shadow Hand has fivefingers and 24 degrees of freedom actuated by 20 motorsThe observation consists of joint angles joint velocities andtouch sensors for a total of 63 dimensions Each fingertipof our Shadow Hand has a 4times4 spatial touch sensor Thissensor has three channels measuring the normal force andthe x and y-axis tangential forces for a sensor dimension of4times4times3 per fingertip We simplify this sensor by summingacross spatial dimensions resulting in a 1times1times3 observationper fingertip
In the touch task there is a block in the environment that isalways fixed in the same pose In the turn task there is a dialin the environment that can be rotated and it is initialized toa random position between minus30 and 30 The target angleof the dial is 0 The angle of the dial is included in theagentrsquos observation In the orient task the robot interactswith a rectangular peg in the environment the initial andtarget pose of the peg remains the same across episodesThe pose of the block is included in the agentrsquos observationencoded as the xyz positions of four corners of the block(based on how Levine et al (2016) encodes end-effectorpose)
A Distributional View on Multi-Objective Policy Optimization
Figure 9 These images show what task completion looks like forthe touch turn and orient Shadow Hand tasks (from left to right)
0 2 4 6impact force
minus6
minus4
minus2
0
pain
pen
alty
Figure 10 In the touch and turn Shadow Hand tasks the painpenalty is computed based on the impact force as plotted here Forlow impact forces the pain penalty is near-zero For high impactforces the pain penalty is equal to the negative of the impact force
B21 BALANCING TASK COMPLETION AND PAIN
In the touch and turn tasks there are two objectives ldquopainrdquopenalty and task completion A sparse task completion re-ward of 1 is given for pressing a block with greater than 5Nof force or turning a dial to a fixed target angle respectivelyIn both tasks the episode terminates when the agent com-pletes the task ie the agent gets a total reward of either0 or 1 for the task completion objective per episode ThePareto plots for these two tasks in the main paper (Fig 6)show the discounted task reward (with a discount factor ofγ = 099) to capture how long it takes agents to completethe task
The ldquopainrdquo penalty is computed as in (Huang et al 2019) Itis based on the impact force m(s sprime) which is the increasein force from state s to the next state sprime In our tasks this ismeasured by a touch sensor on the block or dial The painpenalty is equal to the negative of the impact force scaledby how unacceptable it is
rpain(s a sprime) = minus[1minus aλ(m(s sprime))
]m(s sprime) (6)
where aλ(middot) isin [0 1] computes the acceptability of an impactforce aλ(middot) should be a monotonically decreasing functionthat captures how resilient the robot and the environmentare to impacts As in (Huang et al 2019) we use
aλ(m) = sigmoid(λ1(minusm+ λ2)) (7)
with λ = [2 2]gt The relationship between pain penaltyand impact force is plotted in Fig 10
675 700 725zpeg
0
1
rew
ard
height objective
0 2d(θ θpeg)
orientation objective
Figure 11 In the orient Shadow Hand task the height and orienta-tion objectives have shaped rewards as plotted here
B22 ALIGNED OBJECTIVES
In the orient task there are three non-competing objectivestouching the peg lifting the peg to a target height and ori-enting the peg to be perpendicular to the ground The targetz position and orientation is shown by the gray transparentpeg (Fig 9 right-most) note that the x and y position isnot specified so the robot in the figure gets the maximumreward with respect to all three objectives
The touch objective has a sparse reward of 1 for activat-ing the pegrsquos touch sensor and zero otherwise For theheight objective the target z position is 7cm above theground the pegrsquos z position is computed with respect toits center of mass The shaped reward for this objectiveis 1 minus tanh(50|ztarget minus zpeg|) For the orientation objec-tive since the peg is symmetrical there are eight possibleorientations that are equally valid The acceptable targetorientationsQtarget and the pegrsquos orientation qpeg are denotedas quaternions The shaped reward is computed with respectto the closest target quaternion as
minqisinQtarget
1minus tanh(2lowastd(q qpeg))
where d(middot) denotes the `2-norm of the axis-angle equivalent(in radians) for the distance between the two quaternionsFig 11 shows the shaping of the height and orientationrewards
B3 Humanoid
We make use of the humanoid run task from Tassa et al(2018)9 The observation dimension is 67 and the actiondimension is 21 Actions are joint accelerations with min-imum and maximum limits of minus1 and 1 respectively Forthis task there are two objectives
bull The original task reward given by the environment Thegoal is is to achieve a horizontal speed of 10 meters persecond in any direction This reward is shaped it isequal to min(h10 1) where h is in meters per secondFor this objective we learn a Q-function
9This task is available at githubcomdeepminddm control
A Distributional View on Multi-Objective Policy Optimization
bull Limiting energy usage by penalizing taking high-magnitude actions The penalty is the `2-norm of theaction vector ie rpenalty(a) = minusa2 For this ob-jective we do not learn a Q-function Instead wecompute this penalty directly to evaluate a given actionin a state-independent way during policy optimization
B4 Humanoid Mocap
The motion capture tracking task used in this paper wasdeveloped as part of a concurrent submission (Anonymous2020) and does not constitute a contribution of this paperWe use a simulated humanoid adapted from the ldquoCMUhumanoidrdquo available at dm controllocomotion (Merel et al2019) We adjusted various body and actuator parametersto be more comparable to that of an average human Thishumanoid has 56 degrees of freedom and the observation is1021-dimensional
This motion capture tracking task is broadly similar to previ-ous work on motion capture tracking (Chentanez et al 2018Peng et al 2018 Merel et al 2019) The task is associatedwith an underlying set of motion capture clips For thispaper we used roughly 40 minutes of locomotion motioncapture clips from the CMU motion capture database10
At the beginning of a new episode we select a frame uni-formly at random from all frames in the underlying mocapdata (excluding the last 10 frames of each clip) The simu-lated humanoid is then initialized to the pose of the selectedframe
The observations in this environment are various propriocep-tive observations about the current state of the body as wellas the relative target positions and orientations of 13 differ-ent body parts in the humanoidrsquos local frame We providethe agent with the target poses for the next 5 timesteps
The choice of reward function is crucial for motion capturetracking tasks Here we consider five different reward com-ponents each capturing a different aspect of the similarityof the pose between the simulated humanoid and the mocaptarget Four of our reward components were initially pro-posed by Peng et al (2018) The first reward component isbased on the difference in the center of mass
rcom = exp(minus10pcom minus pref
com2)
where pcom and prefcom are the positions of the center of mass
of the simulated humanoid and the mocap reference re-spectively The second reward component is based on thedifference in joint angle velocities
rvel = exp(minus01qvel minus qref
vel2)
where qvel and qrefvel are the joint angle velocities of the simu-
lated humanoid and the mocap reference respectively The
10This database is available at mocapcscmuedu
third reward component is based on the difference in end-effector positions
rapp = exp(minus40papp minus pref
app2)
where papp and prefapp are the end effector positions of the
simulated humanoid and the mocap reference respectivelyThe fourth reward component is based on the difference injoint orientations
rquat = exp(minus2qquat qref
quat2)
where denotes quaternion differences and qquat and qrefquat
are the joint quaternions of the simulated humanoid and themocap reference respectively
Finally the last reward component is based on difference inthe joint angles and the Euclidean positions of a set of 13body parts
rtrunc = 1minus 1
03(
=ε︷ ︸︸ ︷bpos minus bref
pos1 + qpos minus qrefpos1)
where bpos and brefpos correspond to the body positions of the
simulated character and the mocap reference and qpos andqref
pos correspond to the joint angles
We include an early termination condition in our task that islinked to this last reward term We terminate the episode ifε gt 03 which ensures that rtrunc isin [0 1]
In our MO-VMPO experiments we treat each reward com-ponent as a separate objective In our VMPO experimentswe use a reward of the form
r =1
2rtrunc +
1
2(01rcom + λrvel + 015rapp + 065rquat)
(8)
varying λ as described in the main paper (Sec 63)
B5 Sawyer
The Sawyer peg-in-hole setup consists of a Rethink RoboticsSawyer robot arm The robot has 7 joints driven by serieselastic actuators On the wrist a Robotiq FT 300 force-torque sensor is mounted followed by a Robotiq 2F85 par-allel gripper We implement a 3D end-effector Cartesianvelocity controller that maintains a fixed orientation of thegripper The agent controls the Cartesian velocity setpointsThe gripper is not actuated in this setup The observationsprovided are the joint positions velocities and torques theend-effector pose the wrist force-torque measurements andthe joint velocity command output of the Cartesian con-troller We augment each observation with the two previousobservations as a history
In each episode the peg is initalized randomly within an8times8times8 cm workspace (with the peg outside of the hole)
A Distributional View on Multi-Objective Policy Optimization
The environment is then run at 20 Hz for 600 steps Theaction limits are plusmn005 ms on each axis If the magnitudeof the wrist force-torque measurements exceeds 15 N onany horizontal axis (x and y axis) or 5 N on the vertical axis(z axis) the episode is terminated early
There are two objectives the task itself (inserting the peg)and minimizing wrist force measurements
The reward for the task objective is defined as
rinsertion = max(
02rapproach rinserted
)(9)
rapproach = s(ppeg popening) (10)
rinserted = raligned s(ppegz pbottom
z ) (11)
raligned = ppegxy minus popening
xy lt d2 (12)
where ppeg is the peg position popening the position of thehole opening pbottom the bottom position of the hole and dthe diameter of the hole
s(p1 p2) is a shaping operator
s(p1 p2) = 1minus tanh2(atanh(
radicl)
εp1 minus p22) (13)
which gives a reward of 1minus l if p1 and p2 are at a Euclideandistance of ε Here we chose lapproach = 095 εapproach =001 linserted = 05 and εinserted = 004 Intentionally 004corresponds to the length of the peg such that rinserted = 05if ppeg = popening As a result if the peg is in the approachposition rinserted is dominant over rapproach in rinsertion
Intuitively there is a shaped reward for aligning the pegwith a magnitude of 02 After accurate alignment withina horizontal tolerance the overall reward is dominated bya vertical shaped reward component for insertion The hor-izontal tolerance threshold ensures that the agent is notencouraged to try to insert the peg from the side
The reward for the secondary objective minimizing wristforce measurements is defined as
rforce = minusF1 (14)
where F = (Fx Fy Fz) are the forces measured by thewrist sensor
C Algorithmic DetailsC1 General Algorithm Description
We maintain one online network and one target network foreach Q-function associated to objective k with parametersdenoted by φk and φprimek respectively We also maintain oneonline network and one target network for the policy withparameters denoted by θ and θprime respectively Target net-works are updated every fixed number of steps by copying
parameters from the online network Online networks areupdated using gradient descent in each learning iterationNote that in the main paper we refer to the target policynetwork as the old policy πold
We use an asynchronous actor-learner setup in which ac-tors regularly fetch policy parameters from the learner andact in the environment writing these transitions to the re-play buffer We refer to this policy as the behavior policythroughout the paper The learner uses the transitions inthe replay buffer to update the (online) Q-functions and thepolicy Please see Algorithms 2 and 3 for more details onthe actor and learner processes respectively
Algorithm 2 MO-MPO Asynchronous Actor
1 given (N) reward functions rk(s a)Nk=1 T steps perepisode
2 repeat3 Fetch current policy parameters θ from learner4 Collect trajectory from environment5 τ = 6 for t = 0 T do7 at sim πθ(middot|st)8 Execute action and determine rewards9 r = [r1(st at) rN (st at)]
10 τ larr τ cup (st at r πθ(at|st))11 end for12 Send trajectory τ to replay buffer13 until end of training
C2 Retrace for Multi-Objective Policy Evaluation
Recall that the goal of the policy evaluation stage (Sec 41)is to learn the Q-function Qφk
(s a) for each objective kparametrized by φk These Q-functions are with respectto the policy πold We emphasize that it is valid to use anyoff-the-shelf Q-learning algorithm such as TD(0) (Sutton1988) to learn these Q-functions We choose to use Retrace(Munos et al 2016) described here
Given a replay buffer D containing trajectory snippetsτ = (s0 a0 r0 s1) (sT aT rT sT+1) where rtdenotes a reward vector rk(st at)Nk=1 that consists of ascalar reward for each ofN objectives the Retrace objectiveis as follows
minφk
EτsimD[(Qretk (st at)minus Qφk
(st at))2] (15)
with
Qretk (st at) = Qφprime
k(st at) +
Tsumj=t
γjminust( jprodz=t+1
cz
)δj
δj = rk(sj aj) + γV (sj+1)minus Qφprimek(sj aj)
V (sj+1) = Eπold(a|sj+1)[Qφprimek(sj+1 a)]
A Distributional View on Multi-Objective Policy Optimization
Algorithm 3 MO-MPO Asynchronous Learner1 given batch size (L) number of actions (M) (N) Q-functions
(N) preferences εkNk=1 policy networks and replay bufferD
2 initialize Lagrangians ηkNk=1 and ν target networks andonline networks such that πθprime = πθ and Qφprime
k= Qφk
3 repeat4 repeat5 Collect dataset si aij Qijk
LMNijk where
6 si sim D aij sim πθprime(a|si) and Qijk = Qφkprime(si aij)
78 Compute action distribution for each objective9 for k = 1 N do
10 δηk larr nablaηkηkεk + ηksumLi
1Llog
(sumMj
1M
exp(Q
ijkηk
))11 Update ηk based on δηk12 qijk prop exp(
Qijkηk
)
13 end for1415 Update parametric policy with trust region16 sg denotes a stop-gradient17 δπ larr minusnablaθ
sumLi
sumMj
sumNk q
ijk log πθ(a
ij |si)18 +sg(ν)
(β minus
sumLi KL(πθprime(a|si) πθ(a|si)) ds
)19 δν larr nablaνν
(βminussumLi KL(πθprime(a|si) πsg(θ)(a|si)) ds
)20 Update πθ based on δπ21 Update ν based on δν2223 Update Q-functions24 for k = 1 N do25 δφk larr nablaφk
sum(stat)isinτsimD
(Qφk (st at)minusQ
retk
)226 with Qret
k as in Eq 1527 Update φk based on δφk
28 end for2930 until fixed number of steps31 Update target networks32 πθprime = πθ Qφk
prime = Qφk
33 until convergence
The importance weights ck are defined as
cz = min
(1πold(az|sz)b(az|sz)
)
where b(az|sz) denotes the behavior policy used to col-lect trajectories in the environment When j = t we set(prodj
z=t+1 cz
)= 1
We also make use of a target network for each Q-function(Mnih et al 2015) parametrized by φprimek which we copyfrom the online network φk after a fixed number of gradientsteps on the Retrace objective (15)
C3 Policy Fitting
Recall that fitting the policy in the policy improvementstage (Sec 422) requires solving the following constrained
optimization
πnew = argmaxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds
stints
micro(s) KL(πold(a|s) πθ(a|s)) ds lt β (16)
We first write the generalized Lagrangian equation ie
L(θ ν) =
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds (17)
+ ν(β minus
ints
micro(s) KL(πold(a|s) πθ(a|s)) ds)
where ν is the Lagrangian multiplier Now we solve thefollowing primal problem
maxθ
minνgt0
L(θ ν)
To obtain the parameters θ for the updated policy we solvefor ν and θ by alternating between 1) fixing ν to its currentvalue and optimizing for θ and 2) fixing θ to its currentvalue and optimizing for ν This can be applied for anypolicy output distribution
For Gaussian policies we decouple the update rules to op-timize for mean and covariance independently as in (Ab-dolmaleki et al 2018b) This allows for setting differentKL bounds for the mean (βmicro) and covariance (βΣ) whichresults in more stable learning To do this we first separateout the following two policies for mean and covariance
πmicroθ (a|s) = N(amicroθ(s)Σθold
(s)) (18)
πΣθ (a|s) = N
(amicroθold
(s)Σθ(s)) (19)
Policy πmicroθ (a|s) takes the mean from the online policy net-work and covariance from the target policy network andpolicy πΣ
θ (a|s) takes the mean from the target policy net-work and covariance from online policy network Now ouroptimization problem has the following form
maxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s)(
log πmicroθ (a|s)πΣθ (a|s)
)da ds
stints
micro(s) KL(πold(a|s) πmicroθ (a|s)) ds lt βmicroints
micro(s) KL(πold(a|s) πΣθ (a|s)) ds lt βΣ (20)
As in (Abdolmaleki et al 2018b) we set a much smallerbound for covariance than for mean to keep the explorationand avoid premature convergence We can solve this op-timization problem using the same general procedure asdescribed above
A Distributional View on Multi-Objective Policy Optimization
C4 Derivation of Dual Function for η
Recall that obtaining the per-objective improved action dis-tributions qk(a|s) in the policy improvement stage (Sec421) requires solving a convex dual function for the tem-perature ηk for each objective For the derivation of thisdual function please refer to Appendix D2 of the originalMPO paper (Abdolmaleki et al 2018b)
D MO-V-MPO Multi-Objective On-PolicyMPO
In this section we describe how MO-MPO can be adaptedto the on-policy case in which a state-value function V (s)is learned and used to estimate the advantages for eachobjective We call this approach MO-V-MPO
D1 Multi-objective policy evaluation in the on-policysetting
In the on-policy setting to evaluate the previous policyπold we use advantages A(s a) estimated from a learnedstate-value function V (s) instead of a state-action valuefunction Q(s a) as in the main text We train a separateV -function for each objective by regressing to the stan-dard n-step return (Sutton 1988) associated with each ob-jective More concretely given trajectory snippets τ =(s0 a0 r0) (sT aT rT ) where rt denotes a rewardvector rk(st at)Nk=1 that consists of rewards for all N ob-jectives we find value function parameters φk by optimizingthe following objective
minφ1k
Nsumk=1
Eτ[(G
(T )k (st at)minus V πold
φk(st))
2] (21)
Here G(T )k (st at) is the T -step target for value function
k which uses the actual rewards in the trajectory andbootstraps from the current value function for the restG
(T )k (st at) =
sumTminus1`=t γ
`minustrk(s` a`) + γTminustV πoldφk
(s`+T )The advantages are then estimated as Aπold
k (st at) =
G(T )k (st at)minus V πold
φk(st)
D2 Multi-objective policy improvement in theon-policy setting
Given the previous policy πold(a|s) and estimated advan-tages Aπold
k (s a)k=1N associated with this policy foreach objective our goal is to improve the previous policyTo this end we first learn an improved variational distribu-tion qk(s a) for each objective and then combine and distillthe variational distributions into a new parametric policyπnew(a|s) Unlike in MO-MPO for MO-V-MPO we use thejoint distribution qk(s a) rather than local policies qk(a|s)because without a learned Q-function only one action perstate is available for learning This is a multi-objective vari-
ant of the two-step policy improvement procedure employedby V-MPO (Song et al 2020)
D21 OBTAINING IMPROVED VARIATIONALDISTRIBUTIONS PER OBJECTIVE (STEP 1)
In order to obtain the improved variational distributionsqk(s a) we optimize the RL objective for each objectiveAk
maxqk
intsa
qk(s a)Ak(s a) dads (22)
st KL(qk(s a)pold(s a)) lt εk
where the KL-divergence is computed over all (s a) εk de-notes the allowed expected KL divergence and pold(s a) =micro(s)πold(a|s) is the state-action distribution associated withπold
As in MO-MPO we use these εk to define the preferencesover objectives More concretely εk defines the allowed con-tribution of objective k to the change of the policy There-fore the larger a particular εk is with respect to others themore that objective k is preferred On the other hand ifεk = 0 is zero then objective k will have no contribution tothe change of the policy and will effectively be ignored
Equation (22) can be solved in closed form
qk(s a) prop pold(s a) exp(Ak(s a)
ηk
) (23)
where the temperature ηk is computed based on the con-straint εk by solving the following convex dual problem
ηk = argminηk
[ηk εk + (24)
ηk log
intsa
pold(s a) exp(Ak(s a)
η
)da ds
]
We can perform the optimization along with the loss bytaking a gradient descent step on ηk and we initialize withthe solution found in the previous policy iteration step Sinceηk must be positive we use a projection operator after eachgradient step to maintain ηk gt 0
Top-k advantages As in Song et al (2020) in practicewe used the samples corresponding to the top 50 of ad-vantages in each batch of data
D22 FITTING A NEW PARAMETRIC POLICY (STEP 2)
We next want to combine and distill the state-action distribu-tions obtained in the previous step into a single parametricpolicy πnew(a|s) that favors all of the objectives accordingto the preferences specified by εk For this we solve a su-pervised learning problem that fits a parametric policy asfollows
A Distributional View on Multi-Objective Policy Optimization
πnew = argmaxθ
Nsumk=1
intsa
qk(s a) log πθ(a|s) da ds
stints
KL(πold(a|s) πθ(a|s)) ds lt β (25)
where θ are the parameters of our function approximator(a neural network) which we initialize from the weights ofthe previous policy πold and the KL constraint enforces atrust region of size β that limits the overall change in theparametric policy to improve stability of learning As inMPO the KL constraint in this step has a regularizationeffect that prevents the policy from overfitting to the localpolicies and therefore avoids premature convergence
In order to optimize Equation (25) we employ Lagrangianrelaxation similar to the one employed for ordinary V-MPO (Song et al 2020)
E Multi-Objective Policy Improvement asInference
In the main paper we motivated the multi-objective policyupdate rules from an intuitive perspective In this section weshow that our multi-objective policy improvement algorithmcan also be derived from the RL-as-inference perspective Inthe policy improvement stage we assume that a Q-functionfor each objective is given and we would like to improveour policy with respect to these Q-functions The deriva-tion here extends the derivation for the policy improvementalgorithm in (single-objective) MPO in Abdolmaleki et al(2018a) (in appendix) to the multi-objective case
We assume there are observable binary improvement eventsRkNk=1 for each objective Rk = 1 indicates that ourpolicy has improved for objective k whileRk = 0 indicatesthat it has not Now we ask if the policy has improved withrespect to all objectives ie Rk = 1Nk=1 what would theparameters θ of that improved policy be More concretelythis is equivalent to the maximum a posteriori estimate forRk = 1Nk=1
maxθ
pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)
where the probability of improvement events depends on θAssuming independence between improvement events Rkand using log probabilities leads to the following objective
maxθ
Nsumk=1
log pθ(Rk = 1) + log p(θ) (27)
The prior distribution over parameters p(θ) is fixed duringthe policy improvement step We set the prior such that πθ
stays close to the target policy during each policy improve-ment step to improve stability of learning (Sec E2)
We use the standard expectation-maximization (EM) algo-rithm to efficiently maximize
sumNk=1 log pθ(Rk = 1) The
EM algorithm repeatedly constructs a tight lower bound inthe E-step given the previous estimate of θ (which corre-sponds to the target policy in our case) and then optimizesthat lower bound in the M-step We introduce a variationaldistribution qk(s a) per objective and use this to decom-pose the log-evidence
sumNk=1 log pθ(Rk = 1) as follows
Nsumk=1
log pθ(Rk = 1) (28)
=
Nsumk=1
KL(qk(s a) pθ(s a|Rk = 1))minus
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a))
The second term of this decomposition is expanded as
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a)) (29)
=
Nsumk=1
Eqk(sa)
[log
pθ(Rk = 1 s a)
qk(s a)
]=
Nsumk=1
Eqk(sa)
[log
p(Rk = 1|s a)πθ(a|s)micro(s)
qk(s a)
]
where micro(s) is the stationary state distribution which is as-sumed to be given in each policy improvement step Inpractice micro(s) is approximated by the distribution of thestates in the replay buffer p(Rk = 1|s a) is the likelihoodof the improvement event for objective k if our policy choseaction a in state s
The first term of the decomposition in Equation (28) isalways positive so the second term is a lower bound onthe log-evidence
sumNk=1 log pθ(Rk = 1) πθ(a|s) and
qk(a|s) = qk(sa)micro(s) are unknown even though micro(s) is given
In the E-step we estimate qk(a|s) for each objective k byminimizing the first term given θ = θprime (the parameters ofthe target policy) Then in the M-step we find a new θ byfitting the parametric policy to the distributions from firststep
E1 E-Step
The E-step corresponds to the first step of policy improve-ment in the main paper (Sec 421) In the E-step wechoose the variational distributions qk(a|s)Nk=1 such thatthe lower bound on
sumNk=1 log pθ(Rk = 1) is as tight as
A Distributional View on Multi-Objective Policy Optimization
possible when θ = θprime the parameters of the target policyThe lower bound is tight when the first term of the decompo-sition in Equation (28) is zero so we choose qk to minimizethis term We can optimize for each variational distributionqk independently
qk(a|s) = argminq
Emicro(s)
[KL(q(a|s) pθprime(s a|Rk = 1))
]= argmin
qEmicro(s)
[KL(q(a|s) πθprime(a|s))minus
Eq(a|s) log p(Rk = 1|s a))]
(30)
We can solve this optimization problem in closed formwhich gives us
qk(a|s) =πθprime(a|s) p(Rk = 1|s a)intπθprime(a|s) p(Rk = 1|s a) da
This solution weighs the actions based on their relativeimprovement likelihood p(Rk = 1|s a) for each objectiveWe define the likelihood of an improvement event Rk as
p(Rk = 1|s a) prop exp(Qk(s a)
αk
)
where αk is an objective-dependent temperature parameterthat controls how greedy the solution qk(a|s) is with respectto its associated objective Qk(s a) For example given abatch of state-action pairs at the extremes an αk of zerowould give all probability to the state-action pair with themaximum Q-value while an αk of positive infinity wouldgive the same probability to all state-action pairs In orderto automatically optimize for αk we plug the exponentialtransformation into (30) which leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) dadsminus
αk
intmicro(s) KL(q(a|s) πθprime(a|s)) ds (31)
If we instead enforce the bound on KL divergence as a hardconstraint (which we do in practice) that leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) da ds (32)
stintmicro(s) KL(q(a|s) πθprime(a|s)) ds lt εk
where as described in the main paper the parameter εkdefines preferences over objectives If we now assume thatqk(a|s) is a non-parametric sample-based distribution as in(Abdolmaleki et al 2018b) we can solve this constrained
optimization problem in closed form for each sampled states by setting
q(a|s) prop πθprime(a|s) exp(Qk(s a)
ηk
) (33)
where ηk is computed by solving the following convex dualfunction
ηk = argminη
ηεk+ (34)
η
intmicro(s) log
intπθprime(a|s) exp
(Qk(s a)
η
)da ds
Equations (32) (33) and (34) are equivalent to those givenfor the first step of multi-objective policy improvement inthe main paper (Sec 421) Please see refer to AppendixD2 in (Abdolmaleki et al 2018b) for details on derivationof the dual function
E2 M-Step
After obtaining the variational distributions qk(a|s) foreach objective k we have found a tight lower bound forsumNk=1 log pθ(Rk = 1) when θ = θprime We can now obtain
the parameters θ of the updated policy πθ(a|s) by optimiz-ing this lower bound After rearranging and dropping theterms in (27) that do not involve θ we obtain
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) da ds
+ log p(θ)
This corresponds to the supervised learning step of policyimprovement in the main paper (Sec 422)
For the prior p(θ) on the parameters θ we enforce that thenew policy πθ(a|s) is close to the target policy πθprime(a|s)We therefore optimize for
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) dads
stintmicro(s) KL(πθprime(a|s) πθ(a|s)) ds lt β
Because of the constraint on the change of the parametricpolicy we do not greedily optimize the M-step objectiveSee (Abdolmaleki et al 2018b) for more details
A Distributional View on Multi-Objective Policy Optimization
et al 2016) to learn a Q-function Qπoldk (s aφk) for each
objective k parameterized by φk as follows
minφkN1
Nsumk=1
E(sa)simD
[(Qretk (s a)minusQπold
k (s aφk))2]
where Qretk is the Retrace target for objective k and the
previous policy πold and D is a replay buffer containinggathered transitions See Appendix C for details
42 Multi-Objective Policy Improvement
Given the previous policy πold(a|s) and associated Q-functions Qπold
k (s a)Nk=1 our goal is to improve the previ-ous policy for a given visitation distribution micro(s)3 To thisend we learn an action distribution for each Q-function andcombine these to obtain the next policy πnew(a|s) This is amulti-objective variant of the two-step policy improvementprocedure employed by MPO (Abdolmaleki et al 2018b)
In the first step for each objective k we learn an improvedaction distribution qk(a|s) such that Eqk(a|s)[Q
πoldk (s a)] ge
Eπold(a|s)[Qπoldk (s a)] where states s are drawn from a visi-
tation distribution micro(s)
In the second step we combine and distill the improveddistributions qk into a new parametric policy πnew (withparameters θnew) by minimizing the KL-divergence betweenthe distributions and the new parametric policy ie
θnew = argminθ
Nsumk=1
Emicro(s)
[KL(qk(a|s)πθ(a|s)
)] (1)
This is a supervised learning loss that performs maximumlikelihood estimate of distributions qk Next we will explainthese two steps in more detail
421 OBTAINING ACTION DISTRIBUTIONS PEROBJECTIVE (STEP 1)
To obtain the per-objective improved action distributionsqk(a|s) we optimize the standard RL objective for eachobjective Qk
maxqk
ints
micro(s)
inta
qk(a|s)Qk(s a) dads (2)
stints
micro(s) KL(qk(a|s)πold(a|s)) ds lt εk
where εk denotes the allowed expected KL divergence forobjective k We use these εk to encode preferences over ob-jectives More concretely εk defines the allowed influenceof objective k on the change of the policy
3In practice we use draws from the replay buffer to estimateexpectations over the visitation distribution micro(s)
For nonparametric action distributions qk(a|s) we can solvethis constrained optimization problem in closed form foreach state s sampled from micro(s) (Abdolmaleki et al 2018b)
qk(a|s) prop πold(a|s) exp(Qk(s a)
ηk
) (3)
where the temperature ηk is computed based on the corre-sponding εk by solving the following convex dual function
ηk = argminη
η εk + (4)
η
ints
micro(s) log
inta
πold(a|s) exp(Qk(s a)
η
)dads
In order to evaluate qk(a|s) and the integrals in (4) we drawL states from the replay buffer and for each state sampleM actions from the current policy πold In practice wemaintain one temperature parameter ηk per objective Wefound that optimizing the dual function by performing a fewsteps of gradient descent on ηk is effective and we initializewith the solution found in the previous policy iteration stepSince ηk should be positive we use a projection operatorafter each gradient step to maintain ηk gt 0 Please refer toAppendix C for derivation details
Application to Other Deep RL Algorithms Since theconstraints εk in (2) encode the preferences over objectivessolving this optimization problem with good satisfactionof constraints is key for learning a policy that satisfies thedesired preferences For nonparametric action distributionsqk(a|s) we can satisfy these constraints exactly One coulduse any policy gradient method (eg Schulman et al 20152017 Heess et al 2015 Haarnoja et al 2018) to obtainqk(a|s) in a parametric form instead However solving theconstrained optimization for parametric qk(a|s) is not exactand the constraints may not be well satisfied which impedesthe use of εk to encode preferences Moreover assuming aparametric qk(a|s) requires maintaining a function approx-imator (eg a neural network) per objective which cansignificantly increase the complexity of the algorithm andlimits scalability
Choosing εk It is more intuitive to encode preferences viaεk rather than via scalarization weights because the formeris invariant to the scale of rewards In other words having adesired preference across objectives narrows down the rangeof reasonable choices for εk but does not narrow down therange of reasonable choices for scalarization weights Inorder to identify reasonable scalarization weights a RL prac-titioner needs to additionally be familiar with the scale ofrewards for each objective In practice we have found thatlearning performance is robust to a wide range of scales forεk It is the relative scales of the εk that matter for encodingpreferences over objectivesmdashthe larger a particular εk iswith respect to others the more that objective k is preferredOn the other hand if εk = 0 then objective k will have
A Distributional View on Multi-Objective Policy Optimization
no influence and will effectively be ignored In AppendixA1 we provide suggestions for setting εk given a desiredpreference over objectives
422 FITTING A NEW PARAMETRIC POLICY (STEP 2)
In the previous section for each objective k we have ob-tained an improved action distribution qk(a|s) Next wewant to combine these distributions to obtain a single para-metric policy that trades off the objectives according tothe constraints εk that we set For this we solve a super-vised learning problem that fits a parametric policy to theper-objective action distributions from step 1
θnew = argmaxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) dads
stints
micro(s) KL(πold(a|s) πθ(a|s)) ds lt β (5)
where θ are the parameters of our policy (a neural network)and the KL constraint enforces a trust region of size β thatlimits the overall change in the parametric policy The KLconstraint in this step has a regularization effect that pre-vents the policy from overfitting to the sample-based actiondistributions and therefore avoids premature convergenceand improves stability of learning (Schulman et al 2015Abdolmaleki et al 2018ab)
Similar to the first policy improvement step we evaluatethe integrals by using the L states sampled from the replaybuffer and the M actions per state sampled from the oldpolicy In order to optimize (5) using gradient descent weemploy Lagrangian relaxation similar to in MPO (Abdol-maleki et al 2018a) (see Appendix C for more detail)
5 Experiments Toy DomainsIn the empirical evaluation that follows we will first demon-strate the mechanics and scale-invariance of MO-MPO ina single-state environment (Sec 51) and then show thatMO-MPO can find all Pareto-dominant policies in a popularMORL benchmark (Sec 52) Finally we show the benefitof using MO-MPO in high-dimensional continuous controldomains including on a real robot (Sec 6) AppendicesA and B contain a detailed description of all domains andtasks experimental setups and implementation details
Baselines The goal of our empirical evaluation is toanalyze the benefit of using our proposed multi-objectivepolicy improvement step (Sec 422) that encodes prefer-ences over objectives via constraints εk on expected KL-divergences rather than via weights wk Thus we primarilycompare MO-MPO against scalarized MPO which relieson linear scalarization weights wk to encode preferencesThe only difference between MO-MPO and scalarized MPOis the policy improvement step for scalarized MPO a single
Figure 2 A simple environment in which the agent starts at stateS0 and chooses to navigate to one of three terminal states Thereare two objectives Taking the left action for instance leads toa reward of 1 for the first objective and 4 for the second
improved action distribution q(a|s) is computed based onsumk wkQk(s a) and a single KL constraint ε
State-of-the-art approaches that combine MORL with deepRL assume linear scalarization as well either learning aseparate policy for each setting of weights (Mossalam et al2016) or learning a single policy conditioned on scalar-ization weights (Friedman amp Fontaine 2018 Abels et al2019) Scalarized MPO addresses the former problemwhich is easier The policy evaluation step in scalarizedMPO is analagous to scalarized Q-learning proposed byMossalam et al (2016) As we show later in Sec 6 evenlearning an optimal policy for a single scalarization is diffi-cult in high-dimensional continuous control domains
51 Simple World
First we will examine the behavior of MO-MPO in a simplemulti-armed bandit with three actions (up right andleft) (Fig 2) inspired by Russell amp Zimdars (2003)We train policies with scalarized MPO and with MO-MPOThe policy evaluation step is exact because the Q-valuefunction for each objective is known it is equal to thereward received for that objective after taking each actionas labeled in Fig 24
We consider three possible desired preferences equal pref-erence for the two objectives preferring the first and pre-ferring the second Encoding preferences in scalarizedMPO amounts to choosing appropriate linear scalarizationweights and in MO-MPO amounts to choosing appropriateεrsquos We use the following weights and εrsquos
bull equal preference weights [05 05] or εrsquos [001 001]bull prefer first weights [09 01] or εrsquos [001 0002]bull prefer second weights [01 09] or εrsquos [0002 001]
We set ε = 001 for scalarized MPO If we start with a4The policy improvement step can also be computed exactly
because solving for the optimal temperature η (or η1 and η2 in theMO-MPO case) is a convex optimization problem and the KL-constrained policy update is also a convex optimization problemwhen there is only one possible state We use CVXOPT (Andersenet al 2020) as our convex optimization solver
A Distributional View on Multi-Objective Policy Optimization
uniform policy and run MPO with β = 0001 until thepolicy converges scalarized MPO and MO-MPO result insimilar policies (Fig 3 solid bars) up for equal preferenceright for prefer first and left for prefer second
However if we make the rewards imbalanced by multiply-ing the rewards obtained for the first objective by 20 (egleft now obtains a reward of [20 4]) we see that the poli-cies learned by scalarized MPO shift to preferring the opti-mal action for the first objective (right) in both the equalpreference and prefer second cases (Fig 3 striped bars)In contrast the final policies for MO-MPO are the same asfor balanced rewards because in each policy improvementstep MO-MPO optimizes for a separate temperature ηk thatscales each objectiversquos Q-value function This ηk is com-puted based on the corresponding allowed KL-divergenceεk so when the rewards for any objective k are multipliedby a factor but εk remains the same the computed ηk endsup being scaled by that factor as well neutralizing the effectof the scaling of rewards (see Eq (4))
Even in this simple environment we see that MO-MPOrsquosscale-invariant way of encoding preferences is valuable Inmore complex domains in which the Q-value functionsmust be learned in parallel with the policy the (automatic)dynamic adjustment of temperatures ηk per objective be-comes more essential (Sec 6)
The scale of εk controls the amount that objective k caninfluence the policyrsquos update If we set ε1 = 001 and sweepover the range from 0 to 001 for ε2 the resulting policies gofrom always picking right to splitting probability acrossright and up to always picking up (Fig 4 right) Incontrast setting weights leads to policies quickly convergingto placing all probability on a single action (Fig 4 left)We hypothesize this limits the ability of scalarized MPOto explore and find compromise policies (that perform wellwith respect to all objectives) in more challenging domains
52 Deep Sea Treasure
An important quality of any MORL approach is the abilityto find a variety of policies on the true Pareto front (Roijerset al 2013) We demonstrate this in Deep Sea Treasure(DST) (Vamplew et al 2011) a popular benchmark for test-ing MORL approaches DST consists of a 11times10 grid worldwith ten treasure locations The agent starts in the upper leftcorner (s0 = (0 0)) and has a choice of four actions (mov-ing one square up right down and left) When theagent picks up a treasure the episode terminates The agenthas two objectives treasure value and time penalty Thetime penalty is minus1 for each time step that the agent takesand farther-away treasures have higher treasure values Weuse the treasure values in Yang et al (2019)
We ran scalarized MPO with weightings [w 1 minus w] and
00
05
10equal preference prefer first prefer second
left up right00
05
10
left up right left up right
scalarizeddistributional
scalarized imbalanced rewardsdistributional imbalanced rewards
Figure 3 When two objectives have comparable reward scales(solid bars) scalarized MPO (first row) and MO-MPO (second row)learn similar policies across three different preferences Howeverwhen the scale of the first objective is much higher (striped bars)scalarized MPO shifts to always preferring the first objective Incontrast MO-MPO is scale-invariant and still learns policies thatsatisfy the preferences The y-axis denotes action probability
00 05 10p(left)
0
1distributional
[001]
[0101]
[010]
00 05 10p(left)
0
1p(
righ
t)scalarized
[01]
[55]
[10][w1 w2]
00 05 10p(left)
0
1p(
righ
t)scalarized
[01]
[55]
[10]
00 05 10p(left)
0
1
p(righ
t)
scalarized
[01]
[55]
[10][1205981 1205982]
00 05 10p(left)
0
1distributional
[001]
[0101]
[010]
Figure 4 A visualization of policies during learningmdasheach curvecorresponds to a particular setting of weights (left) or εrsquos (right)Policies are initialized as uniform (the blue dot) and are traineduntil convergence Each point (x y) corresponds to the policywith p(left) = x p(right) = y and p(up) = 1minusxminusy Thetop left and bottom right blue stars denote the optimal policy forthe first and second objectives respectively
w isin [0 001 002 1] All policies converged to a pointon the true Pareto front and all but three found the optimalpolicy for that weighting In terms of coverage policieswere found for eight out of ten points on the Pareto front5
We ran MO-MPO on this task as well for a range of εεtime isin [001 002 005] and εtreasure = c lowast εtime wherec isin [05 051 052 15] All runs converged to a policyon the true Pareto front and MO-MPO finds policies for allten points on the front (Fig 5 left) Note that it is the ratioof εrsquos that matters rather than the exact settingsmdashacrossall settings of εtime similar ratios of εtreasure to εtime resultin similar policies as this ratio increases policies tend toprefer higher-value treasures (Fig 5 right)
5Since scalarized MPO consistently finds the optimal policyfor any given weighting in this task with a more strategic selectionof weights we expect policies for all ten points would be found
A Distributional View on Multi-Objective Policy Optimization
Figure 5 Left Blue stars mark the true Pareto front for Deep SeaTreasure MO-MPO with a variety of settings for εk discoversall points on the true Pareto front The area of the orange circlesis proportional to the number of εk settings that converged tothat point Right As more preference is given to the treasureobjective (ie as x increases) policies tend to prefer higher-valuetreasures Each orange dot in the scatterplot corresponds to aparticular setting of εk
6 Experiments Continuous Control DomainsThe advantage of encoding preferences via εrsquos rather thanvia weights is apparent in more complex domains We com-pared our approaches MO-MPO and MO-V-MPO againstscalarized MPO and V-MPO in four high-dimensional con-tinuous control domains in MuJoCo (Todorov et al 2012)and on a real robot The domains we consider are
Humanoid We use the humanoid run task defined in Deep-Mind Control Suite (Tassa et al 2018) Policies must opti-mize for horizontal speed h while minimizing energy usageThe task reward is min(h10 1) where h is in meters persecond and the energy usage penalty is action `2-norm Thehumanoid has 21 degrees of freedom and the observationconsists of joint angles joint velocities head height handand feet positions torso vertical orientation and center-of-mass velocity for a total of 67 dimensions
Shadow Hand We consider three tasks on the ShadowDexterous Hand touch turn and orient In the touch andturn tasks policies must complete the task while minimiz-ing ldquopainrdquo A sparse task reward of 10 is given for pressingthe block with greater than 5N of force or for turning thedial from a random initial location to the target locationThe pain penalty penalizes the robot for colliding with ob-jects at high speed this penalty is defined as in Huang et al(2019) In the orient task there are three aligned objectivestouching the rectangular peg lifting it to a given height andorienting it to be perpendicular to the ground All threerewards are between 0 and 1 The Shadow Hand has five
fingers and 24 degrees of freedom actuated by 20 motorsThe observation consists of joint angles joint velocitiesand touch sensors for a total of 63 dimensions The touchand turn tasks terminate when the goal is reached or after 5seconds and the orient task terminates after 10 seconds
Humanoid Mocap We consider a large-scale humanoidmotion capture tracking task similar to in Peng et al (2018)in which policies must learn to follow motion capture ref-erence data6 There are five objectives each capturing adifferent aspect of the similarity of the pose between thesimulated humanoid and the mocap target joint orientationsjoint velocities hand and feet positions center-of-mass po-sitions and certain body positions and joint angles Theseobjectives are described in detail in Appendix B4 In orderto balance these multiple objectives prior work relied onheavily-tuned reward functions (eg Peng et al 2018) Thehumanoid has 56 degrees of freedom and the observation is1021-dimensional consisting of proprioceptive observationsas well as six steps of motion capture reference frames Intotal we use about 40 minutes of locomotion mocap datamaking this an extremely challenging domain
Sawyer Peg-in-Hole We train a Rethink Robotics Sawyerrobot arm to insert a cylindrical peg into a hole while min-imizing wrist forces The task reward is shaped towardpositioning the peg directly above the hole and increasesfor insertion and the penalty is the `1-norm of Cartesianforces measured by the wrist force-torque sensor The latterimplicitly penalizes contacts and impacts as well as exces-sive directional change (due to the gripperrsquos inertia inducingforces when accelerating) We impose a force thresholdto protect the hardwaremdashif this threshold is exceeded theepisode is terminated The action space is the end effectorrsquosCartesian velocity and the observation is 102-dimensionalconsisting of Cartesian position joint position and velocitywrist force-torque and joint action for three timesteps
61 Evaluation Metric
We run MO-(V-)MPO and scalarized (V-)MPO with a widerange of constraint settings εk and scalarization weightswk respectively corresponding to a wide range of possibledesired preferences (The exact settings are provided inAppendix A) For tasks with two objectives we plot thePareto front found by each approach We also computethe hypervolume of each found Pareto front this metric iscommonly-used for evaluating MORL algorithms (Vamplewet al 2011) Given a set of policies Π and a reference policyr that is dominated by all policies in this set this metric isthe hypervolume of the space of all policies that dominater and are dominated by at least one policy in Π We useDEAP (Fortin et al 2012) to compute hypervolumes
6This task was developed concurrently by Anonymous (2020)and does not constitute a contribution of this paper
A Distributional View on Multi-Objective Policy Optimization
3000 00
400
800
task
rewar
d
mid-training
3000 0
action norm penalty
end of training
3000 0
normal scale
30000 0
10x penalty
MO-MPO scalarized MPO
5 0pain penalty
00
05
task
rewar
d
touch mid-training
5 0pain penalty
touch
5 0pain penalty
turn
0 100z
0
150
angl
e
orient
3000 00
400
800
task
rewar
d
mid-training
3000 0
action norm penalty
end of training
3000 0
normal scale
30000 0
10x penalty
MO-MPO scalarized MPO
5 0pain penalty
00
05
task
rewar
d
touch mid-training
5 0pain penalty
touch
5 0pain penalty
turn
0 100z
0
150
angl
e
orient
touch mid-training
1x penalty
action norm penalty
pain penalty
3000 00
400
800
task
rewar
d
mid-training
3000 0
action norm penalty
end of training
3000 0
normal scale
30000 0
10x penalty
MO-MPO scalarized MPOscalarized MPOMO-MPO
touch end of training
Figure 6 Pareto fronts found by MO-MPO and scalarized MO-MPO for humanoid run (top row) and Shadow Hand tasks Eachdot represents a single trained policy Corresponding hypervol-umes are in Table 1 Task reward is discounted for touch and turnwith a discount factor of 099 For orient the x and y axes are thetotal reward for the lift and orientation objectives respectively
62 Results Humanoid and Shadow Hand
In the run touch and turn tasks the two objectives arecompetingmdasha very high preference for minimizing actionnorm or pain as opposed to getting task reward will resultin a policy that always chooses zero-valued actions Acrossthese three tasks the Pareto front found by MO-MPO issuperior to the one found by scalarized MPO with respectto the hypervolume metric (Table 1)7 In particular MO-MPO finds more policies that perform well with respect toboth objectives ie are in the upper right portion of thePareto front MO-MPO also speeds up learning on run andtouch (Fig 6) Qualitatively MO-MPO trains run policiesthat look more natural and ldquohuman-likerdquo videos are athttpsitesgooglecomviewmo-mpo
When we scale the action norm penalty by 10times for runscalarized MPO policies no longer achieve high task rewardwhereas MO-MPO policies do (Fig 6 top right) Thissupports that MO-MPOrsquos encoding of preferences is indeedscale-invariant When the objectives are aligned and havesimilarly scaled rewards as in the orient task MO-MPOand scalarized MPO perform similarly as expected
Ablation We also ran vanilla MPO on humanoid run withthe same range of weight settings as for scalarized MPO toinvestigate how useful Q-decomposition is In vanilla MPOwe train a single critic on the scalarized reward function
7We use hypervolume reference points of [0minus104] for run[0minus105] for run with 10times penalty [0minus20] for touch and turnand the zero vector for orient and humanoid mocap
Task scalarized MPO MO-MPO
Humanoid run mid-training 11times106 33times106
Humanoid run 64times106 71times106
Humanoid run 1times penalty 50times106 59times106
Humanoid run 10times penalty 26times107 69times107
Shadow touch mid-training 143 156Shadow touch 162 164Shadow turn 144 154Shadow orient 28times104 28times104
Humanoid Mocap 386times10minus6 341times10minus6
Table 1 Hypervolume measurements across tasks and approaches
which is equivalent to removing Q-decomposition fromscalarized MPO Vanilla MPO trains policies that achievesimilar task reward (up to 800) but with twice the actionnorm penalty (up to minus104) As a result the hypervolumeof the Pareto front that vanilla MPO finds is more than amagnitude worse than that of scalarized MPO and MO-MPO(22times106 versus 26times107 and 69times107 respectively)
63 Results Humanoid Mocap
The objectives in this task are mostly aligned In contrast tothe other experiments we use V-MPO (Song et al 2020) asthe base algorithm because it outperforms MPO in learningthis task In addition since V-MPO is an on-policy vari-ant of MPO this enables us to evaluate our approach inthe on-policy setting Each training run is very computa-tionally expensive so we train only a handful of policieseach8 None of the MO-V-MPO policies are dominatedby those found by V-MPO In fact although the weightsspan a wide range of ldquopreferencesrdquo for the joint velocity theonly policies found by V-MPO that are not dominated by aMO-V-MPO policy are those with extreme values for jointvelocity reward (either le 0006 or ge 0018) whereas it isbetween 00103 and 00121 for MO-V-MPO policies
Although the hypervolume of the Pareto front found by V-MPO is higher than that of MO-V-MPO (Table 1) findingpolicies that over- or under- prioritize any objective is unde-sirable Qualitatively the policies trained with MO-V-MPOlook more similar to the mocap reference datamdashthey exhibitless feet jittering compared to those trained with scalarizedV-MPO this can be seen in the corresponding video
64 Results Sawyer Peg-in-Hole
In this task we would like the robot to prioritize successfultask completion while minimizing wrist forces With this
8For MO-V-MPO we set all εk = 001 Also for each objec-tive we set εk = 0001 and set all others to 001 For V-MPOwe fix the weights of four objectives to reasonable values and tryweights of [01 03 06 1 5] for matching joint velocities
A Distributional View on Multi-Objective Policy Optimization
0 200000 400000
200
400
number of policy improvement steps
task reward
0 200000 400000
minus12
00minus
800
wrist force penalty
MO-MPO scalarized MPO [095 005] [08 02]
Figure 7 Task reward (left) and wrist force penalty (right) learningcurves for the Sawyer peg-in-hole task The policy trained withMO-MPO quickly learns to optimize both objectives whereas theother two do not Each line represents a single trained policy
in mind for MO-MPO we set εtask = 01 and εforce = 005and for scalarized MPO we try [wtask wforce] = [095 005]and [08 02] We find that policies trained with scalarizedMPO focus all learning on a single objective at the beginningof training we also observed this in the touch task wherescalarized MPO policies quickly learn to either maximizetask reward or minimize pain penalty but not both (Fig 6bottom left) In contrast the policy trained with MO-MPOsimultaneously optimizes for both objectives throughouttraining In fact throughout training the MO-MPO policydoes just as well with respect to task reward as the scalarizedMPO policy that cares more about task reward and similarlyfor wrist force penalty (Fig 7)
7 Conclusions and Future WorkIn this paper we presented a new distributional perspectiveon multi objective reinforcement learning that is derivedfrom the Rl-as-inference perspective This view leads to twonovel multi-objective RL algorithms namely MO-MPO andMO-V-MPO We showed these algorithms enable practition-ers to encode preferences in a scale-invariant way AlthoughMO-(V-)MPO is a single-policy approach to MORL it iscapable of producing a variety of Pareto-dominant policiesA limitation of this work is that we produced this set ofpolicies by iterating over a relatively large number of εrsquoswhich is computationally expensive In future work weplan to extend MO-(V-)MPO into a true multiple-policyMORL approach either by conditioning the policy on set-tings of εrsquos or by developing a way to strategically select εrsquosto train policies for analogous to what prior work has donefor weights (eg Roijers et al (2014))
AcknowledgmentsThe authors would like to thank Guillaume DesjardinsCsaba Szepesvari Jost Tobias Springenberg Steven BohezPhilemon Brakel Brendan Tracey Jonas Degrave JonasBuchli Leslie Fritz Chloe Rosenberg and many others atDeepMind for their support and feedback on this paper
ReferencesAbdolmaleki A Springenberg J T Degrave J Bohez S
Tassa Y Belov D Heess N and Riedmiller M Rela-tive entropy regularized policy iteration arXiv preprintarXiv181202256 2018a
Abdolmaleki A Springenberg J T Tassa Y Munos RHeess N and Riedmiller M Maximum a posterioripolicy optimisation In Proceedings of the Sixth Interna-tional Conference on Learning Representations (ICLR)2018b
Abels A Roijers D M Lenaerts T Nowe A andSteckelmacher D Dynamic weights in multi-objectivedeep reinforcement learning In Proceedings of the 36thInternational Conference on Machine Learning (ICML)2019
Achiam J Held D Tamar A and Abbeel P Constrainedpolicy optimization In Proceedings of the 34th Inter-national Conference on Machine Learning (ICML) pp22ndash31 2017
Amodei D Olah C Steinhardt J Christiano P Schul-man J and Mane D Concrete problems in AI safetyarXiv preprint arXiv160606565 2016
Andersen M S Dahl J and Vandenberghe L CVXOPTA Python package for convex optimization version 12httpscvxoptorg 2020
Anonymous CoMic Co-Training and Mimicry forReusable Skills 2020
Barrett L and Narayanan S Learning all optimal policieswith multiple criteria In Proceedings of the 25th Inter-national Conference on Machine Learning (ICML) pp41ndash47 2008
Barth-Maron G Hoffman M W Budden D Dabney WHorgan D TB D Muldal A Heess N and LillicrapT Distributed distributional deterministic policy gradi-ents In Proceedings of the Sixth International Conferenceon Learning Representations (ICLR) 2018
Bohez S Abdolmaleki A Neunert M Buchli J HeessN and Hadsell R Value constrained model-free contin-uous control arXiv preprint arXiv190204623 2019
Chen X Ghadirzadeh A Bjorkman M and Jensfelt PMeta-learning for multi-objective reinforcement learningarXiv preprint arXiv181103376 2019
Chentanez N Muller M Macklin M Makoviychuk Vand Jeschke S Physics-based motion capture imitationwith deep reinforcement learning In International Con-ference on Motion Interaction and Games (MIG) ACM2018
A Distributional View on Multi-Objective Policy Optimization
Chow Y Nachum O Duenez-Guzman E andGhavamzadeh M A Lyapunov-based approach to safereinforcement learning In Proceedings of the 31st Inter-national Conference on Neural Information ProcessingSystems (NeurIPS) pp 8092ndash8101 2018
Daneshmand M Tale Masouleh M Saadatzi M HOzcinar C and Anbarjafari G A robust proportion-preserving composite objective function for scale-invariant multi-objective optimization Scientia Iranica24(6)2977ndash2991 2017
Fortin F-A De Rainville F-M Gardner M-A ParizeauM and Gagne C DEAP Evolutionary algorithms madeeasy Journal of Machine Learning Research 132171ndash2175 July 2012
Friedman E and Fontaine F Generalizing across multi-objective reward functions in deep reinforcement learningarXiv preprint arXiv180906364 2018
Gabor Z Kalmar Z and Szepesvari C Multi-criteriareinforcement learning In Proceedings of the FifteenthInternational Conference on Machine Learning (ICML)pp 197ndash205 1998
Grodzevich O and Romanko O Normalization and othertopics in multi-objective optimization In Proceedings ofthe Fields-MITACS Industrial Problems Workshop 2006
Haarnoja T Zhou A Abbeel P and Levine S Softactor-critic Off-policy maximum entropy deep reinforce-ment learning with a stochastic actor In Proceedings ofthe 35th International Conference on Machine Learning(ICML) pp 1861ndash1870 2018
Heess N Wayne G Silver D Lillicrap T Erez T andTassa Y Learning continuous control policies by stochas-tic value gradients In Advances in Neural InformationProcessing Systems 28 (NIPS) 2015
Hessel M Soyer H Espeholt L Czarnecki W SchmittS and van Hasselt H Multi-task deep reinforcementlearning with popart In Proceedings of the AAAI Confer-ence on Artificial Intelligence volume 33 pp 3796ndash38032019
Huang S H Zambelli M Kay J Martins M F TassaY Pilarski P M and Hadsell R Learning gentle objectmanipulation with curiosity-driven deep reinforcementlearning arXiv preprint arXiv190308542 2019
Ishibuchi H Doi K and Nojima Y On the effect ofnormalization in MOEAD for multi-objective and many-objective optimization Complex amp Intelligent Systems 3(4)279ndash294 2017
Kim I and de Weck O Adaptive weighted sum methodfor multiobjective optimization A new method for paretofront generation Structural and Multidisciplinary Opti-mization 31(2)105ndash116 2006
Kingma D P and Ba J Adam A method for stochasticoptimization In Proceedings of the Third InternationalConference on Learning Representations (ICLR) 2015
Le H Voloshin C and Yue Y Batch policy learningunder constraints In Proceedings of the 36th Inter-national Conference on Machine Learning (ICML) pp3703ndash3712 2019
Levine S Finn C Darrell T and Abbeel P End-to-endtraining of deep visuomotor policies Journal of MachineLearning Research 17(39)1ndash40 2016
Liu C Xu X and Hu D Multiobjective reinforcementlearning A comprehensive overview IEEE Transactionson Systems Man and Cybernetics Systems 45(3)385ndash398 2015
Marler R T and Arora J S Function-transformationmethods for multi-objective optimization EngineeringOptimization 37(6)551ndash570 2005
Martens J New perspectives on the natural gradient methodarXiv preprint arXiv14121193 2014
Merel J Ahuja A Pham V Tunyasuvunakool S LiuS Tirumala D Heess N and Wayne G Hierarchi-cal visuomotor control of humanoids In InternationalConference on Learning Representations (ICLR) 2019
Mnih V Kavukcuoglu K Silver D Rusu A A VenessJ Bellemare M G Graves A Riedmiller M Fidje-land A K Ostrovski G et al Human-level controlthrough deep reinforcement learning Nature 518(7540)529 2015
Moffaert K V and Nowe A Multi-objective reinforcementlearning using sets of Pareto dominating policies Journalof Machine Learning Research 153663ndash3692 2014
Mossalam H Assael Y M Roijers D M and WhitesonS Multi-objective deep reinforcement learning arXivpreprint arXiv161002707 2016
Munos R Stepleton T Harutyunyan A and BellemareM Safe and efficient off-policy reinforcement learningIn Proceedings of the 29th International Conference onNeural Information Processing Systems pp 1054ndash10622016
Natarajan S and Tadepalli P Dynamic preferences inmulti-criteria reinforcement learning In Proceedings ofthe 22nd International Conference on Machine Learning(ICML) pp 601ndash608 2005
A Distributional View on Multi-Objective Policy Optimization
Nottingham K Balakrishnan A Deshmukh J Christo-pherson C and Wingate D Using logical specificationsof objectives in multi-objective reinforcement learningarXiv preprint arXiv191001723 2019
Parisi S Pirotta M Smacchia N Bascetta L andRestelli M Policy gradient approaches for multi-objective sequential decision making In Proceedingsof the International Joint Conference on Neural Networks(IJCNN) pp 2323ndash2330 2014
Peng X B Abbeel P Levine S and van de Panne MDeepmimic Example-guided deep reinforcement learn-ing of physics-based character skills ACM Transactionson Graphics 37(4) 2018
Pirotta M Parisi S and Restelli M Multi-objectivereinforcement learning with continuous Pareto frontierapproximation In Proceedings of the Twenty-Ninth AAAIConference on Artificial Intelligence (AAAI) 2015
Reymond M and Nowe A Pareto-DQN Approximat-ing the Pareto front in complex multi-objective decisionproblems In Proceedings of the Adaptive and LearningAgents Workshop at the 18th International Conferenceon Autonomous Agents and MultiAgent Systems AAMAS2019
Riedmiller M Hafner R Lampe T Neunert M De-grave J Van de Wiele T Mnih V Heess N and Sprin-genberg J T Learning by playing - solving sparse rewardtasks from scratch arXiv preprint arXiv1802105672018
Roijers D M Vamplew P Whiteson S and Dazeley RA survey of multi-objective sequential decision-makingJournal of Artificial Intelligence Research 48(1)67ndash1132013
Roijers D M Whiteson S and Oliehoek F A Linearsupport for multi-objective coordination graphs In Pro-ceedings of the International Conference on AutonomousAgents and Multi-Agent Systems (AAMAS) pp 1297ndash1304 2014
Russell S and Zimdars A L Q-decomposition for rein-forcement learning agents In Proceedings of the Twenti-eth International Conference on International Conferenceon Machine Learning (ICML) pp 656ndash663 2003
Schulman J Levine S Abbeel P Jordan M and MoritzP Trust region policy optimization In Proceedings ofthe 32nd International Conference on Machine Learning(ICML) 2015
Schulman J Wolski F Dhariwal P Radford A andKlimov O Proximal policy optimization algorithmsarXiv preprint arXiv170706347 2017
Shadow Robot Company Shadow dexterous handhttpswwwshadowrobotcomproductsdexterous-hand 2020
Silver D Huang A Maddison C J Guez A Sifre Lvan den Driessche G Schrittwieser J Antonoglou IPanneershelvam V Lanctot M Dieleman S GreweD Nham J Kalchbrenner N Sutskever I Lillicrap TLeach M Kavukcuoglu K Graepel T and HassabisD Mastering the game of Go with deep neural networksand tree search Nature 529484ndash503 2016
Song H F Abdolmaleki A Springenberg J T ClarkA Soyer H Rae J W Noury S Ahuja A LiuS Tirumala D Heess N Belov D Riedmiller Mand Botvinick M M V-MPO On-policy maximum aposteriori policy optimization for discrete and continu-ous control In International Conference on LearningRepresentations 2020
Sutton R S Learning to predict by the methods of temporaldifferences Machine learning 3(1)9ndash44 1988
Tan J Zhang T Coumans E Iscen A Bai Y HafnerD Bohez S and Vanhoucke V Sim-to-real Learningagile locomotion for quadruped robots In Proceedingsof Robotics Science and Systems (RSS) 2018
Tassa Y Doron Y Muldal A Erez T Li Yde Las Casas D Budden D Abdolmaleki A Merel JLefrancq A Lillicrap T and Riedmiller M Deepmindcontrol suite arXiv preprint arXiv180100690 2018
Teh Y W Bapst V Czarnecki W M Quan J Kirk-patrick J Hadsell R Heess N and Pascanu R DistralRobust multitask reinforcement learning In Proceedingsof the 31st International Conference on Neural Informa-tion Processing Systems (NeurIPS) 2017
Todorov E Erez T and Tassa Y Mujoco A physicsengine for model-based control In Proceedings of theIEEERSJ International Conference on Intelligent Robotsand Systems (IROS) pp 5026ndash5033 2012
Vamplew P Dazeley R Berry A Issabekov R andDekker E Empirical evaluation methods for multiobjec-tive reinforcement learning algorithms Machine Learn-ing 84(1)51ndash80 Jul 2011
van Hasselt H P Guez A Hessel M Mnih V and Sil-ver D Learning values across many orders of magnitudeIn Advances in Neural Information Processing Systemspp 4287ndash4295 2016
Van Moffaert K Drugan M M and Nowe A Scalar-ized multi-objective reinforcement learning Novel de-sign techniques In Proceedings of the IEEE Symposiumon Adaptive Dynamic Programming and ReinforcementLearning (ADPRL) pp 191ndash199 2013
A Distributional View on Multi-Objective Policy Optimization
van Seijen H Fatemi M Romoff J Laroche R BarnesT and Tsang J Hybrid reward architecture for reinforce-ment learning In Proceedings of the 31st InternationalConference on Neural Information Processing Systems(NeurIPS) pp 5398ndash5408 2017
Watkins C J C H Learning from Delayed Rewards PhDthesis Kingrsquos College Cambridge UK May 1989
Wray K H Zilberstein S and Mouaddib A-I Multi-objective MDPs with conditional lexicographic rewardpreferences In Proceedings of the Twenty-Ninth AAAIConference on Artificial Intelligence (AAAI) pp 3418ndash3424 2015
Wulfmeier M Abdolmaleki A Hafner R SpringenbergJ T Neunert M Hertweck T Lampe T Siegel NHeess N and Riedmiller M Regularized hierarchicalpolicies for compositional transfer in robotics arXivpreprint arXiv190611228 2019
Yang R Sun X and Narasimhan K A generalized algo-rithm for multi-objective reinforcement learning and pol-icy adaptation In Proceedings of the 32nd InternationalConference on Neural Information Processing Systems(NeurIPS) pp 14610ndash14621 2019
Zeng A Song S Lee J Rodriguez A and FunkhouserT Tossingbot Learning to throw arbitrary objects withresidual physics In Proceedings of Robotics Scienceand Systems (RSS) 2019
Zhan H and Cao Y Relationship explainable multi-objective optimization via vector value function basedreinforcement learning arXiv preprint arXiv1910019192019
Zuluaga M Krause A and Puschel M ε-pal An ac-tive learning approach to the multi-objective optimizationproblem Journal of Machine Learning Research 17(1)3619ndash3650 2016
A Distributional View on Multi-Objective Policy Optimization
A Experiment DetailsIn this section we describe implementation details and spec-ify the hyperparameters used for our algorithm In all ourexperiments the policy and critic(s) are implemented withfeed-forward neural networks We use Adam (Kingma ampBa 2015) for optimization
In our continuous control tasks the policy returns a Gaus-sian distribution with a diagonal covariance matrix ieπθ(a|s) = N (microΣ) The policy is parametrized by aneural network which outputs the mean micro = micro(s) and di-agonal Cholesky factors A = A(s) such that Σ = AAT The diagonal factor A has positive diagonal elements en-forced by the softplus transform Aii larr log(1 + exp(Aii))to ensure positive definiteness of the diagonal covariancematrix
Table 2 shows the default hyperparameters that we use forMO-MPO and scalarized MPO and Table 3 shows the hy-perparameters that differ from these defaults for the hu-manoid Shadow Hand and Sawyer experiments For alltasks that we train policies with MO-MPO for a separatecritic is trained for each objective with a shared first layerThe only exception is the action norm penalty in humanoidrun which is computed exactly from the action We foundthat for both policy and critic networks layer normalizationof the first layer followed by a hyperbolic tangent (tanh) isimportant for the stability of the algorithm
In our experiments comparing the Pareto front found byMO-MPO versus scalarized MPO we ran MO-MPO with arange of εk settings and ran scalarized MPO with a range ofweight settings to obtain a set of policies for each algorithmThe settings for humanoid run and the Shadow Hand tasksare listed in Table 4 The settings for humanoid mocap andSawyer are specified in the main paper in Sec 63 and 64respectively
In the following subsections we first give suggestions forchoosing appropriate εk to encode a particular preference(Sec A1) then we describe the discrete MO-MPO usedfor Deep Sea Treasure (Sec A2) and finally we describeimplementation details for humanoid mocap (Sec A3)
A1 Suggestions for Setting ε
Our proposed algorithm MO-(V-)MPO requires practition-ers to translate a desired preference across objectives to nu-merical choices for εk for each objective k At first glancethis may seem daunting However in practice we havefound that encoding preferences via εk is often more in-tuitive than doing so via scalarization In this subsectionwe seek to give an intuition on how to set εk for differentdesired preferences Recall that each εk controls the influ-ence of objective k on the policy update by constraining theKL-divergence between each objective-specific distribution
Hyperparameter Default
policy networklayer sizes (300 200)take tanh of action mean yesminimum variance 10minus12
maximum variance unbounded
critic network(s)layer sizes (400 400 300)take tanh of action yesRetrace sequence size 8discount factor γ 099
both policy and critic networkslayer norm on first layer yestanh on output of layer norm yesactivation (after each hidden layer) ELU
MPOactions sampled per state 20default ε 01KL-constraint on policy mean βmicro 10minus3
KL-constraint on policy covariance βΣ 10minus5
initial temperature η 1
trainingbatch size 512replay buffer size 106
Adam learning rate 3times10minus4
Adam ε 10minus3
target network update period 200
Table 2 Default hyperparameters for MO-MPO and scalarizedMPO with decoupled update on mean and covariance (Sec C3)
and the current policy (Sec 41) We generally choose εk inthe range of 0001 to 01
Equal Preference When all objectives are equally impor-tant the general rule is to set all εk to the same value Wedid this in our orient and humanoid mocap tasks where ob-jectives are aligned or mostly aligned In contrast it can betricky to choose appropriate weights in linear scalarizationto encode equal preferencesmdashsetting all weights equal to1K (where K is the number of objectives) is only appro-priate if the objectivesrsquo rewards are of similar scales Weexplored this in Sec 51 and Fig 3
When setting all εk to the same value what should thisvalue be The larger εk is the more influence the objectiveswill have on the policy update step Since the per-objectivecritics are learned in parallel with the policy setting εktoo high tends to destabilize learning because early on intraining when the critics produce unreliable Q-values theirinfluence on the policy will lead it in the wrong directionOn the other hand if εk is set too low then it slows downlearning because the per-objective action distribution isonly allowed to deviate by a tiny amount from the currentpolicy and the updated policy is obtained via supervisedlearning on the combination of these action distributions
A Distributional View on Multi-Objective Policy Optimization
Humanoid run
policy networklayer sizes (400 400 300)take tanh of action mean nominimum variance 10minus8
critic network(s)layer sizes (500 500 400)take tanh of action no
MPOKL-constraint on policy mean βmicro 5times10minus3
KL-constraint on policy covariance βΣ 10minus6
trainingAdam learning rate 2times10minus4
Adam ε 10minus8
Shadow Hand
MPOactions sampled per state 30default ε 001
Sawyer
MPOactions sampled per state 15
trainingtarget network update period 100
Table 3 Hyperparameters for humanoid Shadow Hand andSawyer experiments that differ from the defaults in Table 2
Eventually the learning will converge to more or less thesame policy though as long as εk is not set too high
Unequal Preference When there is a difference in pref-erences across objectives the relative scale of εk is whatmatters The more the relative scale of εk is compared to εlthe more influence objective k has over the policy updatecompared to objective l And in the extreme case whenεl is near-zero for objective l then objective l will haveno influence on the policy update and will effectively beignored We explored this briefly in Sec 51 and Fig 4
One common example of unequal preferences is when wewould like an agent to complete a task while minimiz-ing other objectivesmdasheg an action norm penalty ldquopainrdquopenalty or wrist force-torque penalty in our experimentsIn this case the ε for the task objective should be higherthan that for the other objectives to incentivize the agent toprioritize actually doing the task If the ε for the penaltiesis too high then the agent will care more about minimiz-ing the penalty (which can typically be achieved by simplytaking no actions) rather than doing the task which is notparticularly useful
The scale of εk has a similar effect as in the equal preference
Condition Settings
Humanoid run (one seed per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin linspace(0 015 100)
MO-MPO εtask = 01εpenalty isin linspace(10minus6 015 100)
Humanoid run normal vs scaled (three seeds per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin 001 005 01
MO-MPO εtask = 01εpenalty isin 001 005 01
Shadow Hand touch and turn (three seeds per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin linspace(0 09 10)
MO-MPO εtask = 001εpenalty isin linspace(0001 0015 15)
Shadow Hand orient (ten seeds per setting)
scalarized MPO wtouch = wheight = worientation = 13MO-MPO εtouch = εheight = εorientation = 001
Table 4 Settings for εk and weights (linspace(x y z) denotes aset of z evenly-spaced values between x and y)
case If the scale of εk is too high or too low then the sameissues arise as discussed for equal preferences If all εkincrease or decrease in scale by the same (moderate) factorand thus their relative scales remain the same then typicallythey will converge to more or less the same policy Thiscan be seen in Fig 5 (right) in the Deep Sea Treasuredomain regardless of whether εtime is 001 002 or 005the relationship between the relative scale of εtreasure andεtime to the treasure that the policy converges to selecting isessentially the same
A2 Deep Sea Treasure
In order to handle discrete actions we make several mi-nor adjustments to scalarized MPO and MO-MPO Thepolicy returns a categorical distribution rather than a Gaus-sian The policy is parametrized by a neural network whichoutputs a vector of logits (ie unnormalized log probabil-ities) of size |A| The KL constraint on the change of thispolicy β is 0001 The input to the critic network is thestate concatenated with a four-dimensional one-hot vectordenoting which action is chosen (eg the up action corre-sponds to [1 0 0 0]gt) Critics are trained with one-steptemporal-difference error with a discount of 0999 Otherthan these changes the network architectures and the MPOand training hyperparameters are the same as in Table 2
A Distributional View on Multi-Objective Policy Optimization
Humanoid Mocap
Scalarized VMPOKL-constraint on policy mean βmicro 01KL-constraint on policy covariance βΣ 10minus5
default ε 01initial temperature η 1
TrainingAdam learning rate 10minus4
batch size 128unroll length (for n-step return Sec D1) 32
Table 5 Hyperparameters for the humanoid mocap experiments
A3 Humanoid Mocap
For the humanoid mocap experiments we used the follow-ing architecture for both MO-VMPO and VMPO for thepolicy we process a concatentation of the mocap referenceobservations and the proprioceptive observations by a twolayer MLP with 1024 hidden units per layer This referenceencoder is followed by linear layers to produce the meanand log standard deviation of a stochastic latent variableThese latent variables are then concatenated with the propri-oceptive observations and processed by another two layerMLP with 1024 hidden units per layer to produce the actiondistribution For VMPO we use a three layer MLP with1024 hidden units per layer as the critic For MO-VMPOwe use a shared two layer MLP with 1024 hidden units perlayer followed by a one layer MLP with 1024 hidden unitsper layer per objective In both cases we use k-step returnsto train the critic with a discount factor of 095 Table 5shows additional hyperparameters used in our experiments
B Experimental DomainsWe evaluated our approach on one discrete domain (DeepSea Treasure) three simulated continuous control domains(humanoid Shadow Hand and humanoid mocap) and onereal-world continuous control domain (Sawyer robot) Herewe provide more detail about these domains and the objec-tives used in each task
B1 Deep Sea Treasure
Deep Sea Treasure (DST) is a 11times10 grid-world domainthe state space S consists of the x and y position of the agentand the action spaceA is up right down left Thelayout of the environment and values of the treasures areshown in Fig 8 If the action would cause the agent tocollide with the sea floor or go out of bounds it has noeffect Farther-away treasures have higher values Theepisode terminates when the agent collects a treasure orafter 200 timesteps
237
224
203196
161151140
115
82
07
Figure 8 Deep Sea Treasure environment from Vamplew et al(2011) with weights from Yang et al (2019) Treasures are labeledwith their respective values The agent can move around freelyin the white squares but cannot enter the black squares (ie theocean floor)
There are two objectives time penalty and treasure valueA time penalty of minus1 is given at each time step The agentreceives the value of the treasure it picks up as the rewardfor the treasure objective In other words when the agentpicks up a treasure of value v the reward vector is [minus1 v]otherwise it is [minus1 0]
B2 Shadow Hand
Our robot platform is a simulated Shadow Dexterous Hand(Shadow Robot Company 2020) in the MuJoCo physicsengine (Todorov et al 2012) The Shadow Hand has fivefingers and 24 degrees of freedom actuated by 20 motorsThe observation consists of joint angles joint velocities andtouch sensors for a total of 63 dimensions Each fingertipof our Shadow Hand has a 4times4 spatial touch sensor Thissensor has three channels measuring the normal force andthe x and y-axis tangential forces for a sensor dimension of4times4times3 per fingertip We simplify this sensor by summingacross spatial dimensions resulting in a 1times1times3 observationper fingertip
In the touch task there is a block in the environment that isalways fixed in the same pose In the turn task there is a dialin the environment that can be rotated and it is initialized toa random position between minus30 and 30 The target angleof the dial is 0 The angle of the dial is included in theagentrsquos observation In the orient task the robot interactswith a rectangular peg in the environment the initial andtarget pose of the peg remains the same across episodesThe pose of the block is included in the agentrsquos observationencoded as the xyz positions of four corners of the block(based on how Levine et al (2016) encodes end-effectorpose)
A Distributional View on Multi-Objective Policy Optimization
Figure 9 These images show what task completion looks like forthe touch turn and orient Shadow Hand tasks (from left to right)
0 2 4 6impact force
minus6
minus4
minus2
0
pain
pen
alty
Figure 10 In the touch and turn Shadow Hand tasks the painpenalty is computed based on the impact force as plotted here Forlow impact forces the pain penalty is near-zero For high impactforces the pain penalty is equal to the negative of the impact force
B21 BALANCING TASK COMPLETION AND PAIN
In the touch and turn tasks there are two objectives ldquopainrdquopenalty and task completion A sparse task completion re-ward of 1 is given for pressing a block with greater than 5Nof force or turning a dial to a fixed target angle respectivelyIn both tasks the episode terminates when the agent com-pletes the task ie the agent gets a total reward of either0 or 1 for the task completion objective per episode ThePareto plots for these two tasks in the main paper (Fig 6)show the discounted task reward (with a discount factor ofγ = 099) to capture how long it takes agents to completethe task
The ldquopainrdquo penalty is computed as in (Huang et al 2019) Itis based on the impact force m(s sprime) which is the increasein force from state s to the next state sprime In our tasks this ismeasured by a touch sensor on the block or dial The painpenalty is equal to the negative of the impact force scaledby how unacceptable it is
rpain(s a sprime) = minus[1minus aλ(m(s sprime))
]m(s sprime) (6)
where aλ(middot) isin [0 1] computes the acceptability of an impactforce aλ(middot) should be a monotonically decreasing functionthat captures how resilient the robot and the environmentare to impacts As in (Huang et al 2019) we use
aλ(m) = sigmoid(λ1(minusm+ λ2)) (7)
with λ = [2 2]gt The relationship between pain penaltyand impact force is plotted in Fig 10
675 700 725zpeg
0
1
rew
ard
height objective
0 2d(θ θpeg)
orientation objective
Figure 11 In the orient Shadow Hand task the height and orienta-tion objectives have shaped rewards as plotted here
B22 ALIGNED OBJECTIVES
In the orient task there are three non-competing objectivestouching the peg lifting the peg to a target height and ori-enting the peg to be perpendicular to the ground The targetz position and orientation is shown by the gray transparentpeg (Fig 9 right-most) note that the x and y position isnot specified so the robot in the figure gets the maximumreward with respect to all three objectives
The touch objective has a sparse reward of 1 for activat-ing the pegrsquos touch sensor and zero otherwise For theheight objective the target z position is 7cm above theground the pegrsquos z position is computed with respect toits center of mass The shaped reward for this objectiveis 1 minus tanh(50|ztarget minus zpeg|) For the orientation objec-tive since the peg is symmetrical there are eight possibleorientations that are equally valid The acceptable targetorientationsQtarget and the pegrsquos orientation qpeg are denotedas quaternions The shaped reward is computed with respectto the closest target quaternion as
minqisinQtarget
1minus tanh(2lowastd(q qpeg))
where d(middot) denotes the `2-norm of the axis-angle equivalent(in radians) for the distance between the two quaternionsFig 11 shows the shaping of the height and orientationrewards
B3 Humanoid
We make use of the humanoid run task from Tassa et al(2018)9 The observation dimension is 67 and the actiondimension is 21 Actions are joint accelerations with min-imum and maximum limits of minus1 and 1 respectively Forthis task there are two objectives
bull The original task reward given by the environment Thegoal is is to achieve a horizontal speed of 10 meters persecond in any direction This reward is shaped it isequal to min(h10 1) where h is in meters per secondFor this objective we learn a Q-function
9This task is available at githubcomdeepminddm control
A Distributional View on Multi-Objective Policy Optimization
bull Limiting energy usage by penalizing taking high-magnitude actions The penalty is the `2-norm of theaction vector ie rpenalty(a) = minusa2 For this ob-jective we do not learn a Q-function Instead wecompute this penalty directly to evaluate a given actionin a state-independent way during policy optimization
B4 Humanoid Mocap
The motion capture tracking task used in this paper wasdeveloped as part of a concurrent submission (Anonymous2020) and does not constitute a contribution of this paperWe use a simulated humanoid adapted from the ldquoCMUhumanoidrdquo available at dm controllocomotion (Merel et al2019) We adjusted various body and actuator parametersto be more comparable to that of an average human Thishumanoid has 56 degrees of freedom and the observation is1021-dimensional
This motion capture tracking task is broadly similar to previ-ous work on motion capture tracking (Chentanez et al 2018Peng et al 2018 Merel et al 2019) The task is associatedwith an underlying set of motion capture clips For thispaper we used roughly 40 minutes of locomotion motioncapture clips from the CMU motion capture database10
At the beginning of a new episode we select a frame uni-formly at random from all frames in the underlying mocapdata (excluding the last 10 frames of each clip) The simu-lated humanoid is then initialized to the pose of the selectedframe
The observations in this environment are various propriocep-tive observations about the current state of the body as wellas the relative target positions and orientations of 13 differ-ent body parts in the humanoidrsquos local frame We providethe agent with the target poses for the next 5 timesteps
The choice of reward function is crucial for motion capturetracking tasks Here we consider five different reward com-ponents each capturing a different aspect of the similarityof the pose between the simulated humanoid and the mocaptarget Four of our reward components were initially pro-posed by Peng et al (2018) The first reward component isbased on the difference in the center of mass
rcom = exp(minus10pcom minus pref
com2)
where pcom and prefcom are the positions of the center of mass
of the simulated humanoid and the mocap reference re-spectively The second reward component is based on thedifference in joint angle velocities
rvel = exp(minus01qvel minus qref
vel2)
where qvel and qrefvel are the joint angle velocities of the simu-
lated humanoid and the mocap reference respectively The
10This database is available at mocapcscmuedu
third reward component is based on the difference in end-effector positions
rapp = exp(minus40papp minus pref
app2)
where papp and prefapp are the end effector positions of the
simulated humanoid and the mocap reference respectivelyThe fourth reward component is based on the difference injoint orientations
rquat = exp(minus2qquat qref
quat2)
where denotes quaternion differences and qquat and qrefquat
are the joint quaternions of the simulated humanoid and themocap reference respectively
Finally the last reward component is based on difference inthe joint angles and the Euclidean positions of a set of 13body parts
rtrunc = 1minus 1
03(
=ε︷ ︸︸ ︷bpos minus bref
pos1 + qpos minus qrefpos1)
where bpos and brefpos correspond to the body positions of the
simulated character and the mocap reference and qpos andqref
pos correspond to the joint angles
We include an early termination condition in our task that islinked to this last reward term We terminate the episode ifε gt 03 which ensures that rtrunc isin [0 1]
In our MO-VMPO experiments we treat each reward com-ponent as a separate objective In our VMPO experimentswe use a reward of the form
r =1
2rtrunc +
1
2(01rcom + λrvel + 015rapp + 065rquat)
(8)
varying λ as described in the main paper (Sec 63)
B5 Sawyer
The Sawyer peg-in-hole setup consists of a Rethink RoboticsSawyer robot arm The robot has 7 joints driven by serieselastic actuators On the wrist a Robotiq FT 300 force-torque sensor is mounted followed by a Robotiq 2F85 par-allel gripper We implement a 3D end-effector Cartesianvelocity controller that maintains a fixed orientation of thegripper The agent controls the Cartesian velocity setpointsThe gripper is not actuated in this setup The observationsprovided are the joint positions velocities and torques theend-effector pose the wrist force-torque measurements andthe joint velocity command output of the Cartesian con-troller We augment each observation with the two previousobservations as a history
In each episode the peg is initalized randomly within an8times8times8 cm workspace (with the peg outside of the hole)
A Distributional View on Multi-Objective Policy Optimization
The environment is then run at 20 Hz for 600 steps Theaction limits are plusmn005 ms on each axis If the magnitudeof the wrist force-torque measurements exceeds 15 N onany horizontal axis (x and y axis) or 5 N on the vertical axis(z axis) the episode is terminated early
There are two objectives the task itself (inserting the peg)and minimizing wrist force measurements
The reward for the task objective is defined as
rinsertion = max(
02rapproach rinserted
)(9)
rapproach = s(ppeg popening) (10)
rinserted = raligned s(ppegz pbottom
z ) (11)
raligned = ppegxy minus popening
xy lt d2 (12)
where ppeg is the peg position popening the position of thehole opening pbottom the bottom position of the hole and dthe diameter of the hole
s(p1 p2) is a shaping operator
s(p1 p2) = 1minus tanh2(atanh(
radicl)
εp1 minus p22) (13)
which gives a reward of 1minus l if p1 and p2 are at a Euclideandistance of ε Here we chose lapproach = 095 εapproach =001 linserted = 05 and εinserted = 004 Intentionally 004corresponds to the length of the peg such that rinserted = 05if ppeg = popening As a result if the peg is in the approachposition rinserted is dominant over rapproach in rinsertion
Intuitively there is a shaped reward for aligning the pegwith a magnitude of 02 After accurate alignment withina horizontal tolerance the overall reward is dominated bya vertical shaped reward component for insertion The hor-izontal tolerance threshold ensures that the agent is notencouraged to try to insert the peg from the side
The reward for the secondary objective minimizing wristforce measurements is defined as
rforce = minusF1 (14)
where F = (Fx Fy Fz) are the forces measured by thewrist sensor
C Algorithmic DetailsC1 General Algorithm Description
We maintain one online network and one target network foreach Q-function associated to objective k with parametersdenoted by φk and φprimek respectively We also maintain oneonline network and one target network for the policy withparameters denoted by θ and θprime respectively Target net-works are updated every fixed number of steps by copying
parameters from the online network Online networks areupdated using gradient descent in each learning iterationNote that in the main paper we refer to the target policynetwork as the old policy πold
We use an asynchronous actor-learner setup in which ac-tors regularly fetch policy parameters from the learner andact in the environment writing these transitions to the re-play buffer We refer to this policy as the behavior policythroughout the paper The learner uses the transitions inthe replay buffer to update the (online) Q-functions and thepolicy Please see Algorithms 2 and 3 for more details onthe actor and learner processes respectively
Algorithm 2 MO-MPO Asynchronous Actor
1 given (N) reward functions rk(s a)Nk=1 T steps perepisode
2 repeat3 Fetch current policy parameters θ from learner4 Collect trajectory from environment5 τ = 6 for t = 0 T do7 at sim πθ(middot|st)8 Execute action and determine rewards9 r = [r1(st at) rN (st at)]
10 τ larr τ cup (st at r πθ(at|st))11 end for12 Send trajectory τ to replay buffer13 until end of training
C2 Retrace for Multi-Objective Policy Evaluation
Recall that the goal of the policy evaluation stage (Sec 41)is to learn the Q-function Qφk
(s a) for each objective kparametrized by φk These Q-functions are with respectto the policy πold We emphasize that it is valid to use anyoff-the-shelf Q-learning algorithm such as TD(0) (Sutton1988) to learn these Q-functions We choose to use Retrace(Munos et al 2016) described here
Given a replay buffer D containing trajectory snippetsτ = (s0 a0 r0 s1) (sT aT rT sT+1) where rtdenotes a reward vector rk(st at)Nk=1 that consists of ascalar reward for each ofN objectives the Retrace objectiveis as follows
minφk
EτsimD[(Qretk (st at)minus Qφk
(st at))2] (15)
with
Qretk (st at) = Qφprime
k(st at) +
Tsumj=t
γjminust( jprodz=t+1
cz
)δj
δj = rk(sj aj) + γV (sj+1)minus Qφprimek(sj aj)
V (sj+1) = Eπold(a|sj+1)[Qφprimek(sj+1 a)]
A Distributional View on Multi-Objective Policy Optimization
Algorithm 3 MO-MPO Asynchronous Learner1 given batch size (L) number of actions (M) (N) Q-functions
(N) preferences εkNk=1 policy networks and replay bufferD
2 initialize Lagrangians ηkNk=1 and ν target networks andonline networks such that πθprime = πθ and Qφprime
k= Qφk
3 repeat4 repeat5 Collect dataset si aij Qijk
LMNijk where
6 si sim D aij sim πθprime(a|si) and Qijk = Qφkprime(si aij)
78 Compute action distribution for each objective9 for k = 1 N do
10 δηk larr nablaηkηkεk + ηksumLi
1Llog
(sumMj
1M
exp(Q
ijkηk
))11 Update ηk based on δηk12 qijk prop exp(
Qijkηk
)
13 end for1415 Update parametric policy with trust region16 sg denotes a stop-gradient17 δπ larr minusnablaθ
sumLi
sumMj
sumNk q
ijk log πθ(a
ij |si)18 +sg(ν)
(β minus
sumLi KL(πθprime(a|si) πθ(a|si)) ds
)19 δν larr nablaνν
(βminussumLi KL(πθprime(a|si) πsg(θ)(a|si)) ds
)20 Update πθ based on δπ21 Update ν based on δν2223 Update Q-functions24 for k = 1 N do25 δφk larr nablaφk
sum(stat)isinτsimD
(Qφk (st at)minusQ
retk
)226 with Qret
k as in Eq 1527 Update φk based on δφk
28 end for2930 until fixed number of steps31 Update target networks32 πθprime = πθ Qφk
prime = Qφk
33 until convergence
The importance weights ck are defined as
cz = min
(1πold(az|sz)b(az|sz)
)
where b(az|sz) denotes the behavior policy used to col-lect trajectories in the environment When j = t we set(prodj
z=t+1 cz
)= 1
We also make use of a target network for each Q-function(Mnih et al 2015) parametrized by φprimek which we copyfrom the online network φk after a fixed number of gradientsteps on the Retrace objective (15)
C3 Policy Fitting
Recall that fitting the policy in the policy improvementstage (Sec 422) requires solving the following constrained
optimization
πnew = argmaxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds
stints
micro(s) KL(πold(a|s) πθ(a|s)) ds lt β (16)
We first write the generalized Lagrangian equation ie
L(θ ν) =
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds (17)
+ ν(β minus
ints
micro(s) KL(πold(a|s) πθ(a|s)) ds)
where ν is the Lagrangian multiplier Now we solve thefollowing primal problem
maxθ
minνgt0
L(θ ν)
To obtain the parameters θ for the updated policy we solvefor ν and θ by alternating between 1) fixing ν to its currentvalue and optimizing for θ and 2) fixing θ to its currentvalue and optimizing for ν This can be applied for anypolicy output distribution
For Gaussian policies we decouple the update rules to op-timize for mean and covariance independently as in (Ab-dolmaleki et al 2018b) This allows for setting differentKL bounds for the mean (βmicro) and covariance (βΣ) whichresults in more stable learning To do this we first separateout the following two policies for mean and covariance
πmicroθ (a|s) = N(amicroθ(s)Σθold
(s)) (18)
πΣθ (a|s) = N
(amicroθold
(s)Σθ(s)) (19)
Policy πmicroθ (a|s) takes the mean from the online policy net-work and covariance from the target policy network andpolicy πΣ
θ (a|s) takes the mean from the target policy net-work and covariance from online policy network Now ouroptimization problem has the following form
maxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s)(
log πmicroθ (a|s)πΣθ (a|s)
)da ds
stints
micro(s) KL(πold(a|s) πmicroθ (a|s)) ds lt βmicroints
micro(s) KL(πold(a|s) πΣθ (a|s)) ds lt βΣ (20)
As in (Abdolmaleki et al 2018b) we set a much smallerbound for covariance than for mean to keep the explorationand avoid premature convergence We can solve this op-timization problem using the same general procedure asdescribed above
A Distributional View on Multi-Objective Policy Optimization
C4 Derivation of Dual Function for η
Recall that obtaining the per-objective improved action dis-tributions qk(a|s) in the policy improvement stage (Sec421) requires solving a convex dual function for the tem-perature ηk for each objective For the derivation of thisdual function please refer to Appendix D2 of the originalMPO paper (Abdolmaleki et al 2018b)
D MO-V-MPO Multi-Objective On-PolicyMPO
In this section we describe how MO-MPO can be adaptedto the on-policy case in which a state-value function V (s)is learned and used to estimate the advantages for eachobjective We call this approach MO-V-MPO
D1 Multi-objective policy evaluation in the on-policysetting
In the on-policy setting to evaluate the previous policyπold we use advantages A(s a) estimated from a learnedstate-value function V (s) instead of a state-action valuefunction Q(s a) as in the main text We train a separateV -function for each objective by regressing to the stan-dard n-step return (Sutton 1988) associated with each ob-jective More concretely given trajectory snippets τ =(s0 a0 r0) (sT aT rT ) where rt denotes a rewardvector rk(st at)Nk=1 that consists of rewards for all N ob-jectives we find value function parameters φk by optimizingthe following objective
minφ1k
Nsumk=1
Eτ[(G
(T )k (st at)minus V πold
φk(st))
2] (21)
Here G(T )k (st at) is the T -step target for value function
k which uses the actual rewards in the trajectory andbootstraps from the current value function for the restG
(T )k (st at) =
sumTminus1`=t γ
`minustrk(s` a`) + γTminustV πoldφk
(s`+T )The advantages are then estimated as Aπold
k (st at) =
G(T )k (st at)minus V πold
φk(st)
D2 Multi-objective policy improvement in theon-policy setting
Given the previous policy πold(a|s) and estimated advan-tages Aπold
k (s a)k=1N associated with this policy foreach objective our goal is to improve the previous policyTo this end we first learn an improved variational distribu-tion qk(s a) for each objective and then combine and distillthe variational distributions into a new parametric policyπnew(a|s) Unlike in MO-MPO for MO-V-MPO we use thejoint distribution qk(s a) rather than local policies qk(a|s)because without a learned Q-function only one action perstate is available for learning This is a multi-objective vari-
ant of the two-step policy improvement procedure employedby V-MPO (Song et al 2020)
D21 OBTAINING IMPROVED VARIATIONALDISTRIBUTIONS PER OBJECTIVE (STEP 1)
In order to obtain the improved variational distributionsqk(s a) we optimize the RL objective for each objectiveAk
maxqk
intsa
qk(s a)Ak(s a) dads (22)
st KL(qk(s a)pold(s a)) lt εk
where the KL-divergence is computed over all (s a) εk de-notes the allowed expected KL divergence and pold(s a) =micro(s)πold(a|s) is the state-action distribution associated withπold
As in MO-MPO we use these εk to define the preferencesover objectives More concretely εk defines the allowed con-tribution of objective k to the change of the policy There-fore the larger a particular εk is with respect to others themore that objective k is preferred On the other hand ifεk = 0 is zero then objective k will have no contribution tothe change of the policy and will effectively be ignored
Equation (22) can be solved in closed form
qk(s a) prop pold(s a) exp(Ak(s a)
ηk
) (23)
where the temperature ηk is computed based on the con-straint εk by solving the following convex dual problem
ηk = argminηk
[ηk εk + (24)
ηk log
intsa
pold(s a) exp(Ak(s a)
η
)da ds
]
We can perform the optimization along with the loss bytaking a gradient descent step on ηk and we initialize withthe solution found in the previous policy iteration step Sinceηk must be positive we use a projection operator after eachgradient step to maintain ηk gt 0
Top-k advantages As in Song et al (2020) in practicewe used the samples corresponding to the top 50 of ad-vantages in each batch of data
D22 FITTING A NEW PARAMETRIC POLICY (STEP 2)
We next want to combine and distill the state-action distribu-tions obtained in the previous step into a single parametricpolicy πnew(a|s) that favors all of the objectives accordingto the preferences specified by εk For this we solve a su-pervised learning problem that fits a parametric policy asfollows
A Distributional View on Multi-Objective Policy Optimization
πnew = argmaxθ
Nsumk=1
intsa
qk(s a) log πθ(a|s) da ds
stints
KL(πold(a|s) πθ(a|s)) ds lt β (25)
where θ are the parameters of our function approximator(a neural network) which we initialize from the weights ofthe previous policy πold and the KL constraint enforces atrust region of size β that limits the overall change in theparametric policy to improve stability of learning As inMPO the KL constraint in this step has a regularizationeffect that prevents the policy from overfitting to the localpolicies and therefore avoids premature convergence
In order to optimize Equation (25) we employ Lagrangianrelaxation similar to the one employed for ordinary V-MPO (Song et al 2020)
E Multi-Objective Policy Improvement asInference
In the main paper we motivated the multi-objective policyupdate rules from an intuitive perspective In this section weshow that our multi-objective policy improvement algorithmcan also be derived from the RL-as-inference perspective Inthe policy improvement stage we assume that a Q-functionfor each objective is given and we would like to improveour policy with respect to these Q-functions The deriva-tion here extends the derivation for the policy improvementalgorithm in (single-objective) MPO in Abdolmaleki et al(2018a) (in appendix) to the multi-objective case
We assume there are observable binary improvement eventsRkNk=1 for each objective Rk = 1 indicates that ourpolicy has improved for objective k whileRk = 0 indicatesthat it has not Now we ask if the policy has improved withrespect to all objectives ie Rk = 1Nk=1 what would theparameters θ of that improved policy be More concretelythis is equivalent to the maximum a posteriori estimate forRk = 1Nk=1
maxθ
pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)
where the probability of improvement events depends on θAssuming independence between improvement events Rkand using log probabilities leads to the following objective
maxθ
Nsumk=1
log pθ(Rk = 1) + log p(θ) (27)
The prior distribution over parameters p(θ) is fixed duringthe policy improvement step We set the prior such that πθ
stays close to the target policy during each policy improve-ment step to improve stability of learning (Sec E2)
We use the standard expectation-maximization (EM) algo-rithm to efficiently maximize
sumNk=1 log pθ(Rk = 1) The
EM algorithm repeatedly constructs a tight lower bound inthe E-step given the previous estimate of θ (which corre-sponds to the target policy in our case) and then optimizesthat lower bound in the M-step We introduce a variationaldistribution qk(s a) per objective and use this to decom-pose the log-evidence
sumNk=1 log pθ(Rk = 1) as follows
Nsumk=1
log pθ(Rk = 1) (28)
=
Nsumk=1
KL(qk(s a) pθ(s a|Rk = 1))minus
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a))
The second term of this decomposition is expanded as
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a)) (29)
=
Nsumk=1
Eqk(sa)
[log
pθ(Rk = 1 s a)
qk(s a)
]=
Nsumk=1
Eqk(sa)
[log
p(Rk = 1|s a)πθ(a|s)micro(s)
qk(s a)
]
where micro(s) is the stationary state distribution which is as-sumed to be given in each policy improvement step Inpractice micro(s) is approximated by the distribution of thestates in the replay buffer p(Rk = 1|s a) is the likelihoodof the improvement event for objective k if our policy choseaction a in state s
The first term of the decomposition in Equation (28) isalways positive so the second term is a lower bound onthe log-evidence
sumNk=1 log pθ(Rk = 1) πθ(a|s) and
qk(a|s) = qk(sa)micro(s) are unknown even though micro(s) is given
In the E-step we estimate qk(a|s) for each objective k byminimizing the first term given θ = θprime (the parameters ofthe target policy) Then in the M-step we find a new θ byfitting the parametric policy to the distributions from firststep
E1 E-Step
The E-step corresponds to the first step of policy improve-ment in the main paper (Sec 421) In the E-step wechoose the variational distributions qk(a|s)Nk=1 such thatthe lower bound on
sumNk=1 log pθ(Rk = 1) is as tight as
A Distributional View on Multi-Objective Policy Optimization
possible when θ = θprime the parameters of the target policyThe lower bound is tight when the first term of the decompo-sition in Equation (28) is zero so we choose qk to minimizethis term We can optimize for each variational distributionqk independently
qk(a|s) = argminq
Emicro(s)
[KL(q(a|s) pθprime(s a|Rk = 1))
]= argmin
qEmicro(s)
[KL(q(a|s) πθprime(a|s))minus
Eq(a|s) log p(Rk = 1|s a))]
(30)
We can solve this optimization problem in closed formwhich gives us
qk(a|s) =πθprime(a|s) p(Rk = 1|s a)intπθprime(a|s) p(Rk = 1|s a) da
This solution weighs the actions based on their relativeimprovement likelihood p(Rk = 1|s a) for each objectiveWe define the likelihood of an improvement event Rk as
p(Rk = 1|s a) prop exp(Qk(s a)
αk
)
where αk is an objective-dependent temperature parameterthat controls how greedy the solution qk(a|s) is with respectto its associated objective Qk(s a) For example given abatch of state-action pairs at the extremes an αk of zerowould give all probability to the state-action pair with themaximum Q-value while an αk of positive infinity wouldgive the same probability to all state-action pairs In orderto automatically optimize for αk we plug the exponentialtransformation into (30) which leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) dadsminus
αk
intmicro(s) KL(q(a|s) πθprime(a|s)) ds (31)
If we instead enforce the bound on KL divergence as a hardconstraint (which we do in practice) that leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) da ds (32)
stintmicro(s) KL(q(a|s) πθprime(a|s)) ds lt εk
where as described in the main paper the parameter εkdefines preferences over objectives If we now assume thatqk(a|s) is a non-parametric sample-based distribution as in(Abdolmaleki et al 2018b) we can solve this constrained
optimization problem in closed form for each sampled states by setting
q(a|s) prop πθprime(a|s) exp(Qk(s a)
ηk
) (33)
where ηk is computed by solving the following convex dualfunction
ηk = argminη
ηεk+ (34)
η
intmicro(s) log
intπθprime(a|s) exp
(Qk(s a)
η
)da ds
Equations (32) (33) and (34) are equivalent to those givenfor the first step of multi-objective policy improvement inthe main paper (Sec 421) Please see refer to AppendixD2 in (Abdolmaleki et al 2018b) for details on derivationof the dual function
E2 M-Step
After obtaining the variational distributions qk(a|s) foreach objective k we have found a tight lower bound forsumNk=1 log pθ(Rk = 1) when θ = θprime We can now obtain
the parameters θ of the updated policy πθ(a|s) by optimiz-ing this lower bound After rearranging and dropping theterms in (27) that do not involve θ we obtain
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) da ds
+ log p(θ)
This corresponds to the supervised learning step of policyimprovement in the main paper (Sec 422)
For the prior p(θ) on the parameters θ we enforce that thenew policy πθ(a|s) is close to the target policy πθprime(a|s)We therefore optimize for
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) dads
stintmicro(s) KL(πθprime(a|s) πθ(a|s)) ds lt β
Because of the constraint on the change of the parametricpolicy we do not greedily optimize the M-step objectiveSee (Abdolmaleki et al 2018b) for more details
A Distributional View on Multi-Objective Policy Optimization
no influence and will effectively be ignored In AppendixA1 we provide suggestions for setting εk given a desiredpreference over objectives
422 FITTING A NEW PARAMETRIC POLICY (STEP 2)
In the previous section for each objective k we have ob-tained an improved action distribution qk(a|s) Next wewant to combine these distributions to obtain a single para-metric policy that trades off the objectives according tothe constraints εk that we set For this we solve a super-vised learning problem that fits a parametric policy to theper-objective action distributions from step 1
θnew = argmaxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) dads
stints
micro(s) KL(πold(a|s) πθ(a|s)) ds lt β (5)
where θ are the parameters of our policy (a neural network)and the KL constraint enforces a trust region of size β thatlimits the overall change in the parametric policy The KLconstraint in this step has a regularization effect that pre-vents the policy from overfitting to the sample-based actiondistributions and therefore avoids premature convergenceand improves stability of learning (Schulman et al 2015Abdolmaleki et al 2018ab)
Similar to the first policy improvement step we evaluatethe integrals by using the L states sampled from the replaybuffer and the M actions per state sampled from the oldpolicy In order to optimize (5) using gradient descent weemploy Lagrangian relaxation similar to in MPO (Abdol-maleki et al 2018a) (see Appendix C for more detail)
5 Experiments Toy DomainsIn the empirical evaluation that follows we will first demon-strate the mechanics and scale-invariance of MO-MPO ina single-state environment (Sec 51) and then show thatMO-MPO can find all Pareto-dominant policies in a popularMORL benchmark (Sec 52) Finally we show the benefitof using MO-MPO in high-dimensional continuous controldomains including on a real robot (Sec 6) AppendicesA and B contain a detailed description of all domains andtasks experimental setups and implementation details
Baselines The goal of our empirical evaluation is toanalyze the benefit of using our proposed multi-objectivepolicy improvement step (Sec 422) that encodes prefer-ences over objectives via constraints εk on expected KL-divergences rather than via weights wk Thus we primarilycompare MO-MPO against scalarized MPO which relieson linear scalarization weights wk to encode preferencesThe only difference between MO-MPO and scalarized MPOis the policy improvement step for scalarized MPO a single
Figure 2 A simple environment in which the agent starts at stateS0 and chooses to navigate to one of three terminal states Thereare two objectives Taking the left action for instance leads toa reward of 1 for the first objective and 4 for the second
improved action distribution q(a|s) is computed based onsumk wkQk(s a) and a single KL constraint ε
State-of-the-art approaches that combine MORL with deepRL assume linear scalarization as well either learning aseparate policy for each setting of weights (Mossalam et al2016) or learning a single policy conditioned on scalar-ization weights (Friedman amp Fontaine 2018 Abels et al2019) Scalarized MPO addresses the former problemwhich is easier The policy evaluation step in scalarizedMPO is analagous to scalarized Q-learning proposed byMossalam et al (2016) As we show later in Sec 6 evenlearning an optimal policy for a single scalarization is diffi-cult in high-dimensional continuous control domains
51 Simple World
First we will examine the behavior of MO-MPO in a simplemulti-armed bandit with three actions (up right andleft) (Fig 2) inspired by Russell amp Zimdars (2003)We train policies with scalarized MPO and with MO-MPOThe policy evaluation step is exact because the Q-valuefunction for each objective is known it is equal to thereward received for that objective after taking each actionas labeled in Fig 24
We consider three possible desired preferences equal pref-erence for the two objectives preferring the first and pre-ferring the second Encoding preferences in scalarizedMPO amounts to choosing appropriate linear scalarizationweights and in MO-MPO amounts to choosing appropriateεrsquos We use the following weights and εrsquos
bull equal preference weights [05 05] or εrsquos [001 001]bull prefer first weights [09 01] or εrsquos [001 0002]bull prefer second weights [01 09] or εrsquos [0002 001]
We set ε = 001 for scalarized MPO If we start with a4The policy improvement step can also be computed exactly
because solving for the optimal temperature η (or η1 and η2 in theMO-MPO case) is a convex optimization problem and the KL-constrained policy update is also a convex optimization problemwhen there is only one possible state We use CVXOPT (Andersenet al 2020) as our convex optimization solver
A Distributional View on Multi-Objective Policy Optimization
uniform policy and run MPO with β = 0001 until thepolicy converges scalarized MPO and MO-MPO result insimilar policies (Fig 3 solid bars) up for equal preferenceright for prefer first and left for prefer second
However if we make the rewards imbalanced by multiply-ing the rewards obtained for the first objective by 20 (egleft now obtains a reward of [20 4]) we see that the poli-cies learned by scalarized MPO shift to preferring the opti-mal action for the first objective (right) in both the equalpreference and prefer second cases (Fig 3 striped bars)In contrast the final policies for MO-MPO are the same asfor balanced rewards because in each policy improvementstep MO-MPO optimizes for a separate temperature ηk thatscales each objectiversquos Q-value function This ηk is com-puted based on the corresponding allowed KL-divergenceεk so when the rewards for any objective k are multipliedby a factor but εk remains the same the computed ηk endsup being scaled by that factor as well neutralizing the effectof the scaling of rewards (see Eq (4))
Even in this simple environment we see that MO-MPOrsquosscale-invariant way of encoding preferences is valuable Inmore complex domains in which the Q-value functionsmust be learned in parallel with the policy the (automatic)dynamic adjustment of temperatures ηk per objective be-comes more essential (Sec 6)
The scale of εk controls the amount that objective k caninfluence the policyrsquos update If we set ε1 = 001 and sweepover the range from 0 to 001 for ε2 the resulting policies gofrom always picking right to splitting probability acrossright and up to always picking up (Fig 4 right) Incontrast setting weights leads to policies quickly convergingto placing all probability on a single action (Fig 4 left)We hypothesize this limits the ability of scalarized MPOto explore and find compromise policies (that perform wellwith respect to all objectives) in more challenging domains
52 Deep Sea Treasure
An important quality of any MORL approach is the abilityto find a variety of policies on the true Pareto front (Roijerset al 2013) We demonstrate this in Deep Sea Treasure(DST) (Vamplew et al 2011) a popular benchmark for test-ing MORL approaches DST consists of a 11times10 grid worldwith ten treasure locations The agent starts in the upper leftcorner (s0 = (0 0)) and has a choice of four actions (mov-ing one square up right down and left) When theagent picks up a treasure the episode terminates The agenthas two objectives treasure value and time penalty Thetime penalty is minus1 for each time step that the agent takesand farther-away treasures have higher treasure values Weuse the treasure values in Yang et al (2019)
We ran scalarized MPO with weightings [w 1 minus w] and
00
05
10equal preference prefer first prefer second
left up right00
05
10
left up right left up right
scalarizeddistributional
scalarized imbalanced rewardsdistributional imbalanced rewards
Figure 3 When two objectives have comparable reward scales(solid bars) scalarized MPO (first row) and MO-MPO (second row)learn similar policies across three different preferences Howeverwhen the scale of the first objective is much higher (striped bars)scalarized MPO shifts to always preferring the first objective Incontrast MO-MPO is scale-invariant and still learns policies thatsatisfy the preferences The y-axis denotes action probability
00 05 10p(left)
0
1distributional
[001]
[0101]
[010]
00 05 10p(left)
0
1p(
righ
t)scalarized
[01]
[55]
[10][w1 w2]
00 05 10p(left)
0
1p(
righ
t)scalarized
[01]
[55]
[10]
00 05 10p(left)
0
1
p(righ
t)
scalarized
[01]
[55]
[10][1205981 1205982]
00 05 10p(left)
0
1distributional
[001]
[0101]
[010]
Figure 4 A visualization of policies during learningmdasheach curvecorresponds to a particular setting of weights (left) or εrsquos (right)Policies are initialized as uniform (the blue dot) and are traineduntil convergence Each point (x y) corresponds to the policywith p(left) = x p(right) = y and p(up) = 1minusxminusy Thetop left and bottom right blue stars denote the optimal policy forthe first and second objectives respectively
w isin [0 001 002 1] All policies converged to a pointon the true Pareto front and all but three found the optimalpolicy for that weighting In terms of coverage policieswere found for eight out of ten points on the Pareto front5
We ran MO-MPO on this task as well for a range of εεtime isin [001 002 005] and εtreasure = c lowast εtime wherec isin [05 051 052 15] All runs converged to a policyon the true Pareto front and MO-MPO finds policies for allten points on the front (Fig 5 left) Note that it is the ratioof εrsquos that matters rather than the exact settingsmdashacrossall settings of εtime similar ratios of εtreasure to εtime resultin similar policies as this ratio increases policies tend toprefer higher-value treasures (Fig 5 right)
5Since scalarized MPO consistently finds the optimal policyfor any given weighting in this task with a more strategic selectionof weights we expect policies for all ten points would be found
A Distributional View on Multi-Objective Policy Optimization
Figure 5 Left Blue stars mark the true Pareto front for Deep SeaTreasure MO-MPO with a variety of settings for εk discoversall points on the true Pareto front The area of the orange circlesis proportional to the number of εk settings that converged tothat point Right As more preference is given to the treasureobjective (ie as x increases) policies tend to prefer higher-valuetreasures Each orange dot in the scatterplot corresponds to aparticular setting of εk
6 Experiments Continuous Control DomainsThe advantage of encoding preferences via εrsquos rather thanvia weights is apparent in more complex domains We com-pared our approaches MO-MPO and MO-V-MPO againstscalarized MPO and V-MPO in four high-dimensional con-tinuous control domains in MuJoCo (Todorov et al 2012)and on a real robot The domains we consider are
Humanoid We use the humanoid run task defined in Deep-Mind Control Suite (Tassa et al 2018) Policies must opti-mize for horizontal speed h while minimizing energy usageThe task reward is min(h10 1) where h is in meters persecond and the energy usage penalty is action `2-norm Thehumanoid has 21 degrees of freedom and the observationconsists of joint angles joint velocities head height handand feet positions torso vertical orientation and center-of-mass velocity for a total of 67 dimensions
Shadow Hand We consider three tasks on the ShadowDexterous Hand touch turn and orient In the touch andturn tasks policies must complete the task while minimiz-ing ldquopainrdquo A sparse task reward of 10 is given for pressingthe block with greater than 5N of force or for turning thedial from a random initial location to the target locationThe pain penalty penalizes the robot for colliding with ob-jects at high speed this penalty is defined as in Huang et al(2019) In the orient task there are three aligned objectivestouching the rectangular peg lifting it to a given height andorienting it to be perpendicular to the ground All threerewards are between 0 and 1 The Shadow Hand has five
fingers and 24 degrees of freedom actuated by 20 motorsThe observation consists of joint angles joint velocitiesand touch sensors for a total of 63 dimensions The touchand turn tasks terminate when the goal is reached or after 5seconds and the orient task terminates after 10 seconds
Humanoid Mocap We consider a large-scale humanoidmotion capture tracking task similar to in Peng et al (2018)in which policies must learn to follow motion capture ref-erence data6 There are five objectives each capturing adifferent aspect of the similarity of the pose between thesimulated humanoid and the mocap target joint orientationsjoint velocities hand and feet positions center-of-mass po-sitions and certain body positions and joint angles Theseobjectives are described in detail in Appendix B4 In orderto balance these multiple objectives prior work relied onheavily-tuned reward functions (eg Peng et al 2018) Thehumanoid has 56 degrees of freedom and the observation is1021-dimensional consisting of proprioceptive observationsas well as six steps of motion capture reference frames Intotal we use about 40 minutes of locomotion mocap datamaking this an extremely challenging domain
Sawyer Peg-in-Hole We train a Rethink Robotics Sawyerrobot arm to insert a cylindrical peg into a hole while min-imizing wrist forces The task reward is shaped towardpositioning the peg directly above the hole and increasesfor insertion and the penalty is the `1-norm of Cartesianforces measured by the wrist force-torque sensor The latterimplicitly penalizes contacts and impacts as well as exces-sive directional change (due to the gripperrsquos inertia inducingforces when accelerating) We impose a force thresholdto protect the hardwaremdashif this threshold is exceeded theepisode is terminated The action space is the end effectorrsquosCartesian velocity and the observation is 102-dimensionalconsisting of Cartesian position joint position and velocitywrist force-torque and joint action for three timesteps
61 Evaluation Metric
We run MO-(V-)MPO and scalarized (V-)MPO with a widerange of constraint settings εk and scalarization weightswk respectively corresponding to a wide range of possibledesired preferences (The exact settings are provided inAppendix A) For tasks with two objectives we plot thePareto front found by each approach We also computethe hypervolume of each found Pareto front this metric iscommonly-used for evaluating MORL algorithms (Vamplewet al 2011) Given a set of policies Π and a reference policyr that is dominated by all policies in this set this metric isthe hypervolume of the space of all policies that dominater and are dominated by at least one policy in Π We useDEAP (Fortin et al 2012) to compute hypervolumes
6This task was developed concurrently by Anonymous (2020)and does not constitute a contribution of this paper
A Distributional View on Multi-Objective Policy Optimization
3000 00
400
800
task
rewar
d
mid-training
3000 0
action norm penalty
end of training
3000 0
normal scale
30000 0
10x penalty
MO-MPO scalarized MPO
5 0pain penalty
00
05
task
rewar
d
touch mid-training
5 0pain penalty
touch
5 0pain penalty
turn
0 100z
0
150
angl
e
orient
3000 00
400
800
task
rewar
d
mid-training
3000 0
action norm penalty
end of training
3000 0
normal scale
30000 0
10x penalty
MO-MPO scalarized MPO
5 0pain penalty
00
05
task
rewar
d
touch mid-training
5 0pain penalty
touch
5 0pain penalty
turn
0 100z
0
150
angl
e
orient
touch mid-training
1x penalty
action norm penalty
pain penalty
3000 00
400
800
task
rewar
d
mid-training
3000 0
action norm penalty
end of training
3000 0
normal scale
30000 0
10x penalty
MO-MPO scalarized MPOscalarized MPOMO-MPO
touch end of training
Figure 6 Pareto fronts found by MO-MPO and scalarized MO-MPO for humanoid run (top row) and Shadow Hand tasks Eachdot represents a single trained policy Corresponding hypervol-umes are in Table 1 Task reward is discounted for touch and turnwith a discount factor of 099 For orient the x and y axes are thetotal reward for the lift and orientation objectives respectively
62 Results Humanoid and Shadow Hand
In the run touch and turn tasks the two objectives arecompetingmdasha very high preference for minimizing actionnorm or pain as opposed to getting task reward will resultin a policy that always chooses zero-valued actions Acrossthese three tasks the Pareto front found by MO-MPO issuperior to the one found by scalarized MPO with respectto the hypervolume metric (Table 1)7 In particular MO-MPO finds more policies that perform well with respect toboth objectives ie are in the upper right portion of thePareto front MO-MPO also speeds up learning on run andtouch (Fig 6) Qualitatively MO-MPO trains run policiesthat look more natural and ldquohuman-likerdquo videos are athttpsitesgooglecomviewmo-mpo
When we scale the action norm penalty by 10times for runscalarized MPO policies no longer achieve high task rewardwhereas MO-MPO policies do (Fig 6 top right) Thissupports that MO-MPOrsquos encoding of preferences is indeedscale-invariant When the objectives are aligned and havesimilarly scaled rewards as in the orient task MO-MPOand scalarized MPO perform similarly as expected
Ablation We also ran vanilla MPO on humanoid run withthe same range of weight settings as for scalarized MPO toinvestigate how useful Q-decomposition is In vanilla MPOwe train a single critic on the scalarized reward function
7We use hypervolume reference points of [0minus104] for run[0minus105] for run with 10times penalty [0minus20] for touch and turnand the zero vector for orient and humanoid mocap
Task scalarized MPO MO-MPO
Humanoid run mid-training 11times106 33times106
Humanoid run 64times106 71times106
Humanoid run 1times penalty 50times106 59times106
Humanoid run 10times penalty 26times107 69times107
Shadow touch mid-training 143 156Shadow touch 162 164Shadow turn 144 154Shadow orient 28times104 28times104
Humanoid Mocap 386times10minus6 341times10minus6
Table 1 Hypervolume measurements across tasks and approaches
which is equivalent to removing Q-decomposition fromscalarized MPO Vanilla MPO trains policies that achievesimilar task reward (up to 800) but with twice the actionnorm penalty (up to minus104) As a result the hypervolumeof the Pareto front that vanilla MPO finds is more than amagnitude worse than that of scalarized MPO and MO-MPO(22times106 versus 26times107 and 69times107 respectively)
63 Results Humanoid Mocap
The objectives in this task are mostly aligned In contrast tothe other experiments we use V-MPO (Song et al 2020) asthe base algorithm because it outperforms MPO in learningthis task In addition since V-MPO is an on-policy vari-ant of MPO this enables us to evaluate our approach inthe on-policy setting Each training run is very computa-tionally expensive so we train only a handful of policieseach8 None of the MO-V-MPO policies are dominatedby those found by V-MPO In fact although the weightsspan a wide range of ldquopreferencesrdquo for the joint velocity theonly policies found by V-MPO that are not dominated by aMO-V-MPO policy are those with extreme values for jointvelocity reward (either le 0006 or ge 0018) whereas it isbetween 00103 and 00121 for MO-V-MPO policies
Although the hypervolume of the Pareto front found by V-MPO is higher than that of MO-V-MPO (Table 1) findingpolicies that over- or under- prioritize any objective is unde-sirable Qualitatively the policies trained with MO-V-MPOlook more similar to the mocap reference datamdashthey exhibitless feet jittering compared to those trained with scalarizedV-MPO this can be seen in the corresponding video
64 Results Sawyer Peg-in-Hole
In this task we would like the robot to prioritize successfultask completion while minimizing wrist forces With this
8For MO-V-MPO we set all εk = 001 Also for each objec-tive we set εk = 0001 and set all others to 001 For V-MPOwe fix the weights of four objectives to reasonable values and tryweights of [01 03 06 1 5] for matching joint velocities
A Distributional View on Multi-Objective Policy Optimization
0 200000 400000
200
400
number of policy improvement steps
task reward
0 200000 400000
minus12
00minus
800
wrist force penalty
MO-MPO scalarized MPO [095 005] [08 02]
Figure 7 Task reward (left) and wrist force penalty (right) learningcurves for the Sawyer peg-in-hole task The policy trained withMO-MPO quickly learns to optimize both objectives whereas theother two do not Each line represents a single trained policy
in mind for MO-MPO we set εtask = 01 and εforce = 005and for scalarized MPO we try [wtask wforce] = [095 005]and [08 02] We find that policies trained with scalarizedMPO focus all learning on a single objective at the beginningof training we also observed this in the touch task wherescalarized MPO policies quickly learn to either maximizetask reward or minimize pain penalty but not both (Fig 6bottom left) In contrast the policy trained with MO-MPOsimultaneously optimizes for both objectives throughouttraining In fact throughout training the MO-MPO policydoes just as well with respect to task reward as the scalarizedMPO policy that cares more about task reward and similarlyfor wrist force penalty (Fig 7)
7 Conclusions and Future WorkIn this paper we presented a new distributional perspectiveon multi objective reinforcement learning that is derivedfrom the Rl-as-inference perspective This view leads to twonovel multi-objective RL algorithms namely MO-MPO andMO-V-MPO We showed these algorithms enable practition-ers to encode preferences in a scale-invariant way AlthoughMO-(V-)MPO is a single-policy approach to MORL it iscapable of producing a variety of Pareto-dominant policiesA limitation of this work is that we produced this set ofpolicies by iterating over a relatively large number of εrsquoswhich is computationally expensive In future work weplan to extend MO-(V-)MPO into a true multiple-policyMORL approach either by conditioning the policy on set-tings of εrsquos or by developing a way to strategically select εrsquosto train policies for analogous to what prior work has donefor weights (eg Roijers et al (2014))
AcknowledgmentsThe authors would like to thank Guillaume DesjardinsCsaba Szepesvari Jost Tobias Springenberg Steven BohezPhilemon Brakel Brendan Tracey Jonas Degrave JonasBuchli Leslie Fritz Chloe Rosenberg and many others atDeepMind for their support and feedback on this paper
ReferencesAbdolmaleki A Springenberg J T Degrave J Bohez S
Tassa Y Belov D Heess N and Riedmiller M Rela-tive entropy regularized policy iteration arXiv preprintarXiv181202256 2018a
Abdolmaleki A Springenberg J T Tassa Y Munos RHeess N and Riedmiller M Maximum a posterioripolicy optimisation In Proceedings of the Sixth Interna-tional Conference on Learning Representations (ICLR)2018b
Abels A Roijers D M Lenaerts T Nowe A andSteckelmacher D Dynamic weights in multi-objectivedeep reinforcement learning In Proceedings of the 36thInternational Conference on Machine Learning (ICML)2019
Achiam J Held D Tamar A and Abbeel P Constrainedpolicy optimization In Proceedings of the 34th Inter-national Conference on Machine Learning (ICML) pp22ndash31 2017
Amodei D Olah C Steinhardt J Christiano P Schul-man J and Mane D Concrete problems in AI safetyarXiv preprint arXiv160606565 2016
Andersen M S Dahl J and Vandenberghe L CVXOPTA Python package for convex optimization version 12httpscvxoptorg 2020
Anonymous CoMic Co-Training and Mimicry forReusable Skills 2020
Barrett L and Narayanan S Learning all optimal policieswith multiple criteria In Proceedings of the 25th Inter-national Conference on Machine Learning (ICML) pp41ndash47 2008
Barth-Maron G Hoffman M W Budden D Dabney WHorgan D TB D Muldal A Heess N and LillicrapT Distributed distributional deterministic policy gradi-ents In Proceedings of the Sixth International Conferenceon Learning Representations (ICLR) 2018
Bohez S Abdolmaleki A Neunert M Buchli J HeessN and Hadsell R Value constrained model-free contin-uous control arXiv preprint arXiv190204623 2019
Chen X Ghadirzadeh A Bjorkman M and Jensfelt PMeta-learning for multi-objective reinforcement learningarXiv preprint arXiv181103376 2019
Chentanez N Muller M Macklin M Makoviychuk Vand Jeschke S Physics-based motion capture imitationwith deep reinforcement learning In International Con-ference on Motion Interaction and Games (MIG) ACM2018
A Distributional View on Multi-Objective Policy Optimization
Chow Y Nachum O Duenez-Guzman E andGhavamzadeh M A Lyapunov-based approach to safereinforcement learning In Proceedings of the 31st Inter-national Conference on Neural Information ProcessingSystems (NeurIPS) pp 8092ndash8101 2018
Daneshmand M Tale Masouleh M Saadatzi M HOzcinar C and Anbarjafari G A robust proportion-preserving composite objective function for scale-invariant multi-objective optimization Scientia Iranica24(6)2977ndash2991 2017
Fortin F-A De Rainville F-M Gardner M-A ParizeauM and Gagne C DEAP Evolutionary algorithms madeeasy Journal of Machine Learning Research 132171ndash2175 July 2012
Friedman E and Fontaine F Generalizing across multi-objective reward functions in deep reinforcement learningarXiv preprint arXiv180906364 2018
Gabor Z Kalmar Z and Szepesvari C Multi-criteriareinforcement learning In Proceedings of the FifteenthInternational Conference on Machine Learning (ICML)pp 197ndash205 1998
Grodzevich O and Romanko O Normalization and othertopics in multi-objective optimization In Proceedings ofthe Fields-MITACS Industrial Problems Workshop 2006
Haarnoja T Zhou A Abbeel P and Levine S Softactor-critic Off-policy maximum entropy deep reinforce-ment learning with a stochastic actor In Proceedings ofthe 35th International Conference on Machine Learning(ICML) pp 1861ndash1870 2018
Heess N Wayne G Silver D Lillicrap T Erez T andTassa Y Learning continuous control policies by stochas-tic value gradients In Advances in Neural InformationProcessing Systems 28 (NIPS) 2015
Hessel M Soyer H Espeholt L Czarnecki W SchmittS and van Hasselt H Multi-task deep reinforcementlearning with popart In Proceedings of the AAAI Confer-ence on Artificial Intelligence volume 33 pp 3796ndash38032019
Huang S H Zambelli M Kay J Martins M F TassaY Pilarski P M and Hadsell R Learning gentle objectmanipulation with curiosity-driven deep reinforcementlearning arXiv preprint arXiv190308542 2019
Ishibuchi H Doi K and Nojima Y On the effect ofnormalization in MOEAD for multi-objective and many-objective optimization Complex amp Intelligent Systems 3(4)279ndash294 2017
Kim I and de Weck O Adaptive weighted sum methodfor multiobjective optimization A new method for paretofront generation Structural and Multidisciplinary Opti-mization 31(2)105ndash116 2006
Kingma D P and Ba J Adam A method for stochasticoptimization In Proceedings of the Third InternationalConference on Learning Representations (ICLR) 2015
Le H Voloshin C and Yue Y Batch policy learningunder constraints In Proceedings of the 36th Inter-national Conference on Machine Learning (ICML) pp3703ndash3712 2019
Levine S Finn C Darrell T and Abbeel P End-to-endtraining of deep visuomotor policies Journal of MachineLearning Research 17(39)1ndash40 2016
Liu C Xu X and Hu D Multiobjective reinforcementlearning A comprehensive overview IEEE Transactionson Systems Man and Cybernetics Systems 45(3)385ndash398 2015
Marler R T and Arora J S Function-transformationmethods for multi-objective optimization EngineeringOptimization 37(6)551ndash570 2005
Martens J New perspectives on the natural gradient methodarXiv preprint arXiv14121193 2014
Merel J Ahuja A Pham V Tunyasuvunakool S LiuS Tirumala D Heess N and Wayne G Hierarchi-cal visuomotor control of humanoids In InternationalConference on Learning Representations (ICLR) 2019
Mnih V Kavukcuoglu K Silver D Rusu A A VenessJ Bellemare M G Graves A Riedmiller M Fidje-land A K Ostrovski G et al Human-level controlthrough deep reinforcement learning Nature 518(7540)529 2015
Moffaert K V and Nowe A Multi-objective reinforcementlearning using sets of Pareto dominating policies Journalof Machine Learning Research 153663ndash3692 2014
Mossalam H Assael Y M Roijers D M and WhitesonS Multi-objective deep reinforcement learning arXivpreprint arXiv161002707 2016
Munos R Stepleton T Harutyunyan A and BellemareM Safe and efficient off-policy reinforcement learningIn Proceedings of the 29th International Conference onNeural Information Processing Systems pp 1054ndash10622016
Natarajan S and Tadepalli P Dynamic preferences inmulti-criteria reinforcement learning In Proceedings ofthe 22nd International Conference on Machine Learning(ICML) pp 601ndash608 2005
A Distributional View on Multi-Objective Policy Optimization
Nottingham K Balakrishnan A Deshmukh J Christo-pherson C and Wingate D Using logical specificationsof objectives in multi-objective reinforcement learningarXiv preprint arXiv191001723 2019
Parisi S Pirotta M Smacchia N Bascetta L andRestelli M Policy gradient approaches for multi-objective sequential decision making In Proceedingsof the International Joint Conference on Neural Networks(IJCNN) pp 2323ndash2330 2014
Peng X B Abbeel P Levine S and van de Panne MDeepmimic Example-guided deep reinforcement learn-ing of physics-based character skills ACM Transactionson Graphics 37(4) 2018
Pirotta M Parisi S and Restelli M Multi-objectivereinforcement learning with continuous Pareto frontierapproximation In Proceedings of the Twenty-Ninth AAAIConference on Artificial Intelligence (AAAI) 2015
Reymond M and Nowe A Pareto-DQN Approximat-ing the Pareto front in complex multi-objective decisionproblems In Proceedings of the Adaptive and LearningAgents Workshop at the 18th International Conferenceon Autonomous Agents and MultiAgent Systems AAMAS2019
Riedmiller M Hafner R Lampe T Neunert M De-grave J Van de Wiele T Mnih V Heess N and Sprin-genberg J T Learning by playing - solving sparse rewardtasks from scratch arXiv preprint arXiv1802105672018
Roijers D M Vamplew P Whiteson S and Dazeley RA survey of multi-objective sequential decision-makingJournal of Artificial Intelligence Research 48(1)67ndash1132013
Roijers D M Whiteson S and Oliehoek F A Linearsupport for multi-objective coordination graphs In Pro-ceedings of the International Conference on AutonomousAgents and Multi-Agent Systems (AAMAS) pp 1297ndash1304 2014
Russell S and Zimdars A L Q-decomposition for rein-forcement learning agents In Proceedings of the Twenti-eth International Conference on International Conferenceon Machine Learning (ICML) pp 656ndash663 2003
Schulman J Levine S Abbeel P Jordan M and MoritzP Trust region policy optimization In Proceedings ofthe 32nd International Conference on Machine Learning(ICML) 2015
Schulman J Wolski F Dhariwal P Radford A andKlimov O Proximal policy optimization algorithmsarXiv preprint arXiv170706347 2017
Shadow Robot Company Shadow dexterous handhttpswwwshadowrobotcomproductsdexterous-hand 2020
Silver D Huang A Maddison C J Guez A Sifre Lvan den Driessche G Schrittwieser J Antonoglou IPanneershelvam V Lanctot M Dieleman S GreweD Nham J Kalchbrenner N Sutskever I Lillicrap TLeach M Kavukcuoglu K Graepel T and HassabisD Mastering the game of Go with deep neural networksand tree search Nature 529484ndash503 2016
Song H F Abdolmaleki A Springenberg J T ClarkA Soyer H Rae J W Noury S Ahuja A LiuS Tirumala D Heess N Belov D Riedmiller Mand Botvinick M M V-MPO On-policy maximum aposteriori policy optimization for discrete and continu-ous control In International Conference on LearningRepresentations 2020
Sutton R S Learning to predict by the methods of temporaldifferences Machine learning 3(1)9ndash44 1988
Tan J Zhang T Coumans E Iscen A Bai Y HafnerD Bohez S and Vanhoucke V Sim-to-real Learningagile locomotion for quadruped robots In Proceedingsof Robotics Science and Systems (RSS) 2018
Tassa Y Doron Y Muldal A Erez T Li Yde Las Casas D Budden D Abdolmaleki A Merel JLefrancq A Lillicrap T and Riedmiller M Deepmindcontrol suite arXiv preprint arXiv180100690 2018
Teh Y W Bapst V Czarnecki W M Quan J Kirk-patrick J Hadsell R Heess N and Pascanu R DistralRobust multitask reinforcement learning In Proceedingsof the 31st International Conference on Neural Informa-tion Processing Systems (NeurIPS) 2017
Todorov E Erez T and Tassa Y Mujoco A physicsengine for model-based control In Proceedings of theIEEERSJ International Conference on Intelligent Robotsand Systems (IROS) pp 5026ndash5033 2012
Vamplew P Dazeley R Berry A Issabekov R andDekker E Empirical evaluation methods for multiobjec-tive reinforcement learning algorithms Machine Learn-ing 84(1)51ndash80 Jul 2011
van Hasselt H P Guez A Hessel M Mnih V and Sil-ver D Learning values across many orders of magnitudeIn Advances in Neural Information Processing Systemspp 4287ndash4295 2016
Van Moffaert K Drugan M M and Nowe A Scalar-ized multi-objective reinforcement learning Novel de-sign techniques In Proceedings of the IEEE Symposiumon Adaptive Dynamic Programming and ReinforcementLearning (ADPRL) pp 191ndash199 2013
A Distributional View on Multi-Objective Policy Optimization
van Seijen H Fatemi M Romoff J Laroche R BarnesT and Tsang J Hybrid reward architecture for reinforce-ment learning In Proceedings of the 31st InternationalConference on Neural Information Processing Systems(NeurIPS) pp 5398ndash5408 2017
Watkins C J C H Learning from Delayed Rewards PhDthesis Kingrsquos College Cambridge UK May 1989
Wray K H Zilberstein S and Mouaddib A-I Multi-objective MDPs with conditional lexicographic rewardpreferences In Proceedings of the Twenty-Ninth AAAIConference on Artificial Intelligence (AAAI) pp 3418ndash3424 2015
Wulfmeier M Abdolmaleki A Hafner R SpringenbergJ T Neunert M Hertweck T Lampe T Siegel NHeess N and Riedmiller M Regularized hierarchicalpolicies for compositional transfer in robotics arXivpreprint arXiv190611228 2019
Yang R Sun X and Narasimhan K A generalized algo-rithm for multi-objective reinforcement learning and pol-icy adaptation In Proceedings of the 32nd InternationalConference on Neural Information Processing Systems(NeurIPS) pp 14610ndash14621 2019
Zeng A Song S Lee J Rodriguez A and FunkhouserT Tossingbot Learning to throw arbitrary objects withresidual physics In Proceedings of Robotics Scienceand Systems (RSS) 2019
Zhan H and Cao Y Relationship explainable multi-objective optimization via vector value function basedreinforcement learning arXiv preprint arXiv1910019192019
Zuluaga M Krause A and Puschel M ε-pal An ac-tive learning approach to the multi-objective optimizationproblem Journal of Machine Learning Research 17(1)3619ndash3650 2016
A Distributional View on Multi-Objective Policy Optimization
A Experiment DetailsIn this section we describe implementation details and spec-ify the hyperparameters used for our algorithm In all ourexperiments the policy and critic(s) are implemented withfeed-forward neural networks We use Adam (Kingma ampBa 2015) for optimization
In our continuous control tasks the policy returns a Gaus-sian distribution with a diagonal covariance matrix ieπθ(a|s) = N (microΣ) The policy is parametrized by aneural network which outputs the mean micro = micro(s) and di-agonal Cholesky factors A = A(s) such that Σ = AAT The diagonal factor A has positive diagonal elements en-forced by the softplus transform Aii larr log(1 + exp(Aii))to ensure positive definiteness of the diagonal covariancematrix
Table 2 shows the default hyperparameters that we use forMO-MPO and scalarized MPO and Table 3 shows the hy-perparameters that differ from these defaults for the hu-manoid Shadow Hand and Sawyer experiments For alltasks that we train policies with MO-MPO for a separatecritic is trained for each objective with a shared first layerThe only exception is the action norm penalty in humanoidrun which is computed exactly from the action We foundthat for both policy and critic networks layer normalizationof the first layer followed by a hyperbolic tangent (tanh) isimportant for the stability of the algorithm
In our experiments comparing the Pareto front found byMO-MPO versus scalarized MPO we ran MO-MPO with arange of εk settings and ran scalarized MPO with a range ofweight settings to obtain a set of policies for each algorithmThe settings for humanoid run and the Shadow Hand tasksare listed in Table 4 The settings for humanoid mocap andSawyer are specified in the main paper in Sec 63 and 64respectively
In the following subsections we first give suggestions forchoosing appropriate εk to encode a particular preference(Sec A1) then we describe the discrete MO-MPO usedfor Deep Sea Treasure (Sec A2) and finally we describeimplementation details for humanoid mocap (Sec A3)
A1 Suggestions for Setting ε
Our proposed algorithm MO-(V-)MPO requires practition-ers to translate a desired preference across objectives to nu-merical choices for εk for each objective k At first glancethis may seem daunting However in practice we havefound that encoding preferences via εk is often more in-tuitive than doing so via scalarization In this subsectionwe seek to give an intuition on how to set εk for differentdesired preferences Recall that each εk controls the influ-ence of objective k on the policy update by constraining theKL-divergence between each objective-specific distribution
Hyperparameter Default
policy networklayer sizes (300 200)take tanh of action mean yesminimum variance 10minus12
maximum variance unbounded
critic network(s)layer sizes (400 400 300)take tanh of action yesRetrace sequence size 8discount factor γ 099
both policy and critic networkslayer norm on first layer yestanh on output of layer norm yesactivation (after each hidden layer) ELU
MPOactions sampled per state 20default ε 01KL-constraint on policy mean βmicro 10minus3
KL-constraint on policy covariance βΣ 10minus5
initial temperature η 1
trainingbatch size 512replay buffer size 106
Adam learning rate 3times10minus4
Adam ε 10minus3
target network update period 200
Table 2 Default hyperparameters for MO-MPO and scalarizedMPO with decoupled update on mean and covariance (Sec C3)
and the current policy (Sec 41) We generally choose εk inthe range of 0001 to 01
Equal Preference When all objectives are equally impor-tant the general rule is to set all εk to the same value Wedid this in our orient and humanoid mocap tasks where ob-jectives are aligned or mostly aligned In contrast it can betricky to choose appropriate weights in linear scalarizationto encode equal preferencesmdashsetting all weights equal to1K (where K is the number of objectives) is only appro-priate if the objectivesrsquo rewards are of similar scales Weexplored this in Sec 51 and Fig 3
When setting all εk to the same value what should thisvalue be The larger εk is the more influence the objectiveswill have on the policy update step Since the per-objectivecritics are learned in parallel with the policy setting εktoo high tends to destabilize learning because early on intraining when the critics produce unreliable Q-values theirinfluence on the policy will lead it in the wrong directionOn the other hand if εk is set too low then it slows downlearning because the per-objective action distribution isonly allowed to deviate by a tiny amount from the currentpolicy and the updated policy is obtained via supervisedlearning on the combination of these action distributions
A Distributional View on Multi-Objective Policy Optimization
Humanoid run
policy networklayer sizes (400 400 300)take tanh of action mean nominimum variance 10minus8
critic network(s)layer sizes (500 500 400)take tanh of action no
MPOKL-constraint on policy mean βmicro 5times10minus3
KL-constraint on policy covariance βΣ 10minus6
trainingAdam learning rate 2times10minus4
Adam ε 10minus8
Shadow Hand
MPOactions sampled per state 30default ε 001
Sawyer
MPOactions sampled per state 15
trainingtarget network update period 100
Table 3 Hyperparameters for humanoid Shadow Hand andSawyer experiments that differ from the defaults in Table 2
Eventually the learning will converge to more or less thesame policy though as long as εk is not set too high
Unequal Preference When there is a difference in pref-erences across objectives the relative scale of εk is whatmatters The more the relative scale of εk is compared to εlthe more influence objective k has over the policy updatecompared to objective l And in the extreme case whenεl is near-zero for objective l then objective l will haveno influence on the policy update and will effectively beignored We explored this briefly in Sec 51 and Fig 4
One common example of unequal preferences is when wewould like an agent to complete a task while minimiz-ing other objectivesmdasheg an action norm penalty ldquopainrdquopenalty or wrist force-torque penalty in our experimentsIn this case the ε for the task objective should be higherthan that for the other objectives to incentivize the agent toprioritize actually doing the task If the ε for the penaltiesis too high then the agent will care more about minimiz-ing the penalty (which can typically be achieved by simplytaking no actions) rather than doing the task which is notparticularly useful
The scale of εk has a similar effect as in the equal preference
Condition Settings
Humanoid run (one seed per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin linspace(0 015 100)
MO-MPO εtask = 01εpenalty isin linspace(10minus6 015 100)
Humanoid run normal vs scaled (three seeds per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin 001 005 01
MO-MPO εtask = 01εpenalty isin 001 005 01
Shadow Hand touch and turn (three seeds per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin linspace(0 09 10)
MO-MPO εtask = 001εpenalty isin linspace(0001 0015 15)
Shadow Hand orient (ten seeds per setting)
scalarized MPO wtouch = wheight = worientation = 13MO-MPO εtouch = εheight = εorientation = 001
Table 4 Settings for εk and weights (linspace(x y z) denotes aset of z evenly-spaced values between x and y)
case If the scale of εk is too high or too low then the sameissues arise as discussed for equal preferences If all εkincrease or decrease in scale by the same (moderate) factorand thus their relative scales remain the same then typicallythey will converge to more or less the same policy Thiscan be seen in Fig 5 (right) in the Deep Sea Treasuredomain regardless of whether εtime is 001 002 or 005the relationship between the relative scale of εtreasure andεtime to the treasure that the policy converges to selecting isessentially the same
A2 Deep Sea Treasure
In order to handle discrete actions we make several mi-nor adjustments to scalarized MPO and MO-MPO Thepolicy returns a categorical distribution rather than a Gaus-sian The policy is parametrized by a neural network whichoutputs a vector of logits (ie unnormalized log probabil-ities) of size |A| The KL constraint on the change of thispolicy β is 0001 The input to the critic network is thestate concatenated with a four-dimensional one-hot vectordenoting which action is chosen (eg the up action corre-sponds to [1 0 0 0]gt) Critics are trained with one-steptemporal-difference error with a discount of 0999 Otherthan these changes the network architectures and the MPOand training hyperparameters are the same as in Table 2
A Distributional View on Multi-Objective Policy Optimization
Humanoid Mocap
Scalarized VMPOKL-constraint on policy mean βmicro 01KL-constraint on policy covariance βΣ 10minus5
default ε 01initial temperature η 1
TrainingAdam learning rate 10minus4
batch size 128unroll length (for n-step return Sec D1) 32
Table 5 Hyperparameters for the humanoid mocap experiments
A3 Humanoid Mocap
For the humanoid mocap experiments we used the follow-ing architecture for both MO-VMPO and VMPO for thepolicy we process a concatentation of the mocap referenceobservations and the proprioceptive observations by a twolayer MLP with 1024 hidden units per layer This referenceencoder is followed by linear layers to produce the meanand log standard deviation of a stochastic latent variableThese latent variables are then concatenated with the propri-oceptive observations and processed by another two layerMLP with 1024 hidden units per layer to produce the actiondistribution For VMPO we use a three layer MLP with1024 hidden units per layer as the critic For MO-VMPOwe use a shared two layer MLP with 1024 hidden units perlayer followed by a one layer MLP with 1024 hidden unitsper layer per objective In both cases we use k-step returnsto train the critic with a discount factor of 095 Table 5shows additional hyperparameters used in our experiments
B Experimental DomainsWe evaluated our approach on one discrete domain (DeepSea Treasure) three simulated continuous control domains(humanoid Shadow Hand and humanoid mocap) and onereal-world continuous control domain (Sawyer robot) Herewe provide more detail about these domains and the objec-tives used in each task
B1 Deep Sea Treasure
Deep Sea Treasure (DST) is a 11times10 grid-world domainthe state space S consists of the x and y position of the agentand the action spaceA is up right down left Thelayout of the environment and values of the treasures areshown in Fig 8 If the action would cause the agent tocollide with the sea floor or go out of bounds it has noeffect Farther-away treasures have higher values Theepisode terminates when the agent collects a treasure orafter 200 timesteps
237
224
203196
161151140
115
82
07
Figure 8 Deep Sea Treasure environment from Vamplew et al(2011) with weights from Yang et al (2019) Treasures are labeledwith their respective values The agent can move around freelyin the white squares but cannot enter the black squares (ie theocean floor)
There are two objectives time penalty and treasure valueA time penalty of minus1 is given at each time step The agentreceives the value of the treasure it picks up as the rewardfor the treasure objective In other words when the agentpicks up a treasure of value v the reward vector is [minus1 v]otherwise it is [minus1 0]
B2 Shadow Hand
Our robot platform is a simulated Shadow Dexterous Hand(Shadow Robot Company 2020) in the MuJoCo physicsengine (Todorov et al 2012) The Shadow Hand has fivefingers and 24 degrees of freedom actuated by 20 motorsThe observation consists of joint angles joint velocities andtouch sensors for a total of 63 dimensions Each fingertipof our Shadow Hand has a 4times4 spatial touch sensor Thissensor has three channels measuring the normal force andthe x and y-axis tangential forces for a sensor dimension of4times4times3 per fingertip We simplify this sensor by summingacross spatial dimensions resulting in a 1times1times3 observationper fingertip
In the touch task there is a block in the environment that isalways fixed in the same pose In the turn task there is a dialin the environment that can be rotated and it is initialized toa random position between minus30 and 30 The target angleof the dial is 0 The angle of the dial is included in theagentrsquos observation In the orient task the robot interactswith a rectangular peg in the environment the initial andtarget pose of the peg remains the same across episodesThe pose of the block is included in the agentrsquos observationencoded as the xyz positions of four corners of the block(based on how Levine et al (2016) encodes end-effectorpose)
A Distributional View on Multi-Objective Policy Optimization
Figure 9 These images show what task completion looks like forthe touch turn and orient Shadow Hand tasks (from left to right)
0 2 4 6impact force
minus6
minus4
minus2
0
pain
pen
alty
Figure 10 In the touch and turn Shadow Hand tasks the painpenalty is computed based on the impact force as plotted here Forlow impact forces the pain penalty is near-zero For high impactforces the pain penalty is equal to the negative of the impact force
B21 BALANCING TASK COMPLETION AND PAIN
In the touch and turn tasks there are two objectives ldquopainrdquopenalty and task completion A sparse task completion re-ward of 1 is given for pressing a block with greater than 5Nof force or turning a dial to a fixed target angle respectivelyIn both tasks the episode terminates when the agent com-pletes the task ie the agent gets a total reward of either0 or 1 for the task completion objective per episode ThePareto plots for these two tasks in the main paper (Fig 6)show the discounted task reward (with a discount factor ofγ = 099) to capture how long it takes agents to completethe task
The ldquopainrdquo penalty is computed as in (Huang et al 2019) Itis based on the impact force m(s sprime) which is the increasein force from state s to the next state sprime In our tasks this ismeasured by a touch sensor on the block or dial The painpenalty is equal to the negative of the impact force scaledby how unacceptable it is
rpain(s a sprime) = minus[1minus aλ(m(s sprime))
]m(s sprime) (6)
where aλ(middot) isin [0 1] computes the acceptability of an impactforce aλ(middot) should be a monotonically decreasing functionthat captures how resilient the robot and the environmentare to impacts As in (Huang et al 2019) we use
aλ(m) = sigmoid(λ1(minusm+ λ2)) (7)
with λ = [2 2]gt The relationship between pain penaltyand impact force is plotted in Fig 10
675 700 725zpeg
0
1
rew
ard
height objective
0 2d(θ θpeg)
orientation objective
Figure 11 In the orient Shadow Hand task the height and orienta-tion objectives have shaped rewards as plotted here
B22 ALIGNED OBJECTIVES
In the orient task there are three non-competing objectivestouching the peg lifting the peg to a target height and ori-enting the peg to be perpendicular to the ground The targetz position and orientation is shown by the gray transparentpeg (Fig 9 right-most) note that the x and y position isnot specified so the robot in the figure gets the maximumreward with respect to all three objectives
The touch objective has a sparse reward of 1 for activat-ing the pegrsquos touch sensor and zero otherwise For theheight objective the target z position is 7cm above theground the pegrsquos z position is computed with respect toits center of mass The shaped reward for this objectiveis 1 minus tanh(50|ztarget minus zpeg|) For the orientation objec-tive since the peg is symmetrical there are eight possibleorientations that are equally valid The acceptable targetorientationsQtarget and the pegrsquos orientation qpeg are denotedas quaternions The shaped reward is computed with respectto the closest target quaternion as
minqisinQtarget
1minus tanh(2lowastd(q qpeg))
where d(middot) denotes the `2-norm of the axis-angle equivalent(in radians) for the distance between the two quaternionsFig 11 shows the shaping of the height and orientationrewards
B3 Humanoid
We make use of the humanoid run task from Tassa et al(2018)9 The observation dimension is 67 and the actiondimension is 21 Actions are joint accelerations with min-imum and maximum limits of minus1 and 1 respectively Forthis task there are two objectives
bull The original task reward given by the environment Thegoal is is to achieve a horizontal speed of 10 meters persecond in any direction This reward is shaped it isequal to min(h10 1) where h is in meters per secondFor this objective we learn a Q-function
9This task is available at githubcomdeepminddm control
A Distributional View on Multi-Objective Policy Optimization
bull Limiting energy usage by penalizing taking high-magnitude actions The penalty is the `2-norm of theaction vector ie rpenalty(a) = minusa2 For this ob-jective we do not learn a Q-function Instead wecompute this penalty directly to evaluate a given actionin a state-independent way during policy optimization
B4 Humanoid Mocap
The motion capture tracking task used in this paper wasdeveloped as part of a concurrent submission (Anonymous2020) and does not constitute a contribution of this paperWe use a simulated humanoid adapted from the ldquoCMUhumanoidrdquo available at dm controllocomotion (Merel et al2019) We adjusted various body and actuator parametersto be more comparable to that of an average human Thishumanoid has 56 degrees of freedom and the observation is1021-dimensional
This motion capture tracking task is broadly similar to previ-ous work on motion capture tracking (Chentanez et al 2018Peng et al 2018 Merel et al 2019) The task is associatedwith an underlying set of motion capture clips For thispaper we used roughly 40 minutes of locomotion motioncapture clips from the CMU motion capture database10
At the beginning of a new episode we select a frame uni-formly at random from all frames in the underlying mocapdata (excluding the last 10 frames of each clip) The simu-lated humanoid is then initialized to the pose of the selectedframe
The observations in this environment are various propriocep-tive observations about the current state of the body as wellas the relative target positions and orientations of 13 differ-ent body parts in the humanoidrsquos local frame We providethe agent with the target poses for the next 5 timesteps
The choice of reward function is crucial for motion capturetracking tasks Here we consider five different reward com-ponents each capturing a different aspect of the similarityof the pose between the simulated humanoid and the mocaptarget Four of our reward components were initially pro-posed by Peng et al (2018) The first reward component isbased on the difference in the center of mass
rcom = exp(minus10pcom minus pref
com2)
where pcom and prefcom are the positions of the center of mass
of the simulated humanoid and the mocap reference re-spectively The second reward component is based on thedifference in joint angle velocities
rvel = exp(minus01qvel minus qref
vel2)
where qvel and qrefvel are the joint angle velocities of the simu-
lated humanoid and the mocap reference respectively The
10This database is available at mocapcscmuedu
third reward component is based on the difference in end-effector positions
rapp = exp(minus40papp minus pref
app2)
where papp and prefapp are the end effector positions of the
simulated humanoid and the mocap reference respectivelyThe fourth reward component is based on the difference injoint orientations
rquat = exp(minus2qquat qref
quat2)
where denotes quaternion differences and qquat and qrefquat
are the joint quaternions of the simulated humanoid and themocap reference respectively
Finally the last reward component is based on difference inthe joint angles and the Euclidean positions of a set of 13body parts
rtrunc = 1minus 1
03(
=ε︷ ︸︸ ︷bpos minus bref
pos1 + qpos minus qrefpos1)
where bpos and brefpos correspond to the body positions of the
simulated character and the mocap reference and qpos andqref
pos correspond to the joint angles
We include an early termination condition in our task that islinked to this last reward term We terminate the episode ifε gt 03 which ensures that rtrunc isin [0 1]
In our MO-VMPO experiments we treat each reward com-ponent as a separate objective In our VMPO experimentswe use a reward of the form
r =1
2rtrunc +
1
2(01rcom + λrvel + 015rapp + 065rquat)
(8)
varying λ as described in the main paper (Sec 63)
B5 Sawyer
The Sawyer peg-in-hole setup consists of a Rethink RoboticsSawyer robot arm The robot has 7 joints driven by serieselastic actuators On the wrist a Robotiq FT 300 force-torque sensor is mounted followed by a Robotiq 2F85 par-allel gripper We implement a 3D end-effector Cartesianvelocity controller that maintains a fixed orientation of thegripper The agent controls the Cartesian velocity setpointsThe gripper is not actuated in this setup The observationsprovided are the joint positions velocities and torques theend-effector pose the wrist force-torque measurements andthe joint velocity command output of the Cartesian con-troller We augment each observation with the two previousobservations as a history
In each episode the peg is initalized randomly within an8times8times8 cm workspace (with the peg outside of the hole)
A Distributional View on Multi-Objective Policy Optimization
The environment is then run at 20 Hz for 600 steps Theaction limits are plusmn005 ms on each axis If the magnitudeof the wrist force-torque measurements exceeds 15 N onany horizontal axis (x and y axis) or 5 N on the vertical axis(z axis) the episode is terminated early
There are two objectives the task itself (inserting the peg)and minimizing wrist force measurements
The reward for the task objective is defined as
rinsertion = max(
02rapproach rinserted
)(9)
rapproach = s(ppeg popening) (10)
rinserted = raligned s(ppegz pbottom
z ) (11)
raligned = ppegxy minus popening
xy lt d2 (12)
where ppeg is the peg position popening the position of thehole opening pbottom the bottom position of the hole and dthe diameter of the hole
s(p1 p2) is a shaping operator
s(p1 p2) = 1minus tanh2(atanh(
radicl)
εp1 minus p22) (13)
which gives a reward of 1minus l if p1 and p2 are at a Euclideandistance of ε Here we chose lapproach = 095 εapproach =001 linserted = 05 and εinserted = 004 Intentionally 004corresponds to the length of the peg such that rinserted = 05if ppeg = popening As a result if the peg is in the approachposition rinserted is dominant over rapproach in rinsertion
Intuitively there is a shaped reward for aligning the pegwith a magnitude of 02 After accurate alignment withina horizontal tolerance the overall reward is dominated bya vertical shaped reward component for insertion The hor-izontal tolerance threshold ensures that the agent is notencouraged to try to insert the peg from the side
The reward for the secondary objective minimizing wristforce measurements is defined as
rforce = minusF1 (14)
where F = (Fx Fy Fz) are the forces measured by thewrist sensor
C Algorithmic DetailsC1 General Algorithm Description
We maintain one online network and one target network foreach Q-function associated to objective k with parametersdenoted by φk and φprimek respectively We also maintain oneonline network and one target network for the policy withparameters denoted by θ and θprime respectively Target net-works are updated every fixed number of steps by copying
parameters from the online network Online networks areupdated using gradient descent in each learning iterationNote that in the main paper we refer to the target policynetwork as the old policy πold
We use an asynchronous actor-learner setup in which ac-tors regularly fetch policy parameters from the learner andact in the environment writing these transitions to the re-play buffer We refer to this policy as the behavior policythroughout the paper The learner uses the transitions inthe replay buffer to update the (online) Q-functions and thepolicy Please see Algorithms 2 and 3 for more details onthe actor and learner processes respectively
Algorithm 2 MO-MPO Asynchronous Actor
1 given (N) reward functions rk(s a)Nk=1 T steps perepisode
2 repeat3 Fetch current policy parameters θ from learner4 Collect trajectory from environment5 τ = 6 for t = 0 T do7 at sim πθ(middot|st)8 Execute action and determine rewards9 r = [r1(st at) rN (st at)]
10 τ larr τ cup (st at r πθ(at|st))11 end for12 Send trajectory τ to replay buffer13 until end of training
C2 Retrace for Multi-Objective Policy Evaluation
Recall that the goal of the policy evaluation stage (Sec 41)is to learn the Q-function Qφk
(s a) for each objective kparametrized by φk These Q-functions are with respectto the policy πold We emphasize that it is valid to use anyoff-the-shelf Q-learning algorithm such as TD(0) (Sutton1988) to learn these Q-functions We choose to use Retrace(Munos et al 2016) described here
Given a replay buffer D containing trajectory snippetsτ = (s0 a0 r0 s1) (sT aT rT sT+1) where rtdenotes a reward vector rk(st at)Nk=1 that consists of ascalar reward for each ofN objectives the Retrace objectiveis as follows
minφk
EτsimD[(Qretk (st at)minus Qφk
(st at))2] (15)
with
Qretk (st at) = Qφprime
k(st at) +
Tsumj=t
γjminust( jprodz=t+1
cz
)δj
δj = rk(sj aj) + γV (sj+1)minus Qφprimek(sj aj)
V (sj+1) = Eπold(a|sj+1)[Qφprimek(sj+1 a)]
A Distributional View on Multi-Objective Policy Optimization
Algorithm 3 MO-MPO Asynchronous Learner1 given batch size (L) number of actions (M) (N) Q-functions
(N) preferences εkNk=1 policy networks and replay bufferD
2 initialize Lagrangians ηkNk=1 and ν target networks andonline networks such that πθprime = πθ and Qφprime
k= Qφk
3 repeat4 repeat5 Collect dataset si aij Qijk
LMNijk where
6 si sim D aij sim πθprime(a|si) and Qijk = Qφkprime(si aij)
78 Compute action distribution for each objective9 for k = 1 N do
10 δηk larr nablaηkηkεk + ηksumLi
1Llog
(sumMj
1M
exp(Q
ijkηk
))11 Update ηk based on δηk12 qijk prop exp(
Qijkηk
)
13 end for1415 Update parametric policy with trust region16 sg denotes a stop-gradient17 δπ larr minusnablaθ
sumLi
sumMj
sumNk q
ijk log πθ(a
ij |si)18 +sg(ν)
(β minus
sumLi KL(πθprime(a|si) πθ(a|si)) ds
)19 δν larr nablaνν
(βminussumLi KL(πθprime(a|si) πsg(θ)(a|si)) ds
)20 Update πθ based on δπ21 Update ν based on δν2223 Update Q-functions24 for k = 1 N do25 δφk larr nablaφk
sum(stat)isinτsimD
(Qφk (st at)minusQ
retk
)226 with Qret
k as in Eq 1527 Update φk based on δφk
28 end for2930 until fixed number of steps31 Update target networks32 πθprime = πθ Qφk
prime = Qφk
33 until convergence
The importance weights ck are defined as
cz = min
(1πold(az|sz)b(az|sz)
)
where b(az|sz) denotes the behavior policy used to col-lect trajectories in the environment When j = t we set(prodj
z=t+1 cz
)= 1
We also make use of a target network for each Q-function(Mnih et al 2015) parametrized by φprimek which we copyfrom the online network φk after a fixed number of gradientsteps on the Retrace objective (15)
C3 Policy Fitting
Recall that fitting the policy in the policy improvementstage (Sec 422) requires solving the following constrained
optimization
πnew = argmaxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds
stints
micro(s) KL(πold(a|s) πθ(a|s)) ds lt β (16)
We first write the generalized Lagrangian equation ie
L(θ ν) =
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds (17)
+ ν(β minus
ints
micro(s) KL(πold(a|s) πθ(a|s)) ds)
where ν is the Lagrangian multiplier Now we solve thefollowing primal problem
maxθ
minνgt0
L(θ ν)
To obtain the parameters θ for the updated policy we solvefor ν and θ by alternating between 1) fixing ν to its currentvalue and optimizing for θ and 2) fixing θ to its currentvalue and optimizing for ν This can be applied for anypolicy output distribution
For Gaussian policies we decouple the update rules to op-timize for mean and covariance independently as in (Ab-dolmaleki et al 2018b) This allows for setting differentKL bounds for the mean (βmicro) and covariance (βΣ) whichresults in more stable learning To do this we first separateout the following two policies for mean and covariance
πmicroθ (a|s) = N(amicroθ(s)Σθold
(s)) (18)
πΣθ (a|s) = N
(amicroθold
(s)Σθ(s)) (19)
Policy πmicroθ (a|s) takes the mean from the online policy net-work and covariance from the target policy network andpolicy πΣ
θ (a|s) takes the mean from the target policy net-work and covariance from online policy network Now ouroptimization problem has the following form
maxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s)(
log πmicroθ (a|s)πΣθ (a|s)
)da ds
stints
micro(s) KL(πold(a|s) πmicroθ (a|s)) ds lt βmicroints
micro(s) KL(πold(a|s) πΣθ (a|s)) ds lt βΣ (20)
As in (Abdolmaleki et al 2018b) we set a much smallerbound for covariance than for mean to keep the explorationand avoid premature convergence We can solve this op-timization problem using the same general procedure asdescribed above
A Distributional View on Multi-Objective Policy Optimization
C4 Derivation of Dual Function for η
Recall that obtaining the per-objective improved action dis-tributions qk(a|s) in the policy improvement stage (Sec421) requires solving a convex dual function for the tem-perature ηk for each objective For the derivation of thisdual function please refer to Appendix D2 of the originalMPO paper (Abdolmaleki et al 2018b)
D MO-V-MPO Multi-Objective On-PolicyMPO
In this section we describe how MO-MPO can be adaptedto the on-policy case in which a state-value function V (s)is learned and used to estimate the advantages for eachobjective We call this approach MO-V-MPO
D1 Multi-objective policy evaluation in the on-policysetting
In the on-policy setting to evaluate the previous policyπold we use advantages A(s a) estimated from a learnedstate-value function V (s) instead of a state-action valuefunction Q(s a) as in the main text We train a separateV -function for each objective by regressing to the stan-dard n-step return (Sutton 1988) associated with each ob-jective More concretely given trajectory snippets τ =(s0 a0 r0) (sT aT rT ) where rt denotes a rewardvector rk(st at)Nk=1 that consists of rewards for all N ob-jectives we find value function parameters φk by optimizingthe following objective
minφ1k
Nsumk=1
Eτ[(G
(T )k (st at)minus V πold
φk(st))
2] (21)
Here G(T )k (st at) is the T -step target for value function
k which uses the actual rewards in the trajectory andbootstraps from the current value function for the restG
(T )k (st at) =
sumTminus1`=t γ
`minustrk(s` a`) + γTminustV πoldφk
(s`+T )The advantages are then estimated as Aπold
k (st at) =
G(T )k (st at)minus V πold
φk(st)
D2 Multi-objective policy improvement in theon-policy setting
Given the previous policy πold(a|s) and estimated advan-tages Aπold
k (s a)k=1N associated with this policy foreach objective our goal is to improve the previous policyTo this end we first learn an improved variational distribu-tion qk(s a) for each objective and then combine and distillthe variational distributions into a new parametric policyπnew(a|s) Unlike in MO-MPO for MO-V-MPO we use thejoint distribution qk(s a) rather than local policies qk(a|s)because without a learned Q-function only one action perstate is available for learning This is a multi-objective vari-
ant of the two-step policy improvement procedure employedby V-MPO (Song et al 2020)
D21 OBTAINING IMPROVED VARIATIONALDISTRIBUTIONS PER OBJECTIVE (STEP 1)
In order to obtain the improved variational distributionsqk(s a) we optimize the RL objective for each objectiveAk
maxqk
intsa
qk(s a)Ak(s a) dads (22)
st KL(qk(s a)pold(s a)) lt εk
where the KL-divergence is computed over all (s a) εk de-notes the allowed expected KL divergence and pold(s a) =micro(s)πold(a|s) is the state-action distribution associated withπold
As in MO-MPO we use these εk to define the preferencesover objectives More concretely εk defines the allowed con-tribution of objective k to the change of the policy There-fore the larger a particular εk is with respect to others themore that objective k is preferred On the other hand ifεk = 0 is zero then objective k will have no contribution tothe change of the policy and will effectively be ignored
Equation (22) can be solved in closed form
qk(s a) prop pold(s a) exp(Ak(s a)
ηk
) (23)
where the temperature ηk is computed based on the con-straint εk by solving the following convex dual problem
ηk = argminηk
[ηk εk + (24)
ηk log
intsa
pold(s a) exp(Ak(s a)
η
)da ds
]
We can perform the optimization along with the loss bytaking a gradient descent step on ηk and we initialize withthe solution found in the previous policy iteration step Sinceηk must be positive we use a projection operator after eachgradient step to maintain ηk gt 0
Top-k advantages As in Song et al (2020) in practicewe used the samples corresponding to the top 50 of ad-vantages in each batch of data
D22 FITTING A NEW PARAMETRIC POLICY (STEP 2)
We next want to combine and distill the state-action distribu-tions obtained in the previous step into a single parametricpolicy πnew(a|s) that favors all of the objectives accordingto the preferences specified by εk For this we solve a su-pervised learning problem that fits a parametric policy asfollows
A Distributional View on Multi-Objective Policy Optimization
πnew = argmaxθ
Nsumk=1
intsa
qk(s a) log πθ(a|s) da ds
stints
KL(πold(a|s) πθ(a|s)) ds lt β (25)
where θ are the parameters of our function approximator(a neural network) which we initialize from the weights ofthe previous policy πold and the KL constraint enforces atrust region of size β that limits the overall change in theparametric policy to improve stability of learning As inMPO the KL constraint in this step has a regularizationeffect that prevents the policy from overfitting to the localpolicies and therefore avoids premature convergence
In order to optimize Equation (25) we employ Lagrangianrelaxation similar to the one employed for ordinary V-MPO (Song et al 2020)
E Multi-Objective Policy Improvement asInference
In the main paper we motivated the multi-objective policyupdate rules from an intuitive perspective In this section weshow that our multi-objective policy improvement algorithmcan also be derived from the RL-as-inference perspective Inthe policy improvement stage we assume that a Q-functionfor each objective is given and we would like to improveour policy with respect to these Q-functions The deriva-tion here extends the derivation for the policy improvementalgorithm in (single-objective) MPO in Abdolmaleki et al(2018a) (in appendix) to the multi-objective case
We assume there are observable binary improvement eventsRkNk=1 for each objective Rk = 1 indicates that ourpolicy has improved for objective k whileRk = 0 indicatesthat it has not Now we ask if the policy has improved withrespect to all objectives ie Rk = 1Nk=1 what would theparameters θ of that improved policy be More concretelythis is equivalent to the maximum a posteriori estimate forRk = 1Nk=1
maxθ
pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)
where the probability of improvement events depends on θAssuming independence between improvement events Rkand using log probabilities leads to the following objective
maxθ
Nsumk=1
log pθ(Rk = 1) + log p(θ) (27)
The prior distribution over parameters p(θ) is fixed duringthe policy improvement step We set the prior such that πθ
stays close to the target policy during each policy improve-ment step to improve stability of learning (Sec E2)
We use the standard expectation-maximization (EM) algo-rithm to efficiently maximize
sumNk=1 log pθ(Rk = 1) The
EM algorithm repeatedly constructs a tight lower bound inthe E-step given the previous estimate of θ (which corre-sponds to the target policy in our case) and then optimizesthat lower bound in the M-step We introduce a variationaldistribution qk(s a) per objective and use this to decom-pose the log-evidence
sumNk=1 log pθ(Rk = 1) as follows
Nsumk=1
log pθ(Rk = 1) (28)
=
Nsumk=1
KL(qk(s a) pθ(s a|Rk = 1))minus
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a))
The second term of this decomposition is expanded as
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a)) (29)
=
Nsumk=1
Eqk(sa)
[log
pθ(Rk = 1 s a)
qk(s a)
]=
Nsumk=1
Eqk(sa)
[log
p(Rk = 1|s a)πθ(a|s)micro(s)
qk(s a)
]
where micro(s) is the stationary state distribution which is as-sumed to be given in each policy improvement step Inpractice micro(s) is approximated by the distribution of thestates in the replay buffer p(Rk = 1|s a) is the likelihoodof the improvement event for objective k if our policy choseaction a in state s
The first term of the decomposition in Equation (28) isalways positive so the second term is a lower bound onthe log-evidence
sumNk=1 log pθ(Rk = 1) πθ(a|s) and
qk(a|s) = qk(sa)micro(s) are unknown even though micro(s) is given
In the E-step we estimate qk(a|s) for each objective k byminimizing the first term given θ = θprime (the parameters ofthe target policy) Then in the M-step we find a new θ byfitting the parametric policy to the distributions from firststep
E1 E-Step
The E-step corresponds to the first step of policy improve-ment in the main paper (Sec 421) In the E-step wechoose the variational distributions qk(a|s)Nk=1 such thatthe lower bound on
sumNk=1 log pθ(Rk = 1) is as tight as
A Distributional View on Multi-Objective Policy Optimization
possible when θ = θprime the parameters of the target policyThe lower bound is tight when the first term of the decompo-sition in Equation (28) is zero so we choose qk to minimizethis term We can optimize for each variational distributionqk independently
qk(a|s) = argminq
Emicro(s)
[KL(q(a|s) pθprime(s a|Rk = 1))
]= argmin
qEmicro(s)
[KL(q(a|s) πθprime(a|s))minus
Eq(a|s) log p(Rk = 1|s a))]
(30)
We can solve this optimization problem in closed formwhich gives us
qk(a|s) =πθprime(a|s) p(Rk = 1|s a)intπθprime(a|s) p(Rk = 1|s a) da
This solution weighs the actions based on their relativeimprovement likelihood p(Rk = 1|s a) for each objectiveWe define the likelihood of an improvement event Rk as
p(Rk = 1|s a) prop exp(Qk(s a)
αk
)
where αk is an objective-dependent temperature parameterthat controls how greedy the solution qk(a|s) is with respectto its associated objective Qk(s a) For example given abatch of state-action pairs at the extremes an αk of zerowould give all probability to the state-action pair with themaximum Q-value while an αk of positive infinity wouldgive the same probability to all state-action pairs In orderto automatically optimize for αk we plug the exponentialtransformation into (30) which leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) dadsminus
αk
intmicro(s) KL(q(a|s) πθprime(a|s)) ds (31)
If we instead enforce the bound on KL divergence as a hardconstraint (which we do in practice) that leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) da ds (32)
stintmicro(s) KL(q(a|s) πθprime(a|s)) ds lt εk
where as described in the main paper the parameter εkdefines preferences over objectives If we now assume thatqk(a|s) is a non-parametric sample-based distribution as in(Abdolmaleki et al 2018b) we can solve this constrained
optimization problem in closed form for each sampled states by setting
q(a|s) prop πθprime(a|s) exp(Qk(s a)
ηk
) (33)
where ηk is computed by solving the following convex dualfunction
ηk = argminη
ηεk+ (34)
η
intmicro(s) log
intπθprime(a|s) exp
(Qk(s a)
η
)da ds
Equations (32) (33) and (34) are equivalent to those givenfor the first step of multi-objective policy improvement inthe main paper (Sec 421) Please see refer to AppendixD2 in (Abdolmaleki et al 2018b) for details on derivationof the dual function
E2 M-Step
After obtaining the variational distributions qk(a|s) foreach objective k we have found a tight lower bound forsumNk=1 log pθ(Rk = 1) when θ = θprime We can now obtain
the parameters θ of the updated policy πθ(a|s) by optimiz-ing this lower bound After rearranging and dropping theterms in (27) that do not involve θ we obtain
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) da ds
+ log p(θ)
This corresponds to the supervised learning step of policyimprovement in the main paper (Sec 422)
For the prior p(θ) on the parameters θ we enforce that thenew policy πθ(a|s) is close to the target policy πθprime(a|s)We therefore optimize for
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) dads
stintmicro(s) KL(πθprime(a|s) πθ(a|s)) ds lt β
Because of the constraint on the change of the parametricpolicy we do not greedily optimize the M-step objectiveSee (Abdolmaleki et al 2018b) for more details
A Distributional View on Multi-Objective Policy Optimization
uniform policy and run MPO with β = 0001 until thepolicy converges scalarized MPO and MO-MPO result insimilar policies (Fig 3 solid bars) up for equal preferenceright for prefer first and left for prefer second
However if we make the rewards imbalanced by multiply-ing the rewards obtained for the first objective by 20 (egleft now obtains a reward of [20 4]) we see that the poli-cies learned by scalarized MPO shift to preferring the opti-mal action for the first objective (right) in both the equalpreference and prefer second cases (Fig 3 striped bars)In contrast the final policies for MO-MPO are the same asfor balanced rewards because in each policy improvementstep MO-MPO optimizes for a separate temperature ηk thatscales each objectiversquos Q-value function This ηk is com-puted based on the corresponding allowed KL-divergenceεk so when the rewards for any objective k are multipliedby a factor but εk remains the same the computed ηk endsup being scaled by that factor as well neutralizing the effectof the scaling of rewards (see Eq (4))
Even in this simple environment we see that MO-MPOrsquosscale-invariant way of encoding preferences is valuable Inmore complex domains in which the Q-value functionsmust be learned in parallel with the policy the (automatic)dynamic adjustment of temperatures ηk per objective be-comes more essential (Sec 6)
The scale of εk controls the amount that objective k caninfluence the policyrsquos update If we set ε1 = 001 and sweepover the range from 0 to 001 for ε2 the resulting policies gofrom always picking right to splitting probability acrossright and up to always picking up (Fig 4 right) Incontrast setting weights leads to policies quickly convergingto placing all probability on a single action (Fig 4 left)We hypothesize this limits the ability of scalarized MPOto explore and find compromise policies (that perform wellwith respect to all objectives) in more challenging domains
52 Deep Sea Treasure
An important quality of any MORL approach is the abilityto find a variety of policies on the true Pareto front (Roijerset al 2013) We demonstrate this in Deep Sea Treasure(DST) (Vamplew et al 2011) a popular benchmark for test-ing MORL approaches DST consists of a 11times10 grid worldwith ten treasure locations The agent starts in the upper leftcorner (s0 = (0 0)) and has a choice of four actions (mov-ing one square up right down and left) When theagent picks up a treasure the episode terminates The agenthas two objectives treasure value and time penalty Thetime penalty is minus1 for each time step that the agent takesand farther-away treasures have higher treasure values Weuse the treasure values in Yang et al (2019)
We ran scalarized MPO with weightings [w 1 minus w] and
00
05
10equal preference prefer first prefer second
left up right00
05
10
left up right left up right
scalarizeddistributional
scalarized imbalanced rewardsdistributional imbalanced rewards
Figure 3 When two objectives have comparable reward scales(solid bars) scalarized MPO (first row) and MO-MPO (second row)learn similar policies across three different preferences Howeverwhen the scale of the first objective is much higher (striped bars)scalarized MPO shifts to always preferring the first objective Incontrast MO-MPO is scale-invariant and still learns policies thatsatisfy the preferences The y-axis denotes action probability
00 05 10p(left)
0
1distributional
[001]
[0101]
[010]
00 05 10p(left)
0
1p(
righ
t)scalarized
[01]
[55]
[10][w1 w2]
00 05 10p(left)
0
1p(
righ
t)scalarized
[01]
[55]
[10]
00 05 10p(left)
0
1
p(righ
t)
scalarized
[01]
[55]
[10][1205981 1205982]
00 05 10p(left)
0
1distributional
[001]
[0101]
[010]
Figure 4 A visualization of policies during learningmdasheach curvecorresponds to a particular setting of weights (left) or εrsquos (right)Policies are initialized as uniform (the blue dot) and are traineduntil convergence Each point (x y) corresponds to the policywith p(left) = x p(right) = y and p(up) = 1minusxminusy Thetop left and bottom right blue stars denote the optimal policy forthe first and second objectives respectively
w isin [0 001 002 1] All policies converged to a pointon the true Pareto front and all but three found the optimalpolicy for that weighting In terms of coverage policieswere found for eight out of ten points on the Pareto front5
We ran MO-MPO on this task as well for a range of εεtime isin [001 002 005] and εtreasure = c lowast εtime wherec isin [05 051 052 15] All runs converged to a policyon the true Pareto front and MO-MPO finds policies for allten points on the front (Fig 5 left) Note that it is the ratioof εrsquos that matters rather than the exact settingsmdashacrossall settings of εtime similar ratios of εtreasure to εtime resultin similar policies as this ratio increases policies tend toprefer higher-value treasures (Fig 5 right)
5Since scalarized MPO consistently finds the optimal policyfor any given weighting in this task with a more strategic selectionof weights we expect policies for all ten points would be found
A Distributional View on Multi-Objective Policy Optimization
Figure 5 Left Blue stars mark the true Pareto front for Deep SeaTreasure MO-MPO with a variety of settings for εk discoversall points on the true Pareto front The area of the orange circlesis proportional to the number of εk settings that converged tothat point Right As more preference is given to the treasureobjective (ie as x increases) policies tend to prefer higher-valuetreasures Each orange dot in the scatterplot corresponds to aparticular setting of εk
6 Experiments Continuous Control DomainsThe advantage of encoding preferences via εrsquos rather thanvia weights is apparent in more complex domains We com-pared our approaches MO-MPO and MO-V-MPO againstscalarized MPO and V-MPO in four high-dimensional con-tinuous control domains in MuJoCo (Todorov et al 2012)and on a real robot The domains we consider are
Humanoid We use the humanoid run task defined in Deep-Mind Control Suite (Tassa et al 2018) Policies must opti-mize for horizontal speed h while minimizing energy usageThe task reward is min(h10 1) where h is in meters persecond and the energy usage penalty is action `2-norm Thehumanoid has 21 degrees of freedom and the observationconsists of joint angles joint velocities head height handand feet positions torso vertical orientation and center-of-mass velocity for a total of 67 dimensions
Shadow Hand We consider three tasks on the ShadowDexterous Hand touch turn and orient In the touch andturn tasks policies must complete the task while minimiz-ing ldquopainrdquo A sparse task reward of 10 is given for pressingthe block with greater than 5N of force or for turning thedial from a random initial location to the target locationThe pain penalty penalizes the robot for colliding with ob-jects at high speed this penalty is defined as in Huang et al(2019) In the orient task there are three aligned objectivestouching the rectangular peg lifting it to a given height andorienting it to be perpendicular to the ground All threerewards are between 0 and 1 The Shadow Hand has five
fingers and 24 degrees of freedom actuated by 20 motorsThe observation consists of joint angles joint velocitiesand touch sensors for a total of 63 dimensions The touchand turn tasks terminate when the goal is reached or after 5seconds and the orient task terminates after 10 seconds
Humanoid Mocap We consider a large-scale humanoidmotion capture tracking task similar to in Peng et al (2018)in which policies must learn to follow motion capture ref-erence data6 There are five objectives each capturing adifferent aspect of the similarity of the pose between thesimulated humanoid and the mocap target joint orientationsjoint velocities hand and feet positions center-of-mass po-sitions and certain body positions and joint angles Theseobjectives are described in detail in Appendix B4 In orderto balance these multiple objectives prior work relied onheavily-tuned reward functions (eg Peng et al 2018) Thehumanoid has 56 degrees of freedom and the observation is1021-dimensional consisting of proprioceptive observationsas well as six steps of motion capture reference frames Intotal we use about 40 minutes of locomotion mocap datamaking this an extremely challenging domain
Sawyer Peg-in-Hole We train a Rethink Robotics Sawyerrobot arm to insert a cylindrical peg into a hole while min-imizing wrist forces The task reward is shaped towardpositioning the peg directly above the hole and increasesfor insertion and the penalty is the `1-norm of Cartesianforces measured by the wrist force-torque sensor The latterimplicitly penalizes contacts and impacts as well as exces-sive directional change (due to the gripperrsquos inertia inducingforces when accelerating) We impose a force thresholdto protect the hardwaremdashif this threshold is exceeded theepisode is terminated The action space is the end effectorrsquosCartesian velocity and the observation is 102-dimensionalconsisting of Cartesian position joint position and velocitywrist force-torque and joint action for three timesteps
61 Evaluation Metric
We run MO-(V-)MPO and scalarized (V-)MPO with a widerange of constraint settings εk and scalarization weightswk respectively corresponding to a wide range of possibledesired preferences (The exact settings are provided inAppendix A) For tasks with two objectives we plot thePareto front found by each approach We also computethe hypervolume of each found Pareto front this metric iscommonly-used for evaluating MORL algorithms (Vamplewet al 2011) Given a set of policies Π and a reference policyr that is dominated by all policies in this set this metric isthe hypervolume of the space of all policies that dominater and are dominated by at least one policy in Π We useDEAP (Fortin et al 2012) to compute hypervolumes
6This task was developed concurrently by Anonymous (2020)and does not constitute a contribution of this paper
A Distributional View on Multi-Objective Policy Optimization
3000 00
400
800
task
rewar
d
mid-training
3000 0
action norm penalty
end of training
3000 0
normal scale
30000 0
10x penalty
MO-MPO scalarized MPO
5 0pain penalty
00
05
task
rewar
d
touch mid-training
5 0pain penalty
touch
5 0pain penalty
turn
0 100z
0
150
angl
e
orient
3000 00
400
800
task
rewar
d
mid-training
3000 0
action norm penalty
end of training
3000 0
normal scale
30000 0
10x penalty
MO-MPO scalarized MPO
5 0pain penalty
00
05
task
rewar
d
touch mid-training
5 0pain penalty
touch
5 0pain penalty
turn
0 100z
0
150
angl
e
orient
touch mid-training
1x penalty
action norm penalty
pain penalty
3000 00
400
800
task
rewar
d
mid-training
3000 0
action norm penalty
end of training
3000 0
normal scale
30000 0
10x penalty
MO-MPO scalarized MPOscalarized MPOMO-MPO
touch end of training
Figure 6 Pareto fronts found by MO-MPO and scalarized MO-MPO for humanoid run (top row) and Shadow Hand tasks Eachdot represents a single trained policy Corresponding hypervol-umes are in Table 1 Task reward is discounted for touch and turnwith a discount factor of 099 For orient the x and y axes are thetotal reward for the lift and orientation objectives respectively
62 Results Humanoid and Shadow Hand
In the run touch and turn tasks the two objectives arecompetingmdasha very high preference for minimizing actionnorm or pain as opposed to getting task reward will resultin a policy that always chooses zero-valued actions Acrossthese three tasks the Pareto front found by MO-MPO issuperior to the one found by scalarized MPO with respectto the hypervolume metric (Table 1)7 In particular MO-MPO finds more policies that perform well with respect toboth objectives ie are in the upper right portion of thePareto front MO-MPO also speeds up learning on run andtouch (Fig 6) Qualitatively MO-MPO trains run policiesthat look more natural and ldquohuman-likerdquo videos are athttpsitesgooglecomviewmo-mpo
When we scale the action norm penalty by 10times for runscalarized MPO policies no longer achieve high task rewardwhereas MO-MPO policies do (Fig 6 top right) Thissupports that MO-MPOrsquos encoding of preferences is indeedscale-invariant When the objectives are aligned and havesimilarly scaled rewards as in the orient task MO-MPOand scalarized MPO perform similarly as expected
Ablation We also ran vanilla MPO on humanoid run withthe same range of weight settings as for scalarized MPO toinvestigate how useful Q-decomposition is In vanilla MPOwe train a single critic on the scalarized reward function
7We use hypervolume reference points of [0minus104] for run[0minus105] for run with 10times penalty [0minus20] for touch and turnand the zero vector for orient and humanoid mocap
Task scalarized MPO MO-MPO
Humanoid run mid-training 11times106 33times106
Humanoid run 64times106 71times106
Humanoid run 1times penalty 50times106 59times106
Humanoid run 10times penalty 26times107 69times107
Shadow touch mid-training 143 156Shadow touch 162 164Shadow turn 144 154Shadow orient 28times104 28times104
Humanoid Mocap 386times10minus6 341times10minus6
Table 1 Hypervolume measurements across tasks and approaches
which is equivalent to removing Q-decomposition fromscalarized MPO Vanilla MPO trains policies that achievesimilar task reward (up to 800) but with twice the actionnorm penalty (up to minus104) As a result the hypervolumeof the Pareto front that vanilla MPO finds is more than amagnitude worse than that of scalarized MPO and MO-MPO(22times106 versus 26times107 and 69times107 respectively)
63 Results Humanoid Mocap
The objectives in this task are mostly aligned In contrast tothe other experiments we use V-MPO (Song et al 2020) asthe base algorithm because it outperforms MPO in learningthis task In addition since V-MPO is an on-policy vari-ant of MPO this enables us to evaluate our approach inthe on-policy setting Each training run is very computa-tionally expensive so we train only a handful of policieseach8 None of the MO-V-MPO policies are dominatedby those found by V-MPO In fact although the weightsspan a wide range of ldquopreferencesrdquo for the joint velocity theonly policies found by V-MPO that are not dominated by aMO-V-MPO policy are those with extreme values for jointvelocity reward (either le 0006 or ge 0018) whereas it isbetween 00103 and 00121 for MO-V-MPO policies
Although the hypervolume of the Pareto front found by V-MPO is higher than that of MO-V-MPO (Table 1) findingpolicies that over- or under- prioritize any objective is unde-sirable Qualitatively the policies trained with MO-V-MPOlook more similar to the mocap reference datamdashthey exhibitless feet jittering compared to those trained with scalarizedV-MPO this can be seen in the corresponding video
64 Results Sawyer Peg-in-Hole
In this task we would like the robot to prioritize successfultask completion while minimizing wrist forces With this
8For MO-V-MPO we set all εk = 001 Also for each objec-tive we set εk = 0001 and set all others to 001 For V-MPOwe fix the weights of four objectives to reasonable values and tryweights of [01 03 06 1 5] for matching joint velocities
A Distributional View on Multi-Objective Policy Optimization
0 200000 400000
200
400
number of policy improvement steps
task reward
0 200000 400000
minus12
00minus
800
wrist force penalty
MO-MPO scalarized MPO [095 005] [08 02]
Figure 7 Task reward (left) and wrist force penalty (right) learningcurves for the Sawyer peg-in-hole task The policy trained withMO-MPO quickly learns to optimize both objectives whereas theother two do not Each line represents a single trained policy
in mind for MO-MPO we set εtask = 01 and εforce = 005and for scalarized MPO we try [wtask wforce] = [095 005]and [08 02] We find that policies trained with scalarizedMPO focus all learning on a single objective at the beginningof training we also observed this in the touch task wherescalarized MPO policies quickly learn to either maximizetask reward or minimize pain penalty but not both (Fig 6bottom left) In contrast the policy trained with MO-MPOsimultaneously optimizes for both objectives throughouttraining In fact throughout training the MO-MPO policydoes just as well with respect to task reward as the scalarizedMPO policy that cares more about task reward and similarlyfor wrist force penalty (Fig 7)
7 Conclusions and Future WorkIn this paper we presented a new distributional perspectiveon multi objective reinforcement learning that is derivedfrom the Rl-as-inference perspective This view leads to twonovel multi-objective RL algorithms namely MO-MPO andMO-V-MPO We showed these algorithms enable practition-ers to encode preferences in a scale-invariant way AlthoughMO-(V-)MPO is a single-policy approach to MORL it iscapable of producing a variety of Pareto-dominant policiesA limitation of this work is that we produced this set ofpolicies by iterating over a relatively large number of εrsquoswhich is computationally expensive In future work weplan to extend MO-(V-)MPO into a true multiple-policyMORL approach either by conditioning the policy on set-tings of εrsquos or by developing a way to strategically select εrsquosto train policies for analogous to what prior work has donefor weights (eg Roijers et al (2014))
AcknowledgmentsThe authors would like to thank Guillaume DesjardinsCsaba Szepesvari Jost Tobias Springenberg Steven BohezPhilemon Brakel Brendan Tracey Jonas Degrave JonasBuchli Leslie Fritz Chloe Rosenberg and many others atDeepMind for their support and feedback on this paper
ReferencesAbdolmaleki A Springenberg J T Degrave J Bohez S
Tassa Y Belov D Heess N and Riedmiller M Rela-tive entropy regularized policy iteration arXiv preprintarXiv181202256 2018a
Abdolmaleki A Springenberg J T Tassa Y Munos RHeess N and Riedmiller M Maximum a posterioripolicy optimisation In Proceedings of the Sixth Interna-tional Conference on Learning Representations (ICLR)2018b
Abels A Roijers D M Lenaerts T Nowe A andSteckelmacher D Dynamic weights in multi-objectivedeep reinforcement learning In Proceedings of the 36thInternational Conference on Machine Learning (ICML)2019
Achiam J Held D Tamar A and Abbeel P Constrainedpolicy optimization In Proceedings of the 34th Inter-national Conference on Machine Learning (ICML) pp22ndash31 2017
Amodei D Olah C Steinhardt J Christiano P Schul-man J and Mane D Concrete problems in AI safetyarXiv preprint arXiv160606565 2016
Andersen M S Dahl J and Vandenberghe L CVXOPTA Python package for convex optimization version 12httpscvxoptorg 2020
Anonymous CoMic Co-Training and Mimicry forReusable Skills 2020
Barrett L and Narayanan S Learning all optimal policieswith multiple criteria In Proceedings of the 25th Inter-national Conference on Machine Learning (ICML) pp41ndash47 2008
Barth-Maron G Hoffman M W Budden D Dabney WHorgan D TB D Muldal A Heess N and LillicrapT Distributed distributional deterministic policy gradi-ents In Proceedings of the Sixth International Conferenceon Learning Representations (ICLR) 2018
Bohez S Abdolmaleki A Neunert M Buchli J HeessN and Hadsell R Value constrained model-free contin-uous control arXiv preprint arXiv190204623 2019
Chen X Ghadirzadeh A Bjorkman M and Jensfelt PMeta-learning for multi-objective reinforcement learningarXiv preprint arXiv181103376 2019
Chentanez N Muller M Macklin M Makoviychuk Vand Jeschke S Physics-based motion capture imitationwith deep reinforcement learning In International Con-ference on Motion Interaction and Games (MIG) ACM2018
A Distributional View on Multi-Objective Policy Optimization
Chow Y Nachum O Duenez-Guzman E andGhavamzadeh M A Lyapunov-based approach to safereinforcement learning In Proceedings of the 31st Inter-national Conference on Neural Information ProcessingSystems (NeurIPS) pp 8092ndash8101 2018
Daneshmand M Tale Masouleh M Saadatzi M HOzcinar C and Anbarjafari G A robust proportion-preserving composite objective function for scale-invariant multi-objective optimization Scientia Iranica24(6)2977ndash2991 2017
Fortin F-A De Rainville F-M Gardner M-A ParizeauM and Gagne C DEAP Evolutionary algorithms madeeasy Journal of Machine Learning Research 132171ndash2175 July 2012
Friedman E and Fontaine F Generalizing across multi-objective reward functions in deep reinforcement learningarXiv preprint arXiv180906364 2018
Gabor Z Kalmar Z and Szepesvari C Multi-criteriareinforcement learning In Proceedings of the FifteenthInternational Conference on Machine Learning (ICML)pp 197ndash205 1998
Grodzevich O and Romanko O Normalization and othertopics in multi-objective optimization In Proceedings ofthe Fields-MITACS Industrial Problems Workshop 2006
Haarnoja T Zhou A Abbeel P and Levine S Softactor-critic Off-policy maximum entropy deep reinforce-ment learning with a stochastic actor In Proceedings ofthe 35th International Conference on Machine Learning(ICML) pp 1861ndash1870 2018
Heess N Wayne G Silver D Lillicrap T Erez T andTassa Y Learning continuous control policies by stochas-tic value gradients In Advances in Neural InformationProcessing Systems 28 (NIPS) 2015
Hessel M Soyer H Espeholt L Czarnecki W SchmittS and van Hasselt H Multi-task deep reinforcementlearning with popart In Proceedings of the AAAI Confer-ence on Artificial Intelligence volume 33 pp 3796ndash38032019
Huang S H Zambelli M Kay J Martins M F TassaY Pilarski P M and Hadsell R Learning gentle objectmanipulation with curiosity-driven deep reinforcementlearning arXiv preprint arXiv190308542 2019
Ishibuchi H Doi K and Nojima Y On the effect ofnormalization in MOEAD for multi-objective and many-objective optimization Complex amp Intelligent Systems 3(4)279ndash294 2017
Kim I and de Weck O Adaptive weighted sum methodfor multiobjective optimization A new method for paretofront generation Structural and Multidisciplinary Opti-mization 31(2)105ndash116 2006
Kingma D P and Ba J Adam A method for stochasticoptimization In Proceedings of the Third InternationalConference on Learning Representations (ICLR) 2015
Le H Voloshin C and Yue Y Batch policy learningunder constraints In Proceedings of the 36th Inter-national Conference on Machine Learning (ICML) pp3703ndash3712 2019
Levine S Finn C Darrell T and Abbeel P End-to-endtraining of deep visuomotor policies Journal of MachineLearning Research 17(39)1ndash40 2016
Liu C Xu X and Hu D Multiobjective reinforcementlearning A comprehensive overview IEEE Transactionson Systems Man and Cybernetics Systems 45(3)385ndash398 2015
Marler R T and Arora J S Function-transformationmethods for multi-objective optimization EngineeringOptimization 37(6)551ndash570 2005
Martens J New perspectives on the natural gradient methodarXiv preprint arXiv14121193 2014
Merel J Ahuja A Pham V Tunyasuvunakool S LiuS Tirumala D Heess N and Wayne G Hierarchi-cal visuomotor control of humanoids In InternationalConference on Learning Representations (ICLR) 2019
Mnih V Kavukcuoglu K Silver D Rusu A A VenessJ Bellemare M G Graves A Riedmiller M Fidje-land A K Ostrovski G et al Human-level controlthrough deep reinforcement learning Nature 518(7540)529 2015
Moffaert K V and Nowe A Multi-objective reinforcementlearning using sets of Pareto dominating policies Journalof Machine Learning Research 153663ndash3692 2014
Mossalam H Assael Y M Roijers D M and WhitesonS Multi-objective deep reinforcement learning arXivpreprint arXiv161002707 2016
Munos R Stepleton T Harutyunyan A and BellemareM Safe and efficient off-policy reinforcement learningIn Proceedings of the 29th International Conference onNeural Information Processing Systems pp 1054ndash10622016
Natarajan S and Tadepalli P Dynamic preferences inmulti-criteria reinforcement learning In Proceedings ofthe 22nd International Conference on Machine Learning(ICML) pp 601ndash608 2005
A Distributional View on Multi-Objective Policy Optimization
Nottingham K Balakrishnan A Deshmukh J Christo-pherson C and Wingate D Using logical specificationsof objectives in multi-objective reinforcement learningarXiv preprint arXiv191001723 2019
Parisi S Pirotta M Smacchia N Bascetta L andRestelli M Policy gradient approaches for multi-objective sequential decision making In Proceedingsof the International Joint Conference on Neural Networks(IJCNN) pp 2323ndash2330 2014
Peng X B Abbeel P Levine S and van de Panne MDeepmimic Example-guided deep reinforcement learn-ing of physics-based character skills ACM Transactionson Graphics 37(4) 2018
Pirotta M Parisi S and Restelli M Multi-objectivereinforcement learning with continuous Pareto frontierapproximation In Proceedings of the Twenty-Ninth AAAIConference on Artificial Intelligence (AAAI) 2015
Reymond M and Nowe A Pareto-DQN Approximat-ing the Pareto front in complex multi-objective decisionproblems In Proceedings of the Adaptive and LearningAgents Workshop at the 18th International Conferenceon Autonomous Agents and MultiAgent Systems AAMAS2019
Riedmiller M Hafner R Lampe T Neunert M De-grave J Van de Wiele T Mnih V Heess N and Sprin-genberg J T Learning by playing - solving sparse rewardtasks from scratch arXiv preprint arXiv1802105672018
Roijers D M Vamplew P Whiteson S and Dazeley RA survey of multi-objective sequential decision-makingJournal of Artificial Intelligence Research 48(1)67ndash1132013
Roijers D M Whiteson S and Oliehoek F A Linearsupport for multi-objective coordination graphs In Pro-ceedings of the International Conference on AutonomousAgents and Multi-Agent Systems (AAMAS) pp 1297ndash1304 2014
Russell S and Zimdars A L Q-decomposition for rein-forcement learning agents In Proceedings of the Twenti-eth International Conference on International Conferenceon Machine Learning (ICML) pp 656ndash663 2003
Schulman J Levine S Abbeel P Jordan M and MoritzP Trust region policy optimization In Proceedings ofthe 32nd International Conference on Machine Learning(ICML) 2015
Schulman J Wolski F Dhariwal P Radford A andKlimov O Proximal policy optimization algorithmsarXiv preprint arXiv170706347 2017
Shadow Robot Company Shadow dexterous handhttpswwwshadowrobotcomproductsdexterous-hand 2020
Silver D Huang A Maddison C J Guez A Sifre Lvan den Driessche G Schrittwieser J Antonoglou IPanneershelvam V Lanctot M Dieleman S GreweD Nham J Kalchbrenner N Sutskever I Lillicrap TLeach M Kavukcuoglu K Graepel T and HassabisD Mastering the game of Go with deep neural networksand tree search Nature 529484ndash503 2016
Song H F Abdolmaleki A Springenberg J T ClarkA Soyer H Rae J W Noury S Ahuja A LiuS Tirumala D Heess N Belov D Riedmiller Mand Botvinick M M V-MPO On-policy maximum aposteriori policy optimization for discrete and continu-ous control In International Conference on LearningRepresentations 2020
Sutton R S Learning to predict by the methods of temporaldifferences Machine learning 3(1)9ndash44 1988
Tan J Zhang T Coumans E Iscen A Bai Y HafnerD Bohez S and Vanhoucke V Sim-to-real Learningagile locomotion for quadruped robots In Proceedingsof Robotics Science and Systems (RSS) 2018
Tassa Y Doron Y Muldal A Erez T Li Yde Las Casas D Budden D Abdolmaleki A Merel JLefrancq A Lillicrap T and Riedmiller M Deepmindcontrol suite arXiv preprint arXiv180100690 2018
Teh Y W Bapst V Czarnecki W M Quan J Kirk-patrick J Hadsell R Heess N and Pascanu R DistralRobust multitask reinforcement learning In Proceedingsof the 31st International Conference on Neural Informa-tion Processing Systems (NeurIPS) 2017
Todorov E Erez T and Tassa Y Mujoco A physicsengine for model-based control In Proceedings of theIEEERSJ International Conference on Intelligent Robotsand Systems (IROS) pp 5026ndash5033 2012
Vamplew P Dazeley R Berry A Issabekov R andDekker E Empirical evaluation methods for multiobjec-tive reinforcement learning algorithms Machine Learn-ing 84(1)51ndash80 Jul 2011
van Hasselt H P Guez A Hessel M Mnih V and Sil-ver D Learning values across many orders of magnitudeIn Advances in Neural Information Processing Systemspp 4287ndash4295 2016
Van Moffaert K Drugan M M and Nowe A Scalar-ized multi-objective reinforcement learning Novel de-sign techniques In Proceedings of the IEEE Symposiumon Adaptive Dynamic Programming and ReinforcementLearning (ADPRL) pp 191ndash199 2013
A Distributional View on Multi-Objective Policy Optimization
van Seijen H Fatemi M Romoff J Laroche R BarnesT and Tsang J Hybrid reward architecture for reinforce-ment learning In Proceedings of the 31st InternationalConference on Neural Information Processing Systems(NeurIPS) pp 5398ndash5408 2017
Watkins C J C H Learning from Delayed Rewards PhDthesis Kingrsquos College Cambridge UK May 1989
Wray K H Zilberstein S and Mouaddib A-I Multi-objective MDPs with conditional lexicographic rewardpreferences In Proceedings of the Twenty-Ninth AAAIConference on Artificial Intelligence (AAAI) pp 3418ndash3424 2015
Wulfmeier M Abdolmaleki A Hafner R SpringenbergJ T Neunert M Hertweck T Lampe T Siegel NHeess N and Riedmiller M Regularized hierarchicalpolicies for compositional transfer in robotics arXivpreprint arXiv190611228 2019
Yang R Sun X and Narasimhan K A generalized algo-rithm for multi-objective reinforcement learning and pol-icy adaptation In Proceedings of the 32nd InternationalConference on Neural Information Processing Systems(NeurIPS) pp 14610ndash14621 2019
Zeng A Song S Lee J Rodriguez A and FunkhouserT Tossingbot Learning to throw arbitrary objects withresidual physics In Proceedings of Robotics Scienceand Systems (RSS) 2019
Zhan H and Cao Y Relationship explainable multi-objective optimization via vector value function basedreinforcement learning arXiv preprint arXiv1910019192019
Zuluaga M Krause A and Puschel M ε-pal An ac-tive learning approach to the multi-objective optimizationproblem Journal of Machine Learning Research 17(1)3619ndash3650 2016
A Distributional View on Multi-Objective Policy Optimization
A Experiment DetailsIn this section we describe implementation details and spec-ify the hyperparameters used for our algorithm In all ourexperiments the policy and critic(s) are implemented withfeed-forward neural networks We use Adam (Kingma ampBa 2015) for optimization
In our continuous control tasks the policy returns a Gaus-sian distribution with a diagonal covariance matrix ieπθ(a|s) = N (microΣ) The policy is parametrized by aneural network which outputs the mean micro = micro(s) and di-agonal Cholesky factors A = A(s) such that Σ = AAT The diagonal factor A has positive diagonal elements en-forced by the softplus transform Aii larr log(1 + exp(Aii))to ensure positive definiteness of the diagonal covariancematrix
Table 2 shows the default hyperparameters that we use forMO-MPO and scalarized MPO and Table 3 shows the hy-perparameters that differ from these defaults for the hu-manoid Shadow Hand and Sawyer experiments For alltasks that we train policies with MO-MPO for a separatecritic is trained for each objective with a shared first layerThe only exception is the action norm penalty in humanoidrun which is computed exactly from the action We foundthat for both policy and critic networks layer normalizationof the first layer followed by a hyperbolic tangent (tanh) isimportant for the stability of the algorithm
In our experiments comparing the Pareto front found byMO-MPO versus scalarized MPO we ran MO-MPO with arange of εk settings and ran scalarized MPO with a range ofweight settings to obtain a set of policies for each algorithmThe settings for humanoid run and the Shadow Hand tasksare listed in Table 4 The settings for humanoid mocap andSawyer are specified in the main paper in Sec 63 and 64respectively
In the following subsections we first give suggestions forchoosing appropriate εk to encode a particular preference(Sec A1) then we describe the discrete MO-MPO usedfor Deep Sea Treasure (Sec A2) and finally we describeimplementation details for humanoid mocap (Sec A3)
A1 Suggestions for Setting ε
Our proposed algorithm MO-(V-)MPO requires practition-ers to translate a desired preference across objectives to nu-merical choices for εk for each objective k At first glancethis may seem daunting However in practice we havefound that encoding preferences via εk is often more in-tuitive than doing so via scalarization In this subsectionwe seek to give an intuition on how to set εk for differentdesired preferences Recall that each εk controls the influ-ence of objective k on the policy update by constraining theKL-divergence between each objective-specific distribution
Hyperparameter Default
policy networklayer sizes (300 200)take tanh of action mean yesminimum variance 10minus12
maximum variance unbounded
critic network(s)layer sizes (400 400 300)take tanh of action yesRetrace sequence size 8discount factor γ 099
both policy and critic networkslayer norm on first layer yestanh on output of layer norm yesactivation (after each hidden layer) ELU
MPOactions sampled per state 20default ε 01KL-constraint on policy mean βmicro 10minus3
KL-constraint on policy covariance βΣ 10minus5
initial temperature η 1
trainingbatch size 512replay buffer size 106
Adam learning rate 3times10minus4
Adam ε 10minus3
target network update period 200
Table 2 Default hyperparameters for MO-MPO and scalarizedMPO with decoupled update on mean and covariance (Sec C3)
and the current policy (Sec 41) We generally choose εk inthe range of 0001 to 01
Equal Preference When all objectives are equally impor-tant the general rule is to set all εk to the same value Wedid this in our orient and humanoid mocap tasks where ob-jectives are aligned or mostly aligned In contrast it can betricky to choose appropriate weights in linear scalarizationto encode equal preferencesmdashsetting all weights equal to1K (where K is the number of objectives) is only appro-priate if the objectivesrsquo rewards are of similar scales Weexplored this in Sec 51 and Fig 3
When setting all εk to the same value what should thisvalue be The larger εk is the more influence the objectiveswill have on the policy update step Since the per-objectivecritics are learned in parallel with the policy setting εktoo high tends to destabilize learning because early on intraining when the critics produce unreliable Q-values theirinfluence on the policy will lead it in the wrong directionOn the other hand if εk is set too low then it slows downlearning because the per-objective action distribution isonly allowed to deviate by a tiny amount from the currentpolicy and the updated policy is obtained via supervisedlearning on the combination of these action distributions
A Distributional View on Multi-Objective Policy Optimization
Humanoid run
policy networklayer sizes (400 400 300)take tanh of action mean nominimum variance 10minus8
critic network(s)layer sizes (500 500 400)take tanh of action no
MPOKL-constraint on policy mean βmicro 5times10minus3
KL-constraint on policy covariance βΣ 10minus6
trainingAdam learning rate 2times10minus4
Adam ε 10minus8
Shadow Hand
MPOactions sampled per state 30default ε 001
Sawyer
MPOactions sampled per state 15
trainingtarget network update period 100
Table 3 Hyperparameters for humanoid Shadow Hand andSawyer experiments that differ from the defaults in Table 2
Eventually the learning will converge to more or less thesame policy though as long as εk is not set too high
Unequal Preference When there is a difference in pref-erences across objectives the relative scale of εk is whatmatters The more the relative scale of εk is compared to εlthe more influence objective k has over the policy updatecompared to objective l And in the extreme case whenεl is near-zero for objective l then objective l will haveno influence on the policy update and will effectively beignored We explored this briefly in Sec 51 and Fig 4
One common example of unequal preferences is when wewould like an agent to complete a task while minimiz-ing other objectivesmdasheg an action norm penalty ldquopainrdquopenalty or wrist force-torque penalty in our experimentsIn this case the ε for the task objective should be higherthan that for the other objectives to incentivize the agent toprioritize actually doing the task If the ε for the penaltiesis too high then the agent will care more about minimiz-ing the penalty (which can typically be achieved by simplytaking no actions) rather than doing the task which is notparticularly useful
The scale of εk has a similar effect as in the equal preference
Condition Settings
Humanoid run (one seed per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin linspace(0 015 100)
MO-MPO εtask = 01εpenalty isin linspace(10minus6 015 100)
Humanoid run normal vs scaled (three seeds per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin 001 005 01
MO-MPO εtask = 01εpenalty isin 001 005 01
Shadow Hand touch and turn (three seeds per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin linspace(0 09 10)
MO-MPO εtask = 001εpenalty isin linspace(0001 0015 15)
Shadow Hand orient (ten seeds per setting)
scalarized MPO wtouch = wheight = worientation = 13MO-MPO εtouch = εheight = εorientation = 001
Table 4 Settings for εk and weights (linspace(x y z) denotes aset of z evenly-spaced values between x and y)
case If the scale of εk is too high or too low then the sameissues arise as discussed for equal preferences If all εkincrease or decrease in scale by the same (moderate) factorand thus their relative scales remain the same then typicallythey will converge to more or less the same policy Thiscan be seen in Fig 5 (right) in the Deep Sea Treasuredomain regardless of whether εtime is 001 002 or 005the relationship between the relative scale of εtreasure andεtime to the treasure that the policy converges to selecting isessentially the same
A2 Deep Sea Treasure
In order to handle discrete actions we make several mi-nor adjustments to scalarized MPO and MO-MPO Thepolicy returns a categorical distribution rather than a Gaus-sian The policy is parametrized by a neural network whichoutputs a vector of logits (ie unnormalized log probabil-ities) of size |A| The KL constraint on the change of thispolicy β is 0001 The input to the critic network is thestate concatenated with a four-dimensional one-hot vectordenoting which action is chosen (eg the up action corre-sponds to [1 0 0 0]gt) Critics are trained with one-steptemporal-difference error with a discount of 0999 Otherthan these changes the network architectures and the MPOand training hyperparameters are the same as in Table 2
A Distributional View on Multi-Objective Policy Optimization
Humanoid Mocap
Scalarized VMPOKL-constraint on policy mean βmicro 01KL-constraint on policy covariance βΣ 10minus5
default ε 01initial temperature η 1
TrainingAdam learning rate 10minus4
batch size 128unroll length (for n-step return Sec D1) 32
Table 5 Hyperparameters for the humanoid mocap experiments
A3 Humanoid Mocap
For the humanoid mocap experiments we used the follow-ing architecture for both MO-VMPO and VMPO for thepolicy we process a concatentation of the mocap referenceobservations and the proprioceptive observations by a twolayer MLP with 1024 hidden units per layer This referenceencoder is followed by linear layers to produce the meanand log standard deviation of a stochastic latent variableThese latent variables are then concatenated with the propri-oceptive observations and processed by another two layerMLP with 1024 hidden units per layer to produce the actiondistribution For VMPO we use a three layer MLP with1024 hidden units per layer as the critic For MO-VMPOwe use a shared two layer MLP with 1024 hidden units perlayer followed by a one layer MLP with 1024 hidden unitsper layer per objective In both cases we use k-step returnsto train the critic with a discount factor of 095 Table 5shows additional hyperparameters used in our experiments
B Experimental DomainsWe evaluated our approach on one discrete domain (DeepSea Treasure) three simulated continuous control domains(humanoid Shadow Hand and humanoid mocap) and onereal-world continuous control domain (Sawyer robot) Herewe provide more detail about these domains and the objec-tives used in each task
B1 Deep Sea Treasure
Deep Sea Treasure (DST) is a 11times10 grid-world domainthe state space S consists of the x and y position of the agentand the action spaceA is up right down left Thelayout of the environment and values of the treasures areshown in Fig 8 If the action would cause the agent tocollide with the sea floor or go out of bounds it has noeffect Farther-away treasures have higher values Theepisode terminates when the agent collects a treasure orafter 200 timesteps
237
224
203196
161151140
115
82
07
Figure 8 Deep Sea Treasure environment from Vamplew et al(2011) with weights from Yang et al (2019) Treasures are labeledwith their respective values The agent can move around freelyin the white squares but cannot enter the black squares (ie theocean floor)
There are two objectives time penalty and treasure valueA time penalty of minus1 is given at each time step The agentreceives the value of the treasure it picks up as the rewardfor the treasure objective In other words when the agentpicks up a treasure of value v the reward vector is [minus1 v]otherwise it is [minus1 0]
B2 Shadow Hand
Our robot platform is a simulated Shadow Dexterous Hand(Shadow Robot Company 2020) in the MuJoCo physicsengine (Todorov et al 2012) The Shadow Hand has fivefingers and 24 degrees of freedom actuated by 20 motorsThe observation consists of joint angles joint velocities andtouch sensors for a total of 63 dimensions Each fingertipof our Shadow Hand has a 4times4 spatial touch sensor Thissensor has three channels measuring the normal force andthe x and y-axis tangential forces for a sensor dimension of4times4times3 per fingertip We simplify this sensor by summingacross spatial dimensions resulting in a 1times1times3 observationper fingertip
In the touch task there is a block in the environment that isalways fixed in the same pose In the turn task there is a dialin the environment that can be rotated and it is initialized toa random position between minus30 and 30 The target angleof the dial is 0 The angle of the dial is included in theagentrsquos observation In the orient task the robot interactswith a rectangular peg in the environment the initial andtarget pose of the peg remains the same across episodesThe pose of the block is included in the agentrsquos observationencoded as the xyz positions of four corners of the block(based on how Levine et al (2016) encodes end-effectorpose)
A Distributional View on Multi-Objective Policy Optimization
Figure 9 These images show what task completion looks like forthe touch turn and orient Shadow Hand tasks (from left to right)
0 2 4 6impact force
minus6
minus4
minus2
0
pain
pen
alty
Figure 10 In the touch and turn Shadow Hand tasks the painpenalty is computed based on the impact force as plotted here Forlow impact forces the pain penalty is near-zero For high impactforces the pain penalty is equal to the negative of the impact force
B21 BALANCING TASK COMPLETION AND PAIN
In the touch and turn tasks there are two objectives ldquopainrdquopenalty and task completion A sparse task completion re-ward of 1 is given for pressing a block with greater than 5Nof force or turning a dial to a fixed target angle respectivelyIn both tasks the episode terminates when the agent com-pletes the task ie the agent gets a total reward of either0 or 1 for the task completion objective per episode ThePareto plots for these two tasks in the main paper (Fig 6)show the discounted task reward (with a discount factor ofγ = 099) to capture how long it takes agents to completethe task
The ldquopainrdquo penalty is computed as in (Huang et al 2019) Itis based on the impact force m(s sprime) which is the increasein force from state s to the next state sprime In our tasks this ismeasured by a touch sensor on the block or dial The painpenalty is equal to the negative of the impact force scaledby how unacceptable it is
rpain(s a sprime) = minus[1minus aλ(m(s sprime))
]m(s sprime) (6)
where aλ(middot) isin [0 1] computes the acceptability of an impactforce aλ(middot) should be a monotonically decreasing functionthat captures how resilient the robot and the environmentare to impacts As in (Huang et al 2019) we use
aλ(m) = sigmoid(λ1(minusm+ λ2)) (7)
with λ = [2 2]gt The relationship between pain penaltyand impact force is plotted in Fig 10
675 700 725zpeg
0
1
rew
ard
height objective
0 2d(θ θpeg)
orientation objective
Figure 11 In the orient Shadow Hand task the height and orienta-tion objectives have shaped rewards as plotted here
B22 ALIGNED OBJECTIVES
In the orient task there are three non-competing objectivestouching the peg lifting the peg to a target height and ori-enting the peg to be perpendicular to the ground The targetz position and orientation is shown by the gray transparentpeg (Fig 9 right-most) note that the x and y position isnot specified so the robot in the figure gets the maximumreward with respect to all three objectives
The touch objective has a sparse reward of 1 for activat-ing the pegrsquos touch sensor and zero otherwise For theheight objective the target z position is 7cm above theground the pegrsquos z position is computed with respect toits center of mass The shaped reward for this objectiveis 1 minus tanh(50|ztarget minus zpeg|) For the orientation objec-tive since the peg is symmetrical there are eight possibleorientations that are equally valid The acceptable targetorientationsQtarget and the pegrsquos orientation qpeg are denotedas quaternions The shaped reward is computed with respectto the closest target quaternion as
minqisinQtarget
1minus tanh(2lowastd(q qpeg))
where d(middot) denotes the `2-norm of the axis-angle equivalent(in radians) for the distance between the two quaternionsFig 11 shows the shaping of the height and orientationrewards
B3 Humanoid
We make use of the humanoid run task from Tassa et al(2018)9 The observation dimension is 67 and the actiondimension is 21 Actions are joint accelerations with min-imum and maximum limits of minus1 and 1 respectively Forthis task there are two objectives
bull The original task reward given by the environment Thegoal is is to achieve a horizontal speed of 10 meters persecond in any direction This reward is shaped it isequal to min(h10 1) where h is in meters per secondFor this objective we learn a Q-function
9This task is available at githubcomdeepminddm control
A Distributional View on Multi-Objective Policy Optimization
bull Limiting energy usage by penalizing taking high-magnitude actions The penalty is the `2-norm of theaction vector ie rpenalty(a) = minusa2 For this ob-jective we do not learn a Q-function Instead wecompute this penalty directly to evaluate a given actionin a state-independent way during policy optimization
B4 Humanoid Mocap
The motion capture tracking task used in this paper wasdeveloped as part of a concurrent submission (Anonymous2020) and does not constitute a contribution of this paperWe use a simulated humanoid adapted from the ldquoCMUhumanoidrdquo available at dm controllocomotion (Merel et al2019) We adjusted various body and actuator parametersto be more comparable to that of an average human Thishumanoid has 56 degrees of freedom and the observation is1021-dimensional
This motion capture tracking task is broadly similar to previ-ous work on motion capture tracking (Chentanez et al 2018Peng et al 2018 Merel et al 2019) The task is associatedwith an underlying set of motion capture clips For thispaper we used roughly 40 minutes of locomotion motioncapture clips from the CMU motion capture database10
At the beginning of a new episode we select a frame uni-formly at random from all frames in the underlying mocapdata (excluding the last 10 frames of each clip) The simu-lated humanoid is then initialized to the pose of the selectedframe
The observations in this environment are various propriocep-tive observations about the current state of the body as wellas the relative target positions and orientations of 13 differ-ent body parts in the humanoidrsquos local frame We providethe agent with the target poses for the next 5 timesteps
The choice of reward function is crucial for motion capturetracking tasks Here we consider five different reward com-ponents each capturing a different aspect of the similarityof the pose between the simulated humanoid and the mocaptarget Four of our reward components were initially pro-posed by Peng et al (2018) The first reward component isbased on the difference in the center of mass
rcom = exp(minus10pcom minus pref
com2)
where pcom and prefcom are the positions of the center of mass
of the simulated humanoid and the mocap reference re-spectively The second reward component is based on thedifference in joint angle velocities
rvel = exp(minus01qvel minus qref
vel2)
where qvel and qrefvel are the joint angle velocities of the simu-
lated humanoid and the mocap reference respectively The
10This database is available at mocapcscmuedu
third reward component is based on the difference in end-effector positions
rapp = exp(minus40papp minus pref
app2)
where papp and prefapp are the end effector positions of the
simulated humanoid and the mocap reference respectivelyThe fourth reward component is based on the difference injoint orientations
rquat = exp(minus2qquat qref
quat2)
where denotes quaternion differences and qquat and qrefquat
are the joint quaternions of the simulated humanoid and themocap reference respectively
Finally the last reward component is based on difference inthe joint angles and the Euclidean positions of a set of 13body parts
rtrunc = 1minus 1
03(
=ε︷ ︸︸ ︷bpos minus bref
pos1 + qpos minus qrefpos1)
where bpos and brefpos correspond to the body positions of the
simulated character and the mocap reference and qpos andqref
pos correspond to the joint angles
We include an early termination condition in our task that islinked to this last reward term We terminate the episode ifε gt 03 which ensures that rtrunc isin [0 1]
In our MO-VMPO experiments we treat each reward com-ponent as a separate objective In our VMPO experimentswe use a reward of the form
r =1
2rtrunc +
1
2(01rcom + λrvel + 015rapp + 065rquat)
(8)
varying λ as described in the main paper (Sec 63)
B5 Sawyer
The Sawyer peg-in-hole setup consists of a Rethink RoboticsSawyer robot arm The robot has 7 joints driven by serieselastic actuators On the wrist a Robotiq FT 300 force-torque sensor is mounted followed by a Robotiq 2F85 par-allel gripper We implement a 3D end-effector Cartesianvelocity controller that maintains a fixed orientation of thegripper The agent controls the Cartesian velocity setpointsThe gripper is not actuated in this setup The observationsprovided are the joint positions velocities and torques theend-effector pose the wrist force-torque measurements andthe joint velocity command output of the Cartesian con-troller We augment each observation with the two previousobservations as a history
In each episode the peg is initalized randomly within an8times8times8 cm workspace (with the peg outside of the hole)
A Distributional View on Multi-Objective Policy Optimization
The environment is then run at 20 Hz for 600 steps Theaction limits are plusmn005 ms on each axis If the magnitudeof the wrist force-torque measurements exceeds 15 N onany horizontal axis (x and y axis) or 5 N on the vertical axis(z axis) the episode is terminated early
There are two objectives the task itself (inserting the peg)and minimizing wrist force measurements
The reward for the task objective is defined as
rinsertion = max(
02rapproach rinserted
)(9)
rapproach = s(ppeg popening) (10)
rinserted = raligned s(ppegz pbottom
z ) (11)
raligned = ppegxy minus popening
xy lt d2 (12)
where ppeg is the peg position popening the position of thehole opening pbottom the bottom position of the hole and dthe diameter of the hole
s(p1 p2) is a shaping operator
s(p1 p2) = 1minus tanh2(atanh(
radicl)
εp1 minus p22) (13)
which gives a reward of 1minus l if p1 and p2 are at a Euclideandistance of ε Here we chose lapproach = 095 εapproach =001 linserted = 05 and εinserted = 004 Intentionally 004corresponds to the length of the peg such that rinserted = 05if ppeg = popening As a result if the peg is in the approachposition rinserted is dominant over rapproach in rinsertion
Intuitively there is a shaped reward for aligning the pegwith a magnitude of 02 After accurate alignment withina horizontal tolerance the overall reward is dominated bya vertical shaped reward component for insertion The hor-izontal tolerance threshold ensures that the agent is notencouraged to try to insert the peg from the side
The reward for the secondary objective minimizing wristforce measurements is defined as
rforce = minusF1 (14)
where F = (Fx Fy Fz) are the forces measured by thewrist sensor
C Algorithmic DetailsC1 General Algorithm Description
We maintain one online network and one target network foreach Q-function associated to objective k with parametersdenoted by φk and φprimek respectively We also maintain oneonline network and one target network for the policy withparameters denoted by θ and θprime respectively Target net-works are updated every fixed number of steps by copying
parameters from the online network Online networks areupdated using gradient descent in each learning iterationNote that in the main paper we refer to the target policynetwork as the old policy πold
We use an asynchronous actor-learner setup in which ac-tors regularly fetch policy parameters from the learner andact in the environment writing these transitions to the re-play buffer We refer to this policy as the behavior policythroughout the paper The learner uses the transitions inthe replay buffer to update the (online) Q-functions and thepolicy Please see Algorithms 2 and 3 for more details onthe actor and learner processes respectively
Algorithm 2 MO-MPO Asynchronous Actor
1 given (N) reward functions rk(s a)Nk=1 T steps perepisode
2 repeat3 Fetch current policy parameters θ from learner4 Collect trajectory from environment5 τ = 6 for t = 0 T do7 at sim πθ(middot|st)8 Execute action and determine rewards9 r = [r1(st at) rN (st at)]
10 τ larr τ cup (st at r πθ(at|st))11 end for12 Send trajectory τ to replay buffer13 until end of training
C2 Retrace for Multi-Objective Policy Evaluation
Recall that the goal of the policy evaluation stage (Sec 41)is to learn the Q-function Qφk
(s a) for each objective kparametrized by φk These Q-functions are with respectto the policy πold We emphasize that it is valid to use anyoff-the-shelf Q-learning algorithm such as TD(0) (Sutton1988) to learn these Q-functions We choose to use Retrace(Munos et al 2016) described here
Given a replay buffer D containing trajectory snippetsτ = (s0 a0 r0 s1) (sT aT rT sT+1) where rtdenotes a reward vector rk(st at)Nk=1 that consists of ascalar reward for each ofN objectives the Retrace objectiveis as follows
minφk
EτsimD[(Qretk (st at)minus Qφk
(st at))2] (15)
with
Qretk (st at) = Qφprime
k(st at) +
Tsumj=t
γjminust( jprodz=t+1
cz
)δj
δj = rk(sj aj) + γV (sj+1)minus Qφprimek(sj aj)
V (sj+1) = Eπold(a|sj+1)[Qφprimek(sj+1 a)]
A Distributional View on Multi-Objective Policy Optimization
Algorithm 3 MO-MPO Asynchronous Learner1 given batch size (L) number of actions (M) (N) Q-functions
(N) preferences εkNk=1 policy networks and replay bufferD
2 initialize Lagrangians ηkNk=1 and ν target networks andonline networks such that πθprime = πθ and Qφprime
k= Qφk
3 repeat4 repeat5 Collect dataset si aij Qijk
LMNijk where
6 si sim D aij sim πθprime(a|si) and Qijk = Qφkprime(si aij)
78 Compute action distribution for each objective9 for k = 1 N do
10 δηk larr nablaηkηkεk + ηksumLi
1Llog
(sumMj
1M
exp(Q
ijkηk
))11 Update ηk based on δηk12 qijk prop exp(
Qijkηk
)
13 end for1415 Update parametric policy with trust region16 sg denotes a stop-gradient17 δπ larr minusnablaθ
sumLi
sumMj
sumNk q
ijk log πθ(a
ij |si)18 +sg(ν)
(β minus
sumLi KL(πθprime(a|si) πθ(a|si)) ds
)19 δν larr nablaνν
(βminussumLi KL(πθprime(a|si) πsg(θ)(a|si)) ds
)20 Update πθ based on δπ21 Update ν based on δν2223 Update Q-functions24 for k = 1 N do25 δφk larr nablaφk
sum(stat)isinτsimD
(Qφk (st at)minusQ
retk
)226 with Qret
k as in Eq 1527 Update φk based on δφk
28 end for2930 until fixed number of steps31 Update target networks32 πθprime = πθ Qφk
prime = Qφk
33 until convergence
The importance weights ck are defined as
cz = min
(1πold(az|sz)b(az|sz)
)
where b(az|sz) denotes the behavior policy used to col-lect trajectories in the environment When j = t we set(prodj
z=t+1 cz
)= 1
We also make use of a target network for each Q-function(Mnih et al 2015) parametrized by φprimek which we copyfrom the online network φk after a fixed number of gradientsteps on the Retrace objective (15)
C3 Policy Fitting
Recall that fitting the policy in the policy improvementstage (Sec 422) requires solving the following constrained
optimization
πnew = argmaxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds
stints
micro(s) KL(πold(a|s) πθ(a|s)) ds lt β (16)
We first write the generalized Lagrangian equation ie
L(θ ν) =
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds (17)
+ ν(β minus
ints
micro(s) KL(πold(a|s) πθ(a|s)) ds)
where ν is the Lagrangian multiplier Now we solve thefollowing primal problem
maxθ
minνgt0
L(θ ν)
To obtain the parameters θ for the updated policy we solvefor ν and θ by alternating between 1) fixing ν to its currentvalue and optimizing for θ and 2) fixing θ to its currentvalue and optimizing for ν This can be applied for anypolicy output distribution
For Gaussian policies we decouple the update rules to op-timize for mean and covariance independently as in (Ab-dolmaleki et al 2018b) This allows for setting differentKL bounds for the mean (βmicro) and covariance (βΣ) whichresults in more stable learning To do this we first separateout the following two policies for mean and covariance
πmicroθ (a|s) = N(amicroθ(s)Σθold
(s)) (18)
πΣθ (a|s) = N
(amicroθold
(s)Σθ(s)) (19)
Policy πmicroθ (a|s) takes the mean from the online policy net-work and covariance from the target policy network andpolicy πΣ
θ (a|s) takes the mean from the target policy net-work and covariance from online policy network Now ouroptimization problem has the following form
maxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s)(
log πmicroθ (a|s)πΣθ (a|s)
)da ds
stints
micro(s) KL(πold(a|s) πmicroθ (a|s)) ds lt βmicroints
micro(s) KL(πold(a|s) πΣθ (a|s)) ds lt βΣ (20)
As in (Abdolmaleki et al 2018b) we set a much smallerbound for covariance than for mean to keep the explorationand avoid premature convergence We can solve this op-timization problem using the same general procedure asdescribed above
A Distributional View on Multi-Objective Policy Optimization
C4 Derivation of Dual Function for η
Recall that obtaining the per-objective improved action dis-tributions qk(a|s) in the policy improvement stage (Sec421) requires solving a convex dual function for the tem-perature ηk for each objective For the derivation of thisdual function please refer to Appendix D2 of the originalMPO paper (Abdolmaleki et al 2018b)
D MO-V-MPO Multi-Objective On-PolicyMPO
In this section we describe how MO-MPO can be adaptedto the on-policy case in which a state-value function V (s)is learned and used to estimate the advantages for eachobjective We call this approach MO-V-MPO
D1 Multi-objective policy evaluation in the on-policysetting
In the on-policy setting to evaluate the previous policyπold we use advantages A(s a) estimated from a learnedstate-value function V (s) instead of a state-action valuefunction Q(s a) as in the main text We train a separateV -function for each objective by regressing to the stan-dard n-step return (Sutton 1988) associated with each ob-jective More concretely given trajectory snippets τ =(s0 a0 r0) (sT aT rT ) where rt denotes a rewardvector rk(st at)Nk=1 that consists of rewards for all N ob-jectives we find value function parameters φk by optimizingthe following objective
minφ1k
Nsumk=1
Eτ[(G
(T )k (st at)minus V πold
φk(st))
2] (21)
Here G(T )k (st at) is the T -step target for value function
k which uses the actual rewards in the trajectory andbootstraps from the current value function for the restG
(T )k (st at) =
sumTminus1`=t γ
`minustrk(s` a`) + γTminustV πoldφk
(s`+T )The advantages are then estimated as Aπold
k (st at) =
G(T )k (st at)minus V πold
φk(st)
D2 Multi-objective policy improvement in theon-policy setting
Given the previous policy πold(a|s) and estimated advan-tages Aπold
k (s a)k=1N associated with this policy foreach objective our goal is to improve the previous policyTo this end we first learn an improved variational distribu-tion qk(s a) for each objective and then combine and distillthe variational distributions into a new parametric policyπnew(a|s) Unlike in MO-MPO for MO-V-MPO we use thejoint distribution qk(s a) rather than local policies qk(a|s)because without a learned Q-function only one action perstate is available for learning This is a multi-objective vari-
ant of the two-step policy improvement procedure employedby V-MPO (Song et al 2020)
D21 OBTAINING IMPROVED VARIATIONALDISTRIBUTIONS PER OBJECTIVE (STEP 1)
In order to obtain the improved variational distributionsqk(s a) we optimize the RL objective for each objectiveAk
maxqk
intsa
qk(s a)Ak(s a) dads (22)
st KL(qk(s a)pold(s a)) lt εk
where the KL-divergence is computed over all (s a) εk de-notes the allowed expected KL divergence and pold(s a) =micro(s)πold(a|s) is the state-action distribution associated withπold
As in MO-MPO we use these εk to define the preferencesover objectives More concretely εk defines the allowed con-tribution of objective k to the change of the policy There-fore the larger a particular εk is with respect to others themore that objective k is preferred On the other hand ifεk = 0 is zero then objective k will have no contribution tothe change of the policy and will effectively be ignored
Equation (22) can be solved in closed form
qk(s a) prop pold(s a) exp(Ak(s a)
ηk
) (23)
where the temperature ηk is computed based on the con-straint εk by solving the following convex dual problem
ηk = argminηk
[ηk εk + (24)
ηk log
intsa
pold(s a) exp(Ak(s a)
η
)da ds
]
We can perform the optimization along with the loss bytaking a gradient descent step on ηk and we initialize withthe solution found in the previous policy iteration step Sinceηk must be positive we use a projection operator after eachgradient step to maintain ηk gt 0
Top-k advantages As in Song et al (2020) in practicewe used the samples corresponding to the top 50 of ad-vantages in each batch of data
D22 FITTING A NEW PARAMETRIC POLICY (STEP 2)
We next want to combine and distill the state-action distribu-tions obtained in the previous step into a single parametricpolicy πnew(a|s) that favors all of the objectives accordingto the preferences specified by εk For this we solve a su-pervised learning problem that fits a parametric policy asfollows
A Distributional View on Multi-Objective Policy Optimization
πnew = argmaxθ
Nsumk=1
intsa
qk(s a) log πθ(a|s) da ds
stints
KL(πold(a|s) πθ(a|s)) ds lt β (25)
where θ are the parameters of our function approximator(a neural network) which we initialize from the weights ofthe previous policy πold and the KL constraint enforces atrust region of size β that limits the overall change in theparametric policy to improve stability of learning As inMPO the KL constraint in this step has a regularizationeffect that prevents the policy from overfitting to the localpolicies and therefore avoids premature convergence
In order to optimize Equation (25) we employ Lagrangianrelaxation similar to the one employed for ordinary V-MPO (Song et al 2020)
E Multi-Objective Policy Improvement asInference
In the main paper we motivated the multi-objective policyupdate rules from an intuitive perspective In this section weshow that our multi-objective policy improvement algorithmcan also be derived from the RL-as-inference perspective Inthe policy improvement stage we assume that a Q-functionfor each objective is given and we would like to improveour policy with respect to these Q-functions The deriva-tion here extends the derivation for the policy improvementalgorithm in (single-objective) MPO in Abdolmaleki et al(2018a) (in appendix) to the multi-objective case
We assume there are observable binary improvement eventsRkNk=1 for each objective Rk = 1 indicates that ourpolicy has improved for objective k whileRk = 0 indicatesthat it has not Now we ask if the policy has improved withrespect to all objectives ie Rk = 1Nk=1 what would theparameters θ of that improved policy be More concretelythis is equivalent to the maximum a posteriori estimate forRk = 1Nk=1
maxθ
pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)
where the probability of improvement events depends on θAssuming independence between improvement events Rkand using log probabilities leads to the following objective
maxθ
Nsumk=1
log pθ(Rk = 1) + log p(θ) (27)
The prior distribution over parameters p(θ) is fixed duringthe policy improvement step We set the prior such that πθ
stays close to the target policy during each policy improve-ment step to improve stability of learning (Sec E2)
We use the standard expectation-maximization (EM) algo-rithm to efficiently maximize
sumNk=1 log pθ(Rk = 1) The
EM algorithm repeatedly constructs a tight lower bound inthe E-step given the previous estimate of θ (which corre-sponds to the target policy in our case) and then optimizesthat lower bound in the M-step We introduce a variationaldistribution qk(s a) per objective and use this to decom-pose the log-evidence
sumNk=1 log pθ(Rk = 1) as follows
Nsumk=1
log pθ(Rk = 1) (28)
=
Nsumk=1
KL(qk(s a) pθ(s a|Rk = 1))minus
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a))
The second term of this decomposition is expanded as
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a)) (29)
=
Nsumk=1
Eqk(sa)
[log
pθ(Rk = 1 s a)
qk(s a)
]=
Nsumk=1
Eqk(sa)
[log
p(Rk = 1|s a)πθ(a|s)micro(s)
qk(s a)
]
where micro(s) is the stationary state distribution which is as-sumed to be given in each policy improvement step Inpractice micro(s) is approximated by the distribution of thestates in the replay buffer p(Rk = 1|s a) is the likelihoodof the improvement event for objective k if our policy choseaction a in state s
The first term of the decomposition in Equation (28) isalways positive so the second term is a lower bound onthe log-evidence
sumNk=1 log pθ(Rk = 1) πθ(a|s) and
qk(a|s) = qk(sa)micro(s) are unknown even though micro(s) is given
In the E-step we estimate qk(a|s) for each objective k byminimizing the first term given θ = θprime (the parameters ofthe target policy) Then in the M-step we find a new θ byfitting the parametric policy to the distributions from firststep
E1 E-Step
The E-step corresponds to the first step of policy improve-ment in the main paper (Sec 421) In the E-step wechoose the variational distributions qk(a|s)Nk=1 such thatthe lower bound on
sumNk=1 log pθ(Rk = 1) is as tight as
A Distributional View on Multi-Objective Policy Optimization
possible when θ = θprime the parameters of the target policyThe lower bound is tight when the first term of the decompo-sition in Equation (28) is zero so we choose qk to minimizethis term We can optimize for each variational distributionqk independently
qk(a|s) = argminq
Emicro(s)
[KL(q(a|s) pθprime(s a|Rk = 1))
]= argmin
qEmicro(s)
[KL(q(a|s) πθprime(a|s))minus
Eq(a|s) log p(Rk = 1|s a))]
(30)
We can solve this optimization problem in closed formwhich gives us
qk(a|s) =πθprime(a|s) p(Rk = 1|s a)intπθprime(a|s) p(Rk = 1|s a) da
This solution weighs the actions based on their relativeimprovement likelihood p(Rk = 1|s a) for each objectiveWe define the likelihood of an improvement event Rk as
p(Rk = 1|s a) prop exp(Qk(s a)
αk
)
where αk is an objective-dependent temperature parameterthat controls how greedy the solution qk(a|s) is with respectto its associated objective Qk(s a) For example given abatch of state-action pairs at the extremes an αk of zerowould give all probability to the state-action pair with themaximum Q-value while an αk of positive infinity wouldgive the same probability to all state-action pairs In orderto automatically optimize for αk we plug the exponentialtransformation into (30) which leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) dadsminus
αk
intmicro(s) KL(q(a|s) πθprime(a|s)) ds (31)
If we instead enforce the bound on KL divergence as a hardconstraint (which we do in practice) that leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) da ds (32)
stintmicro(s) KL(q(a|s) πθprime(a|s)) ds lt εk
where as described in the main paper the parameter εkdefines preferences over objectives If we now assume thatqk(a|s) is a non-parametric sample-based distribution as in(Abdolmaleki et al 2018b) we can solve this constrained
optimization problem in closed form for each sampled states by setting
q(a|s) prop πθprime(a|s) exp(Qk(s a)
ηk
) (33)
where ηk is computed by solving the following convex dualfunction
ηk = argminη
ηεk+ (34)
η
intmicro(s) log
intπθprime(a|s) exp
(Qk(s a)
η
)da ds
Equations (32) (33) and (34) are equivalent to those givenfor the first step of multi-objective policy improvement inthe main paper (Sec 421) Please see refer to AppendixD2 in (Abdolmaleki et al 2018b) for details on derivationof the dual function
E2 M-Step
After obtaining the variational distributions qk(a|s) foreach objective k we have found a tight lower bound forsumNk=1 log pθ(Rk = 1) when θ = θprime We can now obtain
the parameters θ of the updated policy πθ(a|s) by optimiz-ing this lower bound After rearranging and dropping theterms in (27) that do not involve θ we obtain
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) da ds
+ log p(θ)
This corresponds to the supervised learning step of policyimprovement in the main paper (Sec 422)
For the prior p(θ) on the parameters θ we enforce that thenew policy πθ(a|s) is close to the target policy πθprime(a|s)We therefore optimize for
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) dads
stintmicro(s) KL(πθprime(a|s) πθ(a|s)) ds lt β
Because of the constraint on the change of the parametricpolicy we do not greedily optimize the M-step objectiveSee (Abdolmaleki et al 2018b) for more details
A Distributional View on Multi-Objective Policy Optimization
Figure 5 Left Blue stars mark the true Pareto front for Deep SeaTreasure MO-MPO with a variety of settings for εk discoversall points on the true Pareto front The area of the orange circlesis proportional to the number of εk settings that converged tothat point Right As more preference is given to the treasureobjective (ie as x increases) policies tend to prefer higher-valuetreasures Each orange dot in the scatterplot corresponds to aparticular setting of εk
6 Experiments Continuous Control DomainsThe advantage of encoding preferences via εrsquos rather thanvia weights is apparent in more complex domains We com-pared our approaches MO-MPO and MO-V-MPO againstscalarized MPO and V-MPO in four high-dimensional con-tinuous control domains in MuJoCo (Todorov et al 2012)and on a real robot The domains we consider are
Humanoid We use the humanoid run task defined in Deep-Mind Control Suite (Tassa et al 2018) Policies must opti-mize for horizontal speed h while minimizing energy usageThe task reward is min(h10 1) where h is in meters persecond and the energy usage penalty is action `2-norm Thehumanoid has 21 degrees of freedom and the observationconsists of joint angles joint velocities head height handand feet positions torso vertical orientation and center-of-mass velocity for a total of 67 dimensions
Shadow Hand We consider three tasks on the ShadowDexterous Hand touch turn and orient In the touch andturn tasks policies must complete the task while minimiz-ing ldquopainrdquo A sparse task reward of 10 is given for pressingthe block with greater than 5N of force or for turning thedial from a random initial location to the target locationThe pain penalty penalizes the robot for colliding with ob-jects at high speed this penalty is defined as in Huang et al(2019) In the orient task there are three aligned objectivestouching the rectangular peg lifting it to a given height andorienting it to be perpendicular to the ground All threerewards are between 0 and 1 The Shadow Hand has five
fingers and 24 degrees of freedom actuated by 20 motorsThe observation consists of joint angles joint velocitiesand touch sensors for a total of 63 dimensions The touchand turn tasks terminate when the goal is reached or after 5seconds and the orient task terminates after 10 seconds
Humanoid Mocap We consider a large-scale humanoidmotion capture tracking task similar to in Peng et al (2018)in which policies must learn to follow motion capture ref-erence data6 There are five objectives each capturing adifferent aspect of the similarity of the pose between thesimulated humanoid and the mocap target joint orientationsjoint velocities hand and feet positions center-of-mass po-sitions and certain body positions and joint angles Theseobjectives are described in detail in Appendix B4 In orderto balance these multiple objectives prior work relied onheavily-tuned reward functions (eg Peng et al 2018) Thehumanoid has 56 degrees of freedom and the observation is1021-dimensional consisting of proprioceptive observationsas well as six steps of motion capture reference frames Intotal we use about 40 minutes of locomotion mocap datamaking this an extremely challenging domain
Sawyer Peg-in-Hole We train a Rethink Robotics Sawyerrobot arm to insert a cylindrical peg into a hole while min-imizing wrist forces The task reward is shaped towardpositioning the peg directly above the hole and increasesfor insertion and the penalty is the `1-norm of Cartesianforces measured by the wrist force-torque sensor The latterimplicitly penalizes contacts and impacts as well as exces-sive directional change (due to the gripperrsquos inertia inducingforces when accelerating) We impose a force thresholdto protect the hardwaremdashif this threshold is exceeded theepisode is terminated The action space is the end effectorrsquosCartesian velocity and the observation is 102-dimensionalconsisting of Cartesian position joint position and velocitywrist force-torque and joint action for three timesteps
61 Evaluation Metric
We run MO-(V-)MPO and scalarized (V-)MPO with a widerange of constraint settings εk and scalarization weightswk respectively corresponding to a wide range of possibledesired preferences (The exact settings are provided inAppendix A) For tasks with two objectives we plot thePareto front found by each approach We also computethe hypervolume of each found Pareto front this metric iscommonly-used for evaluating MORL algorithms (Vamplewet al 2011) Given a set of policies Π and a reference policyr that is dominated by all policies in this set this metric isthe hypervolume of the space of all policies that dominater and are dominated by at least one policy in Π We useDEAP (Fortin et al 2012) to compute hypervolumes
6This task was developed concurrently by Anonymous (2020)and does not constitute a contribution of this paper
A Distributional View on Multi-Objective Policy Optimization
3000 00
400
800
task
rewar
d
mid-training
3000 0
action norm penalty
end of training
3000 0
normal scale
30000 0
10x penalty
MO-MPO scalarized MPO
5 0pain penalty
00
05
task
rewar
d
touch mid-training
5 0pain penalty
touch
5 0pain penalty
turn
0 100z
0
150
angl
e
orient
3000 00
400
800
task
rewar
d
mid-training
3000 0
action norm penalty
end of training
3000 0
normal scale
30000 0
10x penalty
MO-MPO scalarized MPO
5 0pain penalty
00
05
task
rewar
d
touch mid-training
5 0pain penalty
touch
5 0pain penalty
turn
0 100z
0
150
angl
e
orient
touch mid-training
1x penalty
action norm penalty
pain penalty
3000 00
400
800
task
rewar
d
mid-training
3000 0
action norm penalty
end of training
3000 0
normal scale
30000 0
10x penalty
MO-MPO scalarized MPOscalarized MPOMO-MPO
touch end of training
Figure 6 Pareto fronts found by MO-MPO and scalarized MO-MPO for humanoid run (top row) and Shadow Hand tasks Eachdot represents a single trained policy Corresponding hypervol-umes are in Table 1 Task reward is discounted for touch and turnwith a discount factor of 099 For orient the x and y axes are thetotal reward for the lift and orientation objectives respectively
62 Results Humanoid and Shadow Hand
In the run touch and turn tasks the two objectives arecompetingmdasha very high preference for minimizing actionnorm or pain as opposed to getting task reward will resultin a policy that always chooses zero-valued actions Acrossthese three tasks the Pareto front found by MO-MPO issuperior to the one found by scalarized MPO with respectto the hypervolume metric (Table 1)7 In particular MO-MPO finds more policies that perform well with respect toboth objectives ie are in the upper right portion of thePareto front MO-MPO also speeds up learning on run andtouch (Fig 6) Qualitatively MO-MPO trains run policiesthat look more natural and ldquohuman-likerdquo videos are athttpsitesgooglecomviewmo-mpo
When we scale the action norm penalty by 10times for runscalarized MPO policies no longer achieve high task rewardwhereas MO-MPO policies do (Fig 6 top right) Thissupports that MO-MPOrsquos encoding of preferences is indeedscale-invariant When the objectives are aligned and havesimilarly scaled rewards as in the orient task MO-MPOand scalarized MPO perform similarly as expected
Ablation We also ran vanilla MPO on humanoid run withthe same range of weight settings as for scalarized MPO toinvestigate how useful Q-decomposition is In vanilla MPOwe train a single critic on the scalarized reward function
7We use hypervolume reference points of [0minus104] for run[0minus105] for run with 10times penalty [0minus20] for touch and turnand the zero vector for orient and humanoid mocap
Task scalarized MPO MO-MPO
Humanoid run mid-training 11times106 33times106
Humanoid run 64times106 71times106
Humanoid run 1times penalty 50times106 59times106
Humanoid run 10times penalty 26times107 69times107
Shadow touch mid-training 143 156Shadow touch 162 164Shadow turn 144 154Shadow orient 28times104 28times104
Humanoid Mocap 386times10minus6 341times10minus6
Table 1 Hypervolume measurements across tasks and approaches
which is equivalent to removing Q-decomposition fromscalarized MPO Vanilla MPO trains policies that achievesimilar task reward (up to 800) but with twice the actionnorm penalty (up to minus104) As a result the hypervolumeof the Pareto front that vanilla MPO finds is more than amagnitude worse than that of scalarized MPO and MO-MPO(22times106 versus 26times107 and 69times107 respectively)
63 Results Humanoid Mocap
The objectives in this task are mostly aligned In contrast tothe other experiments we use V-MPO (Song et al 2020) asthe base algorithm because it outperforms MPO in learningthis task In addition since V-MPO is an on-policy vari-ant of MPO this enables us to evaluate our approach inthe on-policy setting Each training run is very computa-tionally expensive so we train only a handful of policieseach8 None of the MO-V-MPO policies are dominatedby those found by V-MPO In fact although the weightsspan a wide range of ldquopreferencesrdquo for the joint velocity theonly policies found by V-MPO that are not dominated by aMO-V-MPO policy are those with extreme values for jointvelocity reward (either le 0006 or ge 0018) whereas it isbetween 00103 and 00121 for MO-V-MPO policies
Although the hypervolume of the Pareto front found by V-MPO is higher than that of MO-V-MPO (Table 1) findingpolicies that over- or under- prioritize any objective is unde-sirable Qualitatively the policies trained with MO-V-MPOlook more similar to the mocap reference datamdashthey exhibitless feet jittering compared to those trained with scalarizedV-MPO this can be seen in the corresponding video
64 Results Sawyer Peg-in-Hole
In this task we would like the robot to prioritize successfultask completion while minimizing wrist forces With this
8For MO-V-MPO we set all εk = 001 Also for each objec-tive we set εk = 0001 and set all others to 001 For V-MPOwe fix the weights of four objectives to reasonable values and tryweights of [01 03 06 1 5] for matching joint velocities
A Distributional View on Multi-Objective Policy Optimization
0 200000 400000
200
400
number of policy improvement steps
task reward
0 200000 400000
minus12
00minus
800
wrist force penalty
MO-MPO scalarized MPO [095 005] [08 02]
Figure 7 Task reward (left) and wrist force penalty (right) learningcurves for the Sawyer peg-in-hole task The policy trained withMO-MPO quickly learns to optimize both objectives whereas theother two do not Each line represents a single trained policy
in mind for MO-MPO we set εtask = 01 and εforce = 005and for scalarized MPO we try [wtask wforce] = [095 005]and [08 02] We find that policies trained with scalarizedMPO focus all learning on a single objective at the beginningof training we also observed this in the touch task wherescalarized MPO policies quickly learn to either maximizetask reward or minimize pain penalty but not both (Fig 6bottom left) In contrast the policy trained with MO-MPOsimultaneously optimizes for both objectives throughouttraining In fact throughout training the MO-MPO policydoes just as well with respect to task reward as the scalarizedMPO policy that cares more about task reward and similarlyfor wrist force penalty (Fig 7)
7 Conclusions and Future WorkIn this paper we presented a new distributional perspectiveon multi objective reinforcement learning that is derivedfrom the Rl-as-inference perspective This view leads to twonovel multi-objective RL algorithms namely MO-MPO andMO-V-MPO We showed these algorithms enable practition-ers to encode preferences in a scale-invariant way AlthoughMO-(V-)MPO is a single-policy approach to MORL it iscapable of producing a variety of Pareto-dominant policiesA limitation of this work is that we produced this set ofpolicies by iterating over a relatively large number of εrsquoswhich is computationally expensive In future work weplan to extend MO-(V-)MPO into a true multiple-policyMORL approach either by conditioning the policy on set-tings of εrsquos or by developing a way to strategically select εrsquosto train policies for analogous to what prior work has donefor weights (eg Roijers et al (2014))
AcknowledgmentsThe authors would like to thank Guillaume DesjardinsCsaba Szepesvari Jost Tobias Springenberg Steven BohezPhilemon Brakel Brendan Tracey Jonas Degrave JonasBuchli Leslie Fritz Chloe Rosenberg and many others atDeepMind for their support and feedback on this paper
ReferencesAbdolmaleki A Springenberg J T Degrave J Bohez S
Tassa Y Belov D Heess N and Riedmiller M Rela-tive entropy regularized policy iteration arXiv preprintarXiv181202256 2018a
Abdolmaleki A Springenberg J T Tassa Y Munos RHeess N and Riedmiller M Maximum a posterioripolicy optimisation In Proceedings of the Sixth Interna-tional Conference on Learning Representations (ICLR)2018b
Abels A Roijers D M Lenaerts T Nowe A andSteckelmacher D Dynamic weights in multi-objectivedeep reinforcement learning In Proceedings of the 36thInternational Conference on Machine Learning (ICML)2019
Achiam J Held D Tamar A and Abbeel P Constrainedpolicy optimization In Proceedings of the 34th Inter-national Conference on Machine Learning (ICML) pp22ndash31 2017
Amodei D Olah C Steinhardt J Christiano P Schul-man J and Mane D Concrete problems in AI safetyarXiv preprint arXiv160606565 2016
Andersen M S Dahl J and Vandenberghe L CVXOPTA Python package for convex optimization version 12httpscvxoptorg 2020
Anonymous CoMic Co-Training and Mimicry forReusable Skills 2020
Barrett L and Narayanan S Learning all optimal policieswith multiple criteria In Proceedings of the 25th Inter-national Conference on Machine Learning (ICML) pp41ndash47 2008
Barth-Maron G Hoffman M W Budden D Dabney WHorgan D TB D Muldal A Heess N and LillicrapT Distributed distributional deterministic policy gradi-ents In Proceedings of the Sixth International Conferenceon Learning Representations (ICLR) 2018
Bohez S Abdolmaleki A Neunert M Buchli J HeessN and Hadsell R Value constrained model-free contin-uous control arXiv preprint arXiv190204623 2019
Chen X Ghadirzadeh A Bjorkman M and Jensfelt PMeta-learning for multi-objective reinforcement learningarXiv preprint arXiv181103376 2019
Chentanez N Muller M Macklin M Makoviychuk Vand Jeschke S Physics-based motion capture imitationwith deep reinforcement learning In International Con-ference on Motion Interaction and Games (MIG) ACM2018
A Distributional View on Multi-Objective Policy Optimization
Chow Y Nachum O Duenez-Guzman E andGhavamzadeh M A Lyapunov-based approach to safereinforcement learning In Proceedings of the 31st Inter-national Conference on Neural Information ProcessingSystems (NeurIPS) pp 8092ndash8101 2018
Daneshmand M Tale Masouleh M Saadatzi M HOzcinar C and Anbarjafari G A robust proportion-preserving composite objective function for scale-invariant multi-objective optimization Scientia Iranica24(6)2977ndash2991 2017
Fortin F-A De Rainville F-M Gardner M-A ParizeauM and Gagne C DEAP Evolutionary algorithms madeeasy Journal of Machine Learning Research 132171ndash2175 July 2012
Friedman E and Fontaine F Generalizing across multi-objective reward functions in deep reinforcement learningarXiv preprint arXiv180906364 2018
Gabor Z Kalmar Z and Szepesvari C Multi-criteriareinforcement learning In Proceedings of the FifteenthInternational Conference on Machine Learning (ICML)pp 197ndash205 1998
Grodzevich O and Romanko O Normalization and othertopics in multi-objective optimization In Proceedings ofthe Fields-MITACS Industrial Problems Workshop 2006
Haarnoja T Zhou A Abbeel P and Levine S Softactor-critic Off-policy maximum entropy deep reinforce-ment learning with a stochastic actor In Proceedings ofthe 35th International Conference on Machine Learning(ICML) pp 1861ndash1870 2018
Heess N Wayne G Silver D Lillicrap T Erez T andTassa Y Learning continuous control policies by stochas-tic value gradients In Advances in Neural InformationProcessing Systems 28 (NIPS) 2015
Hessel M Soyer H Espeholt L Czarnecki W SchmittS and van Hasselt H Multi-task deep reinforcementlearning with popart In Proceedings of the AAAI Confer-ence on Artificial Intelligence volume 33 pp 3796ndash38032019
Huang S H Zambelli M Kay J Martins M F TassaY Pilarski P M and Hadsell R Learning gentle objectmanipulation with curiosity-driven deep reinforcementlearning arXiv preprint arXiv190308542 2019
Ishibuchi H Doi K and Nojima Y On the effect ofnormalization in MOEAD for multi-objective and many-objective optimization Complex amp Intelligent Systems 3(4)279ndash294 2017
Kim I and de Weck O Adaptive weighted sum methodfor multiobjective optimization A new method for paretofront generation Structural and Multidisciplinary Opti-mization 31(2)105ndash116 2006
Kingma D P and Ba J Adam A method for stochasticoptimization In Proceedings of the Third InternationalConference on Learning Representations (ICLR) 2015
Le H Voloshin C and Yue Y Batch policy learningunder constraints In Proceedings of the 36th Inter-national Conference on Machine Learning (ICML) pp3703ndash3712 2019
Levine S Finn C Darrell T and Abbeel P End-to-endtraining of deep visuomotor policies Journal of MachineLearning Research 17(39)1ndash40 2016
Liu C Xu X and Hu D Multiobjective reinforcementlearning A comprehensive overview IEEE Transactionson Systems Man and Cybernetics Systems 45(3)385ndash398 2015
Marler R T and Arora J S Function-transformationmethods for multi-objective optimization EngineeringOptimization 37(6)551ndash570 2005
Martens J New perspectives on the natural gradient methodarXiv preprint arXiv14121193 2014
Merel J Ahuja A Pham V Tunyasuvunakool S LiuS Tirumala D Heess N and Wayne G Hierarchi-cal visuomotor control of humanoids In InternationalConference on Learning Representations (ICLR) 2019
Mnih V Kavukcuoglu K Silver D Rusu A A VenessJ Bellemare M G Graves A Riedmiller M Fidje-land A K Ostrovski G et al Human-level controlthrough deep reinforcement learning Nature 518(7540)529 2015
Moffaert K V and Nowe A Multi-objective reinforcementlearning using sets of Pareto dominating policies Journalof Machine Learning Research 153663ndash3692 2014
Mossalam H Assael Y M Roijers D M and WhitesonS Multi-objective deep reinforcement learning arXivpreprint arXiv161002707 2016
Munos R Stepleton T Harutyunyan A and BellemareM Safe and efficient off-policy reinforcement learningIn Proceedings of the 29th International Conference onNeural Information Processing Systems pp 1054ndash10622016
Natarajan S and Tadepalli P Dynamic preferences inmulti-criteria reinforcement learning In Proceedings ofthe 22nd International Conference on Machine Learning(ICML) pp 601ndash608 2005
A Distributional View on Multi-Objective Policy Optimization
Nottingham K Balakrishnan A Deshmukh J Christo-pherson C and Wingate D Using logical specificationsof objectives in multi-objective reinforcement learningarXiv preprint arXiv191001723 2019
Parisi S Pirotta M Smacchia N Bascetta L andRestelli M Policy gradient approaches for multi-objective sequential decision making In Proceedingsof the International Joint Conference on Neural Networks(IJCNN) pp 2323ndash2330 2014
Peng X B Abbeel P Levine S and van de Panne MDeepmimic Example-guided deep reinforcement learn-ing of physics-based character skills ACM Transactionson Graphics 37(4) 2018
Pirotta M Parisi S and Restelli M Multi-objectivereinforcement learning with continuous Pareto frontierapproximation In Proceedings of the Twenty-Ninth AAAIConference on Artificial Intelligence (AAAI) 2015
Reymond M and Nowe A Pareto-DQN Approximat-ing the Pareto front in complex multi-objective decisionproblems In Proceedings of the Adaptive and LearningAgents Workshop at the 18th International Conferenceon Autonomous Agents and MultiAgent Systems AAMAS2019
Riedmiller M Hafner R Lampe T Neunert M De-grave J Van de Wiele T Mnih V Heess N and Sprin-genberg J T Learning by playing - solving sparse rewardtasks from scratch arXiv preprint arXiv1802105672018
Roijers D M Vamplew P Whiteson S and Dazeley RA survey of multi-objective sequential decision-makingJournal of Artificial Intelligence Research 48(1)67ndash1132013
Roijers D M Whiteson S and Oliehoek F A Linearsupport for multi-objective coordination graphs In Pro-ceedings of the International Conference on AutonomousAgents and Multi-Agent Systems (AAMAS) pp 1297ndash1304 2014
Russell S and Zimdars A L Q-decomposition for rein-forcement learning agents In Proceedings of the Twenti-eth International Conference on International Conferenceon Machine Learning (ICML) pp 656ndash663 2003
Schulman J Levine S Abbeel P Jordan M and MoritzP Trust region policy optimization In Proceedings ofthe 32nd International Conference on Machine Learning(ICML) 2015
Schulman J Wolski F Dhariwal P Radford A andKlimov O Proximal policy optimization algorithmsarXiv preprint arXiv170706347 2017
Shadow Robot Company Shadow dexterous handhttpswwwshadowrobotcomproductsdexterous-hand 2020
Silver D Huang A Maddison C J Guez A Sifre Lvan den Driessche G Schrittwieser J Antonoglou IPanneershelvam V Lanctot M Dieleman S GreweD Nham J Kalchbrenner N Sutskever I Lillicrap TLeach M Kavukcuoglu K Graepel T and HassabisD Mastering the game of Go with deep neural networksand tree search Nature 529484ndash503 2016
Song H F Abdolmaleki A Springenberg J T ClarkA Soyer H Rae J W Noury S Ahuja A LiuS Tirumala D Heess N Belov D Riedmiller Mand Botvinick M M V-MPO On-policy maximum aposteriori policy optimization for discrete and continu-ous control In International Conference on LearningRepresentations 2020
Sutton R S Learning to predict by the methods of temporaldifferences Machine learning 3(1)9ndash44 1988
Tan J Zhang T Coumans E Iscen A Bai Y HafnerD Bohez S and Vanhoucke V Sim-to-real Learningagile locomotion for quadruped robots In Proceedingsof Robotics Science and Systems (RSS) 2018
Tassa Y Doron Y Muldal A Erez T Li Yde Las Casas D Budden D Abdolmaleki A Merel JLefrancq A Lillicrap T and Riedmiller M Deepmindcontrol suite arXiv preprint arXiv180100690 2018
Teh Y W Bapst V Czarnecki W M Quan J Kirk-patrick J Hadsell R Heess N and Pascanu R DistralRobust multitask reinforcement learning In Proceedingsof the 31st International Conference on Neural Informa-tion Processing Systems (NeurIPS) 2017
Todorov E Erez T and Tassa Y Mujoco A physicsengine for model-based control In Proceedings of theIEEERSJ International Conference on Intelligent Robotsand Systems (IROS) pp 5026ndash5033 2012
Vamplew P Dazeley R Berry A Issabekov R andDekker E Empirical evaluation methods for multiobjec-tive reinforcement learning algorithms Machine Learn-ing 84(1)51ndash80 Jul 2011
van Hasselt H P Guez A Hessel M Mnih V and Sil-ver D Learning values across many orders of magnitudeIn Advances in Neural Information Processing Systemspp 4287ndash4295 2016
Van Moffaert K Drugan M M and Nowe A Scalar-ized multi-objective reinforcement learning Novel de-sign techniques In Proceedings of the IEEE Symposiumon Adaptive Dynamic Programming and ReinforcementLearning (ADPRL) pp 191ndash199 2013
A Distributional View on Multi-Objective Policy Optimization
van Seijen H Fatemi M Romoff J Laroche R BarnesT and Tsang J Hybrid reward architecture for reinforce-ment learning In Proceedings of the 31st InternationalConference on Neural Information Processing Systems(NeurIPS) pp 5398ndash5408 2017
Watkins C J C H Learning from Delayed Rewards PhDthesis Kingrsquos College Cambridge UK May 1989
Wray K H Zilberstein S and Mouaddib A-I Multi-objective MDPs with conditional lexicographic rewardpreferences In Proceedings of the Twenty-Ninth AAAIConference on Artificial Intelligence (AAAI) pp 3418ndash3424 2015
Wulfmeier M Abdolmaleki A Hafner R SpringenbergJ T Neunert M Hertweck T Lampe T Siegel NHeess N and Riedmiller M Regularized hierarchicalpolicies for compositional transfer in robotics arXivpreprint arXiv190611228 2019
Yang R Sun X and Narasimhan K A generalized algo-rithm for multi-objective reinforcement learning and pol-icy adaptation In Proceedings of the 32nd InternationalConference on Neural Information Processing Systems(NeurIPS) pp 14610ndash14621 2019
Zeng A Song S Lee J Rodriguez A and FunkhouserT Tossingbot Learning to throw arbitrary objects withresidual physics In Proceedings of Robotics Scienceand Systems (RSS) 2019
Zhan H and Cao Y Relationship explainable multi-objective optimization via vector value function basedreinforcement learning arXiv preprint arXiv1910019192019
Zuluaga M Krause A and Puschel M ε-pal An ac-tive learning approach to the multi-objective optimizationproblem Journal of Machine Learning Research 17(1)3619ndash3650 2016
A Distributional View on Multi-Objective Policy Optimization
A Experiment DetailsIn this section we describe implementation details and spec-ify the hyperparameters used for our algorithm In all ourexperiments the policy and critic(s) are implemented withfeed-forward neural networks We use Adam (Kingma ampBa 2015) for optimization
In our continuous control tasks the policy returns a Gaus-sian distribution with a diagonal covariance matrix ieπθ(a|s) = N (microΣ) The policy is parametrized by aneural network which outputs the mean micro = micro(s) and di-agonal Cholesky factors A = A(s) such that Σ = AAT The diagonal factor A has positive diagonal elements en-forced by the softplus transform Aii larr log(1 + exp(Aii))to ensure positive definiteness of the diagonal covariancematrix
Table 2 shows the default hyperparameters that we use forMO-MPO and scalarized MPO and Table 3 shows the hy-perparameters that differ from these defaults for the hu-manoid Shadow Hand and Sawyer experiments For alltasks that we train policies with MO-MPO for a separatecritic is trained for each objective with a shared first layerThe only exception is the action norm penalty in humanoidrun which is computed exactly from the action We foundthat for both policy and critic networks layer normalizationof the first layer followed by a hyperbolic tangent (tanh) isimportant for the stability of the algorithm
In our experiments comparing the Pareto front found byMO-MPO versus scalarized MPO we ran MO-MPO with arange of εk settings and ran scalarized MPO with a range ofweight settings to obtain a set of policies for each algorithmThe settings for humanoid run and the Shadow Hand tasksare listed in Table 4 The settings for humanoid mocap andSawyer are specified in the main paper in Sec 63 and 64respectively
In the following subsections we first give suggestions forchoosing appropriate εk to encode a particular preference(Sec A1) then we describe the discrete MO-MPO usedfor Deep Sea Treasure (Sec A2) and finally we describeimplementation details for humanoid mocap (Sec A3)
A1 Suggestions for Setting ε
Our proposed algorithm MO-(V-)MPO requires practition-ers to translate a desired preference across objectives to nu-merical choices for εk for each objective k At first glancethis may seem daunting However in practice we havefound that encoding preferences via εk is often more in-tuitive than doing so via scalarization In this subsectionwe seek to give an intuition on how to set εk for differentdesired preferences Recall that each εk controls the influ-ence of objective k on the policy update by constraining theKL-divergence between each objective-specific distribution
Hyperparameter Default
policy networklayer sizes (300 200)take tanh of action mean yesminimum variance 10minus12
maximum variance unbounded
critic network(s)layer sizes (400 400 300)take tanh of action yesRetrace sequence size 8discount factor γ 099
both policy and critic networkslayer norm on first layer yestanh on output of layer norm yesactivation (after each hidden layer) ELU
MPOactions sampled per state 20default ε 01KL-constraint on policy mean βmicro 10minus3
KL-constraint on policy covariance βΣ 10minus5
initial temperature η 1
trainingbatch size 512replay buffer size 106
Adam learning rate 3times10minus4
Adam ε 10minus3
target network update period 200
Table 2 Default hyperparameters for MO-MPO and scalarizedMPO with decoupled update on mean and covariance (Sec C3)
and the current policy (Sec 41) We generally choose εk inthe range of 0001 to 01
Equal Preference When all objectives are equally impor-tant the general rule is to set all εk to the same value Wedid this in our orient and humanoid mocap tasks where ob-jectives are aligned or mostly aligned In contrast it can betricky to choose appropriate weights in linear scalarizationto encode equal preferencesmdashsetting all weights equal to1K (where K is the number of objectives) is only appro-priate if the objectivesrsquo rewards are of similar scales Weexplored this in Sec 51 and Fig 3
When setting all εk to the same value what should thisvalue be The larger εk is the more influence the objectiveswill have on the policy update step Since the per-objectivecritics are learned in parallel with the policy setting εktoo high tends to destabilize learning because early on intraining when the critics produce unreliable Q-values theirinfluence on the policy will lead it in the wrong directionOn the other hand if εk is set too low then it slows downlearning because the per-objective action distribution isonly allowed to deviate by a tiny amount from the currentpolicy and the updated policy is obtained via supervisedlearning on the combination of these action distributions
A Distributional View on Multi-Objective Policy Optimization
Humanoid run
policy networklayer sizes (400 400 300)take tanh of action mean nominimum variance 10minus8
critic network(s)layer sizes (500 500 400)take tanh of action no
MPOKL-constraint on policy mean βmicro 5times10minus3
KL-constraint on policy covariance βΣ 10minus6
trainingAdam learning rate 2times10minus4
Adam ε 10minus8
Shadow Hand
MPOactions sampled per state 30default ε 001
Sawyer
MPOactions sampled per state 15
trainingtarget network update period 100
Table 3 Hyperparameters for humanoid Shadow Hand andSawyer experiments that differ from the defaults in Table 2
Eventually the learning will converge to more or less thesame policy though as long as εk is not set too high
Unequal Preference When there is a difference in pref-erences across objectives the relative scale of εk is whatmatters The more the relative scale of εk is compared to εlthe more influence objective k has over the policy updatecompared to objective l And in the extreme case whenεl is near-zero for objective l then objective l will haveno influence on the policy update and will effectively beignored We explored this briefly in Sec 51 and Fig 4
One common example of unequal preferences is when wewould like an agent to complete a task while minimiz-ing other objectivesmdasheg an action norm penalty ldquopainrdquopenalty or wrist force-torque penalty in our experimentsIn this case the ε for the task objective should be higherthan that for the other objectives to incentivize the agent toprioritize actually doing the task If the ε for the penaltiesis too high then the agent will care more about minimiz-ing the penalty (which can typically be achieved by simplytaking no actions) rather than doing the task which is notparticularly useful
The scale of εk has a similar effect as in the equal preference
Condition Settings
Humanoid run (one seed per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin linspace(0 015 100)
MO-MPO εtask = 01εpenalty isin linspace(10minus6 015 100)
Humanoid run normal vs scaled (three seeds per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin 001 005 01
MO-MPO εtask = 01εpenalty isin 001 005 01
Shadow Hand touch and turn (three seeds per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin linspace(0 09 10)
MO-MPO εtask = 001εpenalty isin linspace(0001 0015 15)
Shadow Hand orient (ten seeds per setting)
scalarized MPO wtouch = wheight = worientation = 13MO-MPO εtouch = εheight = εorientation = 001
Table 4 Settings for εk and weights (linspace(x y z) denotes aset of z evenly-spaced values between x and y)
case If the scale of εk is too high or too low then the sameissues arise as discussed for equal preferences If all εkincrease or decrease in scale by the same (moderate) factorand thus their relative scales remain the same then typicallythey will converge to more or less the same policy Thiscan be seen in Fig 5 (right) in the Deep Sea Treasuredomain regardless of whether εtime is 001 002 or 005the relationship between the relative scale of εtreasure andεtime to the treasure that the policy converges to selecting isessentially the same
A2 Deep Sea Treasure
In order to handle discrete actions we make several mi-nor adjustments to scalarized MPO and MO-MPO Thepolicy returns a categorical distribution rather than a Gaus-sian The policy is parametrized by a neural network whichoutputs a vector of logits (ie unnormalized log probabil-ities) of size |A| The KL constraint on the change of thispolicy β is 0001 The input to the critic network is thestate concatenated with a four-dimensional one-hot vectordenoting which action is chosen (eg the up action corre-sponds to [1 0 0 0]gt) Critics are trained with one-steptemporal-difference error with a discount of 0999 Otherthan these changes the network architectures and the MPOand training hyperparameters are the same as in Table 2
A Distributional View on Multi-Objective Policy Optimization
Humanoid Mocap
Scalarized VMPOKL-constraint on policy mean βmicro 01KL-constraint on policy covariance βΣ 10minus5
default ε 01initial temperature η 1
TrainingAdam learning rate 10minus4
batch size 128unroll length (for n-step return Sec D1) 32
Table 5 Hyperparameters for the humanoid mocap experiments
A3 Humanoid Mocap
For the humanoid mocap experiments we used the follow-ing architecture for both MO-VMPO and VMPO for thepolicy we process a concatentation of the mocap referenceobservations and the proprioceptive observations by a twolayer MLP with 1024 hidden units per layer This referenceencoder is followed by linear layers to produce the meanand log standard deviation of a stochastic latent variableThese latent variables are then concatenated with the propri-oceptive observations and processed by another two layerMLP with 1024 hidden units per layer to produce the actiondistribution For VMPO we use a three layer MLP with1024 hidden units per layer as the critic For MO-VMPOwe use a shared two layer MLP with 1024 hidden units perlayer followed by a one layer MLP with 1024 hidden unitsper layer per objective In both cases we use k-step returnsto train the critic with a discount factor of 095 Table 5shows additional hyperparameters used in our experiments
B Experimental DomainsWe evaluated our approach on one discrete domain (DeepSea Treasure) three simulated continuous control domains(humanoid Shadow Hand and humanoid mocap) and onereal-world continuous control domain (Sawyer robot) Herewe provide more detail about these domains and the objec-tives used in each task
B1 Deep Sea Treasure
Deep Sea Treasure (DST) is a 11times10 grid-world domainthe state space S consists of the x and y position of the agentand the action spaceA is up right down left Thelayout of the environment and values of the treasures areshown in Fig 8 If the action would cause the agent tocollide with the sea floor or go out of bounds it has noeffect Farther-away treasures have higher values Theepisode terminates when the agent collects a treasure orafter 200 timesteps
237
224
203196
161151140
115
82
07
Figure 8 Deep Sea Treasure environment from Vamplew et al(2011) with weights from Yang et al (2019) Treasures are labeledwith their respective values The agent can move around freelyin the white squares but cannot enter the black squares (ie theocean floor)
There are two objectives time penalty and treasure valueA time penalty of minus1 is given at each time step The agentreceives the value of the treasure it picks up as the rewardfor the treasure objective In other words when the agentpicks up a treasure of value v the reward vector is [minus1 v]otherwise it is [minus1 0]
B2 Shadow Hand
Our robot platform is a simulated Shadow Dexterous Hand(Shadow Robot Company 2020) in the MuJoCo physicsengine (Todorov et al 2012) The Shadow Hand has fivefingers and 24 degrees of freedom actuated by 20 motorsThe observation consists of joint angles joint velocities andtouch sensors for a total of 63 dimensions Each fingertipof our Shadow Hand has a 4times4 spatial touch sensor Thissensor has three channels measuring the normal force andthe x and y-axis tangential forces for a sensor dimension of4times4times3 per fingertip We simplify this sensor by summingacross spatial dimensions resulting in a 1times1times3 observationper fingertip
In the touch task there is a block in the environment that isalways fixed in the same pose In the turn task there is a dialin the environment that can be rotated and it is initialized toa random position between minus30 and 30 The target angleof the dial is 0 The angle of the dial is included in theagentrsquos observation In the orient task the robot interactswith a rectangular peg in the environment the initial andtarget pose of the peg remains the same across episodesThe pose of the block is included in the agentrsquos observationencoded as the xyz positions of four corners of the block(based on how Levine et al (2016) encodes end-effectorpose)
A Distributional View on Multi-Objective Policy Optimization
Figure 9 These images show what task completion looks like forthe touch turn and orient Shadow Hand tasks (from left to right)
0 2 4 6impact force
minus6
minus4
minus2
0
pain
pen
alty
Figure 10 In the touch and turn Shadow Hand tasks the painpenalty is computed based on the impact force as plotted here Forlow impact forces the pain penalty is near-zero For high impactforces the pain penalty is equal to the negative of the impact force
B21 BALANCING TASK COMPLETION AND PAIN
In the touch and turn tasks there are two objectives ldquopainrdquopenalty and task completion A sparse task completion re-ward of 1 is given for pressing a block with greater than 5Nof force or turning a dial to a fixed target angle respectivelyIn both tasks the episode terminates when the agent com-pletes the task ie the agent gets a total reward of either0 or 1 for the task completion objective per episode ThePareto plots for these two tasks in the main paper (Fig 6)show the discounted task reward (with a discount factor ofγ = 099) to capture how long it takes agents to completethe task
The ldquopainrdquo penalty is computed as in (Huang et al 2019) Itis based on the impact force m(s sprime) which is the increasein force from state s to the next state sprime In our tasks this ismeasured by a touch sensor on the block or dial The painpenalty is equal to the negative of the impact force scaledby how unacceptable it is
rpain(s a sprime) = minus[1minus aλ(m(s sprime))
]m(s sprime) (6)
where aλ(middot) isin [0 1] computes the acceptability of an impactforce aλ(middot) should be a monotonically decreasing functionthat captures how resilient the robot and the environmentare to impacts As in (Huang et al 2019) we use
aλ(m) = sigmoid(λ1(minusm+ λ2)) (7)
with λ = [2 2]gt The relationship between pain penaltyand impact force is plotted in Fig 10
675 700 725zpeg
0
1
rew
ard
height objective
0 2d(θ θpeg)
orientation objective
Figure 11 In the orient Shadow Hand task the height and orienta-tion objectives have shaped rewards as plotted here
B22 ALIGNED OBJECTIVES
In the orient task there are three non-competing objectivestouching the peg lifting the peg to a target height and ori-enting the peg to be perpendicular to the ground The targetz position and orientation is shown by the gray transparentpeg (Fig 9 right-most) note that the x and y position isnot specified so the robot in the figure gets the maximumreward with respect to all three objectives
The touch objective has a sparse reward of 1 for activat-ing the pegrsquos touch sensor and zero otherwise For theheight objective the target z position is 7cm above theground the pegrsquos z position is computed with respect toits center of mass The shaped reward for this objectiveis 1 minus tanh(50|ztarget minus zpeg|) For the orientation objec-tive since the peg is symmetrical there are eight possibleorientations that are equally valid The acceptable targetorientationsQtarget and the pegrsquos orientation qpeg are denotedas quaternions The shaped reward is computed with respectto the closest target quaternion as
minqisinQtarget
1minus tanh(2lowastd(q qpeg))
where d(middot) denotes the `2-norm of the axis-angle equivalent(in radians) for the distance between the two quaternionsFig 11 shows the shaping of the height and orientationrewards
B3 Humanoid
We make use of the humanoid run task from Tassa et al(2018)9 The observation dimension is 67 and the actiondimension is 21 Actions are joint accelerations with min-imum and maximum limits of minus1 and 1 respectively Forthis task there are two objectives
bull The original task reward given by the environment Thegoal is is to achieve a horizontal speed of 10 meters persecond in any direction This reward is shaped it isequal to min(h10 1) where h is in meters per secondFor this objective we learn a Q-function
9This task is available at githubcomdeepminddm control
A Distributional View on Multi-Objective Policy Optimization
bull Limiting energy usage by penalizing taking high-magnitude actions The penalty is the `2-norm of theaction vector ie rpenalty(a) = minusa2 For this ob-jective we do not learn a Q-function Instead wecompute this penalty directly to evaluate a given actionin a state-independent way during policy optimization
B4 Humanoid Mocap
The motion capture tracking task used in this paper wasdeveloped as part of a concurrent submission (Anonymous2020) and does not constitute a contribution of this paperWe use a simulated humanoid adapted from the ldquoCMUhumanoidrdquo available at dm controllocomotion (Merel et al2019) We adjusted various body and actuator parametersto be more comparable to that of an average human Thishumanoid has 56 degrees of freedom and the observation is1021-dimensional
This motion capture tracking task is broadly similar to previ-ous work on motion capture tracking (Chentanez et al 2018Peng et al 2018 Merel et al 2019) The task is associatedwith an underlying set of motion capture clips For thispaper we used roughly 40 minutes of locomotion motioncapture clips from the CMU motion capture database10
At the beginning of a new episode we select a frame uni-formly at random from all frames in the underlying mocapdata (excluding the last 10 frames of each clip) The simu-lated humanoid is then initialized to the pose of the selectedframe
The observations in this environment are various propriocep-tive observations about the current state of the body as wellas the relative target positions and orientations of 13 differ-ent body parts in the humanoidrsquos local frame We providethe agent with the target poses for the next 5 timesteps
The choice of reward function is crucial for motion capturetracking tasks Here we consider five different reward com-ponents each capturing a different aspect of the similarityof the pose between the simulated humanoid and the mocaptarget Four of our reward components were initially pro-posed by Peng et al (2018) The first reward component isbased on the difference in the center of mass
rcom = exp(minus10pcom minus pref
com2)
where pcom and prefcom are the positions of the center of mass
of the simulated humanoid and the mocap reference re-spectively The second reward component is based on thedifference in joint angle velocities
rvel = exp(minus01qvel minus qref
vel2)
where qvel and qrefvel are the joint angle velocities of the simu-
lated humanoid and the mocap reference respectively The
10This database is available at mocapcscmuedu
third reward component is based on the difference in end-effector positions
rapp = exp(minus40papp minus pref
app2)
where papp and prefapp are the end effector positions of the
simulated humanoid and the mocap reference respectivelyThe fourth reward component is based on the difference injoint orientations
rquat = exp(minus2qquat qref
quat2)
where denotes quaternion differences and qquat and qrefquat
are the joint quaternions of the simulated humanoid and themocap reference respectively
Finally the last reward component is based on difference inthe joint angles and the Euclidean positions of a set of 13body parts
rtrunc = 1minus 1
03(
=ε︷ ︸︸ ︷bpos minus bref
pos1 + qpos minus qrefpos1)
where bpos and brefpos correspond to the body positions of the
simulated character and the mocap reference and qpos andqref
pos correspond to the joint angles
We include an early termination condition in our task that islinked to this last reward term We terminate the episode ifε gt 03 which ensures that rtrunc isin [0 1]
In our MO-VMPO experiments we treat each reward com-ponent as a separate objective In our VMPO experimentswe use a reward of the form
r =1
2rtrunc +
1
2(01rcom + λrvel + 015rapp + 065rquat)
(8)
varying λ as described in the main paper (Sec 63)
B5 Sawyer
The Sawyer peg-in-hole setup consists of a Rethink RoboticsSawyer robot arm The robot has 7 joints driven by serieselastic actuators On the wrist a Robotiq FT 300 force-torque sensor is mounted followed by a Robotiq 2F85 par-allel gripper We implement a 3D end-effector Cartesianvelocity controller that maintains a fixed orientation of thegripper The agent controls the Cartesian velocity setpointsThe gripper is not actuated in this setup The observationsprovided are the joint positions velocities and torques theend-effector pose the wrist force-torque measurements andthe joint velocity command output of the Cartesian con-troller We augment each observation with the two previousobservations as a history
In each episode the peg is initalized randomly within an8times8times8 cm workspace (with the peg outside of the hole)
A Distributional View on Multi-Objective Policy Optimization
The environment is then run at 20 Hz for 600 steps Theaction limits are plusmn005 ms on each axis If the magnitudeof the wrist force-torque measurements exceeds 15 N onany horizontal axis (x and y axis) or 5 N on the vertical axis(z axis) the episode is terminated early
There are two objectives the task itself (inserting the peg)and minimizing wrist force measurements
The reward for the task objective is defined as
rinsertion = max(
02rapproach rinserted
)(9)
rapproach = s(ppeg popening) (10)
rinserted = raligned s(ppegz pbottom
z ) (11)
raligned = ppegxy minus popening
xy lt d2 (12)
where ppeg is the peg position popening the position of thehole opening pbottom the bottom position of the hole and dthe diameter of the hole
s(p1 p2) is a shaping operator
s(p1 p2) = 1minus tanh2(atanh(
radicl)
εp1 minus p22) (13)
which gives a reward of 1minus l if p1 and p2 are at a Euclideandistance of ε Here we chose lapproach = 095 εapproach =001 linserted = 05 and εinserted = 004 Intentionally 004corresponds to the length of the peg such that rinserted = 05if ppeg = popening As a result if the peg is in the approachposition rinserted is dominant over rapproach in rinsertion
Intuitively there is a shaped reward for aligning the pegwith a magnitude of 02 After accurate alignment withina horizontal tolerance the overall reward is dominated bya vertical shaped reward component for insertion The hor-izontal tolerance threshold ensures that the agent is notencouraged to try to insert the peg from the side
The reward for the secondary objective minimizing wristforce measurements is defined as
rforce = minusF1 (14)
where F = (Fx Fy Fz) are the forces measured by thewrist sensor
C Algorithmic DetailsC1 General Algorithm Description
We maintain one online network and one target network foreach Q-function associated to objective k with parametersdenoted by φk and φprimek respectively We also maintain oneonline network and one target network for the policy withparameters denoted by θ and θprime respectively Target net-works are updated every fixed number of steps by copying
parameters from the online network Online networks areupdated using gradient descent in each learning iterationNote that in the main paper we refer to the target policynetwork as the old policy πold
We use an asynchronous actor-learner setup in which ac-tors regularly fetch policy parameters from the learner andact in the environment writing these transitions to the re-play buffer We refer to this policy as the behavior policythroughout the paper The learner uses the transitions inthe replay buffer to update the (online) Q-functions and thepolicy Please see Algorithms 2 and 3 for more details onthe actor and learner processes respectively
Algorithm 2 MO-MPO Asynchronous Actor
1 given (N) reward functions rk(s a)Nk=1 T steps perepisode
2 repeat3 Fetch current policy parameters θ from learner4 Collect trajectory from environment5 τ = 6 for t = 0 T do7 at sim πθ(middot|st)8 Execute action and determine rewards9 r = [r1(st at) rN (st at)]
10 τ larr τ cup (st at r πθ(at|st))11 end for12 Send trajectory τ to replay buffer13 until end of training
C2 Retrace for Multi-Objective Policy Evaluation
Recall that the goal of the policy evaluation stage (Sec 41)is to learn the Q-function Qφk
(s a) for each objective kparametrized by φk These Q-functions are with respectto the policy πold We emphasize that it is valid to use anyoff-the-shelf Q-learning algorithm such as TD(0) (Sutton1988) to learn these Q-functions We choose to use Retrace(Munos et al 2016) described here
Given a replay buffer D containing trajectory snippetsτ = (s0 a0 r0 s1) (sT aT rT sT+1) where rtdenotes a reward vector rk(st at)Nk=1 that consists of ascalar reward for each ofN objectives the Retrace objectiveis as follows
minφk
EτsimD[(Qretk (st at)minus Qφk
(st at))2] (15)
with
Qretk (st at) = Qφprime
k(st at) +
Tsumj=t
γjminust( jprodz=t+1
cz
)δj
δj = rk(sj aj) + γV (sj+1)minus Qφprimek(sj aj)
V (sj+1) = Eπold(a|sj+1)[Qφprimek(sj+1 a)]
A Distributional View on Multi-Objective Policy Optimization
Algorithm 3 MO-MPO Asynchronous Learner1 given batch size (L) number of actions (M) (N) Q-functions
(N) preferences εkNk=1 policy networks and replay bufferD
2 initialize Lagrangians ηkNk=1 and ν target networks andonline networks such that πθprime = πθ and Qφprime
k= Qφk
3 repeat4 repeat5 Collect dataset si aij Qijk
LMNijk where
6 si sim D aij sim πθprime(a|si) and Qijk = Qφkprime(si aij)
78 Compute action distribution for each objective9 for k = 1 N do
10 δηk larr nablaηkηkεk + ηksumLi
1Llog
(sumMj
1M
exp(Q
ijkηk
))11 Update ηk based on δηk12 qijk prop exp(
Qijkηk
)
13 end for1415 Update parametric policy with trust region16 sg denotes a stop-gradient17 δπ larr minusnablaθ
sumLi
sumMj
sumNk q
ijk log πθ(a
ij |si)18 +sg(ν)
(β minus
sumLi KL(πθprime(a|si) πθ(a|si)) ds
)19 δν larr nablaνν
(βminussumLi KL(πθprime(a|si) πsg(θ)(a|si)) ds
)20 Update πθ based on δπ21 Update ν based on δν2223 Update Q-functions24 for k = 1 N do25 δφk larr nablaφk
sum(stat)isinτsimD
(Qφk (st at)minusQ
retk
)226 with Qret
k as in Eq 1527 Update φk based on δφk
28 end for2930 until fixed number of steps31 Update target networks32 πθprime = πθ Qφk
prime = Qφk
33 until convergence
The importance weights ck are defined as
cz = min
(1πold(az|sz)b(az|sz)
)
where b(az|sz) denotes the behavior policy used to col-lect trajectories in the environment When j = t we set(prodj
z=t+1 cz
)= 1
We also make use of a target network for each Q-function(Mnih et al 2015) parametrized by φprimek which we copyfrom the online network φk after a fixed number of gradientsteps on the Retrace objective (15)
C3 Policy Fitting
Recall that fitting the policy in the policy improvementstage (Sec 422) requires solving the following constrained
optimization
πnew = argmaxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds
stints
micro(s) KL(πold(a|s) πθ(a|s)) ds lt β (16)
We first write the generalized Lagrangian equation ie
L(θ ν) =
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds (17)
+ ν(β minus
ints
micro(s) KL(πold(a|s) πθ(a|s)) ds)
where ν is the Lagrangian multiplier Now we solve thefollowing primal problem
maxθ
minνgt0
L(θ ν)
To obtain the parameters θ for the updated policy we solvefor ν and θ by alternating between 1) fixing ν to its currentvalue and optimizing for θ and 2) fixing θ to its currentvalue and optimizing for ν This can be applied for anypolicy output distribution
For Gaussian policies we decouple the update rules to op-timize for mean and covariance independently as in (Ab-dolmaleki et al 2018b) This allows for setting differentKL bounds for the mean (βmicro) and covariance (βΣ) whichresults in more stable learning To do this we first separateout the following two policies for mean and covariance
πmicroθ (a|s) = N(amicroθ(s)Σθold
(s)) (18)
πΣθ (a|s) = N
(amicroθold
(s)Σθ(s)) (19)
Policy πmicroθ (a|s) takes the mean from the online policy net-work and covariance from the target policy network andpolicy πΣ
θ (a|s) takes the mean from the target policy net-work and covariance from online policy network Now ouroptimization problem has the following form
maxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s)(
log πmicroθ (a|s)πΣθ (a|s)
)da ds
stints
micro(s) KL(πold(a|s) πmicroθ (a|s)) ds lt βmicroints
micro(s) KL(πold(a|s) πΣθ (a|s)) ds lt βΣ (20)
As in (Abdolmaleki et al 2018b) we set a much smallerbound for covariance than for mean to keep the explorationand avoid premature convergence We can solve this op-timization problem using the same general procedure asdescribed above
A Distributional View on Multi-Objective Policy Optimization
C4 Derivation of Dual Function for η
Recall that obtaining the per-objective improved action dis-tributions qk(a|s) in the policy improvement stage (Sec421) requires solving a convex dual function for the tem-perature ηk for each objective For the derivation of thisdual function please refer to Appendix D2 of the originalMPO paper (Abdolmaleki et al 2018b)
D MO-V-MPO Multi-Objective On-PolicyMPO
In this section we describe how MO-MPO can be adaptedto the on-policy case in which a state-value function V (s)is learned and used to estimate the advantages for eachobjective We call this approach MO-V-MPO
D1 Multi-objective policy evaluation in the on-policysetting
In the on-policy setting to evaluate the previous policyπold we use advantages A(s a) estimated from a learnedstate-value function V (s) instead of a state-action valuefunction Q(s a) as in the main text We train a separateV -function for each objective by regressing to the stan-dard n-step return (Sutton 1988) associated with each ob-jective More concretely given trajectory snippets τ =(s0 a0 r0) (sT aT rT ) where rt denotes a rewardvector rk(st at)Nk=1 that consists of rewards for all N ob-jectives we find value function parameters φk by optimizingthe following objective
minφ1k
Nsumk=1
Eτ[(G
(T )k (st at)minus V πold
φk(st))
2] (21)
Here G(T )k (st at) is the T -step target for value function
k which uses the actual rewards in the trajectory andbootstraps from the current value function for the restG
(T )k (st at) =
sumTminus1`=t γ
`minustrk(s` a`) + γTminustV πoldφk
(s`+T )The advantages are then estimated as Aπold
k (st at) =
G(T )k (st at)minus V πold
φk(st)
D2 Multi-objective policy improvement in theon-policy setting
Given the previous policy πold(a|s) and estimated advan-tages Aπold
k (s a)k=1N associated with this policy foreach objective our goal is to improve the previous policyTo this end we first learn an improved variational distribu-tion qk(s a) for each objective and then combine and distillthe variational distributions into a new parametric policyπnew(a|s) Unlike in MO-MPO for MO-V-MPO we use thejoint distribution qk(s a) rather than local policies qk(a|s)because without a learned Q-function only one action perstate is available for learning This is a multi-objective vari-
ant of the two-step policy improvement procedure employedby V-MPO (Song et al 2020)
D21 OBTAINING IMPROVED VARIATIONALDISTRIBUTIONS PER OBJECTIVE (STEP 1)
In order to obtain the improved variational distributionsqk(s a) we optimize the RL objective for each objectiveAk
maxqk
intsa
qk(s a)Ak(s a) dads (22)
st KL(qk(s a)pold(s a)) lt εk
where the KL-divergence is computed over all (s a) εk de-notes the allowed expected KL divergence and pold(s a) =micro(s)πold(a|s) is the state-action distribution associated withπold
As in MO-MPO we use these εk to define the preferencesover objectives More concretely εk defines the allowed con-tribution of objective k to the change of the policy There-fore the larger a particular εk is with respect to others themore that objective k is preferred On the other hand ifεk = 0 is zero then objective k will have no contribution tothe change of the policy and will effectively be ignored
Equation (22) can be solved in closed form
qk(s a) prop pold(s a) exp(Ak(s a)
ηk
) (23)
where the temperature ηk is computed based on the con-straint εk by solving the following convex dual problem
ηk = argminηk
[ηk εk + (24)
ηk log
intsa
pold(s a) exp(Ak(s a)
η
)da ds
]
We can perform the optimization along with the loss bytaking a gradient descent step on ηk and we initialize withthe solution found in the previous policy iteration step Sinceηk must be positive we use a projection operator after eachgradient step to maintain ηk gt 0
Top-k advantages As in Song et al (2020) in practicewe used the samples corresponding to the top 50 of ad-vantages in each batch of data
D22 FITTING A NEW PARAMETRIC POLICY (STEP 2)
We next want to combine and distill the state-action distribu-tions obtained in the previous step into a single parametricpolicy πnew(a|s) that favors all of the objectives accordingto the preferences specified by εk For this we solve a su-pervised learning problem that fits a parametric policy asfollows
A Distributional View on Multi-Objective Policy Optimization
πnew = argmaxθ
Nsumk=1
intsa
qk(s a) log πθ(a|s) da ds
stints
KL(πold(a|s) πθ(a|s)) ds lt β (25)
where θ are the parameters of our function approximator(a neural network) which we initialize from the weights ofthe previous policy πold and the KL constraint enforces atrust region of size β that limits the overall change in theparametric policy to improve stability of learning As inMPO the KL constraint in this step has a regularizationeffect that prevents the policy from overfitting to the localpolicies and therefore avoids premature convergence
In order to optimize Equation (25) we employ Lagrangianrelaxation similar to the one employed for ordinary V-MPO (Song et al 2020)
E Multi-Objective Policy Improvement asInference
In the main paper we motivated the multi-objective policyupdate rules from an intuitive perspective In this section weshow that our multi-objective policy improvement algorithmcan also be derived from the RL-as-inference perspective Inthe policy improvement stage we assume that a Q-functionfor each objective is given and we would like to improveour policy with respect to these Q-functions The deriva-tion here extends the derivation for the policy improvementalgorithm in (single-objective) MPO in Abdolmaleki et al(2018a) (in appendix) to the multi-objective case
We assume there are observable binary improvement eventsRkNk=1 for each objective Rk = 1 indicates that ourpolicy has improved for objective k whileRk = 0 indicatesthat it has not Now we ask if the policy has improved withrespect to all objectives ie Rk = 1Nk=1 what would theparameters θ of that improved policy be More concretelythis is equivalent to the maximum a posteriori estimate forRk = 1Nk=1
maxθ
pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)
where the probability of improvement events depends on θAssuming independence between improvement events Rkand using log probabilities leads to the following objective
maxθ
Nsumk=1
log pθ(Rk = 1) + log p(θ) (27)
The prior distribution over parameters p(θ) is fixed duringthe policy improvement step We set the prior such that πθ
stays close to the target policy during each policy improve-ment step to improve stability of learning (Sec E2)
We use the standard expectation-maximization (EM) algo-rithm to efficiently maximize
sumNk=1 log pθ(Rk = 1) The
EM algorithm repeatedly constructs a tight lower bound inthe E-step given the previous estimate of θ (which corre-sponds to the target policy in our case) and then optimizesthat lower bound in the M-step We introduce a variationaldistribution qk(s a) per objective and use this to decom-pose the log-evidence
sumNk=1 log pθ(Rk = 1) as follows
Nsumk=1
log pθ(Rk = 1) (28)
=
Nsumk=1
KL(qk(s a) pθ(s a|Rk = 1))minus
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a))
The second term of this decomposition is expanded as
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a)) (29)
=
Nsumk=1
Eqk(sa)
[log
pθ(Rk = 1 s a)
qk(s a)
]=
Nsumk=1
Eqk(sa)
[log
p(Rk = 1|s a)πθ(a|s)micro(s)
qk(s a)
]
where micro(s) is the stationary state distribution which is as-sumed to be given in each policy improvement step Inpractice micro(s) is approximated by the distribution of thestates in the replay buffer p(Rk = 1|s a) is the likelihoodof the improvement event for objective k if our policy choseaction a in state s
The first term of the decomposition in Equation (28) isalways positive so the second term is a lower bound onthe log-evidence
sumNk=1 log pθ(Rk = 1) πθ(a|s) and
qk(a|s) = qk(sa)micro(s) are unknown even though micro(s) is given
In the E-step we estimate qk(a|s) for each objective k byminimizing the first term given θ = θprime (the parameters ofthe target policy) Then in the M-step we find a new θ byfitting the parametric policy to the distributions from firststep
E1 E-Step
The E-step corresponds to the first step of policy improve-ment in the main paper (Sec 421) In the E-step wechoose the variational distributions qk(a|s)Nk=1 such thatthe lower bound on
sumNk=1 log pθ(Rk = 1) is as tight as
A Distributional View on Multi-Objective Policy Optimization
possible when θ = θprime the parameters of the target policyThe lower bound is tight when the first term of the decompo-sition in Equation (28) is zero so we choose qk to minimizethis term We can optimize for each variational distributionqk independently
qk(a|s) = argminq
Emicro(s)
[KL(q(a|s) pθprime(s a|Rk = 1))
]= argmin
qEmicro(s)
[KL(q(a|s) πθprime(a|s))minus
Eq(a|s) log p(Rk = 1|s a))]
(30)
We can solve this optimization problem in closed formwhich gives us
qk(a|s) =πθprime(a|s) p(Rk = 1|s a)intπθprime(a|s) p(Rk = 1|s a) da
This solution weighs the actions based on their relativeimprovement likelihood p(Rk = 1|s a) for each objectiveWe define the likelihood of an improvement event Rk as
p(Rk = 1|s a) prop exp(Qk(s a)
αk
)
where αk is an objective-dependent temperature parameterthat controls how greedy the solution qk(a|s) is with respectto its associated objective Qk(s a) For example given abatch of state-action pairs at the extremes an αk of zerowould give all probability to the state-action pair with themaximum Q-value while an αk of positive infinity wouldgive the same probability to all state-action pairs In orderto automatically optimize for αk we plug the exponentialtransformation into (30) which leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) dadsminus
αk
intmicro(s) KL(q(a|s) πθprime(a|s)) ds (31)
If we instead enforce the bound on KL divergence as a hardconstraint (which we do in practice) that leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) da ds (32)
stintmicro(s) KL(q(a|s) πθprime(a|s)) ds lt εk
where as described in the main paper the parameter εkdefines preferences over objectives If we now assume thatqk(a|s) is a non-parametric sample-based distribution as in(Abdolmaleki et al 2018b) we can solve this constrained
optimization problem in closed form for each sampled states by setting
q(a|s) prop πθprime(a|s) exp(Qk(s a)
ηk
) (33)
where ηk is computed by solving the following convex dualfunction
ηk = argminη
ηεk+ (34)
η
intmicro(s) log
intπθprime(a|s) exp
(Qk(s a)
η
)da ds
Equations (32) (33) and (34) are equivalent to those givenfor the first step of multi-objective policy improvement inthe main paper (Sec 421) Please see refer to AppendixD2 in (Abdolmaleki et al 2018b) for details on derivationof the dual function
E2 M-Step
After obtaining the variational distributions qk(a|s) foreach objective k we have found a tight lower bound forsumNk=1 log pθ(Rk = 1) when θ = θprime We can now obtain
the parameters θ of the updated policy πθ(a|s) by optimiz-ing this lower bound After rearranging and dropping theterms in (27) that do not involve θ we obtain
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) da ds
+ log p(θ)
This corresponds to the supervised learning step of policyimprovement in the main paper (Sec 422)
For the prior p(θ) on the parameters θ we enforce that thenew policy πθ(a|s) is close to the target policy πθprime(a|s)We therefore optimize for
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) dads
stintmicro(s) KL(πθprime(a|s) πθ(a|s)) ds lt β
Because of the constraint on the change of the parametricpolicy we do not greedily optimize the M-step objectiveSee (Abdolmaleki et al 2018b) for more details
A Distributional View on Multi-Objective Policy Optimization
3000 00
400
800
task
rewar
d
mid-training
3000 0
action norm penalty
end of training
3000 0
normal scale
30000 0
10x penalty
MO-MPO scalarized MPO
5 0pain penalty
00
05
task
rewar
d
touch mid-training
5 0pain penalty
touch
5 0pain penalty
turn
0 100z
0
150
angl
e
orient
3000 00
400
800
task
rewar
d
mid-training
3000 0
action norm penalty
end of training
3000 0
normal scale
30000 0
10x penalty
MO-MPO scalarized MPO
5 0pain penalty
00
05
task
rewar
d
touch mid-training
5 0pain penalty
touch
5 0pain penalty
turn
0 100z
0
150
angl
e
orient
touch mid-training
1x penalty
action norm penalty
pain penalty
3000 00
400
800
task
rewar
d
mid-training
3000 0
action norm penalty
end of training
3000 0
normal scale
30000 0
10x penalty
MO-MPO scalarized MPOscalarized MPOMO-MPO
touch end of training
Figure 6 Pareto fronts found by MO-MPO and scalarized MO-MPO for humanoid run (top row) and Shadow Hand tasks Eachdot represents a single trained policy Corresponding hypervol-umes are in Table 1 Task reward is discounted for touch and turnwith a discount factor of 099 For orient the x and y axes are thetotal reward for the lift and orientation objectives respectively
62 Results Humanoid and Shadow Hand
In the run touch and turn tasks the two objectives arecompetingmdasha very high preference for minimizing actionnorm or pain as opposed to getting task reward will resultin a policy that always chooses zero-valued actions Acrossthese three tasks the Pareto front found by MO-MPO issuperior to the one found by scalarized MPO with respectto the hypervolume metric (Table 1)7 In particular MO-MPO finds more policies that perform well with respect toboth objectives ie are in the upper right portion of thePareto front MO-MPO also speeds up learning on run andtouch (Fig 6) Qualitatively MO-MPO trains run policiesthat look more natural and ldquohuman-likerdquo videos are athttpsitesgooglecomviewmo-mpo
When we scale the action norm penalty by 10times for runscalarized MPO policies no longer achieve high task rewardwhereas MO-MPO policies do (Fig 6 top right) Thissupports that MO-MPOrsquos encoding of preferences is indeedscale-invariant When the objectives are aligned and havesimilarly scaled rewards as in the orient task MO-MPOand scalarized MPO perform similarly as expected
Ablation We also ran vanilla MPO on humanoid run withthe same range of weight settings as for scalarized MPO toinvestigate how useful Q-decomposition is In vanilla MPOwe train a single critic on the scalarized reward function
7We use hypervolume reference points of [0minus104] for run[0minus105] for run with 10times penalty [0minus20] for touch and turnand the zero vector for orient and humanoid mocap
Task scalarized MPO MO-MPO
Humanoid run mid-training 11times106 33times106
Humanoid run 64times106 71times106
Humanoid run 1times penalty 50times106 59times106
Humanoid run 10times penalty 26times107 69times107
Shadow touch mid-training 143 156Shadow touch 162 164Shadow turn 144 154Shadow orient 28times104 28times104
Humanoid Mocap 386times10minus6 341times10minus6
Table 1 Hypervolume measurements across tasks and approaches
which is equivalent to removing Q-decomposition fromscalarized MPO Vanilla MPO trains policies that achievesimilar task reward (up to 800) but with twice the actionnorm penalty (up to minus104) As a result the hypervolumeof the Pareto front that vanilla MPO finds is more than amagnitude worse than that of scalarized MPO and MO-MPO(22times106 versus 26times107 and 69times107 respectively)
63 Results Humanoid Mocap
The objectives in this task are mostly aligned In contrast tothe other experiments we use V-MPO (Song et al 2020) asthe base algorithm because it outperforms MPO in learningthis task In addition since V-MPO is an on-policy vari-ant of MPO this enables us to evaluate our approach inthe on-policy setting Each training run is very computa-tionally expensive so we train only a handful of policieseach8 None of the MO-V-MPO policies are dominatedby those found by V-MPO In fact although the weightsspan a wide range of ldquopreferencesrdquo for the joint velocity theonly policies found by V-MPO that are not dominated by aMO-V-MPO policy are those with extreme values for jointvelocity reward (either le 0006 or ge 0018) whereas it isbetween 00103 and 00121 for MO-V-MPO policies
Although the hypervolume of the Pareto front found by V-MPO is higher than that of MO-V-MPO (Table 1) findingpolicies that over- or under- prioritize any objective is unde-sirable Qualitatively the policies trained with MO-V-MPOlook more similar to the mocap reference datamdashthey exhibitless feet jittering compared to those trained with scalarizedV-MPO this can be seen in the corresponding video
64 Results Sawyer Peg-in-Hole
In this task we would like the robot to prioritize successfultask completion while minimizing wrist forces With this
8For MO-V-MPO we set all εk = 001 Also for each objec-tive we set εk = 0001 and set all others to 001 For V-MPOwe fix the weights of four objectives to reasonable values and tryweights of [01 03 06 1 5] for matching joint velocities
A Distributional View on Multi-Objective Policy Optimization
0 200000 400000
200
400
number of policy improvement steps
task reward
0 200000 400000
minus12
00minus
800
wrist force penalty
MO-MPO scalarized MPO [095 005] [08 02]
Figure 7 Task reward (left) and wrist force penalty (right) learningcurves for the Sawyer peg-in-hole task The policy trained withMO-MPO quickly learns to optimize both objectives whereas theother two do not Each line represents a single trained policy
in mind for MO-MPO we set εtask = 01 and εforce = 005and for scalarized MPO we try [wtask wforce] = [095 005]and [08 02] We find that policies trained with scalarizedMPO focus all learning on a single objective at the beginningof training we also observed this in the touch task wherescalarized MPO policies quickly learn to either maximizetask reward or minimize pain penalty but not both (Fig 6bottom left) In contrast the policy trained with MO-MPOsimultaneously optimizes for both objectives throughouttraining In fact throughout training the MO-MPO policydoes just as well with respect to task reward as the scalarizedMPO policy that cares more about task reward and similarlyfor wrist force penalty (Fig 7)
7 Conclusions and Future WorkIn this paper we presented a new distributional perspectiveon multi objective reinforcement learning that is derivedfrom the Rl-as-inference perspective This view leads to twonovel multi-objective RL algorithms namely MO-MPO andMO-V-MPO We showed these algorithms enable practition-ers to encode preferences in a scale-invariant way AlthoughMO-(V-)MPO is a single-policy approach to MORL it iscapable of producing a variety of Pareto-dominant policiesA limitation of this work is that we produced this set ofpolicies by iterating over a relatively large number of εrsquoswhich is computationally expensive In future work weplan to extend MO-(V-)MPO into a true multiple-policyMORL approach either by conditioning the policy on set-tings of εrsquos or by developing a way to strategically select εrsquosto train policies for analogous to what prior work has donefor weights (eg Roijers et al (2014))
AcknowledgmentsThe authors would like to thank Guillaume DesjardinsCsaba Szepesvari Jost Tobias Springenberg Steven BohezPhilemon Brakel Brendan Tracey Jonas Degrave JonasBuchli Leslie Fritz Chloe Rosenberg and many others atDeepMind for their support and feedback on this paper
ReferencesAbdolmaleki A Springenberg J T Degrave J Bohez S
Tassa Y Belov D Heess N and Riedmiller M Rela-tive entropy regularized policy iteration arXiv preprintarXiv181202256 2018a
Abdolmaleki A Springenberg J T Tassa Y Munos RHeess N and Riedmiller M Maximum a posterioripolicy optimisation In Proceedings of the Sixth Interna-tional Conference on Learning Representations (ICLR)2018b
Abels A Roijers D M Lenaerts T Nowe A andSteckelmacher D Dynamic weights in multi-objectivedeep reinforcement learning In Proceedings of the 36thInternational Conference on Machine Learning (ICML)2019
Achiam J Held D Tamar A and Abbeel P Constrainedpolicy optimization In Proceedings of the 34th Inter-national Conference on Machine Learning (ICML) pp22ndash31 2017
Amodei D Olah C Steinhardt J Christiano P Schul-man J and Mane D Concrete problems in AI safetyarXiv preprint arXiv160606565 2016
Andersen M S Dahl J and Vandenberghe L CVXOPTA Python package for convex optimization version 12httpscvxoptorg 2020
Anonymous CoMic Co-Training and Mimicry forReusable Skills 2020
Barrett L and Narayanan S Learning all optimal policieswith multiple criteria In Proceedings of the 25th Inter-national Conference on Machine Learning (ICML) pp41ndash47 2008
Barth-Maron G Hoffman M W Budden D Dabney WHorgan D TB D Muldal A Heess N and LillicrapT Distributed distributional deterministic policy gradi-ents In Proceedings of the Sixth International Conferenceon Learning Representations (ICLR) 2018
Bohez S Abdolmaleki A Neunert M Buchli J HeessN and Hadsell R Value constrained model-free contin-uous control arXiv preprint arXiv190204623 2019
Chen X Ghadirzadeh A Bjorkman M and Jensfelt PMeta-learning for multi-objective reinforcement learningarXiv preprint arXiv181103376 2019
Chentanez N Muller M Macklin M Makoviychuk Vand Jeschke S Physics-based motion capture imitationwith deep reinforcement learning In International Con-ference on Motion Interaction and Games (MIG) ACM2018
A Distributional View on Multi-Objective Policy Optimization
Chow Y Nachum O Duenez-Guzman E andGhavamzadeh M A Lyapunov-based approach to safereinforcement learning In Proceedings of the 31st Inter-national Conference on Neural Information ProcessingSystems (NeurIPS) pp 8092ndash8101 2018
Daneshmand M Tale Masouleh M Saadatzi M HOzcinar C and Anbarjafari G A robust proportion-preserving composite objective function for scale-invariant multi-objective optimization Scientia Iranica24(6)2977ndash2991 2017
Fortin F-A De Rainville F-M Gardner M-A ParizeauM and Gagne C DEAP Evolutionary algorithms madeeasy Journal of Machine Learning Research 132171ndash2175 July 2012
Friedman E and Fontaine F Generalizing across multi-objective reward functions in deep reinforcement learningarXiv preprint arXiv180906364 2018
Gabor Z Kalmar Z and Szepesvari C Multi-criteriareinforcement learning In Proceedings of the FifteenthInternational Conference on Machine Learning (ICML)pp 197ndash205 1998
Grodzevich O and Romanko O Normalization and othertopics in multi-objective optimization In Proceedings ofthe Fields-MITACS Industrial Problems Workshop 2006
Haarnoja T Zhou A Abbeel P and Levine S Softactor-critic Off-policy maximum entropy deep reinforce-ment learning with a stochastic actor In Proceedings ofthe 35th International Conference on Machine Learning(ICML) pp 1861ndash1870 2018
Heess N Wayne G Silver D Lillicrap T Erez T andTassa Y Learning continuous control policies by stochas-tic value gradients In Advances in Neural InformationProcessing Systems 28 (NIPS) 2015
Hessel M Soyer H Espeholt L Czarnecki W SchmittS and van Hasselt H Multi-task deep reinforcementlearning with popart In Proceedings of the AAAI Confer-ence on Artificial Intelligence volume 33 pp 3796ndash38032019
Huang S H Zambelli M Kay J Martins M F TassaY Pilarski P M and Hadsell R Learning gentle objectmanipulation with curiosity-driven deep reinforcementlearning arXiv preprint arXiv190308542 2019
Ishibuchi H Doi K and Nojima Y On the effect ofnormalization in MOEAD for multi-objective and many-objective optimization Complex amp Intelligent Systems 3(4)279ndash294 2017
Kim I and de Weck O Adaptive weighted sum methodfor multiobjective optimization A new method for paretofront generation Structural and Multidisciplinary Opti-mization 31(2)105ndash116 2006
Kingma D P and Ba J Adam A method for stochasticoptimization In Proceedings of the Third InternationalConference on Learning Representations (ICLR) 2015
Le H Voloshin C and Yue Y Batch policy learningunder constraints In Proceedings of the 36th Inter-national Conference on Machine Learning (ICML) pp3703ndash3712 2019
Levine S Finn C Darrell T and Abbeel P End-to-endtraining of deep visuomotor policies Journal of MachineLearning Research 17(39)1ndash40 2016
Liu C Xu X and Hu D Multiobjective reinforcementlearning A comprehensive overview IEEE Transactionson Systems Man and Cybernetics Systems 45(3)385ndash398 2015
Marler R T and Arora J S Function-transformationmethods for multi-objective optimization EngineeringOptimization 37(6)551ndash570 2005
Martens J New perspectives on the natural gradient methodarXiv preprint arXiv14121193 2014
Merel J Ahuja A Pham V Tunyasuvunakool S LiuS Tirumala D Heess N and Wayne G Hierarchi-cal visuomotor control of humanoids In InternationalConference on Learning Representations (ICLR) 2019
Mnih V Kavukcuoglu K Silver D Rusu A A VenessJ Bellemare M G Graves A Riedmiller M Fidje-land A K Ostrovski G et al Human-level controlthrough deep reinforcement learning Nature 518(7540)529 2015
Moffaert K V and Nowe A Multi-objective reinforcementlearning using sets of Pareto dominating policies Journalof Machine Learning Research 153663ndash3692 2014
Mossalam H Assael Y M Roijers D M and WhitesonS Multi-objective deep reinforcement learning arXivpreprint arXiv161002707 2016
Munos R Stepleton T Harutyunyan A and BellemareM Safe and efficient off-policy reinforcement learningIn Proceedings of the 29th International Conference onNeural Information Processing Systems pp 1054ndash10622016
Natarajan S and Tadepalli P Dynamic preferences inmulti-criteria reinforcement learning In Proceedings ofthe 22nd International Conference on Machine Learning(ICML) pp 601ndash608 2005
A Distributional View on Multi-Objective Policy Optimization
Nottingham K Balakrishnan A Deshmukh J Christo-pherson C and Wingate D Using logical specificationsof objectives in multi-objective reinforcement learningarXiv preprint arXiv191001723 2019
Parisi S Pirotta M Smacchia N Bascetta L andRestelli M Policy gradient approaches for multi-objective sequential decision making In Proceedingsof the International Joint Conference on Neural Networks(IJCNN) pp 2323ndash2330 2014
Peng X B Abbeel P Levine S and van de Panne MDeepmimic Example-guided deep reinforcement learn-ing of physics-based character skills ACM Transactionson Graphics 37(4) 2018
Pirotta M Parisi S and Restelli M Multi-objectivereinforcement learning with continuous Pareto frontierapproximation In Proceedings of the Twenty-Ninth AAAIConference on Artificial Intelligence (AAAI) 2015
Reymond M and Nowe A Pareto-DQN Approximat-ing the Pareto front in complex multi-objective decisionproblems In Proceedings of the Adaptive and LearningAgents Workshop at the 18th International Conferenceon Autonomous Agents and MultiAgent Systems AAMAS2019
Riedmiller M Hafner R Lampe T Neunert M De-grave J Van de Wiele T Mnih V Heess N and Sprin-genberg J T Learning by playing - solving sparse rewardtasks from scratch arXiv preprint arXiv1802105672018
Roijers D M Vamplew P Whiteson S and Dazeley RA survey of multi-objective sequential decision-makingJournal of Artificial Intelligence Research 48(1)67ndash1132013
Roijers D M Whiteson S and Oliehoek F A Linearsupport for multi-objective coordination graphs In Pro-ceedings of the International Conference on AutonomousAgents and Multi-Agent Systems (AAMAS) pp 1297ndash1304 2014
Russell S and Zimdars A L Q-decomposition for rein-forcement learning agents In Proceedings of the Twenti-eth International Conference on International Conferenceon Machine Learning (ICML) pp 656ndash663 2003
Schulman J Levine S Abbeel P Jordan M and MoritzP Trust region policy optimization In Proceedings ofthe 32nd International Conference on Machine Learning(ICML) 2015
Schulman J Wolski F Dhariwal P Radford A andKlimov O Proximal policy optimization algorithmsarXiv preprint arXiv170706347 2017
Shadow Robot Company Shadow dexterous handhttpswwwshadowrobotcomproductsdexterous-hand 2020
Silver D Huang A Maddison C J Guez A Sifre Lvan den Driessche G Schrittwieser J Antonoglou IPanneershelvam V Lanctot M Dieleman S GreweD Nham J Kalchbrenner N Sutskever I Lillicrap TLeach M Kavukcuoglu K Graepel T and HassabisD Mastering the game of Go with deep neural networksand tree search Nature 529484ndash503 2016
Song H F Abdolmaleki A Springenberg J T ClarkA Soyer H Rae J W Noury S Ahuja A LiuS Tirumala D Heess N Belov D Riedmiller Mand Botvinick M M V-MPO On-policy maximum aposteriori policy optimization for discrete and continu-ous control In International Conference on LearningRepresentations 2020
Sutton R S Learning to predict by the methods of temporaldifferences Machine learning 3(1)9ndash44 1988
Tan J Zhang T Coumans E Iscen A Bai Y HafnerD Bohez S and Vanhoucke V Sim-to-real Learningagile locomotion for quadruped robots In Proceedingsof Robotics Science and Systems (RSS) 2018
Tassa Y Doron Y Muldal A Erez T Li Yde Las Casas D Budden D Abdolmaleki A Merel JLefrancq A Lillicrap T and Riedmiller M Deepmindcontrol suite arXiv preprint arXiv180100690 2018
Teh Y W Bapst V Czarnecki W M Quan J Kirk-patrick J Hadsell R Heess N and Pascanu R DistralRobust multitask reinforcement learning In Proceedingsof the 31st International Conference on Neural Informa-tion Processing Systems (NeurIPS) 2017
Todorov E Erez T and Tassa Y Mujoco A physicsengine for model-based control In Proceedings of theIEEERSJ International Conference on Intelligent Robotsand Systems (IROS) pp 5026ndash5033 2012
Vamplew P Dazeley R Berry A Issabekov R andDekker E Empirical evaluation methods for multiobjec-tive reinforcement learning algorithms Machine Learn-ing 84(1)51ndash80 Jul 2011
van Hasselt H P Guez A Hessel M Mnih V and Sil-ver D Learning values across many orders of magnitudeIn Advances in Neural Information Processing Systemspp 4287ndash4295 2016
Van Moffaert K Drugan M M and Nowe A Scalar-ized multi-objective reinforcement learning Novel de-sign techniques In Proceedings of the IEEE Symposiumon Adaptive Dynamic Programming and ReinforcementLearning (ADPRL) pp 191ndash199 2013
A Distributional View on Multi-Objective Policy Optimization
van Seijen H Fatemi M Romoff J Laroche R BarnesT and Tsang J Hybrid reward architecture for reinforce-ment learning In Proceedings of the 31st InternationalConference on Neural Information Processing Systems(NeurIPS) pp 5398ndash5408 2017
Watkins C J C H Learning from Delayed Rewards PhDthesis Kingrsquos College Cambridge UK May 1989
Wray K H Zilberstein S and Mouaddib A-I Multi-objective MDPs with conditional lexicographic rewardpreferences In Proceedings of the Twenty-Ninth AAAIConference on Artificial Intelligence (AAAI) pp 3418ndash3424 2015
Wulfmeier M Abdolmaleki A Hafner R SpringenbergJ T Neunert M Hertweck T Lampe T Siegel NHeess N and Riedmiller M Regularized hierarchicalpolicies for compositional transfer in robotics arXivpreprint arXiv190611228 2019
Yang R Sun X and Narasimhan K A generalized algo-rithm for multi-objective reinforcement learning and pol-icy adaptation In Proceedings of the 32nd InternationalConference on Neural Information Processing Systems(NeurIPS) pp 14610ndash14621 2019
Zeng A Song S Lee J Rodriguez A and FunkhouserT Tossingbot Learning to throw arbitrary objects withresidual physics In Proceedings of Robotics Scienceand Systems (RSS) 2019
Zhan H and Cao Y Relationship explainable multi-objective optimization via vector value function basedreinforcement learning arXiv preprint arXiv1910019192019
Zuluaga M Krause A and Puschel M ε-pal An ac-tive learning approach to the multi-objective optimizationproblem Journal of Machine Learning Research 17(1)3619ndash3650 2016
A Distributional View on Multi-Objective Policy Optimization
A Experiment DetailsIn this section we describe implementation details and spec-ify the hyperparameters used for our algorithm In all ourexperiments the policy and critic(s) are implemented withfeed-forward neural networks We use Adam (Kingma ampBa 2015) for optimization
In our continuous control tasks the policy returns a Gaus-sian distribution with a diagonal covariance matrix ieπθ(a|s) = N (microΣ) The policy is parametrized by aneural network which outputs the mean micro = micro(s) and di-agonal Cholesky factors A = A(s) such that Σ = AAT The diagonal factor A has positive diagonal elements en-forced by the softplus transform Aii larr log(1 + exp(Aii))to ensure positive definiteness of the diagonal covariancematrix
Table 2 shows the default hyperparameters that we use forMO-MPO and scalarized MPO and Table 3 shows the hy-perparameters that differ from these defaults for the hu-manoid Shadow Hand and Sawyer experiments For alltasks that we train policies with MO-MPO for a separatecritic is trained for each objective with a shared first layerThe only exception is the action norm penalty in humanoidrun which is computed exactly from the action We foundthat for both policy and critic networks layer normalizationof the first layer followed by a hyperbolic tangent (tanh) isimportant for the stability of the algorithm
In our experiments comparing the Pareto front found byMO-MPO versus scalarized MPO we ran MO-MPO with arange of εk settings and ran scalarized MPO with a range ofweight settings to obtain a set of policies for each algorithmThe settings for humanoid run and the Shadow Hand tasksare listed in Table 4 The settings for humanoid mocap andSawyer are specified in the main paper in Sec 63 and 64respectively
In the following subsections we first give suggestions forchoosing appropriate εk to encode a particular preference(Sec A1) then we describe the discrete MO-MPO usedfor Deep Sea Treasure (Sec A2) and finally we describeimplementation details for humanoid mocap (Sec A3)
A1 Suggestions for Setting ε
Our proposed algorithm MO-(V-)MPO requires practition-ers to translate a desired preference across objectives to nu-merical choices for εk for each objective k At first glancethis may seem daunting However in practice we havefound that encoding preferences via εk is often more in-tuitive than doing so via scalarization In this subsectionwe seek to give an intuition on how to set εk for differentdesired preferences Recall that each εk controls the influ-ence of objective k on the policy update by constraining theKL-divergence between each objective-specific distribution
Hyperparameter Default
policy networklayer sizes (300 200)take tanh of action mean yesminimum variance 10minus12
maximum variance unbounded
critic network(s)layer sizes (400 400 300)take tanh of action yesRetrace sequence size 8discount factor γ 099
both policy and critic networkslayer norm on first layer yestanh on output of layer norm yesactivation (after each hidden layer) ELU
MPOactions sampled per state 20default ε 01KL-constraint on policy mean βmicro 10minus3
KL-constraint on policy covariance βΣ 10minus5
initial temperature η 1
trainingbatch size 512replay buffer size 106
Adam learning rate 3times10minus4
Adam ε 10minus3
target network update period 200
Table 2 Default hyperparameters for MO-MPO and scalarizedMPO with decoupled update on mean and covariance (Sec C3)
and the current policy (Sec 41) We generally choose εk inthe range of 0001 to 01
Equal Preference When all objectives are equally impor-tant the general rule is to set all εk to the same value Wedid this in our orient and humanoid mocap tasks where ob-jectives are aligned or mostly aligned In contrast it can betricky to choose appropriate weights in linear scalarizationto encode equal preferencesmdashsetting all weights equal to1K (where K is the number of objectives) is only appro-priate if the objectivesrsquo rewards are of similar scales Weexplored this in Sec 51 and Fig 3
When setting all εk to the same value what should thisvalue be The larger εk is the more influence the objectiveswill have on the policy update step Since the per-objectivecritics are learned in parallel with the policy setting εktoo high tends to destabilize learning because early on intraining when the critics produce unreliable Q-values theirinfluence on the policy will lead it in the wrong directionOn the other hand if εk is set too low then it slows downlearning because the per-objective action distribution isonly allowed to deviate by a tiny amount from the currentpolicy and the updated policy is obtained via supervisedlearning on the combination of these action distributions
A Distributional View on Multi-Objective Policy Optimization
Humanoid run
policy networklayer sizes (400 400 300)take tanh of action mean nominimum variance 10minus8
critic network(s)layer sizes (500 500 400)take tanh of action no
MPOKL-constraint on policy mean βmicro 5times10minus3
KL-constraint on policy covariance βΣ 10minus6
trainingAdam learning rate 2times10minus4
Adam ε 10minus8
Shadow Hand
MPOactions sampled per state 30default ε 001
Sawyer
MPOactions sampled per state 15
trainingtarget network update period 100
Table 3 Hyperparameters for humanoid Shadow Hand andSawyer experiments that differ from the defaults in Table 2
Eventually the learning will converge to more or less thesame policy though as long as εk is not set too high
Unequal Preference When there is a difference in pref-erences across objectives the relative scale of εk is whatmatters The more the relative scale of εk is compared to εlthe more influence objective k has over the policy updatecompared to objective l And in the extreme case whenεl is near-zero for objective l then objective l will haveno influence on the policy update and will effectively beignored We explored this briefly in Sec 51 and Fig 4
One common example of unequal preferences is when wewould like an agent to complete a task while minimiz-ing other objectivesmdasheg an action norm penalty ldquopainrdquopenalty or wrist force-torque penalty in our experimentsIn this case the ε for the task objective should be higherthan that for the other objectives to incentivize the agent toprioritize actually doing the task If the ε for the penaltiesis too high then the agent will care more about minimiz-ing the penalty (which can typically be achieved by simplytaking no actions) rather than doing the task which is notparticularly useful
The scale of εk has a similar effect as in the equal preference
Condition Settings
Humanoid run (one seed per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin linspace(0 015 100)
MO-MPO εtask = 01εpenalty isin linspace(10minus6 015 100)
Humanoid run normal vs scaled (three seeds per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin 001 005 01
MO-MPO εtask = 01εpenalty isin 001 005 01
Shadow Hand touch and turn (three seeds per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin linspace(0 09 10)
MO-MPO εtask = 001εpenalty isin linspace(0001 0015 15)
Shadow Hand orient (ten seeds per setting)
scalarized MPO wtouch = wheight = worientation = 13MO-MPO εtouch = εheight = εorientation = 001
Table 4 Settings for εk and weights (linspace(x y z) denotes aset of z evenly-spaced values between x and y)
case If the scale of εk is too high or too low then the sameissues arise as discussed for equal preferences If all εkincrease or decrease in scale by the same (moderate) factorand thus their relative scales remain the same then typicallythey will converge to more or less the same policy Thiscan be seen in Fig 5 (right) in the Deep Sea Treasuredomain regardless of whether εtime is 001 002 or 005the relationship between the relative scale of εtreasure andεtime to the treasure that the policy converges to selecting isessentially the same
A2 Deep Sea Treasure
In order to handle discrete actions we make several mi-nor adjustments to scalarized MPO and MO-MPO Thepolicy returns a categorical distribution rather than a Gaus-sian The policy is parametrized by a neural network whichoutputs a vector of logits (ie unnormalized log probabil-ities) of size |A| The KL constraint on the change of thispolicy β is 0001 The input to the critic network is thestate concatenated with a four-dimensional one-hot vectordenoting which action is chosen (eg the up action corre-sponds to [1 0 0 0]gt) Critics are trained with one-steptemporal-difference error with a discount of 0999 Otherthan these changes the network architectures and the MPOand training hyperparameters are the same as in Table 2
A Distributional View on Multi-Objective Policy Optimization
Humanoid Mocap
Scalarized VMPOKL-constraint on policy mean βmicro 01KL-constraint on policy covariance βΣ 10minus5
default ε 01initial temperature η 1
TrainingAdam learning rate 10minus4
batch size 128unroll length (for n-step return Sec D1) 32
Table 5 Hyperparameters for the humanoid mocap experiments
A3 Humanoid Mocap
For the humanoid mocap experiments we used the follow-ing architecture for both MO-VMPO and VMPO for thepolicy we process a concatentation of the mocap referenceobservations and the proprioceptive observations by a twolayer MLP with 1024 hidden units per layer This referenceencoder is followed by linear layers to produce the meanand log standard deviation of a stochastic latent variableThese latent variables are then concatenated with the propri-oceptive observations and processed by another two layerMLP with 1024 hidden units per layer to produce the actiondistribution For VMPO we use a three layer MLP with1024 hidden units per layer as the critic For MO-VMPOwe use a shared two layer MLP with 1024 hidden units perlayer followed by a one layer MLP with 1024 hidden unitsper layer per objective In both cases we use k-step returnsto train the critic with a discount factor of 095 Table 5shows additional hyperparameters used in our experiments
B Experimental DomainsWe evaluated our approach on one discrete domain (DeepSea Treasure) three simulated continuous control domains(humanoid Shadow Hand and humanoid mocap) and onereal-world continuous control domain (Sawyer robot) Herewe provide more detail about these domains and the objec-tives used in each task
B1 Deep Sea Treasure
Deep Sea Treasure (DST) is a 11times10 grid-world domainthe state space S consists of the x and y position of the agentand the action spaceA is up right down left Thelayout of the environment and values of the treasures areshown in Fig 8 If the action would cause the agent tocollide with the sea floor or go out of bounds it has noeffect Farther-away treasures have higher values Theepisode terminates when the agent collects a treasure orafter 200 timesteps
237
224
203196
161151140
115
82
07
Figure 8 Deep Sea Treasure environment from Vamplew et al(2011) with weights from Yang et al (2019) Treasures are labeledwith their respective values The agent can move around freelyin the white squares but cannot enter the black squares (ie theocean floor)
There are two objectives time penalty and treasure valueA time penalty of minus1 is given at each time step The agentreceives the value of the treasure it picks up as the rewardfor the treasure objective In other words when the agentpicks up a treasure of value v the reward vector is [minus1 v]otherwise it is [minus1 0]
B2 Shadow Hand
Our robot platform is a simulated Shadow Dexterous Hand(Shadow Robot Company 2020) in the MuJoCo physicsengine (Todorov et al 2012) The Shadow Hand has fivefingers and 24 degrees of freedom actuated by 20 motorsThe observation consists of joint angles joint velocities andtouch sensors for a total of 63 dimensions Each fingertipof our Shadow Hand has a 4times4 spatial touch sensor Thissensor has three channels measuring the normal force andthe x and y-axis tangential forces for a sensor dimension of4times4times3 per fingertip We simplify this sensor by summingacross spatial dimensions resulting in a 1times1times3 observationper fingertip
In the touch task there is a block in the environment that isalways fixed in the same pose In the turn task there is a dialin the environment that can be rotated and it is initialized toa random position between minus30 and 30 The target angleof the dial is 0 The angle of the dial is included in theagentrsquos observation In the orient task the robot interactswith a rectangular peg in the environment the initial andtarget pose of the peg remains the same across episodesThe pose of the block is included in the agentrsquos observationencoded as the xyz positions of four corners of the block(based on how Levine et al (2016) encodes end-effectorpose)
A Distributional View on Multi-Objective Policy Optimization
Figure 9 These images show what task completion looks like forthe touch turn and orient Shadow Hand tasks (from left to right)
0 2 4 6impact force
minus6
minus4
minus2
0
pain
pen
alty
Figure 10 In the touch and turn Shadow Hand tasks the painpenalty is computed based on the impact force as plotted here Forlow impact forces the pain penalty is near-zero For high impactforces the pain penalty is equal to the negative of the impact force
B21 BALANCING TASK COMPLETION AND PAIN
In the touch and turn tasks there are two objectives ldquopainrdquopenalty and task completion A sparse task completion re-ward of 1 is given for pressing a block with greater than 5Nof force or turning a dial to a fixed target angle respectivelyIn both tasks the episode terminates when the agent com-pletes the task ie the agent gets a total reward of either0 or 1 for the task completion objective per episode ThePareto plots for these two tasks in the main paper (Fig 6)show the discounted task reward (with a discount factor ofγ = 099) to capture how long it takes agents to completethe task
The ldquopainrdquo penalty is computed as in (Huang et al 2019) Itis based on the impact force m(s sprime) which is the increasein force from state s to the next state sprime In our tasks this ismeasured by a touch sensor on the block or dial The painpenalty is equal to the negative of the impact force scaledby how unacceptable it is
rpain(s a sprime) = minus[1minus aλ(m(s sprime))
]m(s sprime) (6)
where aλ(middot) isin [0 1] computes the acceptability of an impactforce aλ(middot) should be a monotonically decreasing functionthat captures how resilient the robot and the environmentare to impacts As in (Huang et al 2019) we use
aλ(m) = sigmoid(λ1(minusm+ λ2)) (7)
with λ = [2 2]gt The relationship between pain penaltyand impact force is plotted in Fig 10
675 700 725zpeg
0
1
rew
ard
height objective
0 2d(θ θpeg)
orientation objective
Figure 11 In the orient Shadow Hand task the height and orienta-tion objectives have shaped rewards as plotted here
B22 ALIGNED OBJECTIVES
In the orient task there are three non-competing objectivestouching the peg lifting the peg to a target height and ori-enting the peg to be perpendicular to the ground The targetz position and orientation is shown by the gray transparentpeg (Fig 9 right-most) note that the x and y position isnot specified so the robot in the figure gets the maximumreward with respect to all three objectives
The touch objective has a sparse reward of 1 for activat-ing the pegrsquos touch sensor and zero otherwise For theheight objective the target z position is 7cm above theground the pegrsquos z position is computed with respect toits center of mass The shaped reward for this objectiveis 1 minus tanh(50|ztarget minus zpeg|) For the orientation objec-tive since the peg is symmetrical there are eight possibleorientations that are equally valid The acceptable targetorientationsQtarget and the pegrsquos orientation qpeg are denotedas quaternions The shaped reward is computed with respectto the closest target quaternion as
minqisinQtarget
1minus tanh(2lowastd(q qpeg))
where d(middot) denotes the `2-norm of the axis-angle equivalent(in radians) for the distance between the two quaternionsFig 11 shows the shaping of the height and orientationrewards
B3 Humanoid
We make use of the humanoid run task from Tassa et al(2018)9 The observation dimension is 67 and the actiondimension is 21 Actions are joint accelerations with min-imum and maximum limits of minus1 and 1 respectively Forthis task there are two objectives
bull The original task reward given by the environment Thegoal is is to achieve a horizontal speed of 10 meters persecond in any direction This reward is shaped it isequal to min(h10 1) where h is in meters per secondFor this objective we learn a Q-function
9This task is available at githubcomdeepminddm control
A Distributional View on Multi-Objective Policy Optimization
bull Limiting energy usage by penalizing taking high-magnitude actions The penalty is the `2-norm of theaction vector ie rpenalty(a) = minusa2 For this ob-jective we do not learn a Q-function Instead wecompute this penalty directly to evaluate a given actionin a state-independent way during policy optimization
B4 Humanoid Mocap
The motion capture tracking task used in this paper wasdeveloped as part of a concurrent submission (Anonymous2020) and does not constitute a contribution of this paperWe use a simulated humanoid adapted from the ldquoCMUhumanoidrdquo available at dm controllocomotion (Merel et al2019) We adjusted various body and actuator parametersto be more comparable to that of an average human Thishumanoid has 56 degrees of freedom and the observation is1021-dimensional
This motion capture tracking task is broadly similar to previ-ous work on motion capture tracking (Chentanez et al 2018Peng et al 2018 Merel et al 2019) The task is associatedwith an underlying set of motion capture clips For thispaper we used roughly 40 minutes of locomotion motioncapture clips from the CMU motion capture database10
At the beginning of a new episode we select a frame uni-formly at random from all frames in the underlying mocapdata (excluding the last 10 frames of each clip) The simu-lated humanoid is then initialized to the pose of the selectedframe
The observations in this environment are various propriocep-tive observations about the current state of the body as wellas the relative target positions and orientations of 13 differ-ent body parts in the humanoidrsquos local frame We providethe agent with the target poses for the next 5 timesteps
The choice of reward function is crucial for motion capturetracking tasks Here we consider five different reward com-ponents each capturing a different aspect of the similarityof the pose between the simulated humanoid and the mocaptarget Four of our reward components were initially pro-posed by Peng et al (2018) The first reward component isbased on the difference in the center of mass
rcom = exp(minus10pcom minus pref
com2)
where pcom and prefcom are the positions of the center of mass
of the simulated humanoid and the mocap reference re-spectively The second reward component is based on thedifference in joint angle velocities
rvel = exp(minus01qvel minus qref
vel2)
where qvel and qrefvel are the joint angle velocities of the simu-
lated humanoid and the mocap reference respectively The
10This database is available at mocapcscmuedu
third reward component is based on the difference in end-effector positions
rapp = exp(minus40papp minus pref
app2)
where papp and prefapp are the end effector positions of the
simulated humanoid and the mocap reference respectivelyThe fourth reward component is based on the difference injoint orientations
rquat = exp(minus2qquat qref
quat2)
where denotes quaternion differences and qquat and qrefquat
are the joint quaternions of the simulated humanoid and themocap reference respectively
Finally the last reward component is based on difference inthe joint angles and the Euclidean positions of a set of 13body parts
rtrunc = 1minus 1
03(
=ε︷ ︸︸ ︷bpos minus bref
pos1 + qpos minus qrefpos1)
where bpos and brefpos correspond to the body positions of the
simulated character and the mocap reference and qpos andqref
pos correspond to the joint angles
We include an early termination condition in our task that islinked to this last reward term We terminate the episode ifε gt 03 which ensures that rtrunc isin [0 1]
In our MO-VMPO experiments we treat each reward com-ponent as a separate objective In our VMPO experimentswe use a reward of the form
r =1
2rtrunc +
1
2(01rcom + λrvel + 015rapp + 065rquat)
(8)
varying λ as described in the main paper (Sec 63)
B5 Sawyer
The Sawyer peg-in-hole setup consists of a Rethink RoboticsSawyer robot arm The robot has 7 joints driven by serieselastic actuators On the wrist a Robotiq FT 300 force-torque sensor is mounted followed by a Robotiq 2F85 par-allel gripper We implement a 3D end-effector Cartesianvelocity controller that maintains a fixed orientation of thegripper The agent controls the Cartesian velocity setpointsThe gripper is not actuated in this setup The observationsprovided are the joint positions velocities and torques theend-effector pose the wrist force-torque measurements andthe joint velocity command output of the Cartesian con-troller We augment each observation with the two previousobservations as a history
In each episode the peg is initalized randomly within an8times8times8 cm workspace (with the peg outside of the hole)
A Distributional View on Multi-Objective Policy Optimization
The environment is then run at 20 Hz for 600 steps Theaction limits are plusmn005 ms on each axis If the magnitudeof the wrist force-torque measurements exceeds 15 N onany horizontal axis (x and y axis) or 5 N on the vertical axis(z axis) the episode is terminated early
There are two objectives the task itself (inserting the peg)and minimizing wrist force measurements
The reward for the task objective is defined as
rinsertion = max(
02rapproach rinserted
)(9)
rapproach = s(ppeg popening) (10)
rinserted = raligned s(ppegz pbottom
z ) (11)
raligned = ppegxy minus popening
xy lt d2 (12)
where ppeg is the peg position popening the position of thehole opening pbottom the bottom position of the hole and dthe diameter of the hole
s(p1 p2) is a shaping operator
s(p1 p2) = 1minus tanh2(atanh(
radicl)
εp1 minus p22) (13)
which gives a reward of 1minus l if p1 and p2 are at a Euclideandistance of ε Here we chose lapproach = 095 εapproach =001 linserted = 05 and εinserted = 004 Intentionally 004corresponds to the length of the peg such that rinserted = 05if ppeg = popening As a result if the peg is in the approachposition rinserted is dominant over rapproach in rinsertion
Intuitively there is a shaped reward for aligning the pegwith a magnitude of 02 After accurate alignment withina horizontal tolerance the overall reward is dominated bya vertical shaped reward component for insertion The hor-izontal tolerance threshold ensures that the agent is notencouraged to try to insert the peg from the side
The reward for the secondary objective minimizing wristforce measurements is defined as
rforce = minusF1 (14)
where F = (Fx Fy Fz) are the forces measured by thewrist sensor
C Algorithmic DetailsC1 General Algorithm Description
We maintain one online network and one target network foreach Q-function associated to objective k with parametersdenoted by φk and φprimek respectively We also maintain oneonline network and one target network for the policy withparameters denoted by θ and θprime respectively Target net-works are updated every fixed number of steps by copying
parameters from the online network Online networks areupdated using gradient descent in each learning iterationNote that in the main paper we refer to the target policynetwork as the old policy πold
We use an asynchronous actor-learner setup in which ac-tors regularly fetch policy parameters from the learner andact in the environment writing these transitions to the re-play buffer We refer to this policy as the behavior policythroughout the paper The learner uses the transitions inthe replay buffer to update the (online) Q-functions and thepolicy Please see Algorithms 2 and 3 for more details onthe actor and learner processes respectively
Algorithm 2 MO-MPO Asynchronous Actor
1 given (N) reward functions rk(s a)Nk=1 T steps perepisode
2 repeat3 Fetch current policy parameters θ from learner4 Collect trajectory from environment5 τ = 6 for t = 0 T do7 at sim πθ(middot|st)8 Execute action and determine rewards9 r = [r1(st at) rN (st at)]
10 τ larr τ cup (st at r πθ(at|st))11 end for12 Send trajectory τ to replay buffer13 until end of training
C2 Retrace for Multi-Objective Policy Evaluation
Recall that the goal of the policy evaluation stage (Sec 41)is to learn the Q-function Qφk
(s a) for each objective kparametrized by φk These Q-functions are with respectto the policy πold We emphasize that it is valid to use anyoff-the-shelf Q-learning algorithm such as TD(0) (Sutton1988) to learn these Q-functions We choose to use Retrace(Munos et al 2016) described here
Given a replay buffer D containing trajectory snippetsτ = (s0 a0 r0 s1) (sT aT rT sT+1) where rtdenotes a reward vector rk(st at)Nk=1 that consists of ascalar reward for each ofN objectives the Retrace objectiveis as follows
minφk
EτsimD[(Qretk (st at)minus Qφk
(st at))2] (15)
with
Qretk (st at) = Qφprime
k(st at) +
Tsumj=t
γjminust( jprodz=t+1
cz
)δj
δj = rk(sj aj) + γV (sj+1)minus Qφprimek(sj aj)
V (sj+1) = Eπold(a|sj+1)[Qφprimek(sj+1 a)]
A Distributional View on Multi-Objective Policy Optimization
Algorithm 3 MO-MPO Asynchronous Learner1 given batch size (L) number of actions (M) (N) Q-functions
(N) preferences εkNk=1 policy networks and replay bufferD
2 initialize Lagrangians ηkNk=1 and ν target networks andonline networks such that πθprime = πθ and Qφprime
k= Qφk
3 repeat4 repeat5 Collect dataset si aij Qijk
LMNijk where
6 si sim D aij sim πθprime(a|si) and Qijk = Qφkprime(si aij)
78 Compute action distribution for each objective9 for k = 1 N do
10 δηk larr nablaηkηkεk + ηksumLi
1Llog
(sumMj
1M
exp(Q
ijkηk
))11 Update ηk based on δηk12 qijk prop exp(
Qijkηk
)
13 end for1415 Update parametric policy with trust region16 sg denotes a stop-gradient17 δπ larr minusnablaθ
sumLi
sumMj
sumNk q
ijk log πθ(a
ij |si)18 +sg(ν)
(β minus
sumLi KL(πθprime(a|si) πθ(a|si)) ds
)19 δν larr nablaνν
(βminussumLi KL(πθprime(a|si) πsg(θ)(a|si)) ds
)20 Update πθ based on δπ21 Update ν based on δν2223 Update Q-functions24 for k = 1 N do25 δφk larr nablaφk
sum(stat)isinτsimD
(Qφk (st at)minusQ
retk
)226 with Qret
k as in Eq 1527 Update φk based on δφk
28 end for2930 until fixed number of steps31 Update target networks32 πθprime = πθ Qφk
prime = Qφk
33 until convergence
The importance weights ck are defined as
cz = min
(1πold(az|sz)b(az|sz)
)
where b(az|sz) denotes the behavior policy used to col-lect trajectories in the environment When j = t we set(prodj
z=t+1 cz
)= 1
We also make use of a target network for each Q-function(Mnih et al 2015) parametrized by φprimek which we copyfrom the online network φk after a fixed number of gradientsteps on the Retrace objective (15)
C3 Policy Fitting
Recall that fitting the policy in the policy improvementstage (Sec 422) requires solving the following constrained
optimization
πnew = argmaxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds
stints
micro(s) KL(πold(a|s) πθ(a|s)) ds lt β (16)
We first write the generalized Lagrangian equation ie
L(θ ν) =
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds (17)
+ ν(β minus
ints
micro(s) KL(πold(a|s) πθ(a|s)) ds)
where ν is the Lagrangian multiplier Now we solve thefollowing primal problem
maxθ
minνgt0
L(θ ν)
To obtain the parameters θ for the updated policy we solvefor ν and θ by alternating between 1) fixing ν to its currentvalue and optimizing for θ and 2) fixing θ to its currentvalue and optimizing for ν This can be applied for anypolicy output distribution
For Gaussian policies we decouple the update rules to op-timize for mean and covariance independently as in (Ab-dolmaleki et al 2018b) This allows for setting differentKL bounds for the mean (βmicro) and covariance (βΣ) whichresults in more stable learning To do this we first separateout the following two policies for mean and covariance
πmicroθ (a|s) = N(amicroθ(s)Σθold
(s)) (18)
πΣθ (a|s) = N
(amicroθold
(s)Σθ(s)) (19)
Policy πmicroθ (a|s) takes the mean from the online policy net-work and covariance from the target policy network andpolicy πΣ
θ (a|s) takes the mean from the target policy net-work and covariance from online policy network Now ouroptimization problem has the following form
maxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s)(
log πmicroθ (a|s)πΣθ (a|s)
)da ds
stints
micro(s) KL(πold(a|s) πmicroθ (a|s)) ds lt βmicroints
micro(s) KL(πold(a|s) πΣθ (a|s)) ds lt βΣ (20)
As in (Abdolmaleki et al 2018b) we set a much smallerbound for covariance than for mean to keep the explorationand avoid premature convergence We can solve this op-timization problem using the same general procedure asdescribed above
A Distributional View on Multi-Objective Policy Optimization
C4 Derivation of Dual Function for η
Recall that obtaining the per-objective improved action dis-tributions qk(a|s) in the policy improvement stage (Sec421) requires solving a convex dual function for the tem-perature ηk for each objective For the derivation of thisdual function please refer to Appendix D2 of the originalMPO paper (Abdolmaleki et al 2018b)
D MO-V-MPO Multi-Objective On-PolicyMPO
In this section we describe how MO-MPO can be adaptedto the on-policy case in which a state-value function V (s)is learned and used to estimate the advantages for eachobjective We call this approach MO-V-MPO
D1 Multi-objective policy evaluation in the on-policysetting
In the on-policy setting to evaluate the previous policyπold we use advantages A(s a) estimated from a learnedstate-value function V (s) instead of a state-action valuefunction Q(s a) as in the main text We train a separateV -function for each objective by regressing to the stan-dard n-step return (Sutton 1988) associated with each ob-jective More concretely given trajectory snippets τ =(s0 a0 r0) (sT aT rT ) where rt denotes a rewardvector rk(st at)Nk=1 that consists of rewards for all N ob-jectives we find value function parameters φk by optimizingthe following objective
minφ1k
Nsumk=1
Eτ[(G
(T )k (st at)minus V πold
φk(st))
2] (21)
Here G(T )k (st at) is the T -step target for value function
k which uses the actual rewards in the trajectory andbootstraps from the current value function for the restG
(T )k (st at) =
sumTminus1`=t γ
`minustrk(s` a`) + γTminustV πoldφk
(s`+T )The advantages are then estimated as Aπold
k (st at) =
G(T )k (st at)minus V πold
φk(st)
D2 Multi-objective policy improvement in theon-policy setting
Given the previous policy πold(a|s) and estimated advan-tages Aπold
k (s a)k=1N associated with this policy foreach objective our goal is to improve the previous policyTo this end we first learn an improved variational distribu-tion qk(s a) for each objective and then combine and distillthe variational distributions into a new parametric policyπnew(a|s) Unlike in MO-MPO for MO-V-MPO we use thejoint distribution qk(s a) rather than local policies qk(a|s)because without a learned Q-function only one action perstate is available for learning This is a multi-objective vari-
ant of the two-step policy improvement procedure employedby V-MPO (Song et al 2020)
D21 OBTAINING IMPROVED VARIATIONALDISTRIBUTIONS PER OBJECTIVE (STEP 1)
In order to obtain the improved variational distributionsqk(s a) we optimize the RL objective for each objectiveAk
maxqk
intsa
qk(s a)Ak(s a) dads (22)
st KL(qk(s a)pold(s a)) lt εk
where the KL-divergence is computed over all (s a) εk de-notes the allowed expected KL divergence and pold(s a) =micro(s)πold(a|s) is the state-action distribution associated withπold
As in MO-MPO we use these εk to define the preferencesover objectives More concretely εk defines the allowed con-tribution of objective k to the change of the policy There-fore the larger a particular εk is with respect to others themore that objective k is preferred On the other hand ifεk = 0 is zero then objective k will have no contribution tothe change of the policy and will effectively be ignored
Equation (22) can be solved in closed form
qk(s a) prop pold(s a) exp(Ak(s a)
ηk
) (23)
where the temperature ηk is computed based on the con-straint εk by solving the following convex dual problem
ηk = argminηk
[ηk εk + (24)
ηk log
intsa
pold(s a) exp(Ak(s a)
η
)da ds
]
We can perform the optimization along with the loss bytaking a gradient descent step on ηk and we initialize withthe solution found in the previous policy iteration step Sinceηk must be positive we use a projection operator after eachgradient step to maintain ηk gt 0
Top-k advantages As in Song et al (2020) in practicewe used the samples corresponding to the top 50 of ad-vantages in each batch of data
D22 FITTING A NEW PARAMETRIC POLICY (STEP 2)
We next want to combine and distill the state-action distribu-tions obtained in the previous step into a single parametricpolicy πnew(a|s) that favors all of the objectives accordingto the preferences specified by εk For this we solve a su-pervised learning problem that fits a parametric policy asfollows
A Distributional View on Multi-Objective Policy Optimization
πnew = argmaxθ
Nsumk=1
intsa
qk(s a) log πθ(a|s) da ds
stints
KL(πold(a|s) πθ(a|s)) ds lt β (25)
where θ are the parameters of our function approximator(a neural network) which we initialize from the weights ofthe previous policy πold and the KL constraint enforces atrust region of size β that limits the overall change in theparametric policy to improve stability of learning As inMPO the KL constraint in this step has a regularizationeffect that prevents the policy from overfitting to the localpolicies and therefore avoids premature convergence
In order to optimize Equation (25) we employ Lagrangianrelaxation similar to the one employed for ordinary V-MPO (Song et al 2020)
E Multi-Objective Policy Improvement asInference
In the main paper we motivated the multi-objective policyupdate rules from an intuitive perspective In this section weshow that our multi-objective policy improvement algorithmcan also be derived from the RL-as-inference perspective Inthe policy improvement stage we assume that a Q-functionfor each objective is given and we would like to improveour policy with respect to these Q-functions The deriva-tion here extends the derivation for the policy improvementalgorithm in (single-objective) MPO in Abdolmaleki et al(2018a) (in appendix) to the multi-objective case
We assume there are observable binary improvement eventsRkNk=1 for each objective Rk = 1 indicates that ourpolicy has improved for objective k whileRk = 0 indicatesthat it has not Now we ask if the policy has improved withrespect to all objectives ie Rk = 1Nk=1 what would theparameters θ of that improved policy be More concretelythis is equivalent to the maximum a posteriori estimate forRk = 1Nk=1
maxθ
pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)
where the probability of improvement events depends on θAssuming independence between improvement events Rkand using log probabilities leads to the following objective
maxθ
Nsumk=1
log pθ(Rk = 1) + log p(θ) (27)
The prior distribution over parameters p(θ) is fixed duringthe policy improvement step We set the prior such that πθ
stays close to the target policy during each policy improve-ment step to improve stability of learning (Sec E2)
We use the standard expectation-maximization (EM) algo-rithm to efficiently maximize
sumNk=1 log pθ(Rk = 1) The
EM algorithm repeatedly constructs a tight lower bound inthe E-step given the previous estimate of θ (which corre-sponds to the target policy in our case) and then optimizesthat lower bound in the M-step We introduce a variationaldistribution qk(s a) per objective and use this to decom-pose the log-evidence
sumNk=1 log pθ(Rk = 1) as follows
Nsumk=1
log pθ(Rk = 1) (28)
=
Nsumk=1
KL(qk(s a) pθ(s a|Rk = 1))minus
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a))
The second term of this decomposition is expanded as
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a)) (29)
=
Nsumk=1
Eqk(sa)
[log
pθ(Rk = 1 s a)
qk(s a)
]=
Nsumk=1
Eqk(sa)
[log
p(Rk = 1|s a)πθ(a|s)micro(s)
qk(s a)
]
where micro(s) is the stationary state distribution which is as-sumed to be given in each policy improvement step Inpractice micro(s) is approximated by the distribution of thestates in the replay buffer p(Rk = 1|s a) is the likelihoodof the improvement event for objective k if our policy choseaction a in state s
The first term of the decomposition in Equation (28) isalways positive so the second term is a lower bound onthe log-evidence
sumNk=1 log pθ(Rk = 1) πθ(a|s) and
qk(a|s) = qk(sa)micro(s) are unknown even though micro(s) is given
In the E-step we estimate qk(a|s) for each objective k byminimizing the first term given θ = θprime (the parameters ofthe target policy) Then in the M-step we find a new θ byfitting the parametric policy to the distributions from firststep
E1 E-Step
The E-step corresponds to the first step of policy improve-ment in the main paper (Sec 421) In the E-step wechoose the variational distributions qk(a|s)Nk=1 such thatthe lower bound on
sumNk=1 log pθ(Rk = 1) is as tight as
A Distributional View on Multi-Objective Policy Optimization
possible when θ = θprime the parameters of the target policyThe lower bound is tight when the first term of the decompo-sition in Equation (28) is zero so we choose qk to minimizethis term We can optimize for each variational distributionqk independently
qk(a|s) = argminq
Emicro(s)
[KL(q(a|s) pθprime(s a|Rk = 1))
]= argmin
qEmicro(s)
[KL(q(a|s) πθprime(a|s))minus
Eq(a|s) log p(Rk = 1|s a))]
(30)
We can solve this optimization problem in closed formwhich gives us
qk(a|s) =πθprime(a|s) p(Rk = 1|s a)intπθprime(a|s) p(Rk = 1|s a) da
This solution weighs the actions based on their relativeimprovement likelihood p(Rk = 1|s a) for each objectiveWe define the likelihood of an improvement event Rk as
p(Rk = 1|s a) prop exp(Qk(s a)
αk
)
where αk is an objective-dependent temperature parameterthat controls how greedy the solution qk(a|s) is with respectto its associated objective Qk(s a) For example given abatch of state-action pairs at the extremes an αk of zerowould give all probability to the state-action pair with themaximum Q-value while an αk of positive infinity wouldgive the same probability to all state-action pairs In orderto automatically optimize for αk we plug the exponentialtransformation into (30) which leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) dadsminus
αk
intmicro(s) KL(q(a|s) πθprime(a|s)) ds (31)
If we instead enforce the bound on KL divergence as a hardconstraint (which we do in practice) that leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) da ds (32)
stintmicro(s) KL(q(a|s) πθprime(a|s)) ds lt εk
where as described in the main paper the parameter εkdefines preferences over objectives If we now assume thatqk(a|s) is a non-parametric sample-based distribution as in(Abdolmaleki et al 2018b) we can solve this constrained
optimization problem in closed form for each sampled states by setting
q(a|s) prop πθprime(a|s) exp(Qk(s a)
ηk
) (33)
where ηk is computed by solving the following convex dualfunction
ηk = argminη
ηεk+ (34)
η
intmicro(s) log
intπθprime(a|s) exp
(Qk(s a)
η
)da ds
Equations (32) (33) and (34) are equivalent to those givenfor the first step of multi-objective policy improvement inthe main paper (Sec 421) Please see refer to AppendixD2 in (Abdolmaleki et al 2018b) for details on derivationof the dual function
E2 M-Step
After obtaining the variational distributions qk(a|s) foreach objective k we have found a tight lower bound forsumNk=1 log pθ(Rk = 1) when θ = θprime We can now obtain
the parameters θ of the updated policy πθ(a|s) by optimiz-ing this lower bound After rearranging and dropping theterms in (27) that do not involve θ we obtain
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) da ds
+ log p(θ)
This corresponds to the supervised learning step of policyimprovement in the main paper (Sec 422)
For the prior p(θ) on the parameters θ we enforce that thenew policy πθ(a|s) is close to the target policy πθprime(a|s)We therefore optimize for
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) dads
stintmicro(s) KL(πθprime(a|s) πθ(a|s)) ds lt β
Because of the constraint on the change of the parametricpolicy we do not greedily optimize the M-step objectiveSee (Abdolmaleki et al 2018b) for more details
A Distributional View on Multi-Objective Policy Optimization
0 200000 400000
200
400
number of policy improvement steps
task reward
0 200000 400000
minus12
00minus
800
wrist force penalty
MO-MPO scalarized MPO [095 005] [08 02]
Figure 7 Task reward (left) and wrist force penalty (right) learningcurves for the Sawyer peg-in-hole task The policy trained withMO-MPO quickly learns to optimize both objectives whereas theother two do not Each line represents a single trained policy
in mind for MO-MPO we set εtask = 01 and εforce = 005and for scalarized MPO we try [wtask wforce] = [095 005]and [08 02] We find that policies trained with scalarizedMPO focus all learning on a single objective at the beginningof training we also observed this in the touch task wherescalarized MPO policies quickly learn to either maximizetask reward or minimize pain penalty but not both (Fig 6bottom left) In contrast the policy trained with MO-MPOsimultaneously optimizes for both objectives throughouttraining In fact throughout training the MO-MPO policydoes just as well with respect to task reward as the scalarizedMPO policy that cares more about task reward and similarlyfor wrist force penalty (Fig 7)
7 Conclusions and Future WorkIn this paper we presented a new distributional perspectiveon multi objective reinforcement learning that is derivedfrom the Rl-as-inference perspective This view leads to twonovel multi-objective RL algorithms namely MO-MPO andMO-V-MPO We showed these algorithms enable practition-ers to encode preferences in a scale-invariant way AlthoughMO-(V-)MPO is a single-policy approach to MORL it iscapable of producing a variety of Pareto-dominant policiesA limitation of this work is that we produced this set ofpolicies by iterating over a relatively large number of εrsquoswhich is computationally expensive In future work weplan to extend MO-(V-)MPO into a true multiple-policyMORL approach either by conditioning the policy on set-tings of εrsquos or by developing a way to strategically select εrsquosto train policies for analogous to what prior work has donefor weights (eg Roijers et al (2014))
AcknowledgmentsThe authors would like to thank Guillaume DesjardinsCsaba Szepesvari Jost Tobias Springenberg Steven BohezPhilemon Brakel Brendan Tracey Jonas Degrave JonasBuchli Leslie Fritz Chloe Rosenberg and many others atDeepMind for their support and feedback on this paper
ReferencesAbdolmaleki A Springenberg J T Degrave J Bohez S
Tassa Y Belov D Heess N and Riedmiller M Rela-tive entropy regularized policy iteration arXiv preprintarXiv181202256 2018a
Abdolmaleki A Springenberg J T Tassa Y Munos RHeess N and Riedmiller M Maximum a posterioripolicy optimisation In Proceedings of the Sixth Interna-tional Conference on Learning Representations (ICLR)2018b
Abels A Roijers D M Lenaerts T Nowe A andSteckelmacher D Dynamic weights in multi-objectivedeep reinforcement learning In Proceedings of the 36thInternational Conference on Machine Learning (ICML)2019
Achiam J Held D Tamar A and Abbeel P Constrainedpolicy optimization In Proceedings of the 34th Inter-national Conference on Machine Learning (ICML) pp22ndash31 2017
Amodei D Olah C Steinhardt J Christiano P Schul-man J and Mane D Concrete problems in AI safetyarXiv preprint arXiv160606565 2016
Andersen M S Dahl J and Vandenberghe L CVXOPTA Python package for convex optimization version 12httpscvxoptorg 2020
Anonymous CoMic Co-Training and Mimicry forReusable Skills 2020
Barrett L and Narayanan S Learning all optimal policieswith multiple criteria In Proceedings of the 25th Inter-national Conference on Machine Learning (ICML) pp41ndash47 2008
Barth-Maron G Hoffman M W Budden D Dabney WHorgan D TB D Muldal A Heess N and LillicrapT Distributed distributional deterministic policy gradi-ents In Proceedings of the Sixth International Conferenceon Learning Representations (ICLR) 2018
Bohez S Abdolmaleki A Neunert M Buchli J HeessN and Hadsell R Value constrained model-free contin-uous control arXiv preprint arXiv190204623 2019
Chen X Ghadirzadeh A Bjorkman M and Jensfelt PMeta-learning for multi-objective reinforcement learningarXiv preprint arXiv181103376 2019
Chentanez N Muller M Macklin M Makoviychuk Vand Jeschke S Physics-based motion capture imitationwith deep reinforcement learning In International Con-ference on Motion Interaction and Games (MIG) ACM2018
A Distributional View on Multi-Objective Policy Optimization
Chow Y Nachum O Duenez-Guzman E andGhavamzadeh M A Lyapunov-based approach to safereinforcement learning In Proceedings of the 31st Inter-national Conference on Neural Information ProcessingSystems (NeurIPS) pp 8092ndash8101 2018
Daneshmand M Tale Masouleh M Saadatzi M HOzcinar C and Anbarjafari G A robust proportion-preserving composite objective function for scale-invariant multi-objective optimization Scientia Iranica24(6)2977ndash2991 2017
Fortin F-A De Rainville F-M Gardner M-A ParizeauM and Gagne C DEAP Evolutionary algorithms madeeasy Journal of Machine Learning Research 132171ndash2175 July 2012
Friedman E and Fontaine F Generalizing across multi-objective reward functions in deep reinforcement learningarXiv preprint arXiv180906364 2018
Gabor Z Kalmar Z and Szepesvari C Multi-criteriareinforcement learning In Proceedings of the FifteenthInternational Conference on Machine Learning (ICML)pp 197ndash205 1998
Grodzevich O and Romanko O Normalization and othertopics in multi-objective optimization In Proceedings ofthe Fields-MITACS Industrial Problems Workshop 2006
Haarnoja T Zhou A Abbeel P and Levine S Softactor-critic Off-policy maximum entropy deep reinforce-ment learning with a stochastic actor In Proceedings ofthe 35th International Conference on Machine Learning(ICML) pp 1861ndash1870 2018
Heess N Wayne G Silver D Lillicrap T Erez T andTassa Y Learning continuous control policies by stochas-tic value gradients In Advances in Neural InformationProcessing Systems 28 (NIPS) 2015
Hessel M Soyer H Espeholt L Czarnecki W SchmittS and van Hasselt H Multi-task deep reinforcementlearning with popart In Proceedings of the AAAI Confer-ence on Artificial Intelligence volume 33 pp 3796ndash38032019
Huang S H Zambelli M Kay J Martins M F TassaY Pilarski P M and Hadsell R Learning gentle objectmanipulation with curiosity-driven deep reinforcementlearning arXiv preprint arXiv190308542 2019
Ishibuchi H Doi K and Nojima Y On the effect ofnormalization in MOEAD for multi-objective and many-objective optimization Complex amp Intelligent Systems 3(4)279ndash294 2017
Kim I and de Weck O Adaptive weighted sum methodfor multiobjective optimization A new method for paretofront generation Structural and Multidisciplinary Opti-mization 31(2)105ndash116 2006
Kingma D P and Ba J Adam A method for stochasticoptimization In Proceedings of the Third InternationalConference on Learning Representations (ICLR) 2015
Le H Voloshin C and Yue Y Batch policy learningunder constraints In Proceedings of the 36th Inter-national Conference on Machine Learning (ICML) pp3703ndash3712 2019
Levine S Finn C Darrell T and Abbeel P End-to-endtraining of deep visuomotor policies Journal of MachineLearning Research 17(39)1ndash40 2016
Liu C Xu X and Hu D Multiobjective reinforcementlearning A comprehensive overview IEEE Transactionson Systems Man and Cybernetics Systems 45(3)385ndash398 2015
Marler R T and Arora J S Function-transformationmethods for multi-objective optimization EngineeringOptimization 37(6)551ndash570 2005
Martens J New perspectives on the natural gradient methodarXiv preprint arXiv14121193 2014
Merel J Ahuja A Pham V Tunyasuvunakool S LiuS Tirumala D Heess N and Wayne G Hierarchi-cal visuomotor control of humanoids In InternationalConference on Learning Representations (ICLR) 2019
Mnih V Kavukcuoglu K Silver D Rusu A A VenessJ Bellemare M G Graves A Riedmiller M Fidje-land A K Ostrovski G et al Human-level controlthrough deep reinforcement learning Nature 518(7540)529 2015
Moffaert K V and Nowe A Multi-objective reinforcementlearning using sets of Pareto dominating policies Journalof Machine Learning Research 153663ndash3692 2014
Mossalam H Assael Y M Roijers D M and WhitesonS Multi-objective deep reinforcement learning arXivpreprint arXiv161002707 2016
Munos R Stepleton T Harutyunyan A and BellemareM Safe and efficient off-policy reinforcement learningIn Proceedings of the 29th International Conference onNeural Information Processing Systems pp 1054ndash10622016
Natarajan S and Tadepalli P Dynamic preferences inmulti-criteria reinforcement learning In Proceedings ofthe 22nd International Conference on Machine Learning(ICML) pp 601ndash608 2005
A Distributional View on Multi-Objective Policy Optimization
Nottingham K Balakrishnan A Deshmukh J Christo-pherson C and Wingate D Using logical specificationsof objectives in multi-objective reinforcement learningarXiv preprint arXiv191001723 2019
Parisi S Pirotta M Smacchia N Bascetta L andRestelli M Policy gradient approaches for multi-objective sequential decision making In Proceedingsof the International Joint Conference on Neural Networks(IJCNN) pp 2323ndash2330 2014
Peng X B Abbeel P Levine S and van de Panne MDeepmimic Example-guided deep reinforcement learn-ing of physics-based character skills ACM Transactionson Graphics 37(4) 2018
Pirotta M Parisi S and Restelli M Multi-objectivereinforcement learning with continuous Pareto frontierapproximation In Proceedings of the Twenty-Ninth AAAIConference on Artificial Intelligence (AAAI) 2015
Reymond M and Nowe A Pareto-DQN Approximat-ing the Pareto front in complex multi-objective decisionproblems In Proceedings of the Adaptive and LearningAgents Workshop at the 18th International Conferenceon Autonomous Agents and MultiAgent Systems AAMAS2019
Riedmiller M Hafner R Lampe T Neunert M De-grave J Van de Wiele T Mnih V Heess N and Sprin-genberg J T Learning by playing - solving sparse rewardtasks from scratch arXiv preprint arXiv1802105672018
Roijers D M Vamplew P Whiteson S and Dazeley RA survey of multi-objective sequential decision-makingJournal of Artificial Intelligence Research 48(1)67ndash1132013
Roijers D M Whiteson S and Oliehoek F A Linearsupport for multi-objective coordination graphs In Pro-ceedings of the International Conference on AutonomousAgents and Multi-Agent Systems (AAMAS) pp 1297ndash1304 2014
Russell S and Zimdars A L Q-decomposition for rein-forcement learning agents In Proceedings of the Twenti-eth International Conference on International Conferenceon Machine Learning (ICML) pp 656ndash663 2003
Schulman J Levine S Abbeel P Jordan M and MoritzP Trust region policy optimization In Proceedings ofthe 32nd International Conference on Machine Learning(ICML) 2015
Schulman J Wolski F Dhariwal P Radford A andKlimov O Proximal policy optimization algorithmsarXiv preprint arXiv170706347 2017
Shadow Robot Company Shadow dexterous handhttpswwwshadowrobotcomproductsdexterous-hand 2020
Silver D Huang A Maddison C J Guez A Sifre Lvan den Driessche G Schrittwieser J Antonoglou IPanneershelvam V Lanctot M Dieleman S GreweD Nham J Kalchbrenner N Sutskever I Lillicrap TLeach M Kavukcuoglu K Graepel T and HassabisD Mastering the game of Go with deep neural networksand tree search Nature 529484ndash503 2016
Song H F Abdolmaleki A Springenberg J T ClarkA Soyer H Rae J W Noury S Ahuja A LiuS Tirumala D Heess N Belov D Riedmiller Mand Botvinick M M V-MPO On-policy maximum aposteriori policy optimization for discrete and continu-ous control In International Conference on LearningRepresentations 2020
Sutton R S Learning to predict by the methods of temporaldifferences Machine learning 3(1)9ndash44 1988
Tan J Zhang T Coumans E Iscen A Bai Y HafnerD Bohez S and Vanhoucke V Sim-to-real Learningagile locomotion for quadruped robots In Proceedingsof Robotics Science and Systems (RSS) 2018
Tassa Y Doron Y Muldal A Erez T Li Yde Las Casas D Budden D Abdolmaleki A Merel JLefrancq A Lillicrap T and Riedmiller M Deepmindcontrol suite arXiv preprint arXiv180100690 2018
Teh Y W Bapst V Czarnecki W M Quan J Kirk-patrick J Hadsell R Heess N and Pascanu R DistralRobust multitask reinforcement learning In Proceedingsof the 31st International Conference on Neural Informa-tion Processing Systems (NeurIPS) 2017
Todorov E Erez T and Tassa Y Mujoco A physicsengine for model-based control In Proceedings of theIEEERSJ International Conference on Intelligent Robotsand Systems (IROS) pp 5026ndash5033 2012
Vamplew P Dazeley R Berry A Issabekov R andDekker E Empirical evaluation methods for multiobjec-tive reinforcement learning algorithms Machine Learn-ing 84(1)51ndash80 Jul 2011
van Hasselt H P Guez A Hessel M Mnih V and Sil-ver D Learning values across many orders of magnitudeIn Advances in Neural Information Processing Systemspp 4287ndash4295 2016
Van Moffaert K Drugan M M and Nowe A Scalar-ized multi-objective reinforcement learning Novel de-sign techniques In Proceedings of the IEEE Symposiumon Adaptive Dynamic Programming and ReinforcementLearning (ADPRL) pp 191ndash199 2013
A Distributional View on Multi-Objective Policy Optimization
van Seijen H Fatemi M Romoff J Laroche R BarnesT and Tsang J Hybrid reward architecture for reinforce-ment learning In Proceedings of the 31st InternationalConference on Neural Information Processing Systems(NeurIPS) pp 5398ndash5408 2017
Watkins C J C H Learning from Delayed Rewards PhDthesis Kingrsquos College Cambridge UK May 1989
Wray K H Zilberstein S and Mouaddib A-I Multi-objective MDPs with conditional lexicographic rewardpreferences In Proceedings of the Twenty-Ninth AAAIConference on Artificial Intelligence (AAAI) pp 3418ndash3424 2015
Wulfmeier M Abdolmaleki A Hafner R SpringenbergJ T Neunert M Hertweck T Lampe T Siegel NHeess N and Riedmiller M Regularized hierarchicalpolicies for compositional transfer in robotics arXivpreprint arXiv190611228 2019
Yang R Sun X and Narasimhan K A generalized algo-rithm for multi-objective reinforcement learning and pol-icy adaptation In Proceedings of the 32nd InternationalConference on Neural Information Processing Systems(NeurIPS) pp 14610ndash14621 2019
Zeng A Song S Lee J Rodriguez A and FunkhouserT Tossingbot Learning to throw arbitrary objects withresidual physics In Proceedings of Robotics Scienceand Systems (RSS) 2019
Zhan H and Cao Y Relationship explainable multi-objective optimization via vector value function basedreinforcement learning arXiv preprint arXiv1910019192019
Zuluaga M Krause A and Puschel M ε-pal An ac-tive learning approach to the multi-objective optimizationproblem Journal of Machine Learning Research 17(1)3619ndash3650 2016
A Distributional View on Multi-Objective Policy Optimization
A Experiment DetailsIn this section we describe implementation details and spec-ify the hyperparameters used for our algorithm In all ourexperiments the policy and critic(s) are implemented withfeed-forward neural networks We use Adam (Kingma ampBa 2015) for optimization
In our continuous control tasks the policy returns a Gaus-sian distribution with a diagonal covariance matrix ieπθ(a|s) = N (microΣ) The policy is parametrized by aneural network which outputs the mean micro = micro(s) and di-agonal Cholesky factors A = A(s) such that Σ = AAT The diagonal factor A has positive diagonal elements en-forced by the softplus transform Aii larr log(1 + exp(Aii))to ensure positive definiteness of the diagonal covariancematrix
Table 2 shows the default hyperparameters that we use forMO-MPO and scalarized MPO and Table 3 shows the hy-perparameters that differ from these defaults for the hu-manoid Shadow Hand and Sawyer experiments For alltasks that we train policies with MO-MPO for a separatecritic is trained for each objective with a shared first layerThe only exception is the action norm penalty in humanoidrun which is computed exactly from the action We foundthat for both policy and critic networks layer normalizationof the first layer followed by a hyperbolic tangent (tanh) isimportant for the stability of the algorithm
In our experiments comparing the Pareto front found byMO-MPO versus scalarized MPO we ran MO-MPO with arange of εk settings and ran scalarized MPO with a range ofweight settings to obtain a set of policies for each algorithmThe settings for humanoid run and the Shadow Hand tasksare listed in Table 4 The settings for humanoid mocap andSawyer are specified in the main paper in Sec 63 and 64respectively
In the following subsections we first give suggestions forchoosing appropriate εk to encode a particular preference(Sec A1) then we describe the discrete MO-MPO usedfor Deep Sea Treasure (Sec A2) and finally we describeimplementation details for humanoid mocap (Sec A3)
A1 Suggestions for Setting ε
Our proposed algorithm MO-(V-)MPO requires practition-ers to translate a desired preference across objectives to nu-merical choices for εk for each objective k At first glancethis may seem daunting However in practice we havefound that encoding preferences via εk is often more in-tuitive than doing so via scalarization In this subsectionwe seek to give an intuition on how to set εk for differentdesired preferences Recall that each εk controls the influ-ence of objective k on the policy update by constraining theKL-divergence between each objective-specific distribution
Hyperparameter Default
policy networklayer sizes (300 200)take tanh of action mean yesminimum variance 10minus12
maximum variance unbounded
critic network(s)layer sizes (400 400 300)take tanh of action yesRetrace sequence size 8discount factor γ 099
both policy and critic networkslayer norm on first layer yestanh on output of layer norm yesactivation (after each hidden layer) ELU
MPOactions sampled per state 20default ε 01KL-constraint on policy mean βmicro 10minus3
KL-constraint on policy covariance βΣ 10minus5
initial temperature η 1
trainingbatch size 512replay buffer size 106
Adam learning rate 3times10minus4
Adam ε 10minus3
target network update period 200
Table 2 Default hyperparameters for MO-MPO and scalarizedMPO with decoupled update on mean and covariance (Sec C3)
and the current policy (Sec 41) We generally choose εk inthe range of 0001 to 01
Equal Preference When all objectives are equally impor-tant the general rule is to set all εk to the same value Wedid this in our orient and humanoid mocap tasks where ob-jectives are aligned or mostly aligned In contrast it can betricky to choose appropriate weights in linear scalarizationto encode equal preferencesmdashsetting all weights equal to1K (where K is the number of objectives) is only appro-priate if the objectivesrsquo rewards are of similar scales Weexplored this in Sec 51 and Fig 3
When setting all εk to the same value what should thisvalue be The larger εk is the more influence the objectiveswill have on the policy update step Since the per-objectivecritics are learned in parallel with the policy setting εktoo high tends to destabilize learning because early on intraining when the critics produce unreliable Q-values theirinfluence on the policy will lead it in the wrong directionOn the other hand if εk is set too low then it slows downlearning because the per-objective action distribution isonly allowed to deviate by a tiny amount from the currentpolicy and the updated policy is obtained via supervisedlearning on the combination of these action distributions
A Distributional View on Multi-Objective Policy Optimization
Humanoid run
policy networklayer sizes (400 400 300)take tanh of action mean nominimum variance 10minus8
critic network(s)layer sizes (500 500 400)take tanh of action no
MPOKL-constraint on policy mean βmicro 5times10minus3
KL-constraint on policy covariance βΣ 10minus6
trainingAdam learning rate 2times10minus4
Adam ε 10minus8
Shadow Hand
MPOactions sampled per state 30default ε 001
Sawyer
MPOactions sampled per state 15
trainingtarget network update period 100
Table 3 Hyperparameters for humanoid Shadow Hand andSawyer experiments that differ from the defaults in Table 2
Eventually the learning will converge to more or less thesame policy though as long as εk is not set too high
Unequal Preference When there is a difference in pref-erences across objectives the relative scale of εk is whatmatters The more the relative scale of εk is compared to εlthe more influence objective k has over the policy updatecompared to objective l And in the extreme case whenεl is near-zero for objective l then objective l will haveno influence on the policy update and will effectively beignored We explored this briefly in Sec 51 and Fig 4
One common example of unequal preferences is when wewould like an agent to complete a task while minimiz-ing other objectivesmdasheg an action norm penalty ldquopainrdquopenalty or wrist force-torque penalty in our experimentsIn this case the ε for the task objective should be higherthan that for the other objectives to incentivize the agent toprioritize actually doing the task If the ε for the penaltiesis too high then the agent will care more about minimiz-ing the penalty (which can typically be achieved by simplytaking no actions) rather than doing the task which is notparticularly useful
The scale of εk has a similar effect as in the equal preference
Condition Settings
Humanoid run (one seed per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin linspace(0 015 100)
MO-MPO εtask = 01εpenalty isin linspace(10minus6 015 100)
Humanoid run normal vs scaled (three seeds per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin 001 005 01
MO-MPO εtask = 01εpenalty isin 001 005 01
Shadow Hand touch and turn (three seeds per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin linspace(0 09 10)
MO-MPO εtask = 001εpenalty isin linspace(0001 0015 15)
Shadow Hand orient (ten seeds per setting)
scalarized MPO wtouch = wheight = worientation = 13MO-MPO εtouch = εheight = εorientation = 001
Table 4 Settings for εk and weights (linspace(x y z) denotes aset of z evenly-spaced values between x and y)
case If the scale of εk is too high or too low then the sameissues arise as discussed for equal preferences If all εkincrease or decrease in scale by the same (moderate) factorand thus their relative scales remain the same then typicallythey will converge to more or less the same policy Thiscan be seen in Fig 5 (right) in the Deep Sea Treasuredomain regardless of whether εtime is 001 002 or 005the relationship between the relative scale of εtreasure andεtime to the treasure that the policy converges to selecting isessentially the same
A2 Deep Sea Treasure
In order to handle discrete actions we make several mi-nor adjustments to scalarized MPO and MO-MPO Thepolicy returns a categorical distribution rather than a Gaus-sian The policy is parametrized by a neural network whichoutputs a vector of logits (ie unnormalized log probabil-ities) of size |A| The KL constraint on the change of thispolicy β is 0001 The input to the critic network is thestate concatenated with a four-dimensional one-hot vectordenoting which action is chosen (eg the up action corre-sponds to [1 0 0 0]gt) Critics are trained with one-steptemporal-difference error with a discount of 0999 Otherthan these changes the network architectures and the MPOand training hyperparameters are the same as in Table 2
A Distributional View on Multi-Objective Policy Optimization
Humanoid Mocap
Scalarized VMPOKL-constraint on policy mean βmicro 01KL-constraint on policy covariance βΣ 10minus5
default ε 01initial temperature η 1
TrainingAdam learning rate 10minus4
batch size 128unroll length (for n-step return Sec D1) 32
Table 5 Hyperparameters for the humanoid mocap experiments
A3 Humanoid Mocap
For the humanoid mocap experiments we used the follow-ing architecture for both MO-VMPO and VMPO for thepolicy we process a concatentation of the mocap referenceobservations and the proprioceptive observations by a twolayer MLP with 1024 hidden units per layer This referenceencoder is followed by linear layers to produce the meanand log standard deviation of a stochastic latent variableThese latent variables are then concatenated with the propri-oceptive observations and processed by another two layerMLP with 1024 hidden units per layer to produce the actiondistribution For VMPO we use a three layer MLP with1024 hidden units per layer as the critic For MO-VMPOwe use a shared two layer MLP with 1024 hidden units perlayer followed by a one layer MLP with 1024 hidden unitsper layer per objective In both cases we use k-step returnsto train the critic with a discount factor of 095 Table 5shows additional hyperparameters used in our experiments
B Experimental DomainsWe evaluated our approach on one discrete domain (DeepSea Treasure) three simulated continuous control domains(humanoid Shadow Hand and humanoid mocap) and onereal-world continuous control domain (Sawyer robot) Herewe provide more detail about these domains and the objec-tives used in each task
B1 Deep Sea Treasure
Deep Sea Treasure (DST) is a 11times10 grid-world domainthe state space S consists of the x and y position of the agentand the action spaceA is up right down left Thelayout of the environment and values of the treasures areshown in Fig 8 If the action would cause the agent tocollide with the sea floor or go out of bounds it has noeffect Farther-away treasures have higher values Theepisode terminates when the agent collects a treasure orafter 200 timesteps
237
224
203196
161151140
115
82
07
Figure 8 Deep Sea Treasure environment from Vamplew et al(2011) with weights from Yang et al (2019) Treasures are labeledwith their respective values The agent can move around freelyin the white squares but cannot enter the black squares (ie theocean floor)
There are two objectives time penalty and treasure valueA time penalty of minus1 is given at each time step The agentreceives the value of the treasure it picks up as the rewardfor the treasure objective In other words when the agentpicks up a treasure of value v the reward vector is [minus1 v]otherwise it is [minus1 0]
B2 Shadow Hand
Our robot platform is a simulated Shadow Dexterous Hand(Shadow Robot Company 2020) in the MuJoCo physicsengine (Todorov et al 2012) The Shadow Hand has fivefingers and 24 degrees of freedom actuated by 20 motorsThe observation consists of joint angles joint velocities andtouch sensors for a total of 63 dimensions Each fingertipof our Shadow Hand has a 4times4 spatial touch sensor Thissensor has three channels measuring the normal force andthe x and y-axis tangential forces for a sensor dimension of4times4times3 per fingertip We simplify this sensor by summingacross spatial dimensions resulting in a 1times1times3 observationper fingertip
In the touch task there is a block in the environment that isalways fixed in the same pose In the turn task there is a dialin the environment that can be rotated and it is initialized toa random position between minus30 and 30 The target angleof the dial is 0 The angle of the dial is included in theagentrsquos observation In the orient task the robot interactswith a rectangular peg in the environment the initial andtarget pose of the peg remains the same across episodesThe pose of the block is included in the agentrsquos observationencoded as the xyz positions of four corners of the block(based on how Levine et al (2016) encodes end-effectorpose)
A Distributional View on Multi-Objective Policy Optimization
Figure 9 These images show what task completion looks like forthe touch turn and orient Shadow Hand tasks (from left to right)
0 2 4 6impact force
minus6
minus4
minus2
0
pain
pen
alty
Figure 10 In the touch and turn Shadow Hand tasks the painpenalty is computed based on the impact force as plotted here Forlow impact forces the pain penalty is near-zero For high impactforces the pain penalty is equal to the negative of the impact force
B21 BALANCING TASK COMPLETION AND PAIN
In the touch and turn tasks there are two objectives ldquopainrdquopenalty and task completion A sparse task completion re-ward of 1 is given for pressing a block with greater than 5Nof force or turning a dial to a fixed target angle respectivelyIn both tasks the episode terminates when the agent com-pletes the task ie the agent gets a total reward of either0 or 1 for the task completion objective per episode ThePareto plots for these two tasks in the main paper (Fig 6)show the discounted task reward (with a discount factor ofγ = 099) to capture how long it takes agents to completethe task
The ldquopainrdquo penalty is computed as in (Huang et al 2019) Itis based on the impact force m(s sprime) which is the increasein force from state s to the next state sprime In our tasks this ismeasured by a touch sensor on the block or dial The painpenalty is equal to the negative of the impact force scaledby how unacceptable it is
rpain(s a sprime) = minus[1minus aλ(m(s sprime))
]m(s sprime) (6)
where aλ(middot) isin [0 1] computes the acceptability of an impactforce aλ(middot) should be a monotonically decreasing functionthat captures how resilient the robot and the environmentare to impacts As in (Huang et al 2019) we use
aλ(m) = sigmoid(λ1(minusm+ λ2)) (7)
with λ = [2 2]gt The relationship between pain penaltyand impact force is plotted in Fig 10
675 700 725zpeg
0
1
rew
ard
height objective
0 2d(θ θpeg)
orientation objective
Figure 11 In the orient Shadow Hand task the height and orienta-tion objectives have shaped rewards as plotted here
B22 ALIGNED OBJECTIVES
In the orient task there are three non-competing objectivestouching the peg lifting the peg to a target height and ori-enting the peg to be perpendicular to the ground The targetz position and orientation is shown by the gray transparentpeg (Fig 9 right-most) note that the x and y position isnot specified so the robot in the figure gets the maximumreward with respect to all three objectives
The touch objective has a sparse reward of 1 for activat-ing the pegrsquos touch sensor and zero otherwise For theheight objective the target z position is 7cm above theground the pegrsquos z position is computed with respect toits center of mass The shaped reward for this objectiveis 1 minus tanh(50|ztarget minus zpeg|) For the orientation objec-tive since the peg is symmetrical there are eight possibleorientations that are equally valid The acceptable targetorientationsQtarget and the pegrsquos orientation qpeg are denotedas quaternions The shaped reward is computed with respectto the closest target quaternion as
minqisinQtarget
1minus tanh(2lowastd(q qpeg))
where d(middot) denotes the `2-norm of the axis-angle equivalent(in radians) for the distance between the two quaternionsFig 11 shows the shaping of the height and orientationrewards
B3 Humanoid
We make use of the humanoid run task from Tassa et al(2018)9 The observation dimension is 67 and the actiondimension is 21 Actions are joint accelerations with min-imum and maximum limits of minus1 and 1 respectively Forthis task there are two objectives
bull The original task reward given by the environment Thegoal is is to achieve a horizontal speed of 10 meters persecond in any direction This reward is shaped it isequal to min(h10 1) where h is in meters per secondFor this objective we learn a Q-function
9This task is available at githubcomdeepminddm control
A Distributional View on Multi-Objective Policy Optimization
bull Limiting energy usage by penalizing taking high-magnitude actions The penalty is the `2-norm of theaction vector ie rpenalty(a) = minusa2 For this ob-jective we do not learn a Q-function Instead wecompute this penalty directly to evaluate a given actionin a state-independent way during policy optimization
B4 Humanoid Mocap
The motion capture tracking task used in this paper wasdeveloped as part of a concurrent submission (Anonymous2020) and does not constitute a contribution of this paperWe use a simulated humanoid adapted from the ldquoCMUhumanoidrdquo available at dm controllocomotion (Merel et al2019) We adjusted various body and actuator parametersto be more comparable to that of an average human Thishumanoid has 56 degrees of freedom and the observation is1021-dimensional
This motion capture tracking task is broadly similar to previ-ous work on motion capture tracking (Chentanez et al 2018Peng et al 2018 Merel et al 2019) The task is associatedwith an underlying set of motion capture clips For thispaper we used roughly 40 minutes of locomotion motioncapture clips from the CMU motion capture database10
At the beginning of a new episode we select a frame uni-formly at random from all frames in the underlying mocapdata (excluding the last 10 frames of each clip) The simu-lated humanoid is then initialized to the pose of the selectedframe
The observations in this environment are various propriocep-tive observations about the current state of the body as wellas the relative target positions and orientations of 13 differ-ent body parts in the humanoidrsquos local frame We providethe agent with the target poses for the next 5 timesteps
The choice of reward function is crucial for motion capturetracking tasks Here we consider five different reward com-ponents each capturing a different aspect of the similarityof the pose between the simulated humanoid and the mocaptarget Four of our reward components were initially pro-posed by Peng et al (2018) The first reward component isbased on the difference in the center of mass
rcom = exp(minus10pcom minus pref
com2)
where pcom and prefcom are the positions of the center of mass
of the simulated humanoid and the mocap reference re-spectively The second reward component is based on thedifference in joint angle velocities
rvel = exp(minus01qvel minus qref
vel2)
where qvel and qrefvel are the joint angle velocities of the simu-
lated humanoid and the mocap reference respectively The
10This database is available at mocapcscmuedu
third reward component is based on the difference in end-effector positions
rapp = exp(minus40papp minus pref
app2)
where papp and prefapp are the end effector positions of the
simulated humanoid and the mocap reference respectivelyThe fourth reward component is based on the difference injoint orientations
rquat = exp(minus2qquat qref
quat2)
where denotes quaternion differences and qquat and qrefquat
are the joint quaternions of the simulated humanoid and themocap reference respectively
Finally the last reward component is based on difference inthe joint angles and the Euclidean positions of a set of 13body parts
rtrunc = 1minus 1
03(
=ε︷ ︸︸ ︷bpos minus bref
pos1 + qpos minus qrefpos1)
where bpos and brefpos correspond to the body positions of the
simulated character and the mocap reference and qpos andqref
pos correspond to the joint angles
We include an early termination condition in our task that islinked to this last reward term We terminate the episode ifε gt 03 which ensures that rtrunc isin [0 1]
In our MO-VMPO experiments we treat each reward com-ponent as a separate objective In our VMPO experimentswe use a reward of the form
r =1
2rtrunc +
1
2(01rcom + λrvel + 015rapp + 065rquat)
(8)
varying λ as described in the main paper (Sec 63)
B5 Sawyer
The Sawyer peg-in-hole setup consists of a Rethink RoboticsSawyer robot arm The robot has 7 joints driven by serieselastic actuators On the wrist a Robotiq FT 300 force-torque sensor is mounted followed by a Robotiq 2F85 par-allel gripper We implement a 3D end-effector Cartesianvelocity controller that maintains a fixed orientation of thegripper The agent controls the Cartesian velocity setpointsThe gripper is not actuated in this setup The observationsprovided are the joint positions velocities and torques theend-effector pose the wrist force-torque measurements andthe joint velocity command output of the Cartesian con-troller We augment each observation with the two previousobservations as a history
In each episode the peg is initalized randomly within an8times8times8 cm workspace (with the peg outside of the hole)
A Distributional View on Multi-Objective Policy Optimization
The environment is then run at 20 Hz for 600 steps Theaction limits are plusmn005 ms on each axis If the magnitudeof the wrist force-torque measurements exceeds 15 N onany horizontal axis (x and y axis) or 5 N on the vertical axis(z axis) the episode is terminated early
There are two objectives the task itself (inserting the peg)and minimizing wrist force measurements
The reward for the task objective is defined as
rinsertion = max(
02rapproach rinserted
)(9)
rapproach = s(ppeg popening) (10)
rinserted = raligned s(ppegz pbottom
z ) (11)
raligned = ppegxy minus popening
xy lt d2 (12)
where ppeg is the peg position popening the position of thehole opening pbottom the bottom position of the hole and dthe diameter of the hole
s(p1 p2) is a shaping operator
s(p1 p2) = 1minus tanh2(atanh(
radicl)
εp1 minus p22) (13)
which gives a reward of 1minus l if p1 and p2 are at a Euclideandistance of ε Here we chose lapproach = 095 εapproach =001 linserted = 05 and εinserted = 004 Intentionally 004corresponds to the length of the peg such that rinserted = 05if ppeg = popening As a result if the peg is in the approachposition rinserted is dominant over rapproach in rinsertion
Intuitively there is a shaped reward for aligning the pegwith a magnitude of 02 After accurate alignment withina horizontal tolerance the overall reward is dominated bya vertical shaped reward component for insertion The hor-izontal tolerance threshold ensures that the agent is notencouraged to try to insert the peg from the side
The reward for the secondary objective minimizing wristforce measurements is defined as
rforce = minusF1 (14)
where F = (Fx Fy Fz) are the forces measured by thewrist sensor
C Algorithmic DetailsC1 General Algorithm Description
We maintain one online network and one target network foreach Q-function associated to objective k with parametersdenoted by φk and φprimek respectively We also maintain oneonline network and one target network for the policy withparameters denoted by θ and θprime respectively Target net-works are updated every fixed number of steps by copying
parameters from the online network Online networks areupdated using gradient descent in each learning iterationNote that in the main paper we refer to the target policynetwork as the old policy πold
We use an asynchronous actor-learner setup in which ac-tors regularly fetch policy parameters from the learner andact in the environment writing these transitions to the re-play buffer We refer to this policy as the behavior policythroughout the paper The learner uses the transitions inthe replay buffer to update the (online) Q-functions and thepolicy Please see Algorithms 2 and 3 for more details onthe actor and learner processes respectively
Algorithm 2 MO-MPO Asynchronous Actor
1 given (N) reward functions rk(s a)Nk=1 T steps perepisode
2 repeat3 Fetch current policy parameters θ from learner4 Collect trajectory from environment5 τ = 6 for t = 0 T do7 at sim πθ(middot|st)8 Execute action and determine rewards9 r = [r1(st at) rN (st at)]
10 τ larr τ cup (st at r πθ(at|st))11 end for12 Send trajectory τ to replay buffer13 until end of training
C2 Retrace for Multi-Objective Policy Evaluation
Recall that the goal of the policy evaluation stage (Sec 41)is to learn the Q-function Qφk
(s a) for each objective kparametrized by φk These Q-functions are with respectto the policy πold We emphasize that it is valid to use anyoff-the-shelf Q-learning algorithm such as TD(0) (Sutton1988) to learn these Q-functions We choose to use Retrace(Munos et al 2016) described here
Given a replay buffer D containing trajectory snippetsτ = (s0 a0 r0 s1) (sT aT rT sT+1) where rtdenotes a reward vector rk(st at)Nk=1 that consists of ascalar reward for each ofN objectives the Retrace objectiveis as follows
minφk
EτsimD[(Qretk (st at)minus Qφk
(st at))2] (15)
with
Qretk (st at) = Qφprime
k(st at) +
Tsumj=t
γjminust( jprodz=t+1
cz
)δj
δj = rk(sj aj) + γV (sj+1)minus Qφprimek(sj aj)
V (sj+1) = Eπold(a|sj+1)[Qφprimek(sj+1 a)]
A Distributional View on Multi-Objective Policy Optimization
Algorithm 3 MO-MPO Asynchronous Learner1 given batch size (L) number of actions (M) (N) Q-functions
(N) preferences εkNk=1 policy networks and replay bufferD
2 initialize Lagrangians ηkNk=1 and ν target networks andonline networks such that πθprime = πθ and Qφprime
k= Qφk
3 repeat4 repeat5 Collect dataset si aij Qijk
LMNijk where
6 si sim D aij sim πθprime(a|si) and Qijk = Qφkprime(si aij)
78 Compute action distribution for each objective9 for k = 1 N do
10 δηk larr nablaηkηkεk + ηksumLi
1Llog
(sumMj
1M
exp(Q
ijkηk
))11 Update ηk based on δηk12 qijk prop exp(
Qijkηk
)
13 end for1415 Update parametric policy with trust region16 sg denotes a stop-gradient17 δπ larr minusnablaθ
sumLi
sumMj
sumNk q
ijk log πθ(a
ij |si)18 +sg(ν)
(β minus
sumLi KL(πθprime(a|si) πθ(a|si)) ds
)19 δν larr nablaνν
(βminussumLi KL(πθprime(a|si) πsg(θ)(a|si)) ds
)20 Update πθ based on δπ21 Update ν based on δν2223 Update Q-functions24 for k = 1 N do25 δφk larr nablaφk
sum(stat)isinτsimD
(Qφk (st at)minusQ
retk
)226 with Qret
k as in Eq 1527 Update φk based on δφk
28 end for2930 until fixed number of steps31 Update target networks32 πθprime = πθ Qφk
prime = Qφk
33 until convergence
The importance weights ck are defined as
cz = min
(1πold(az|sz)b(az|sz)
)
where b(az|sz) denotes the behavior policy used to col-lect trajectories in the environment When j = t we set(prodj
z=t+1 cz
)= 1
We also make use of a target network for each Q-function(Mnih et al 2015) parametrized by φprimek which we copyfrom the online network φk after a fixed number of gradientsteps on the Retrace objective (15)
C3 Policy Fitting
Recall that fitting the policy in the policy improvementstage (Sec 422) requires solving the following constrained
optimization
πnew = argmaxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds
stints
micro(s) KL(πold(a|s) πθ(a|s)) ds lt β (16)
We first write the generalized Lagrangian equation ie
L(θ ν) =
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds (17)
+ ν(β minus
ints
micro(s) KL(πold(a|s) πθ(a|s)) ds)
where ν is the Lagrangian multiplier Now we solve thefollowing primal problem
maxθ
minνgt0
L(θ ν)
To obtain the parameters θ for the updated policy we solvefor ν and θ by alternating between 1) fixing ν to its currentvalue and optimizing for θ and 2) fixing θ to its currentvalue and optimizing for ν This can be applied for anypolicy output distribution
For Gaussian policies we decouple the update rules to op-timize for mean and covariance independently as in (Ab-dolmaleki et al 2018b) This allows for setting differentKL bounds for the mean (βmicro) and covariance (βΣ) whichresults in more stable learning To do this we first separateout the following two policies for mean and covariance
πmicroθ (a|s) = N(amicroθ(s)Σθold
(s)) (18)
πΣθ (a|s) = N
(amicroθold
(s)Σθ(s)) (19)
Policy πmicroθ (a|s) takes the mean from the online policy net-work and covariance from the target policy network andpolicy πΣ
θ (a|s) takes the mean from the target policy net-work and covariance from online policy network Now ouroptimization problem has the following form
maxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s)(
log πmicroθ (a|s)πΣθ (a|s)
)da ds
stints
micro(s) KL(πold(a|s) πmicroθ (a|s)) ds lt βmicroints
micro(s) KL(πold(a|s) πΣθ (a|s)) ds lt βΣ (20)
As in (Abdolmaleki et al 2018b) we set a much smallerbound for covariance than for mean to keep the explorationand avoid premature convergence We can solve this op-timization problem using the same general procedure asdescribed above
A Distributional View on Multi-Objective Policy Optimization
C4 Derivation of Dual Function for η
Recall that obtaining the per-objective improved action dis-tributions qk(a|s) in the policy improvement stage (Sec421) requires solving a convex dual function for the tem-perature ηk for each objective For the derivation of thisdual function please refer to Appendix D2 of the originalMPO paper (Abdolmaleki et al 2018b)
D MO-V-MPO Multi-Objective On-PolicyMPO
In this section we describe how MO-MPO can be adaptedto the on-policy case in which a state-value function V (s)is learned and used to estimate the advantages for eachobjective We call this approach MO-V-MPO
D1 Multi-objective policy evaluation in the on-policysetting
In the on-policy setting to evaluate the previous policyπold we use advantages A(s a) estimated from a learnedstate-value function V (s) instead of a state-action valuefunction Q(s a) as in the main text We train a separateV -function for each objective by regressing to the stan-dard n-step return (Sutton 1988) associated with each ob-jective More concretely given trajectory snippets τ =(s0 a0 r0) (sT aT rT ) where rt denotes a rewardvector rk(st at)Nk=1 that consists of rewards for all N ob-jectives we find value function parameters φk by optimizingthe following objective
minφ1k
Nsumk=1
Eτ[(G
(T )k (st at)minus V πold
φk(st))
2] (21)
Here G(T )k (st at) is the T -step target for value function
k which uses the actual rewards in the trajectory andbootstraps from the current value function for the restG
(T )k (st at) =
sumTminus1`=t γ
`minustrk(s` a`) + γTminustV πoldφk
(s`+T )The advantages are then estimated as Aπold
k (st at) =
G(T )k (st at)minus V πold
φk(st)
D2 Multi-objective policy improvement in theon-policy setting
Given the previous policy πold(a|s) and estimated advan-tages Aπold
k (s a)k=1N associated with this policy foreach objective our goal is to improve the previous policyTo this end we first learn an improved variational distribu-tion qk(s a) for each objective and then combine and distillthe variational distributions into a new parametric policyπnew(a|s) Unlike in MO-MPO for MO-V-MPO we use thejoint distribution qk(s a) rather than local policies qk(a|s)because without a learned Q-function only one action perstate is available for learning This is a multi-objective vari-
ant of the two-step policy improvement procedure employedby V-MPO (Song et al 2020)
D21 OBTAINING IMPROVED VARIATIONALDISTRIBUTIONS PER OBJECTIVE (STEP 1)
In order to obtain the improved variational distributionsqk(s a) we optimize the RL objective for each objectiveAk
maxqk
intsa
qk(s a)Ak(s a) dads (22)
st KL(qk(s a)pold(s a)) lt εk
where the KL-divergence is computed over all (s a) εk de-notes the allowed expected KL divergence and pold(s a) =micro(s)πold(a|s) is the state-action distribution associated withπold
As in MO-MPO we use these εk to define the preferencesover objectives More concretely εk defines the allowed con-tribution of objective k to the change of the policy There-fore the larger a particular εk is with respect to others themore that objective k is preferred On the other hand ifεk = 0 is zero then objective k will have no contribution tothe change of the policy and will effectively be ignored
Equation (22) can be solved in closed form
qk(s a) prop pold(s a) exp(Ak(s a)
ηk
) (23)
where the temperature ηk is computed based on the con-straint εk by solving the following convex dual problem
ηk = argminηk
[ηk εk + (24)
ηk log
intsa
pold(s a) exp(Ak(s a)
η
)da ds
]
We can perform the optimization along with the loss bytaking a gradient descent step on ηk and we initialize withthe solution found in the previous policy iteration step Sinceηk must be positive we use a projection operator after eachgradient step to maintain ηk gt 0
Top-k advantages As in Song et al (2020) in practicewe used the samples corresponding to the top 50 of ad-vantages in each batch of data
D22 FITTING A NEW PARAMETRIC POLICY (STEP 2)
We next want to combine and distill the state-action distribu-tions obtained in the previous step into a single parametricpolicy πnew(a|s) that favors all of the objectives accordingto the preferences specified by εk For this we solve a su-pervised learning problem that fits a parametric policy asfollows
A Distributional View on Multi-Objective Policy Optimization
πnew = argmaxθ
Nsumk=1
intsa
qk(s a) log πθ(a|s) da ds
stints
KL(πold(a|s) πθ(a|s)) ds lt β (25)
where θ are the parameters of our function approximator(a neural network) which we initialize from the weights ofthe previous policy πold and the KL constraint enforces atrust region of size β that limits the overall change in theparametric policy to improve stability of learning As inMPO the KL constraint in this step has a regularizationeffect that prevents the policy from overfitting to the localpolicies and therefore avoids premature convergence
In order to optimize Equation (25) we employ Lagrangianrelaxation similar to the one employed for ordinary V-MPO (Song et al 2020)
E Multi-Objective Policy Improvement asInference
In the main paper we motivated the multi-objective policyupdate rules from an intuitive perspective In this section weshow that our multi-objective policy improvement algorithmcan also be derived from the RL-as-inference perspective Inthe policy improvement stage we assume that a Q-functionfor each objective is given and we would like to improveour policy with respect to these Q-functions The deriva-tion here extends the derivation for the policy improvementalgorithm in (single-objective) MPO in Abdolmaleki et al(2018a) (in appendix) to the multi-objective case
We assume there are observable binary improvement eventsRkNk=1 for each objective Rk = 1 indicates that ourpolicy has improved for objective k whileRk = 0 indicatesthat it has not Now we ask if the policy has improved withrespect to all objectives ie Rk = 1Nk=1 what would theparameters θ of that improved policy be More concretelythis is equivalent to the maximum a posteriori estimate forRk = 1Nk=1
maxθ
pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)
where the probability of improvement events depends on θAssuming independence between improvement events Rkand using log probabilities leads to the following objective
maxθ
Nsumk=1
log pθ(Rk = 1) + log p(θ) (27)
The prior distribution over parameters p(θ) is fixed duringthe policy improvement step We set the prior such that πθ
stays close to the target policy during each policy improve-ment step to improve stability of learning (Sec E2)
We use the standard expectation-maximization (EM) algo-rithm to efficiently maximize
sumNk=1 log pθ(Rk = 1) The
EM algorithm repeatedly constructs a tight lower bound inthe E-step given the previous estimate of θ (which corre-sponds to the target policy in our case) and then optimizesthat lower bound in the M-step We introduce a variationaldistribution qk(s a) per objective and use this to decom-pose the log-evidence
sumNk=1 log pθ(Rk = 1) as follows
Nsumk=1
log pθ(Rk = 1) (28)
=
Nsumk=1
KL(qk(s a) pθ(s a|Rk = 1))minus
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a))
The second term of this decomposition is expanded as
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a)) (29)
=
Nsumk=1
Eqk(sa)
[log
pθ(Rk = 1 s a)
qk(s a)
]=
Nsumk=1
Eqk(sa)
[log
p(Rk = 1|s a)πθ(a|s)micro(s)
qk(s a)
]
where micro(s) is the stationary state distribution which is as-sumed to be given in each policy improvement step Inpractice micro(s) is approximated by the distribution of thestates in the replay buffer p(Rk = 1|s a) is the likelihoodof the improvement event for objective k if our policy choseaction a in state s
The first term of the decomposition in Equation (28) isalways positive so the second term is a lower bound onthe log-evidence
sumNk=1 log pθ(Rk = 1) πθ(a|s) and
qk(a|s) = qk(sa)micro(s) are unknown even though micro(s) is given
In the E-step we estimate qk(a|s) for each objective k byminimizing the first term given θ = θprime (the parameters ofthe target policy) Then in the M-step we find a new θ byfitting the parametric policy to the distributions from firststep
E1 E-Step
The E-step corresponds to the first step of policy improve-ment in the main paper (Sec 421) In the E-step wechoose the variational distributions qk(a|s)Nk=1 such thatthe lower bound on
sumNk=1 log pθ(Rk = 1) is as tight as
A Distributional View on Multi-Objective Policy Optimization
possible when θ = θprime the parameters of the target policyThe lower bound is tight when the first term of the decompo-sition in Equation (28) is zero so we choose qk to minimizethis term We can optimize for each variational distributionqk independently
qk(a|s) = argminq
Emicro(s)
[KL(q(a|s) pθprime(s a|Rk = 1))
]= argmin
qEmicro(s)
[KL(q(a|s) πθprime(a|s))minus
Eq(a|s) log p(Rk = 1|s a))]
(30)
We can solve this optimization problem in closed formwhich gives us
qk(a|s) =πθprime(a|s) p(Rk = 1|s a)intπθprime(a|s) p(Rk = 1|s a) da
This solution weighs the actions based on their relativeimprovement likelihood p(Rk = 1|s a) for each objectiveWe define the likelihood of an improvement event Rk as
p(Rk = 1|s a) prop exp(Qk(s a)
αk
)
where αk is an objective-dependent temperature parameterthat controls how greedy the solution qk(a|s) is with respectto its associated objective Qk(s a) For example given abatch of state-action pairs at the extremes an αk of zerowould give all probability to the state-action pair with themaximum Q-value while an αk of positive infinity wouldgive the same probability to all state-action pairs In orderto automatically optimize for αk we plug the exponentialtransformation into (30) which leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) dadsminus
αk
intmicro(s) KL(q(a|s) πθprime(a|s)) ds (31)
If we instead enforce the bound on KL divergence as a hardconstraint (which we do in practice) that leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) da ds (32)
stintmicro(s) KL(q(a|s) πθprime(a|s)) ds lt εk
where as described in the main paper the parameter εkdefines preferences over objectives If we now assume thatqk(a|s) is a non-parametric sample-based distribution as in(Abdolmaleki et al 2018b) we can solve this constrained
optimization problem in closed form for each sampled states by setting
q(a|s) prop πθprime(a|s) exp(Qk(s a)
ηk
) (33)
where ηk is computed by solving the following convex dualfunction
ηk = argminη
ηεk+ (34)
η
intmicro(s) log
intπθprime(a|s) exp
(Qk(s a)
η
)da ds
Equations (32) (33) and (34) are equivalent to those givenfor the first step of multi-objective policy improvement inthe main paper (Sec 421) Please see refer to AppendixD2 in (Abdolmaleki et al 2018b) for details on derivationof the dual function
E2 M-Step
After obtaining the variational distributions qk(a|s) foreach objective k we have found a tight lower bound forsumNk=1 log pθ(Rk = 1) when θ = θprime We can now obtain
the parameters θ of the updated policy πθ(a|s) by optimiz-ing this lower bound After rearranging and dropping theterms in (27) that do not involve θ we obtain
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) da ds
+ log p(θ)
This corresponds to the supervised learning step of policyimprovement in the main paper (Sec 422)
For the prior p(θ) on the parameters θ we enforce that thenew policy πθ(a|s) is close to the target policy πθprime(a|s)We therefore optimize for
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) dads
stintmicro(s) KL(πθprime(a|s) πθ(a|s)) ds lt β
Because of the constraint on the change of the parametricpolicy we do not greedily optimize the M-step objectiveSee (Abdolmaleki et al 2018b) for more details
A Distributional View on Multi-Objective Policy Optimization
Chow Y Nachum O Duenez-Guzman E andGhavamzadeh M A Lyapunov-based approach to safereinforcement learning In Proceedings of the 31st Inter-national Conference on Neural Information ProcessingSystems (NeurIPS) pp 8092ndash8101 2018
Daneshmand M Tale Masouleh M Saadatzi M HOzcinar C and Anbarjafari G A robust proportion-preserving composite objective function for scale-invariant multi-objective optimization Scientia Iranica24(6)2977ndash2991 2017
Fortin F-A De Rainville F-M Gardner M-A ParizeauM and Gagne C DEAP Evolutionary algorithms madeeasy Journal of Machine Learning Research 132171ndash2175 July 2012
Friedman E and Fontaine F Generalizing across multi-objective reward functions in deep reinforcement learningarXiv preprint arXiv180906364 2018
Gabor Z Kalmar Z and Szepesvari C Multi-criteriareinforcement learning In Proceedings of the FifteenthInternational Conference on Machine Learning (ICML)pp 197ndash205 1998
Grodzevich O and Romanko O Normalization and othertopics in multi-objective optimization In Proceedings ofthe Fields-MITACS Industrial Problems Workshop 2006
Haarnoja T Zhou A Abbeel P and Levine S Softactor-critic Off-policy maximum entropy deep reinforce-ment learning with a stochastic actor In Proceedings ofthe 35th International Conference on Machine Learning(ICML) pp 1861ndash1870 2018
Heess N Wayne G Silver D Lillicrap T Erez T andTassa Y Learning continuous control policies by stochas-tic value gradients In Advances in Neural InformationProcessing Systems 28 (NIPS) 2015
Hessel M Soyer H Espeholt L Czarnecki W SchmittS and van Hasselt H Multi-task deep reinforcementlearning with popart In Proceedings of the AAAI Confer-ence on Artificial Intelligence volume 33 pp 3796ndash38032019
Huang S H Zambelli M Kay J Martins M F TassaY Pilarski P M and Hadsell R Learning gentle objectmanipulation with curiosity-driven deep reinforcementlearning arXiv preprint arXiv190308542 2019
Ishibuchi H Doi K and Nojima Y On the effect ofnormalization in MOEAD for multi-objective and many-objective optimization Complex amp Intelligent Systems 3(4)279ndash294 2017
Kim I and de Weck O Adaptive weighted sum methodfor multiobjective optimization A new method for paretofront generation Structural and Multidisciplinary Opti-mization 31(2)105ndash116 2006
Kingma D P and Ba J Adam A method for stochasticoptimization In Proceedings of the Third InternationalConference on Learning Representations (ICLR) 2015
Le H Voloshin C and Yue Y Batch policy learningunder constraints In Proceedings of the 36th Inter-national Conference on Machine Learning (ICML) pp3703ndash3712 2019
Levine S Finn C Darrell T and Abbeel P End-to-endtraining of deep visuomotor policies Journal of MachineLearning Research 17(39)1ndash40 2016
Liu C Xu X and Hu D Multiobjective reinforcementlearning A comprehensive overview IEEE Transactionson Systems Man and Cybernetics Systems 45(3)385ndash398 2015
Marler R T and Arora J S Function-transformationmethods for multi-objective optimization EngineeringOptimization 37(6)551ndash570 2005
Martens J New perspectives on the natural gradient methodarXiv preprint arXiv14121193 2014
Merel J Ahuja A Pham V Tunyasuvunakool S LiuS Tirumala D Heess N and Wayne G Hierarchi-cal visuomotor control of humanoids In InternationalConference on Learning Representations (ICLR) 2019
Mnih V Kavukcuoglu K Silver D Rusu A A VenessJ Bellemare M G Graves A Riedmiller M Fidje-land A K Ostrovski G et al Human-level controlthrough deep reinforcement learning Nature 518(7540)529 2015
Moffaert K V and Nowe A Multi-objective reinforcementlearning using sets of Pareto dominating policies Journalof Machine Learning Research 153663ndash3692 2014
Mossalam H Assael Y M Roijers D M and WhitesonS Multi-objective deep reinforcement learning arXivpreprint arXiv161002707 2016
Munos R Stepleton T Harutyunyan A and BellemareM Safe and efficient off-policy reinforcement learningIn Proceedings of the 29th International Conference onNeural Information Processing Systems pp 1054ndash10622016
Natarajan S and Tadepalli P Dynamic preferences inmulti-criteria reinforcement learning In Proceedings ofthe 22nd International Conference on Machine Learning(ICML) pp 601ndash608 2005
A Distributional View on Multi-Objective Policy Optimization
Nottingham K Balakrishnan A Deshmukh J Christo-pherson C and Wingate D Using logical specificationsof objectives in multi-objective reinforcement learningarXiv preprint arXiv191001723 2019
Parisi S Pirotta M Smacchia N Bascetta L andRestelli M Policy gradient approaches for multi-objective sequential decision making In Proceedingsof the International Joint Conference on Neural Networks(IJCNN) pp 2323ndash2330 2014
Peng X B Abbeel P Levine S and van de Panne MDeepmimic Example-guided deep reinforcement learn-ing of physics-based character skills ACM Transactionson Graphics 37(4) 2018
Pirotta M Parisi S and Restelli M Multi-objectivereinforcement learning with continuous Pareto frontierapproximation In Proceedings of the Twenty-Ninth AAAIConference on Artificial Intelligence (AAAI) 2015
Reymond M and Nowe A Pareto-DQN Approximat-ing the Pareto front in complex multi-objective decisionproblems In Proceedings of the Adaptive and LearningAgents Workshop at the 18th International Conferenceon Autonomous Agents and MultiAgent Systems AAMAS2019
Riedmiller M Hafner R Lampe T Neunert M De-grave J Van de Wiele T Mnih V Heess N and Sprin-genberg J T Learning by playing - solving sparse rewardtasks from scratch arXiv preprint arXiv1802105672018
Roijers D M Vamplew P Whiteson S and Dazeley RA survey of multi-objective sequential decision-makingJournal of Artificial Intelligence Research 48(1)67ndash1132013
Roijers D M Whiteson S and Oliehoek F A Linearsupport for multi-objective coordination graphs In Pro-ceedings of the International Conference on AutonomousAgents and Multi-Agent Systems (AAMAS) pp 1297ndash1304 2014
Russell S and Zimdars A L Q-decomposition for rein-forcement learning agents In Proceedings of the Twenti-eth International Conference on International Conferenceon Machine Learning (ICML) pp 656ndash663 2003
Schulman J Levine S Abbeel P Jordan M and MoritzP Trust region policy optimization In Proceedings ofthe 32nd International Conference on Machine Learning(ICML) 2015
Schulman J Wolski F Dhariwal P Radford A andKlimov O Proximal policy optimization algorithmsarXiv preprint arXiv170706347 2017
Shadow Robot Company Shadow dexterous handhttpswwwshadowrobotcomproductsdexterous-hand 2020
Silver D Huang A Maddison C J Guez A Sifre Lvan den Driessche G Schrittwieser J Antonoglou IPanneershelvam V Lanctot M Dieleman S GreweD Nham J Kalchbrenner N Sutskever I Lillicrap TLeach M Kavukcuoglu K Graepel T and HassabisD Mastering the game of Go with deep neural networksand tree search Nature 529484ndash503 2016
Song H F Abdolmaleki A Springenberg J T ClarkA Soyer H Rae J W Noury S Ahuja A LiuS Tirumala D Heess N Belov D Riedmiller Mand Botvinick M M V-MPO On-policy maximum aposteriori policy optimization for discrete and continu-ous control In International Conference on LearningRepresentations 2020
Sutton R S Learning to predict by the methods of temporaldifferences Machine learning 3(1)9ndash44 1988
Tan J Zhang T Coumans E Iscen A Bai Y HafnerD Bohez S and Vanhoucke V Sim-to-real Learningagile locomotion for quadruped robots In Proceedingsof Robotics Science and Systems (RSS) 2018
Tassa Y Doron Y Muldal A Erez T Li Yde Las Casas D Budden D Abdolmaleki A Merel JLefrancq A Lillicrap T and Riedmiller M Deepmindcontrol suite arXiv preprint arXiv180100690 2018
Teh Y W Bapst V Czarnecki W M Quan J Kirk-patrick J Hadsell R Heess N and Pascanu R DistralRobust multitask reinforcement learning In Proceedingsof the 31st International Conference on Neural Informa-tion Processing Systems (NeurIPS) 2017
Todorov E Erez T and Tassa Y Mujoco A physicsengine for model-based control In Proceedings of theIEEERSJ International Conference on Intelligent Robotsand Systems (IROS) pp 5026ndash5033 2012
Vamplew P Dazeley R Berry A Issabekov R andDekker E Empirical evaluation methods for multiobjec-tive reinforcement learning algorithms Machine Learn-ing 84(1)51ndash80 Jul 2011
van Hasselt H P Guez A Hessel M Mnih V and Sil-ver D Learning values across many orders of magnitudeIn Advances in Neural Information Processing Systemspp 4287ndash4295 2016
Van Moffaert K Drugan M M and Nowe A Scalar-ized multi-objective reinforcement learning Novel de-sign techniques In Proceedings of the IEEE Symposiumon Adaptive Dynamic Programming and ReinforcementLearning (ADPRL) pp 191ndash199 2013
A Distributional View on Multi-Objective Policy Optimization
van Seijen H Fatemi M Romoff J Laroche R BarnesT and Tsang J Hybrid reward architecture for reinforce-ment learning In Proceedings of the 31st InternationalConference on Neural Information Processing Systems(NeurIPS) pp 5398ndash5408 2017
Watkins C J C H Learning from Delayed Rewards PhDthesis Kingrsquos College Cambridge UK May 1989
Wray K H Zilberstein S and Mouaddib A-I Multi-objective MDPs with conditional lexicographic rewardpreferences In Proceedings of the Twenty-Ninth AAAIConference on Artificial Intelligence (AAAI) pp 3418ndash3424 2015
Wulfmeier M Abdolmaleki A Hafner R SpringenbergJ T Neunert M Hertweck T Lampe T Siegel NHeess N and Riedmiller M Regularized hierarchicalpolicies for compositional transfer in robotics arXivpreprint arXiv190611228 2019
Yang R Sun X and Narasimhan K A generalized algo-rithm for multi-objective reinforcement learning and pol-icy adaptation In Proceedings of the 32nd InternationalConference on Neural Information Processing Systems(NeurIPS) pp 14610ndash14621 2019
Zeng A Song S Lee J Rodriguez A and FunkhouserT Tossingbot Learning to throw arbitrary objects withresidual physics In Proceedings of Robotics Scienceand Systems (RSS) 2019
Zhan H and Cao Y Relationship explainable multi-objective optimization via vector value function basedreinforcement learning arXiv preprint arXiv1910019192019
Zuluaga M Krause A and Puschel M ε-pal An ac-tive learning approach to the multi-objective optimizationproblem Journal of Machine Learning Research 17(1)3619ndash3650 2016
A Distributional View on Multi-Objective Policy Optimization
A Experiment DetailsIn this section we describe implementation details and spec-ify the hyperparameters used for our algorithm In all ourexperiments the policy and critic(s) are implemented withfeed-forward neural networks We use Adam (Kingma ampBa 2015) for optimization
In our continuous control tasks the policy returns a Gaus-sian distribution with a diagonal covariance matrix ieπθ(a|s) = N (microΣ) The policy is parametrized by aneural network which outputs the mean micro = micro(s) and di-agonal Cholesky factors A = A(s) such that Σ = AAT The diagonal factor A has positive diagonal elements en-forced by the softplus transform Aii larr log(1 + exp(Aii))to ensure positive definiteness of the diagonal covariancematrix
Table 2 shows the default hyperparameters that we use forMO-MPO and scalarized MPO and Table 3 shows the hy-perparameters that differ from these defaults for the hu-manoid Shadow Hand and Sawyer experiments For alltasks that we train policies with MO-MPO for a separatecritic is trained for each objective with a shared first layerThe only exception is the action norm penalty in humanoidrun which is computed exactly from the action We foundthat for both policy and critic networks layer normalizationof the first layer followed by a hyperbolic tangent (tanh) isimportant for the stability of the algorithm
In our experiments comparing the Pareto front found byMO-MPO versus scalarized MPO we ran MO-MPO with arange of εk settings and ran scalarized MPO with a range ofweight settings to obtain a set of policies for each algorithmThe settings for humanoid run and the Shadow Hand tasksare listed in Table 4 The settings for humanoid mocap andSawyer are specified in the main paper in Sec 63 and 64respectively
In the following subsections we first give suggestions forchoosing appropriate εk to encode a particular preference(Sec A1) then we describe the discrete MO-MPO usedfor Deep Sea Treasure (Sec A2) and finally we describeimplementation details for humanoid mocap (Sec A3)
A1 Suggestions for Setting ε
Our proposed algorithm MO-(V-)MPO requires practition-ers to translate a desired preference across objectives to nu-merical choices for εk for each objective k At first glancethis may seem daunting However in practice we havefound that encoding preferences via εk is often more in-tuitive than doing so via scalarization In this subsectionwe seek to give an intuition on how to set εk for differentdesired preferences Recall that each εk controls the influ-ence of objective k on the policy update by constraining theKL-divergence between each objective-specific distribution
Hyperparameter Default
policy networklayer sizes (300 200)take tanh of action mean yesminimum variance 10minus12
maximum variance unbounded
critic network(s)layer sizes (400 400 300)take tanh of action yesRetrace sequence size 8discount factor γ 099
both policy and critic networkslayer norm on first layer yestanh on output of layer norm yesactivation (after each hidden layer) ELU
MPOactions sampled per state 20default ε 01KL-constraint on policy mean βmicro 10minus3
KL-constraint on policy covariance βΣ 10minus5
initial temperature η 1
trainingbatch size 512replay buffer size 106
Adam learning rate 3times10minus4
Adam ε 10minus3
target network update period 200
Table 2 Default hyperparameters for MO-MPO and scalarizedMPO with decoupled update on mean and covariance (Sec C3)
and the current policy (Sec 41) We generally choose εk inthe range of 0001 to 01
Equal Preference When all objectives are equally impor-tant the general rule is to set all εk to the same value Wedid this in our orient and humanoid mocap tasks where ob-jectives are aligned or mostly aligned In contrast it can betricky to choose appropriate weights in linear scalarizationto encode equal preferencesmdashsetting all weights equal to1K (where K is the number of objectives) is only appro-priate if the objectivesrsquo rewards are of similar scales Weexplored this in Sec 51 and Fig 3
When setting all εk to the same value what should thisvalue be The larger εk is the more influence the objectiveswill have on the policy update step Since the per-objectivecritics are learned in parallel with the policy setting εktoo high tends to destabilize learning because early on intraining when the critics produce unreliable Q-values theirinfluence on the policy will lead it in the wrong directionOn the other hand if εk is set too low then it slows downlearning because the per-objective action distribution isonly allowed to deviate by a tiny amount from the currentpolicy and the updated policy is obtained via supervisedlearning on the combination of these action distributions
A Distributional View on Multi-Objective Policy Optimization
Humanoid run
policy networklayer sizes (400 400 300)take tanh of action mean nominimum variance 10minus8
critic network(s)layer sizes (500 500 400)take tanh of action no
MPOKL-constraint on policy mean βmicro 5times10minus3
KL-constraint on policy covariance βΣ 10minus6
trainingAdam learning rate 2times10minus4
Adam ε 10minus8
Shadow Hand
MPOactions sampled per state 30default ε 001
Sawyer
MPOactions sampled per state 15
trainingtarget network update period 100
Table 3 Hyperparameters for humanoid Shadow Hand andSawyer experiments that differ from the defaults in Table 2
Eventually the learning will converge to more or less thesame policy though as long as εk is not set too high
Unequal Preference When there is a difference in pref-erences across objectives the relative scale of εk is whatmatters The more the relative scale of εk is compared to εlthe more influence objective k has over the policy updatecompared to objective l And in the extreme case whenεl is near-zero for objective l then objective l will haveno influence on the policy update and will effectively beignored We explored this briefly in Sec 51 and Fig 4
One common example of unequal preferences is when wewould like an agent to complete a task while minimiz-ing other objectivesmdasheg an action norm penalty ldquopainrdquopenalty or wrist force-torque penalty in our experimentsIn this case the ε for the task objective should be higherthan that for the other objectives to incentivize the agent toprioritize actually doing the task If the ε for the penaltiesis too high then the agent will care more about minimiz-ing the penalty (which can typically be achieved by simplytaking no actions) rather than doing the task which is notparticularly useful
The scale of εk has a similar effect as in the equal preference
Condition Settings
Humanoid run (one seed per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin linspace(0 015 100)
MO-MPO εtask = 01εpenalty isin linspace(10minus6 015 100)
Humanoid run normal vs scaled (three seeds per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin 001 005 01
MO-MPO εtask = 01εpenalty isin 001 005 01
Shadow Hand touch and turn (three seeds per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin linspace(0 09 10)
MO-MPO εtask = 001εpenalty isin linspace(0001 0015 15)
Shadow Hand orient (ten seeds per setting)
scalarized MPO wtouch = wheight = worientation = 13MO-MPO εtouch = εheight = εorientation = 001
Table 4 Settings for εk and weights (linspace(x y z) denotes aset of z evenly-spaced values between x and y)
case If the scale of εk is too high or too low then the sameissues arise as discussed for equal preferences If all εkincrease or decrease in scale by the same (moderate) factorand thus their relative scales remain the same then typicallythey will converge to more or less the same policy Thiscan be seen in Fig 5 (right) in the Deep Sea Treasuredomain regardless of whether εtime is 001 002 or 005the relationship between the relative scale of εtreasure andεtime to the treasure that the policy converges to selecting isessentially the same
A2 Deep Sea Treasure
In order to handle discrete actions we make several mi-nor adjustments to scalarized MPO and MO-MPO Thepolicy returns a categorical distribution rather than a Gaus-sian The policy is parametrized by a neural network whichoutputs a vector of logits (ie unnormalized log probabil-ities) of size |A| The KL constraint on the change of thispolicy β is 0001 The input to the critic network is thestate concatenated with a four-dimensional one-hot vectordenoting which action is chosen (eg the up action corre-sponds to [1 0 0 0]gt) Critics are trained with one-steptemporal-difference error with a discount of 0999 Otherthan these changes the network architectures and the MPOand training hyperparameters are the same as in Table 2
A Distributional View on Multi-Objective Policy Optimization
Humanoid Mocap
Scalarized VMPOKL-constraint on policy mean βmicro 01KL-constraint on policy covariance βΣ 10minus5
default ε 01initial temperature η 1
TrainingAdam learning rate 10minus4
batch size 128unroll length (for n-step return Sec D1) 32
Table 5 Hyperparameters for the humanoid mocap experiments
A3 Humanoid Mocap
For the humanoid mocap experiments we used the follow-ing architecture for both MO-VMPO and VMPO for thepolicy we process a concatentation of the mocap referenceobservations and the proprioceptive observations by a twolayer MLP with 1024 hidden units per layer This referenceencoder is followed by linear layers to produce the meanand log standard deviation of a stochastic latent variableThese latent variables are then concatenated with the propri-oceptive observations and processed by another two layerMLP with 1024 hidden units per layer to produce the actiondistribution For VMPO we use a three layer MLP with1024 hidden units per layer as the critic For MO-VMPOwe use a shared two layer MLP with 1024 hidden units perlayer followed by a one layer MLP with 1024 hidden unitsper layer per objective In both cases we use k-step returnsto train the critic with a discount factor of 095 Table 5shows additional hyperparameters used in our experiments
B Experimental DomainsWe evaluated our approach on one discrete domain (DeepSea Treasure) three simulated continuous control domains(humanoid Shadow Hand and humanoid mocap) and onereal-world continuous control domain (Sawyer robot) Herewe provide more detail about these domains and the objec-tives used in each task
B1 Deep Sea Treasure
Deep Sea Treasure (DST) is a 11times10 grid-world domainthe state space S consists of the x and y position of the agentand the action spaceA is up right down left Thelayout of the environment and values of the treasures areshown in Fig 8 If the action would cause the agent tocollide with the sea floor or go out of bounds it has noeffect Farther-away treasures have higher values Theepisode terminates when the agent collects a treasure orafter 200 timesteps
237
224
203196
161151140
115
82
07
Figure 8 Deep Sea Treasure environment from Vamplew et al(2011) with weights from Yang et al (2019) Treasures are labeledwith their respective values The agent can move around freelyin the white squares but cannot enter the black squares (ie theocean floor)
There are two objectives time penalty and treasure valueA time penalty of minus1 is given at each time step The agentreceives the value of the treasure it picks up as the rewardfor the treasure objective In other words when the agentpicks up a treasure of value v the reward vector is [minus1 v]otherwise it is [minus1 0]
B2 Shadow Hand
Our robot platform is a simulated Shadow Dexterous Hand(Shadow Robot Company 2020) in the MuJoCo physicsengine (Todorov et al 2012) The Shadow Hand has fivefingers and 24 degrees of freedom actuated by 20 motorsThe observation consists of joint angles joint velocities andtouch sensors for a total of 63 dimensions Each fingertipof our Shadow Hand has a 4times4 spatial touch sensor Thissensor has three channels measuring the normal force andthe x and y-axis tangential forces for a sensor dimension of4times4times3 per fingertip We simplify this sensor by summingacross spatial dimensions resulting in a 1times1times3 observationper fingertip
In the touch task there is a block in the environment that isalways fixed in the same pose In the turn task there is a dialin the environment that can be rotated and it is initialized toa random position between minus30 and 30 The target angleof the dial is 0 The angle of the dial is included in theagentrsquos observation In the orient task the robot interactswith a rectangular peg in the environment the initial andtarget pose of the peg remains the same across episodesThe pose of the block is included in the agentrsquos observationencoded as the xyz positions of four corners of the block(based on how Levine et al (2016) encodes end-effectorpose)
A Distributional View on Multi-Objective Policy Optimization
Figure 9 These images show what task completion looks like forthe touch turn and orient Shadow Hand tasks (from left to right)
0 2 4 6impact force
minus6
minus4
minus2
0
pain
pen
alty
Figure 10 In the touch and turn Shadow Hand tasks the painpenalty is computed based on the impact force as plotted here Forlow impact forces the pain penalty is near-zero For high impactforces the pain penalty is equal to the negative of the impact force
B21 BALANCING TASK COMPLETION AND PAIN
In the touch and turn tasks there are two objectives ldquopainrdquopenalty and task completion A sparse task completion re-ward of 1 is given for pressing a block with greater than 5Nof force or turning a dial to a fixed target angle respectivelyIn both tasks the episode terminates when the agent com-pletes the task ie the agent gets a total reward of either0 or 1 for the task completion objective per episode ThePareto plots for these two tasks in the main paper (Fig 6)show the discounted task reward (with a discount factor ofγ = 099) to capture how long it takes agents to completethe task
The ldquopainrdquo penalty is computed as in (Huang et al 2019) Itis based on the impact force m(s sprime) which is the increasein force from state s to the next state sprime In our tasks this ismeasured by a touch sensor on the block or dial The painpenalty is equal to the negative of the impact force scaledby how unacceptable it is
rpain(s a sprime) = minus[1minus aλ(m(s sprime))
]m(s sprime) (6)
where aλ(middot) isin [0 1] computes the acceptability of an impactforce aλ(middot) should be a monotonically decreasing functionthat captures how resilient the robot and the environmentare to impacts As in (Huang et al 2019) we use
aλ(m) = sigmoid(λ1(minusm+ λ2)) (7)
with λ = [2 2]gt The relationship between pain penaltyand impact force is plotted in Fig 10
675 700 725zpeg
0
1
rew
ard
height objective
0 2d(θ θpeg)
orientation objective
Figure 11 In the orient Shadow Hand task the height and orienta-tion objectives have shaped rewards as plotted here
B22 ALIGNED OBJECTIVES
In the orient task there are three non-competing objectivestouching the peg lifting the peg to a target height and ori-enting the peg to be perpendicular to the ground The targetz position and orientation is shown by the gray transparentpeg (Fig 9 right-most) note that the x and y position isnot specified so the robot in the figure gets the maximumreward with respect to all three objectives
The touch objective has a sparse reward of 1 for activat-ing the pegrsquos touch sensor and zero otherwise For theheight objective the target z position is 7cm above theground the pegrsquos z position is computed with respect toits center of mass The shaped reward for this objectiveis 1 minus tanh(50|ztarget minus zpeg|) For the orientation objec-tive since the peg is symmetrical there are eight possibleorientations that are equally valid The acceptable targetorientationsQtarget and the pegrsquos orientation qpeg are denotedas quaternions The shaped reward is computed with respectto the closest target quaternion as
minqisinQtarget
1minus tanh(2lowastd(q qpeg))
where d(middot) denotes the `2-norm of the axis-angle equivalent(in radians) for the distance between the two quaternionsFig 11 shows the shaping of the height and orientationrewards
B3 Humanoid
We make use of the humanoid run task from Tassa et al(2018)9 The observation dimension is 67 and the actiondimension is 21 Actions are joint accelerations with min-imum and maximum limits of minus1 and 1 respectively Forthis task there are two objectives
bull The original task reward given by the environment Thegoal is is to achieve a horizontal speed of 10 meters persecond in any direction This reward is shaped it isequal to min(h10 1) where h is in meters per secondFor this objective we learn a Q-function
9This task is available at githubcomdeepminddm control
A Distributional View on Multi-Objective Policy Optimization
bull Limiting energy usage by penalizing taking high-magnitude actions The penalty is the `2-norm of theaction vector ie rpenalty(a) = minusa2 For this ob-jective we do not learn a Q-function Instead wecompute this penalty directly to evaluate a given actionin a state-independent way during policy optimization
B4 Humanoid Mocap
The motion capture tracking task used in this paper wasdeveloped as part of a concurrent submission (Anonymous2020) and does not constitute a contribution of this paperWe use a simulated humanoid adapted from the ldquoCMUhumanoidrdquo available at dm controllocomotion (Merel et al2019) We adjusted various body and actuator parametersto be more comparable to that of an average human Thishumanoid has 56 degrees of freedom and the observation is1021-dimensional
This motion capture tracking task is broadly similar to previ-ous work on motion capture tracking (Chentanez et al 2018Peng et al 2018 Merel et al 2019) The task is associatedwith an underlying set of motion capture clips For thispaper we used roughly 40 minutes of locomotion motioncapture clips from the CMU motion capture database10
At the beginning of a new episode we select a frame uni-formly at random from all frames in the underlying mocapdata (excluding the last 10 frames of each clip) The simu-lated humanoid is then initialized to the pose of the selectedframe
The observations in this environment are various propriocep-tive observations about the current state of the body as wellas the relative target positions and orientations of 13 differ-ent body parts in the humanoidrsquos local frame We providethe agent with the target poses for the next 5 timesteps
The choice of reward function is crucial for motion capturetracking tasks Here we consider five different reward com-ponents each capturing a different aspect of the similarityof the pose between the simulated humanoid and the mocaptarget Four of our reward components were initially pro-posed by Peng et al (2018) The first reward component isbased on the difference in the center of mass
rcom = exp(minus10pcom minus pref
com2)
where pcom and prefcom are the positions of the center of mass
of the simulated humanoid and the mocap reference re-spectively The second reward component is based on thedifference in joint angle velocities
rvel = exp(minus01qvel minus qref
vel2)
where qvel and qrefvel are the joint angle velocities of the simu-
lated humanoid and the mocap reference respectively The
10This database is available at mocapcscmuedu
third reward component is based on the difference in end-effector positions
rapp = exp(minus40papp minus pref
app2)
where papp and prefapp are the end effector positions of the
simulated humanoid and the mocap reference respectivelyThe fourth reward component is based on the difference injoint orientations
rquat = exp(minus2qquat qref
quat2)
where denotes quaternion differences and qquat and qrefquat
are the joint quaternions of the simulated humanoid and themocap reference respectively
Finally the last reward component is based on difference inthe joint angles and the Euclidean positions of a set of 13body parts
rtrunc = 1minus 1
03(
=ε︷ ︸︸ ︷bpos minus bref
pos1 + qpos minus qrefpos1)
where bpos and brefpos correspond to the body positions of the
simulated character and the mocap reference and qpos andqref
pos correspond to the joint angles
We include an early termination condition in our task that islinked to this last reward term We terminate the episode ifε gt 03 which ensures that rtrunc isin [0 1]
In our MO-VMPO experiments we treat each reward com-ponent as a separate objective In our VMPO experimentswe use a reward of the form
r =1
2rtrunc +
1
2(01rcom + λrvel + 015rapp + 065rquat)
(8)
varying λ as described in the main paper (Sec 63)
B5 Sawyer
The Sawyer peg-in-hole setup consists of a Rethink RoboticsSawyer robot arm The robot has 7 joints driven by serieselastic actuators On the wrist a Robotiq FT 300 force-torque sensor is mounted followed by a Robotiq 2F85 par-allel gripper We implement a 3D end-effector Cartesianvelocity controller that maintains a fixed orientation of thegripper The agent controls the Cartesian velocity setpointsThe gripper is not actuated in this setup The observationsprovided are the joint positions velocities and torques theend-effector pose the wrist force-torque measurements andthe joint velocity command output of the Cartesian con-troller We augment each observation with the two previousobservations as a history
In each episode the peg is initalized randomly within an8times8times8 cm workspace (with the peg outside of the hole)
A Distributional View on Multi-Objective Policy Optimization
The environment is then run at 20 Hz for 600 steps Theaction limits are plusmn005 ms on each axis If the magnitudeof the wrist force-torque measurements exceeds 15 N onany horizontal axis (x and y axis) or 5 N on the vertical axis(z axis) the episode is terminated early
There are two objectives the task itself (inserting the peg)and minimizing wrist force measurements
The reward for the task objective is defined as
rinsertion = max(
02rapproach rinserted
)(9)
rapproach = s(ppeg popening) (10)
rinserted = raligned s(ppegz pbottom
z ) (11)
raligned = ppegxy minus popening
xy lt d2 (12)
where ppeg is the peg position popening the position of thehole opening pbottom the bottom position of the hole and dthe diameter of the hole
s(p1 p2) is a shaping operator
s(p1 p2) = 1minus tanh2(atanh(
radicl)
εp1 minus p22) (13)
which gives a reward of 1minus l if p1 and p2 are at a Euclideandistance of ε Here we chose lapproach = 095 εapproach =001 linserted = 05 and εinserted = 004 Intentionally 004corresponds to the length of the peg such that rinserted = 05if ppeg = popening As a result if the peg is in the approachposition rinserted is dominant over rapproach in rinsertion
Intuitively there is a shaped reward for aligning the pegwith a magnitude of 02 After accurate alignment withina horizontal tolerance the overall reward is dominated bya vertical shaped reward component for insertion The hor-izontal tolerance threshold ensures that the agent is notencouraged to try to insert the peg from the side
The reward for the secondary objective minimizing wristforce measurements is defined as
rforce = minusF1 (14)
where F = (Fx Fy Fz) are the forces measured by thewrist sensor
C Algorithmic DetailsC1 General Algorithm Description
We maintain one online network and one target network foreach Q-function associated to objective k with parametersdenoted by φk and φprimek respectively We also maintain oneonline network and one target network for the policy withparameters denoted by θ and θprime respectively Target net-works are updated every fixed number of steps by copying
parameters from the online network Online networks areupdated using gradient descent in each learning iterationNote that in the main paper we refer to the target policynetwork as the old policy πold
We use an asynchronous actor-learner setup in which ac-tors regularly fetch policy parameters from the learner andact in the environment writing these transitions to the re-play buffer We refer to this policy as the behavior policythroughout the paper The learner uses the transitions inthe replay buffer to update the (online) Q-functions and thepolicy Please see Algorithms 2 and 3 for more details onthe actor and learner processes respectively
Algorithm 2 MO-MPO Asynchronous Actor
1 given (N) reward functions rk(s a)Nk=1 T steps perepisode
2 repeat3 Fetch current policy parameters θ from learner4 Collect trajectory from environment5 τ = 6 for t = 0 T do7 at sim πθ(middot|st)8 Execute action and determine rewards9 r = [r1(st at) rN (st at)]
10 τ larr τ cup (st at r πθ(at|st))11 end for12 Send trajectory τ to replay buffer13 until end of training
C2 Retrace for Multi-Objective Policy Evaluation
Recall that the goal of the policy evaluation stage (Sec 41)is to learn the Q-function Qφk
(s a) for each objective kparametrized by φk These Q-functions are with respectto the policy πold We emphasize that it is valid to use anyoff-the-shelf Q-learning algorithm such as TD(0) (Sutton1988) to learn these Q-functions We choose to use Retrace(Munos et al 2016) described here
Given a replay buffer D containing trajectory snippetsτ = (s0 a0 r0 s1) (sT aT rT sT+1) where rtdenotes a reward vector rk(st at)Nk=1 that consists of ascalar reward for each ofN objectives the Retrace objectiveis as follows
minφk
EτsimD[(Qretk (st at)minus Qφk
(st at))2] (15)
with
Qretk (st at) = Qφprime
k(st at) +
Tsumj=t
γjminust( jprodz=t+1
cz
)δj
δj = rk(sj aj) + γV (sj+1)minus Qφprimek(sj aj)
V (sj+1) = Eπold(a|sj+1)[Qφprimek(sj+1 a)]
A Distributional View on Multi-Objective Policy Optimization
Algorithm 3 MO-MPO Asynchronous Learner1 given batch size (L) number of actions (M) (N) Q-functions
(N) preferences εkNk=1 policy networks and replay bufferD
2 initialize Lagrangians ηkNk=1 and ν target networks andonline networks such that πθprime = πθ and Qφprime
k= Qφk
3 repeat4 repeat5 Collect dataset si aij Qijk
LMNijk where
6 si sim D aij sim πθprime(a|si) and Qijk = Qφkprime(si aij)
78 Compute action distribution for each objective9 for k = 1 N do
10 δηk larr nablaηkηkεk + ηksumLi
1Llog
(sumMj
1M
exp(Q
ijkηk
))11 Update ηk based on δηk12 qijk prop exp(
Qijkηk
)
13 end for1415 Update parametric policy with trust region16 sg denotes a stop-gradient17 δπ larr minusnablaθ
sumLi
sumMj
sumNk q
ijk log πθ(a
ij |si)18 +sg(ν)
(β minus
sumLi KL(πθprime(a|si) πθ(a|si)) ds
)19 δν larr nablaνν
(βminussumLi KL(πθprime(a|si) πsg(θ)(a|si)) ds
)20 Update πθ based on δπ21 Update ν based on δν2223 Update Q-functions24 for k = 1 N do25 δφk larr nablaφk
sum(stat)isinτsimD
(Qφk (st at)minusQ
retk
)226 with Qret
k as in Eq 1527 Update φk based on δφk
28 end for2930 until fixed number of steps31 Update target networks32 πθprime = πθ Qφk
prime = Qφk
33 until convergence
The importance weights ck are defined as
cz = min
(1πold(az|sz)b(az|sz)
)
where b(az|sz) denotes the behavior policy used to col-lect trajectories in the environment When j = t we set(prodj
z=t+1 cz
)= 1
We also make use of a target network for each Q-function(Mnih et al 2015) parametrized by φprimek which we copyfrom the online network φk after a fixed number of gradientsteps on the Retrace objective (15)
C3 Policy Fitting
Recall that fitting the policy in the policy improvementstage (Sec 422) requires solving the following constrained
optimization
πnew = argmaxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds
stints
micro(s) KL(πold(a|s) πθ(a|s)) ds lt β (16)
We first write the generalized Lagrangian equation ie
L(θ ν) =
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds (17)
+ ν(β minus
ints
micro(s) KL(πold(a|s) πθ(a|s)) ds)
where ν is the Lagrangian multiplier Now we solve thefollowing primal problem
maxθ
minνgt0
L(θ ν)
To obtain the parameters θ for the updated policy we solvefor ν and θ by alternating between 1) fixing ν to its currentvalue and optimizing for θ and 2) fixing θ to its currentvalue and optimizing for ν This can be applied for anypolicy output distribution
For Gaussian policies we decouple the update rules to op-timize for mean and covariance independently as in (Ab-dolmaleki et al 2018b) This allows for setting differentKL bounds for the mean (βmicro) and covariance (βΣ) whichresults in more stable learning To do this we first separateout the following two policies for mean and covariance
πmicroθ (a|s) = N(amicroθ(s)Σθold
(s)) (18)
πΣθ (a|s) = N
(amicroθold
(s)Σθ(s)) (19)
Policy πmicroθ (a|s) takes the mean from the online policy net-work and covariance from the target policy network andpolicy πΣ
θ (a|s) takes the mean from the target policy net-work and covariance from online policy network Now ouroptimization problem has the following form
maxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s)(
log πmicroθ (a|s)πΣθ (a|s)
)da ds
stints
micro(s) KL(πold(a|s) πmicroθ (a|s)) ds lt βmicroints
micro(s) KL(πold(a|s) πΣθ (a|s)) ds lt βΣ (20)
As in (Abdolmaleki et al 2018b) we set a much smallerbound for covariance than for mean to keep the explorationand avoid premature convergence We can solve this op-timization problem using the same general procedure asdescribed above
A Distributional View on Multi-Objective Policy Optimization
C4 Derivation of Dual Function for η
Recall that obtaining the per-objective improved action dis-tributions qk(a|s) in the policy improvement stage (Sec421) requires solving a convex dual function for the tem-perature ηk for each objective For the derivation of thisdual function please refer to Appendix D2 of the originalMPO paper (Abdolmaleki et al 2018b)
D MO-V-MPO Multi-Objective On-PolicyMPO
In this section we describe how MO-MPO can be adaptedto the on-policy case in which a state-value function V (s)is learned and used to estimate the advantages for eachobjective We call this approach MO-V-MPO
D1 Multi-objective policy evaluation in the on-policysetting
In the on-policy setting to evaluate the previous policyπold we use advantages A(s a) estimated from a learnedstate-value function V (s) instead of a state-action valuefunction Q(s a) as in the main text We train a separateV -function for each objective by regressing to the stan-dard n-step return (Sutton 1988) associated with each ob-jective More concretely given trajectory snippets τ =(s0 a0 r0) (sT aT rT ) where rt denotes a rewardvector rk(st at)Nk=1 that consists of rewards for all N ob-jectives we find value function parameters φk by optimizingthe following objective
minφ1k
Nsumk=1
Eτ[(G
(T )k (st at)minus V πold
φk(st))
2] (21)
Here G(T )k (st at) is the T -step target for value function
k which uses the actual rewards in the trajectory andbootstraps from the current value function for the restG
(T )k (st at) =
sumTminus1`=t γ
`minustrk(s` a`) + γTminustV πoldφk
(s`+T )The advantages are then estimated as Aπold
k (st at) =
G(T )k (st at)minus V πold
φk(st)
D2 Multi-objective policy improvement in theon-policy setting
Given the previous policy πold(a|s) and estimated advan-tages Aπold
k (s a)k=1N associated with this policy foreach objective our goal is to improve the previous policyTo this end we first learn an improved variational distribu-tion qk(s a) for each objective and then combine and distillthe variational distributions into a new parametric policyπnew(a|s) Unlike in MO-MPO for MO-V-MPO we use thejoint distribution qk(s a) rather than local policies qk(a|s)because without a learned Q-function only one action perstate is available for learning This is a multi-objective vari-
ant of the two-step policy improvement procedure employedby V-MPO (Song et al 2020)
D21 OBTAINING IMPROVED VARIATIONALDISTRIBUTIONS PER OBJECTIVE (STEP 1)
In order to obtain the improved variational distributionsqk(s a) we optimize the RL objective for each objectiveAk
maxqk
intsa
qk(s a)Ak(s a) dads (22)
st KL(qk(s a)pold(s a)) lt εk
where the KL-divergence is computed over all (s a) εk de-notes the allowed expected KL divergence and pold(s a) =micro(s)πold(a|s) is the state-action distribution associated withπold
As in MO-MPO we use these εk to define the preferencesover objectives More concretely εk defines the allowed con-tribution of objective k to the change of the policy There-fore the larger a particular εk is with respect to others themore that objective k is preferred On the other hand ifεk = 0 is zero then objective k will have no contribution tothe change of the policy and will effectively be ignored
Equation (22) can be solved in closed form
qk(s a) prop pold(s a) exp(Ak(s a)
ηk
) (23)
where the temperature ηk is computed based on the con-straint εk by solving the following convex dual problem
ηk = argminηk
[ηk εk + (24)
ηk log
intsa
pold(s a) exp(Ak(s a)
η
)da ds
]
We can perform the optimization along with the loss bytaking a gradient descent step on ηk and we initialize withthe solution found in the previous policy iteration step Sinceηk must be positive we use a projection operator after eachgradient step to maintain ηk gt 0
Top-k advantages As in Song et al (2020) in practicewe used the samples corresponding to the top 50 of ad-vantages in each batch of data
D22 FITTING A NEW PARAMETRIC POLICY (STEP 2)
We next want to combine and distill the state-action distribu-tions obtained in the previous step into a single parametricpolicy πnew(a|s) that favors all of the objectives accordingto the preferences specified by εk For this we solve a su-pervised learning problem that fits a parametric policy asfollows
A Distributional View on Multi-Objective Policy Optimization
πnew = argmaxθ
Nsumk=1
intsa
qk(s a) log πθ(a|s) da ds
stints
KL(πold(a|s) πθ(a|s)) ds lt β (25)
where θ are the parameters of our function approximator(a neural network) which we initialize from the weights ofthe previous policy πold and the KL constraint enforces atrust region of size β that limits the overall change in theparametric policy to improve stability of learning As inMPO the KL constraint in this step has a regularizationeffect that prevents the policy from overfitting to the localpolicies and therefore avoids premature convergence
In order to optimize Equation (25) we employ Lagrangianrelaxation similar to the one employed for ordinary V-MPO (Song et al 2020)
E Multi-Objective Policy Improvement asInference
In the main paper we motivated the multi-objective policyupdate rules from an intuitive perspective In this section weshow that our multi-objective policy improvement algorithmcan also be derived from the RL-as-inference perspective Inthe policy improvement stage we assume that a Q-functionfor each objective is given and we would like to improveour policy with respect to these Q-functions The deriva-tion here extends the derivation for the policy improvementalgorithm in (single-objective) MPO in Abdolmaleki et al(2018a) (in appendix) to the multi-objective case
We assume there are observable binary improvement eventsRkNk=1 for each objective Rk = 1 indicates that ourpolicy has improved for objective k whileRk = 0 indicatesthat it has not Now we ask if the policy has improved withrespect to all objectives ie Rk = 1Nk=1 what would theparameters θ of that improved policy be More concretelythis is equivalent to the maximum a posteriori estimate forRk = 1Nk=1
maxθ
pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)
where the probability of improvement events depends on θAssuming independence between improvement events Rkand using log probabilities leads to the following objective
maxθ
Nsumk=1
log pθ(Rk = 1) + log p(θ) (27)
The prior distribution over parameters p(θ) is fixed duringthe policy improvement step We set the prior such that πθ
stays close to the target policy during each policy improve-ment step to improve stability of learning (Sec E2)
We use the standard expectation-maximization (EM) algo-rithm to efficiently maximize
sumNk=1 log pθ(Rk = 1) The
EM algorithm repeatedly constructs a tight lower bound inthe E-step given the previous estimate of θ (which corre-sponds to the target policy in our case) and then optimizesthat lower bound in the M-step We introduce a variationaldistribution qk(s a) per objective and use this to decom-pose the log-evidence
sumNk=1 log pθ(Rk = 1) as follows
Nsumk=1
log pθ(Rk = 1) (28)
=
Nsumk=1
KL(qk(s a) pθ(s a|Rk = 1))minus
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a))
The second term of this decomposition is expanded as
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a)) (29)
=
Nsumk=1
Eqk(sa)
[log
pθ(Rk = 1 s a)
qk(s a)
]=
Nsumk=1
Eqk(sa)
[log
p(Rk = 1|s a)πθ(a|s)micro(s)
qk(s a)
]
where micro(s) is the stationary state distribution which is as-sumed to be given in each policy improvement step Inpractice micro(s) is approximated by the distribution of thestates in the replay buffer p(Rk = 1|s a) is the likelihoodof the improvement event for objective k if our policy choseaction a in state s
The first term of the decomposition in Equation (28) isalways positive so the second term is a lower bound onthe log-evidence
sumNk=1 log pθ(Rk = 1) πθ(a|s) and
qk(a|s) = qk(sa)micro(s) are unknown even though micro(s) is given
In the E-step we estimate qk(a|s) for each objective k byminimizing the first term given θ = θprime (the parameters ofthe target policy) Then in the M-step we find a new θ byfitting the parametric policy to the distributions from firststep
E1 E-Step
The E-step corresponds to the first step of policy improve-ment in the main paper (Sec 421) In the E-step wechoose the variational distributions qk(a|s)Nk=1 such thatthe lower bound on
sumNk=1 log pθ(Rk = 1) is as tight as
A Distributional View on Multi-Objective Policy Optimization
possible when θ = θprime the parameters of the target policyThe lower bound is tight when the first term of the decompo-sition in Equation (28) is zero so we choose qk to minimizethis term We can optimize for each variational distributionqk independently
qk(a|s) = argminq
Emicro(s)
[KL(q(a|s) pθprime(s a|Rk = 1))
]= argmin
qEmicro(s)
[KL(q(a|s) πθprime(a|s))minus
Eq(a|s) log p(Rk = 1|s a))]
(30)
We can solve this optimization problem in closed formwhich gives us
qk(a|s) =πθprime(a|s) p(Rk = 1|s a)intπθprime(a|s) p(Rk = 1|s a) da
This solution weighs the actions based on their relativeimprovement likelihood p(Rk = 1|s a) for each objectiveWe define the likelihood of an improvement event Rk as
p(Rk = 1|s a) prop exp(Qk(s a)
αk
)
where αk is an objective-dependent temperature parameterthat controls how greedy the solution qk(a|s) is with respectto its associated objective Qk(s a) For example given abatch of state-action pairs at the extremes an αk of zerowould give all probability to the state-action pair with themaximum Q-value while an αk of positive infinity wouldgive the same probability to all state-action pairs In orderto automatically optimize for αk we plug the exponentialtransformation into (30) which leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) dadsminus
αk
intmicro(s) KL(q(a|s) πθprime(a|s)) ds (31)
If we instead enforce the bound on KL divergence as a hardconstraint (which we do in practice) that leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) da ds (32)
stintmicro(s) KL(q(a|s) πθprime(a|s)) ds lt εk
where as described in the main paper the parameter εkdefines preferences over objectives If we now assume thatqk(a|s) is a non-parametric sample-based distribution as in(Abdolmaleki et al 2018b) we can solve this constrained
optimization problem in closed form for each sampled states by setting
q(a|s) prop πθprime(a|s) exp(Qk(s a)
ηk
) (33)
where ηk is computed by solving the following convex dualfunction
ηk = argminη
ηεk+ (34)
η
intmicro(s) log
intπθprime(a|s) exp
(Qk(s a)
η
)da ds
Equations (32) (33) and (34) are equivalent to those givenfor the first step of multi-objective policy improvement inthe main paper (Sec 421) Please see refer to AppendixD2 in (Abdolmaleki et al 2018b) for details on derivationof the dual function
E2 M-Step
After obtaining the variational distributions qk(a|s) foreach objective k we have found a tight lower bound forsumNk=1 log pθ(Rk = 1) when θ = θprime We can now obtain
the parameters θ of the updated policy πθ(a|s) by optimiz-ing this lower bound After rearranging and dropping theterms in (27) that do not involve θ we obtain
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) da ds
+ log p(θ)
This corresponds to the supervised learning step of policyimprovement in the main paper (Sec 422)
For the prior p(θ) on the parameters θ we enforce that thenew policy πθ(a|s) is close to the target policy πθprime(a|s)We therefore optimize for
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) dads
stintmicro(s) KL(πθprime(a|s) πθ(a|s)) ds lt β
Because of the constraint on the change of the parametricpolicy we do not greedily optimize the M-step objectiveSee (Abdolmaleki et al 2018b) for more details
A Distributional View on Multi-Objective Policy Optimization
Nottingham K Balakrishnan A Deshmukh J Christo-pherson C and Wingate D Using logical specificationsof objectives in multi-objective reinforcement learningarXiv preprint arXiv191001723 2019
Parisi S Pirotta M Smacchia N Bascetta L andRestelli M Policy gradient approaches for multi-objective sequential decision making In Proceedingsof the International Joint Conference on Neural Networks(IJCNN) pp 2323ndash2330 2014
Peng X B Abbeel P Levine S and van de Panne MDeepmimic Example-guided deep reinforcement learn-ing of physics-based character skills ACM Transactionson Graphics 37(4) 2018
Pirotta M Parisi S and Restelli M Multi-objectivereinforcement learning with continuous Pareto frontierapproximation In Proceedings of the Twenty-Ninth AAAIConference on Artificial Intelligence (AAAI) 2015
Reymond M and Nowe A Pareto-DQN Approximat-ing the Pareto front in complex multi-objective decisionproblems In Proceedings of the Adaptive and LearningAgents Workshop at the 18th International Conferenceon Autonomous Agents and MultiAgent Systems AAMAS2019
Riedmiller M Hafner R Lampe T Neunert M De-grave J Van de Wiele T Mnih V Heess N and Sprin-genberg J T Learning by playing - solving sparse rewardtasks from scratch arXiv preprint arXiv1802105672018
Roijers D M Vamplew P Whiteson S and Dazeley RA survey of multi-objective sequential decision-makingJournal of Artificial Intelligence Research 48(1)67ndash1132013
Roijers D M Whiteson S and Oliehoek F A Linearsupport for multi-objective coordination graphs In Pro-ceedings of the International Conference on AutonomousAgents and Multi-Agent Systems (AAMAS) pp 1297ndash1304 2014
Russell S and Zimdars A L Q-decomposition for rein-forcement learning agents In Proceedings of the Twenti-eth International Conference on International Conferenceon Machine Learning (ICML) pp 656ndash663 2003
Schulman J Levine S Abbeel P Jordan M and MoritzP Trust region policy optimization In Proceedings ofthe 32nd International Conference on Machine Learning(ICML) 2015
Schulman J Wolski F Dhariwal P Radford A andKlimov O Proximal policy optimization algorithmsarXiv preprint arXiv170706347 2017
Shadow Robot Company Shadow dexterous handhttpswwwshadowrobotcomproductsdexterous-hand 2020
Silver D Huang A Maddison C J Guez A Sifre Lvan den Driessche G Schrittwieser J Antonoglou IPanneershelvam V Lanctot M Dieleman S GreweD Nham J Kalchbrenner N Sutskever I Lillicrap TLeach M Kavukcuoglu K Graepel T and HassabisD Mastering the game of Go with deep neural networksand tree search Nature 529484ndash503 2016
Song H F Abdolmaleki A Springenberg J T ClarkA Soyer H Rae J W Noury S Ahuja A LiuS Tirumala D Heess N Belov D Riedmiller Mand Botvinick M M V-MPO On-policy maximum aposteriori policy optimization for discrete and continu-ous control In International Conference on LearningRepresentations 2020
Sutton R S Learning to predict by the methods of temporaldifferences Machine learning 3(1)9ndash44 1988
Tan J Zhang T Coumans E Iscen A Bai Y HafnerD Bohez S and Vanhoucke V Sim-to-real Learningagile locomotion for quadruped robots In Proceedingsof Robotics Science and Systems (RSS) 2018
Tassa Y Doron Y Muldal A Erez T Li Yde Las Casas D Budden D Abdolmaleki A Merel JLefrancq A Lillicrap T and Riedmiller M Deepmindcontrol suite arXiv preprint arXiv180100690 2018
Teh Y W Bapst V Czarnecki W M Quan J Kirk-patrick J Hadsell R Heess N and Pascanu R DistralRobust multitask reinforcement learning In Proceedingsof the 31st International Conference on Neural Informa-tion Processing Systems (NeurIPS) 2017
Todorov E Erez T and Tassa Y Mujoco A physicsengine for model-based control In Proceedings of theIEEERSJ International Conference on Intelligent Robotsand Systems (IROS) pp 5026ndash5033 2012
Vamplew P Dazeley R Berry A Issabekov R andDekker E Empirical evaluation methods for multiobjec-tive reinforcement learning algorithms Machine Learn-ing 84(1)51ndash80 Jul 2011
van Hasselt H P Guez A Hessel M Mnih V and Sil-ver D Learning values across many orders of magnitudeIn Advances in Neural Information Processing Systemspp 4287ndash4295 2016
Van Moffaert K Drugan M M and Nowe A Scalar-ized multi-objective reinforcement learning Novel de-sign techniques In Proceedings of the IEEE Symposiumon Adaptive Dynamic Programming and ReinforcementLearning (ADPRL) pp 191ndash199 2013
A Distributional View on Multi-Objective Policy Optimization
van Seijen H Fatemi M Romoff J Laroche R BarnesT and Tsang J Hybrid reward architecture for reinforce-ment learning In Proceedings of the 31st InternationalConference on Neural Information Processing Systems(NeurIPS) pp 5398ndash5408 2017
Watkins C J C H Learning from Delayed Rewards PhDthesis Kingrsquos College Cambridge UK May 1989
Wray K H Zilberstein S and Mouaddib A-I Multi-objective MDPs with conditional lexicographic rewardpreferences In Proceedings of the Twenty-Ninth AAAIConference on Artificial Intelligence (AAAI) pp 3418ndash3424 2015
Wulfmeier M Abdolmaleki A Hafner R SpringenbergJ T Neunert M Hertweck T Lampe T Siegel NHeess N and Riedmiller M Regularized hierarchicalpolicies for compositional transfer in robotics arXivpreprint arXiv190611228 2019
Yang R Sun X and Narasimhan K A generalized algo-rithm for multi-objective reinforcement learning and pol-icy adaptation In Proceedings of the 32nd InternationalConference on Neural Information Processing Systems(NeurIPS) pp 14610ndash14621 2019
Zeng A Song S Lee J Rodriguez A and FunkhouserT Tossingbot Learning to throw arbitrary objects withresidual physics In Proceedings of Robotics Scienceand Systems (RSS) 2019
Zhan H and Cao Y Relationship explainable multi-objective optimization via vector value function basedreinforcement learning arXiv preprint arXiv1910019192019
Zuluaga M Krause A and Puschel M ε-pal An ac-tive learning approach to the multi-objective optimizationproblem Journal of Machine Learning Research 17(1)3619ndash3650 2016
A Distributional View on Multi-Objective Policy Optimization
A Experiment DetailsIn this section we describe implementation details and spec-ify the hyperparameters used for our algorithm In all ourexperiments the policy and critic(s) are implemented withfeed-forward neural networks We use Adam (Kingma ampBa 2015) for optimization
In our continuous control tasks the policy returns a Gaus-sian distribution with a diagonal covariance matrix ieπθ(a|s) = N (microΣ) The policy is parametrized by aneural network which outputs the mean micro = micro(s) and di-agonal Cholesky factors A = A(s) such that Σ = AAT The diagonal factor A has positive diagonal elements en-forced by the softplus transform Aii larr log(1 + exp(Aii))to ensure positive definiteness of the diagonal covariancematrix
Table 2 shows the default hyperparameters that we use forMO-MPO and scalarized MPO and Table 3 shows the hy-perparameters that differ from these defaults for the hu-manoid Shadow Hand and Sawyer experiments For alltasks that we train policies with MO-MPO for a separatecritic is trained for each objective with a shared first layerThe only exception is the action norm penalty in humanoidrun which is computed exactly from the action We foundthat for both policy and critic networks layer normalizationof the first layer followed by a hyperbolic tangent (tanh) isimportant for the stability of the algorithm
In our experiments comparing the Pareto front found byMO-MPO versus scalarized MPO we ran MO-MPO with arange of εk settings and ran scalarized MPO with a range ofweight settings to obtain a set of policies for each algorithmThe settings for humanoid run and the Shadow Hand tasksare listed in Table 4 The settings for humanoid mocap andSawyer are specified in the main paper in Sec 63 and 64respectively
In the following subsections we first give suggestions forchoosing appropriate εk to encode a particular preference(Sec A1) then we describe the discrete MO-MPO usedfor Deep Sea Treasure (Sec A2) and finally we describeimplementation details for humanoid mocap (Sec A3)
A1 Suggestions for Setting ε
Our proposed algorithm MO-(V-)MPO requires practition-ers to translate a desired preference across objectives to nu-merical choices for εk for each objective k At first glancethis may seem daunting However in practice we havefound that encoding preferences via εk is often more in-tuitive than doing so via scalarization In this subsectionwe seek to give an intuition on how to set εk for differentdesired preferences Recall that each εk controls the influ-ence of objective k on the policy update by constraining theKL-divergence between each objective-specific distribution
Hyperparameter Default
policy networklayer sizes (300 200)take tanh of action mean yesminimum variance 10minus12
maximum variance unbounded
critic network(s)layer sizes (400 400 300)take tanh of action yesRetrace sequence size 8discount factor γ 099
both policy and critic networkslayer norm on first layer yestanh on output of layer norm yesactivation (after each hidden layer) ELU
MPOactions sampled per state 20default ε 01KL-constraint on policy mean βmicro 10minus3
KL-constraint on policy covariance βΣ 10minus5
initial temperature η 1
trainingbatch size 512replay buffer size 106
Adam learning rate 3times10minus4
Adam ε 10minus3
target network update period 200
Table 2 Default hyperparameters for MO-MPO and scalarizedMPO with decoupled update on mean and covariance (Sec C3)
and the current policy (Sec 41) We generally choose εk inthe range of 0001 to 01
Equal Preference When all objectives are equally impor-tant the general rule is to set all εk to the same value Wedid this in our orient and humanoid mocap tasks where ob-jectives are aligned or mostly aligned In contrast it can betricky to choose appropriate weights in linear scalarizationto encode equal preferencesmdashsetting all weights equal to1K (where K is the number of objectives) is only appro-priate if the objectivesrsquo rewards are of similar scales Weexplored this in Sec 51 and Fig 3
When setting all εk to the same value what should thisvalue be The larger εk is the more influence the objectiveswill have on the policy update step Since the per-objectivecritics are learned in parallel with the policy setting εktoo high tends to destabilize learning because early on intraining when the critics produce unreliable Q-values theirinfluence on the policy will lead it in the wrong directionOn the other hand if εk is set too low then it slows downlearning because the per-objective action distribution isonly allowed to deviate by a tiny amount from the currentpolicy and the updated policy is obtained via supervisedlearning on the combination of these action distributions
A Distributional View on Multi-Objective Policy Optimization
Humanoid run
policy networklayer sizes (400 400 300)take tanh of action mean nominimum variance 10minus8
critic network(s)layer sizes (500 500 400)take tanh of action no
MPOKL-constraint on policy mean βmicro 5times10minus3
KL-constraint on policy covariance βΣ 10minus6
trainingAdam learning rate 2times10minus4
Adam ε 10minus8
Shadow Hand
MPOactions sampled per state 30default ε 001
Sawyer
MPOactions sampled per state 15
trainingtarget network update period 100
Table 3 Hyperparameters for humanoid Shadow Hand andSawyer experiments that differ from the defaults in Table 2
Eventually the learning will converge to more or less thesame policy though as long as εk is not set too high
Unequal Preference When there is a difference in pref-erences across objectives the relative scale of εk is whatmatters The more the relative scale of εk is compared to εlthe more influence objective k has over the policy updatecompared to objective l And in the extreme case whenεl is near-zero for objective l then objective l will haveno influence on the policy update and will effectively beignored We explored this briefly in Sec 51 and Fig 4
One common example of unequal preferences is when wewould like an agent to complete a task while minimiz-ing other objectivesmdasheg an action norm penalty ldquopainrdquopenalty or wrist force-torque penalty in our experimentsIn this case the ε for the task objective should be higherthan that for the other objectives to incentivize the agent toprioritize actually doing the task If the ε for the penaltiesis too high then the agent will care more about minimiz-ing the penalty (which can typically be achieved by simplytaking no actions) rather than doing the task which is notparticularly useful
The scale of εk has a similar effect as in the equal preference
Condition Settings
Humanoid run (one seed per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin linspace(0 015 100)
MO-MPO εtask = 01εpenalty isin linspace(10minus6 015 100)
Humanoid run normal vs scaled (three seeds per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin 001 005 01
MO-MPO εtask = 01εpenalty isin 001 005 01
Shadow Hand touch and turn (three seeds per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin linspace(0 09 10)
MO-MPO εtask = 001εpenalty isin linspace(0001 0015 15)
Shadow Hand orient (ten seeds per setting)
scalarized MPO wtouch = wheight = worientation = 13MO-MPO εtouch = εheight = εorientation = 001
Table 4 Settings for εk and weights (linspace(x y z) denotes aset of z evenly-spaced values between x and y)
case If the scale of εk is too high or too low then the sameissues arise as discussed for equal preferences If all εkincrease or decrease in scale by the same (moderate) factorand thus their relative scales remain the same then typicallythey will converge to more or less the same policy Thiscan be seen in Fig 5 (right) in the Deep Sea Treasuredomain regardless of whether εtime is 001 002 or 005the relationship between the relative scale of εtreasure andεtime to the treasure that the policy converges to selecting isessentially the same
A2 Deep Sea Treasure
In order to handle discrete actions we make several mi-nor adjustments to scalarized MPO and MO-MPO Thepolicy returns a categorical distribution rather than a Gaus-sian The policy is parametrized by a neural network whichoutputs a vector of logits (ie unnormalized log probabil-ities) of size |A| The KL constraint on the change of thispolicy β is 0001 The input to the critic network is thestate concatenated with a four-dimensional one-hot vectordenoting which action is chosen (eg the up action corre-sponds to [1 0 0 0]gt) Critics are trained with one-steptemporal-difference error with a discount of 0999 Otherthan these changes the network architectures and the MPOand training hyperparameters are the same as in Table 2
A Distributional View on Multi-Objective Policy Optimization
Humanoid Mocap
Scalarized VMPOKL-constraint on policy mean βmicro 01KL-constraint on policy covariance βΣ 10minus5
default ε 01initial temperature η 1
TrainingAdam learning rate 10minus4
batch size 128unroll length (for n-step return Sec D1) 32
Table 5 Hyperparameters for the humanoid mocap experiments
A3 Humanoid Mocap
For the humanoid mocap experiments we used the follow-ing architecture for both MO-VMPO and VMPO for thepolicy we process a concatentation of the mocap referenceobservations and the proprioceptive observations by a twolayer MLP with 1024 hidden units per layer This referenceencoder is followed by linear layers to produce the meanand log standard deviation of a stochastic latent variableThese latent variables are then concatenated with the propri-oceptive observations and processed by another two layerMLP with 1024 hidden units per layer to produce the actiondistribution For VMPO we use a three layer MLP with1024 hidden units per layer as the critic For MO-VMPOwe use a shared two layer MLP with 1024 hidden units perlayer followed by a one layer MLP with 1024 hidden unitsper layer per objective In both cases we use k-step returnsto train the critic with a discount factor of 095 Table 5shows additional hyperparameters used in our experiments
B Experimental DomainsWe evaluated our approach on one discrete domain (DeepSea Treasure) three simulated continuous control domains(humanoid Shadow Hand and humanoid mocap) and onereal-world continuous control domain (Sawyer robot) Herewe provide more detail about these domains and the objec-tives used in each task
B1 Deep Sea Treasure
Deep Sea Treasure (DST) is a 11times10 grid-world domainthe state space S consists of the x and y position of the agentand the action spaceA is up right down left Thelayout of the environment and values of the treasures areshown in Fig 8 If the action would cause the agent tocollide with the sea floor or go out of bounds it has noeffect Farther-away treasures have higher values Theepisode terminates when the agent collects a treasure orafter 200 timesteps
237
224
203196
161151140
115
82
07
Figure 8 Deep Sea Treasure environment from Vamplew et al(2011) with weights from Yang et al (2019) Treasures are labeledwith their respective values The agent can move around freelyin the white squares but cannot enter the black squares (ie theocean floor)
There are two objectives time penalty and treasure valueA time penalty of minus1 is given at each time step The agentreceives the value of the treasure it picks up as the rewardfor the treasure objective In other words when the agentpicks up a treasure of value v the reward vector is [minus1 v]otherwise it is [minus1 0]
B2 Shadow Hand
Our robot platform is a simulated Shadow Dexterous Hand(Shadow Robot Company 2020) in the MuJoCo physicsengine (Todorov et al 2012) The Shadow Hand has fivefingers and 24 degrees of freedom actuated by 20 motorsThe observation consists of joint angles joint velocities andtouch sensors for a total of 63 dimensions Each fingertipof our Shadow Hand has a 4times4 spatial touch sensor Thissensor has three channels measuring the normal force andthe x and y-axis tangential forces for a sensor dimension of4times4times3 per fingertip We simplify this sensor by summingacross spatial dimensions resulting in a 1times1times3 observationper fingertip
In the touch task there is a block in the environment that isalways fixed in the same pose In the turn task there is a dialin the environment that can be rotated and it is initialized toa random position between minus30 and 30 The target angleof the dial is 0 The angle of the dial is included in theagentrsquos observation In the orient task the robot interactswith a rectangular peg in the environment the initial andtarget pose of the peg remains the same across episodesThe pose of the block is included in the agentrsquos observationencoded as the xyz positions of four corners of the block(based on how Levine et al (2016) encodes end-effectorpose)
A Distributional View on Multi-Objective Policy Optimization
Figure 9 These images show what task completion looks like forthe touch turn and orient Shadow Hand tasks (from left to right)
0 2 4 6impact force
minus6
minus4
minus2
0
pain
pen
alty
Figure 10 In the touch and turn Shadow Hand tasks the painpenalty is computed based on the impact force as plotted here Forlow impact forces the pain penalty is near-zero For high impactforces the pain penalty is equal to the negative of the impact force
B21 BALANCING TASK COMPLETION AND PAIN
In the touch and turn tasks there are two objectives ldquopainrdquopenalty and task completion A sparse task completion re-ward of 1 is given for pressing a block with greater than 5Nof force or turning a dial to a fixed target angle respectivelyIn both tasks the episode terminates when the agent com-pletes the task ie the agent gets a total reward of either0 or 1 for the task completion objective per episode ThePareto plots for these two tasks in the main paper (Fig 6)show the discounted task reward (with a discount factor ofγ = 099) to capture how long it takes agents to completethe task
The ldquopainrdquo penalty is computed as in (Huang et al 2019) Itis based on the impact force m(s sprime) which is the increasein force from state s to the next state sprime In our tasks this ismeasured by a touch sensor on the block or dial The painpenalty is equal to the negative of the impact force scaledby how unacceptable it is
rpain(s a sprime) = minus[1minus aλ(m(s sprime))
]m(s sprime) (6)
where aλ(middot) isin [0 1] computes the acceptability of an impactforce aλ(middot) should be a monotonically decreasing functionthat captures how resilient the robot and the environmentare to impacts As in (Huang et al 2019) we use
aλ(m) = sigmoid(λ1(minusm+ λ2)) (7)
with λ = [2 2]gt The relationship between pain penaltyand impact force is plotted in Fig 10
675 700 725zpeg
0
1
rew
ard
height objective
0 2d(θ θpeg)
orientation objective
Figure 11 In the orient Shadow Hand task the height and orienta-tion objectives have shaped rewards as plotted here
B22 ALIGNED OBJECTIVES
In the orient task there are three non-competing objectivestouching the peg lifting the peg to a target height and ori-enting the peg to be perpendicular to the ground The targetz position and orientation is shown by the gray transparentpeg (Fig 9 right-most) note that the x and y position isnot specified so the robot in the figure gets the maximumreward with respect to all three objectives
The touch objective has a sparse reward of 1 for activat-ing the pegrsquos touch sensor and zero otherwise For theheight objective the target z position is 7cm above theground the pegrsquos z position is computed with respect toits center of mass The shaped reward for this objectiveis 1 minus tanh(50|ztarget minus zpeg|) For the orientation objec-tive since the peg is symmetrical there are eight possibleorientations that are equally valid The acceptable targetorientationsQtarget and the pegrsquos orientation qpeg are denotedas quaternions The shaped reward is computed with respectto the closest target quaternion as
minqisinQtarget
1minus tanh(2lowastd(q qpeg))
where d(middot) denotes the `2-norm of the axis-angle equivalent(in radians) for the distance between the two quaternionsFig 11 shows the shaping of the height and orientationrewards
B3 Humanoid
We make use of the humanoid run task from Tassa et al(2018)9 The observation dimension is 67 and the actiondimension is 21 Actions are joint accelerations with min-imum and maximum limits of minus1 and 1 respectively Forthis task there are two objectives
bull The original task reward given by the environment Thegoal is is to achieve a horizontal speed of 10 meters persecond in any direction This reward is shaped it isequal to min(h10 1) where h is in meters per secondFor this objective we learn a Q-function
9This task is available at githubcomdeepminddm control
A Distributional View on Multi-Objective Policy Optimization
bull Limiting energy usage by penalizing taking high-magnitude actions The penalty is the `2-norm of theaction vector ie rpenalty(a) = minusa2 For this ob-jective we do not learn a Q-function Instead wecompute this penalty directly to evaluate a given actionin a state-independent way during policy optimization
B4 Humanoid Mocap
The motion capture tracking task used in this paper wasdeveloped as part of a concurrent submission (Anonymous2020) and does not constitute a contribution of this paperWe use a simulated humanoid adapted from the ldquoCMUhumanoidrdquo available at dm controllocomotion (Merel et al2019) We adjusted various body and actuator parametersto be more comparable to that of an average human Thishumanoid has 56 degrees of freedom and the observation is1021-dimensional
This motion capture tracking task is broadly similar to previ-ous work on motion capture tracking (Chentanez et al 2018Peng et al 2018 Merel et al 2019) The task is associatedwith an underlying set of motion capture clips For thispaper we used roughly 40 minutes of locomotion motioncapture clips from the CMU motion capture database10
At the beginning of a new episode we select a frame uni-formly at random from all frames in the underlying mocapdata (excluding the last 10 frames of each clip) The simu-lated humanoid is then initialized to the pose of the selectedframe
The observations in this environment are various propriocep-tive observations about the current state of the body as wellas the relative target positions and orientations of 13 differ-ent body parts in the humanoidrsquos local frame We providethe agent with the target poses for the next 5 timesteps
The choice of reward function is crucial for motion capturetracking tasks Here we consider five different reward com-ponents each capturing a different aspect of the similarityof the pose between the simulated humanoid and the mocaptarget Four of our reward components were initially pro-posed by Peng et al (2018) The first reward component isbased on the difference in the center of mass
rcom = exp(minus10pcom minus pref
com2)
where pcom and prefcom are the positions of the center of mass
of the simulated humanoid and the mocap reference re-spectively The second reward component is based on thedifference in joint angle velocities
rvel = exp(minus01qvel minus qref
vel2)
where qvel and qrefvel are the joint angle velocities of the simu-
lated humanoid and the mocap reference respectively The
10This database is available at mocapcscmuedu
third reward component is based on the difference in end-effector positions
rapp = exp(minus40papp minus pref
app2)
where papp and prefapp are the end effector positions of the
simulated humanoid and the mocap reference respectivelyThe fourth reward component is based on the difference injoint orientations
rquat = exp(minus2qquat qref
quat2)
where denotes quaternion differences and qquat and qrefquat
are the joint quaternions of the simulated humanoid and themocap reference respectively
Finally the last reward component is based on difference inthe joint angles and the Euclidean positions of a set of 13body parts
rtrunc = 1minus 1
03(
=ε︷ ︸︸ ︷bpos minus bref
pos1 + qpos minus qrefpos1)
where bpos and brefpos correspond to the body positions of the
simulated character and the mocap reference and qpos andqref
pos correspond to the joint angles
We include an early termination condition in our task that islinked to this last reward term We terminate the episode ifε gt 03 which ensures that rtrunc isin [0 1]
In our MO-VMPO experiments we treat each reward com-ponent as a separate objective In our VMPO experimentswe use a reward of the form
r =1
2rtrunc +
1
2(01rcom + λrvel + 015rapp + 065rquat)
(8)
varying λ as described in the main paper (Sec 63)
B5 Sawyer
The Sawyer peg-in-hole setup consists of a Rethink RoboticsSawyer robot arm The robot has 7 joints driven by serieselastic actuators On the wrist a Robotiq FT 300 force-torque sensor is mounted followed by a Robotiq 2F85 par-allel gripper We implement a 3D end-effector Cartesianvelocity controller that maintains a fixed orientation of thegripper The agent controls the Cartesian velocity setpointsThe gripper is not actuated in this setup The observationsprovided are the joint positions velocities and torques theend-effector pose the wrist force-torque measurements andthe joint velocity command output of the Cartesian con-troller We augment each observation with the two previousobservations as a history
In each episode the peg is initalized randomly within an8times8times8 cm workspace (with the peg outside of the hole)
A Distributional View on Multi-Objective Policy Optimization
The environment is then run at 20 Hz for 600 steps Theaction limits are plusmn005 ms on each axis If the magnitudeof the wrist force-torque measurements exceeds 15 N onany horizontal axis (x and y axis) or 5 N on the vertical axis(z axis) the episode is terminated early
There are two objectives the task itself (inserting the peg)and minimizing wrist force measurements
The reward for the task objective is defined as
rinsertion = max(
02rapproach rinserted
)(9)
rapproach = s(ppeg popening) (10)
rinserted = raligned s(ppegz pbottom
z ) (11)
raligned = ppegxy minus popening
xy lt d2 (12)
where ppeg is the peg position popening the position of thehole opening pbottom the bottom position of the hole and dthe diameter of the hole
s(p1 p2) is a shaping operator
s(p1 p2) = 1minus tanh2(atanh(
radicl)
εp1 minus p22) (13)
which gives a reward of 1minus l if p1 and p2 are at a Euclideandistance of ε Here we chose lapproach = 095 εapproach =001 linserted = 05 and εinserted = 004 Intentionally 004corresponds to the length of the peg such that rinserted = 05if ppeg = popening As a result if the peg is in the approachposition rinserted is dominant over rapproach in rinsertion
Intuitively there is a shaped reward for aligning the pegwith a magnitude of 02 After accurate alignment withina horizontal tolerance the overall reward is dominated bya vertical shaped reward component for insertion The hor-izontal tolerance threshold ensures that the agent is notencouraged to try to insert the peg from the side
The reward for the secondary objective minimizing wristforce measurements is defined as
rforce = minusF1 (14)
where F = (Fx Fy Fz) are the forces measured by thewrist sensor
C Algorithmic DetailsC1 General Algorithm Description
We maintain one online network and one target network foreach Q-function associated to objective k with parametersdenoted by φk and φprimek respectively We also maintain oneonline network and one target network for the policy withparameters denoted by θ and θprime respectively Target net-works are updated every fixed number of steps by copying
parameters from the online network Online networks areupdated using gradient descent in each learning iterationNote that in the main paper we refer to the target policynetwork as the old policy πold
We use an asynchronous actor-learner setup in which ac-tors regularly fetch policy parameters from the learner andact in the environment writing these transitions to the re-play buffer We refer to this policy as the behavior policythroughout the paper The learner uses the transitions inthe replay buffer to update the (online) Q-functions and thepolicy Please see Algorithms 2 and 3 for more details onthe actor and learner processes respectively
Algorithm 2 MO-MPO Asynchronous Actor
1 given (N) reward functions rk(s a)Nk=1 T steps perepisode
2 repeat3 Fetch current policy parameters θ from learner4 Collect trajectory from environment5 τ = 6 for t = 0 T do7 at sim πθ(middot|st)8 Execute action and determine rewards9 r = [r1(st at) rN (st at)]
10 τ larr τ cup (st at r πθ(at|st))11 end for12 Send trajectory τ to replay buffer13 until end of training
C2 Retrace for Multi-Objective Policy Evaluation
Recall that the goal of the policy evaluation stage (Sec 41)is to learn the Q-function Qφk
(s a) for each objective kparametrized by φk These Q-functions are with respectto the policy πold We emphasize that it is valid to use anyoff-the-shelf Q-learning algorithm such as TD(0) (Sutton1988) to learn these Q-functions We choose to use Retrace(Munos et al 2016) described here
Given a replay buffer D containing trajectory snippetsτ = (s0 a0 r0 s1) (sT aT rT sT+1) where rtdenotes a reward vector rk(st at)Nk=1 that consists of ascalar reward for each ofN objectives the Retrace objectiveis as follows
minφk
EτsimD[(Qretk (st at)minus Qφk
(st at))2] (15)
with
Qretk (st at) = Qφprime
k(st at) +
Tsumj=t
γjminust( jprodz=t+1
cz
)δj
δj = rk(sj aj) + γV (sj+1)minus Qφprimek(sj aj)
V (sj+1) = Eπold(a|sj+1)[Qφprimek(sj+1 a)]
A Distributional View on Multi-Objective Policy Optimization
Algorithm 3 MO-MPO Asynchronous Learner1 given batch size (L) number of actions (M) (N) Q-functions
(N) preferences εkNk=1 policy networks and replay bufferD
2 initialize Lagrangians ηkNk=1 and ν target networks andonline networks such that πθprime = πθ and Qφprime
k= Qφk
3 repeat4 repeat5 Collect dataset si aij Qijk
LMNijk where
6 si sim D aij sim πθprime(a|si) and Qijk = Qφkprime(si aij)
78 Compute action distribution for each objective9 for k = 1 N do
10 δηk larr nablaηkηkεk + ηksumLi
1Llog
(sumMj
1M
exp(Q
ijkηk
))11 Update ηk based on δηk12 qijk prop exp(
Qijkηk
)
13 end for1415 Update parametric policy with trust region16 sg denotes a stop-gradient17 δπ larr minusnablaθ
sumLi
sumMj
sumNk q
ijk log πθ(a
ij |si)18 +sg(ν)
(β minus
sumLi KL(πθprime(a|si) πθ(a|si)) ds
)19 δν larr nablaνν
(βminussumLi KL(πθprime(a|si) πsg(θ)(a|si)) ds
)20 Update πθ based on δπ21 Update ν based on δν2223 Update Q-functions24 for k = 1 N do25 δφk larr nablaφk
sum(stat)isinτsimD
(Qφk (st at)minusQ
retk
)226 with Qret
k as in Eq 1527 Update φk based on δφk
28 end for2930 until fixed number of steps31 Update target networks32 πθprime = πθ Qφk
prime = Qφk
33 until convergence
The importance weights ck are defined as
cz = min
(1πold(az|sz)b(az|sz)
)
where b(az|sz) denotes the behavior policy used to col-lect trajectories in the environment When j = t we set(prodj
z=t+1 cz
)= 1
We also make use of a target network for each Q-function(Mnih et al 2015) parametrized by φprimek which we copyfrom the online network φk after a fixed number of gradientsteps on the Retrace objective (15)
C3 Policy Fitting
Recall that fitting the policy in the policy improvementstage (Sec 422) requires solving the following constrained
optimization
πnew = argmaxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds
stints
micro(s) KL(πold(a|s) πθ(a|s)) ds lt β (16)
We first write the generalized Lagrangian equation ie
L(θ ν) =
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds (17)
+ ν(β minus
ints
micro(s) KL(πold(a|s) πθ(a|s)) ds)
where ν is the Lagrangian multiplier Now we solve thefollowing primal problem
maxθ
minνgt0
L(θ ν)
To obtain the parameters θ for the updated policy we solvefor ν and θ by alternating between 1) fixing ν to its currentvalue and optimizing for θ and 2) fixing θ to its currentvalue and optimizing for ν This can be applied for anypolicy output distribution
For Gaussian policies we decouple the update rules to op-timize for mean and covariance independently as in (Ab-dolmaleki et al 2018b) This allows for setting differentKL bounds for the mean (βmicro) and covariance (βΣ) whichresults in more stable learning To do this we first separateout the following two policies for mean and covariance
πmicroθ (a|s) = N(amicroθ(s)Σθold
(s)) (18)
πΣθ (a|s) = N
(amicroθold
(s)Σθ(s)) (19)
Policy πmicroθ (a|s) takes the mean from the online policy net-work and covariance from the target policy network andpolicy πΣ
θ (a|s) takes the mean from the target policy net-work and covariance from online policy network Now ouroptimization problem has the following form
maxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s)(
log πmicroθ (a|s)πΣθ (a|s)
)da ds
stints
micro(s) KL(πold(a|s) πmicroθ (a|s)) ds lt βmicroints
micro(s) KL(πold(a|s) πΣθ (a|s)) ds lt βΣ (20)
As in (Abdolmaleki et al 2018b) we set a much smallerbound for covariance than for mean to keep the explorationand avoid premature convergence We can solve this op-timization problem using the same general procedure asdescribed above
A Distributional View on Multi-Objective Policy Optimization
C4 Derivation of Dual Function for η
Recall that obtaining the per-objective improved action dis-tributions qk(a|s) in the policy improvement stage (Sec421) requires solving a convex dual function for the tem-perature ηk for each objective For the derivation of thisdual function please refer to Appendix D2 of the originalMPO paper (Abdolmaleki et al 2018b)
D MO-V-MPO Multi-Objective On-PolicyMPO
In this section we describe how MO-MPO can be adaptedto the on-policy case in which a state-value function V (s)is learned and used to estimate the advantages for eachobjective We call this approach MO-V-MPO
D1 Multi-objective policy evaluation in the on-policysetting
In the on-policy setting to evaluate the previous policyπold we use advantages A(s a) estimated from a learnedstate-value function V (s) instead of a state-action valuefunction Q(s a) as in the main text We train a separateV -function for each objective by regressing to the stan-dard n-step return (Sutton 1988) associated with each ob-jective More concretely given trajectory snippets τ =(s0 a0 r0) (sT aT rT ) where rt denotes a rewardvector rk(st at)Nk=1 that consists of rewards for all N ob-jectives we find value function parameters φk by optimizingthe following objective
minφ1k
Nsumk=1
Eτ[(G
(T )k (st at)minus V πold
φk(st))
2] (21)
Here G(T )k (st at) is the T -step target for value function
k which uses the actual rewards in the trajectory andbootstraps from the current value function for the restG
(T )k (st at) =
sumTminus1`=t γ
`minustrk(s` a`) + γTminustV πoldφk
(s`+T )The advantages are then estimated as Aπold
k (st at) =
G(T )k (st at)minus V πold
φk(st)
D2 Multi-objective policy improvement in theon-policy setting
Given the previous policy πold(a|s) and estimated advan-tages Aπold
k (s a)k=1N associated with this policy foreach objective our goal is to improve the previous policyTo this end we first learn an improved variational distribu-tion qk(s a) for each objective and then combine and distillthe variational distributions into a new parametric policyπnew(a|s) Unlike in MO-MPO for MO-V-MPO we use thejoint distribution qk(s a) rather than local policies qk(a|s)because without a learned Q-function only one action perstate is available for learning This is a multi-objective vari-
ant of the two-step policy improvement procedure employedby V-MPO (Song et al 2020)
D21 OBTAINING IMPROVED VARIATIONALDISTRIBUTIONS PER OBJECTIVE (STEP 1)
In order to obtain the improved variational distributionsqk(s a) we optimize the RL objective for each objectiveAk
maxqk
intsa
qk(s a)Ak(s a) dads (22)
st KL(qk(s a)pold(s a)) lt εk
where the KL-divergence is computed over all (s a) εk de-notes the allowed expected KL divergence and pold(s a) =micro(s)πold(a|s) is the state-action distribution associated withπold
As in MO-MPO we use these εk to define the preferencesover objectives More concretely εk defines the allowed con-tribution of objective k to the change of the policy There-fore the larger a particular εk is with respect to others themore that objective k is preferred On the other hand ifεk = 0 is zero then objective k will have no contribution tothe change of the policy and will effectively be ignored
Equation (22) can be solved in closed form
qk(s a) prop pold(s a) exp(Ak(s a)
ηk
) (23)
where the temperature ηk is computed based on the con-straint εk by solving the following convex dual problem
ηk = argminηk
[ηk εk + (24)
ηk log
intsa
pold(s a) exp(Ak(s a)
η
)da ds
]
We can perform the optimization along with the loss bytaking a gradient descent step on ηk and we initialize withthe solution found in the previous policy iteration step Sinceηk must be positive we use a projection operator after eachgradient step to maintain ηk gt 0
Top-k advantages As in Song et al (2020) in practicewe used the samples corresponding to the top 50 of ad-vantages in each batch of data
D22 FITTING A NEW PARAMETRIC POLICY (STEP 2)
We next want to combine and distill the state-action distribu-tions obtained in the previous step into a single parametricpolicy πnew(a|s) that favors all of the objectives accordingto the preferences specified by εk For this we solve a su-pervised learning problem that fits a parametric policy asfollows
A Distributional View on Multi-Objective Policy Optimization
πnew = argmaxθ
Nsumk=1
intsa
qk(s a) log πθ(a|s) da ds
stints
KL(πold(a|s) πθ(a|s)) ds lt β (25)
where θ are the parameters of our function approximator(a neural network) which we initialize from the weights ofthe previous policy πold and the KL constraint enforces atrust region of size β that limits the overall change in theparametric policy to improve stability of learning As inMPO the KL constraint in this step has a regularizationeffect that prevents the policy from overfitting to the localpolicies and therefore avoids premature convergence
In order to optimize Equation (25) we employ Lagrangianrelaxation similar to the one employed for ordinary V-MPO (Song et al 2020)
E Multi-Objective Policy Improvement asInference
In the main paper we motivated the multi-objective policyupdate rules from an intuitive perspective In this section weshow that our multi-objective policy improvement algorithmcan also be derived from the RL-as-inference perspective Inthe policy improvement stage we assume that a Q-functionfor each objective is given and we would like to improveour policy with respect to these Q-functions The deriva-tion here extends the derivation for the policy improvementalgorithm in (single-objective) MPO in Abdolmaleki et al(2018a) (in appendix) to the multi-objective case
We assume there are observable binary improvement eventsRkNk=1 for each objective Rk = 1 indicates that ourpolicy has improved for objective k whileRk = 0 indicatesthat it has not Now we ask if the policy has improved withrespect to all objectives ie Rk = 1Nk=1 what would theparameters θ of that improved policy be More concretelythis is equivalent to the maximum a posteriori estimate forRk = 1Nk=1
maxθ
pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)
where the probability of improvement events depends on θAssuming independence between improvement events Rkand using log probabilities leads to the following objective
maxθ
Nsumk=1
log pθ(Rk = 1) + log p(θ) (27)
The prior distribution over parameters p(θ) is fixed duringthe policy improvement step We set the prior such that πθ
stays close to the target policy during each policy improve-ment step to improve stability of learning (Sec E2)
We use the standard expectation-maximization (EM) algo-rithm to efficiently maximize
sumNk=1 log pθ(Rk = 1) The
EM algorithm repeatedly constructs a tight lower bound inthe E-step given the previous estimate of θ (which corre-sponds to the target policy in our case) and then optimizesthat lower bound in the M-step We introduce a variationaldistribution qk(s a) per objective and use this to decom-pose the log-evidence
sumNk=1 log pθ(Rk = 1) as follows
Nsumk=1
log pθ(Rk = 1) (28)
=
Nsumk=1
KL(qk(s a) pθ(s a|Rk = 1))minus
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a))
The second term of this decomposition is expanded as
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a)) (29)
=
Nsumk=1
Eqk(sa)
[log
pθ(Rk = 1 s a)
qk(s a)
]=
Nsumk=1
Eqk(sa)
[log
p(Rk = 1|s a)πθ(a|s)micro(s)
qk(s a)
]
where micro(s) is the stationary state distribution which is as-sumed to be given in each policy improvement step Inpractice micro(s) is approximated by the distribution of thestates in the replay buffer p(Rk = 1|s a) is the likelihoodof the improvement event for objective k if our policy choseaction a in state s
The first term of the decomposition in Equation (28) isalways positive so the second term is a lower bound onthe log-evidence
sumNk=1 log pθ(Rk = 1) πθ(a|s) and
qk(a|s) = qk(sa)micro(s) are unknown even though micro(s) is given
In the E-step we estimate qk(a|s) for each objective k byminimizing the first term given θ = θprime (the parameters ofthe target policy) Then in the M-step we find a new θ byfitting the parametric policy to the distributions from firststep
E1 E-Step
The E-step corresponds to the first step of policy improve-ment in the main paper (Sec 421) In the E-step wechoose the variational distributions qk(a|s)Nk=1 such thatthe lower bound on
sumNk=1 log pθ(Rk = 1) is as tight as
A Distributional View on Multi-Objective Policy Optimization
possible when θ = θprime the parameters of the target policyThe lower bound is tight when the first term of the decompo-sition in Equation (28) is zero so we choose qk to minimizethis term We can optimize for each variational distributionqk independently
qk(a|s) = argminq
Emicro(s)
[KL(q(a|s) pθprime(s a|Rk = 1))
]= argmin
qEmicro(s)
[KL(q(a|s) πθprime(a|s))minus
Eq(a|s) log p(Rk = 1|s a))]
(30)
We can solve this optimization problem in closed formwhich gives us
qk(a|s) =πθprime(a|s) p(Rk = 1|s a)intπθprime(a|s) p(Rk = 1|s a) da
This solution weighs the actions based on their relativeimprovement likelihood p(Rk = 1|s a) for each objectiveWe define the likelihood of an improvement event Rk as
p(Rk = 1|s a) prop exp(Qk(s a)
αk
)
where αk is an objective-dependent temperature parameterthat controls how greedy the solution qk(a|s) is with respectto its associated objective Qk(s a) For example given abatch of state-action pairs at the extremes an αk of zerowould give all probability to the state-action pair with themaximum Q-value while an αk of positive infinity wouldgive the same probability to all state-action pairs In orderto automatically optimize for αk we plug the exponentialtransformation into (30) which leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) dadsminus
αk
intmicro(s) KL(q(a|s) πθprime(a|s)) ds (31)
If we instead enforce the bound on KL divergence as a hardconstraint (which we do in practice) that leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) da ds (32)
stintmicro(s) KL(q(a|s) πθprime(a|s)) ds lt εk
where as described in the main paper the parameter εkdefines preferences over objectives If we now assume thatqk(a|s) is a non-parametric sample-based distribution as in(Abdolmaleki et al 2018b) we can solve this constrained
optimization problem in closed form for each sampled states by setting
q(a|s) prop πθprime(a|s) exp(Qk(s a)
ηk
) (33)
where ηk is computed by solving the following convex dualfunction
ηk = argminη
ηεk+ (34)
η
intmicro(s) log
intπθprime(a|s) exp
(Qk(s a)
η
)da ds
Equations (32) (33) and (34) are equivalent to those givenfor the first step of multi-objective policy improvement inthe main paper (Sec 421) Please see refer to AppendixD2 in (Abdolmaleki et al 2018b) for details on derivationof the dual function
E2 M-Step
After obtaining the variational distributions qk(a|s) foreach objective k we have found a tight lower bound forsumNk=1 log pθ(Rk = 1) when θ = θprime We can now obtain
the parameters θ of the updated policy πθ(a|s) by optimiz-ing this lower bound After rearranging and dropping theterms in (27) that do not involve θ we obtain
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) da ds
+ log p(θ)
This corresponds to the supervised learning step of policyimprovement in the main paper (Sec 422)
For the prior p(θ) on the parameters θ we enforce that thenew policy πθ(a|s) is close to the target policy πθprime(a|s)We therefore optimize for
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) dads
stintmicro(s) KL(πθprime(a|s) πθ(a|s)) ds lt β
Because of the constraint on the change of the parametricpolicy we do not greedily optimize the M-step objectiveSee (Abdolmaleki et al 2018b) for more details
A Distributional View on Multi-Objective Policy Optimization
van Seijen H Fatemi M Romoff J Laroche R BarnesT and Tsang J Hybrid reward architecture for reinforce-ment learning In Proceedings of the 31st InternationalConference on Neural Information Processing Systems(NeurIPS) pp 5398ndash5408 2017
Watkins C J C H Learning from Delayed Rewards PhDthesis Kingrsquos College Cambridge UK May 1989
Wray K H Zilberstein S and Mouaddib A-I Multi-objective MDPs with conditional lexicographic rewardpreferences In Proceedings of the Twenty-Ninth AAAIConference on Artificial Intelligence (AAAI) pp 3418ndash3424 2015
Wulfmeier M Abdolmaleki A Hafner R SpringenbergJ T Neunert M Hertweck T Lampe T Siegel NHeess N and Riedmiller M Regularized hierarchicalpolicies for compositional transfer in robotics arXivpreprint arXiv190611228 2019
Yang R Sun X and Narasimhan K A generalized algo-rithm for multi-objective reinforcement learning and pol-icy adaptation In Proceedings of the 32nd InternationalConference on Neural Information Processing Systems(NeurIPS) pp 14610ndash14621 2019
Zeng A Song S Lee J Rodriguez A and FunkhouserT Tossingbot Learning to throw arbitrary objects withresidual physics In Proceedings of Robotics Scienceand Systems (RSS) 2019
Zhan H and Cao Y Relationship explainable multi-objective optimization via vector value function basedreinforcement learning arXiv preprint arXiv1910019192019
Zuluaga M Krause A and Puschel M ε-pal An ac-tive learning approach to the multi-objective optimizationproblem Journal of Machine Learning Research 17(1)3619ndash3650 2016
A Distributional View on Multi-Objective Policy Optimization
A Experiment DetailsIn this section we describe implementation details and spec-ify the hyperparameters used for our algorithm In all ourexperiments the policy and critic(s) are implemented withfeed-forward neural networks We use Adam (Kingma ampBa 2015) for optimization
In our continuous control tasks the policy returns a Gaus-sian distribution with a diagonal covariance matrix ieπθ(a|s) = N (microΣ) The policy is parametrized by aneural network which outputs the mean micro = micro(s) and di-agonal Cholesky factors A = A(s) such that Σ = AAT The diagonal factor A has positive diagonal elements en-forced by the softplus transform Aii larr log(1 + exp(Aii))to ensure positive definiteness of the diagonal covariancematrix
Table 2 shows the default hyperparameters that we use forMO-MPO and scalarized MPO and Table 3 shows the hy-perparameters that differ from these defaults for the hu-manoid Shadow Hand and Sawyer experiments For alltasks that we train policies with MO-MPO for a separatecritic is trained for each objective with a shared first layerThe only exception is the action norm penalty in humanoidrun which is computed exactly from the action We foundthat for both policy and critic networks layer normalizationof the first layer followed by a hyperbolic tangent (tanh) isimportant for the stability of the algorithm
In our experiments comparing the Pareto front found byMO-MPO versus scalarized MPO we ran MO-MPO with arange of εk settings and ran scalarized MPO with a range ofweight settings to obtain a set of policies for each algorithmThe settings for humanoid run and the Shadow Hand tasksare listed in Table 4 The settings for humanoid mocap andSawyer are specified in the main paper in Sec 63 and 64respectively
In the following subsections we first give suggestions forchoosing appropriate εk to encode a particular preference(Sec A1) then we describe the discrete MO-MPO usedfor Deep Sea Treasure (Sec A2) and finally we describeimplementation details for humanoid mocap (Sec A3)
A1 Suggestions for Setting ε
Our proposed algorithm MO-(V-)MPO requires practition-ers to translate a desired preference across objectives to nu-merical choices for εk for each objective k At first glancethis may seem daunting However in practice we havefound that encoding preferences via εk is often more in-tuitive than doing so via scalarization In this subsectionwe seek to give an intuition on how to set εk for differentdesired preferences Recall that each εk controls the influ-ence of objective k on the policy update by constraining theKL-divergence between each objective-specific distribution
Hyperparameter Default
policy networklayer sizes (300 200)take tanh of action mean yesminimum variance 10minus12
maximum variance unbounded
critic network(s)layer sizes (400 400 300)take tanh of action yesRetrace sequence size 8discount factor γ 099
both policy and critic networkslayer norm on first layer yestanh on output of layer norm yesactivation (after each hidden layer) ELU
MPOactions sampled per state 20default ε 01KL-constraint on policy mean βmicro 10minus3
KL-constraint on policy covariance βΣ 10minus5
initial temperature η 1
trainingbatch size 512replay buffer size 106
Adam learning rate 3times10minus4
Adam ε 10minus3
target network update period 200
Table 2 Default hyperparameters for MO-MPO and scalarizedMPO with decoupled update on mean and covariance (Sec C3)
and the current policy (Sec 41) We generally choose εk inthe range of 0001 to 01
Equal Preference When all objectives are equally impor-tant the general rule is to set all εk to the same value Wedid this in our orient and humanoid mocap tasks where ob-jectives are aligned or mostly aligned In contrast it can betricky to choose appropriate weights in linear scalarizationto encode equal preferencesmdashsetting all weights equal to1K (where K is the number of objectives) is only appro-priate if the objectivesrsquo rewards are of similar scales Weexplored this in Sec 51 and Fig 3
When setting all εk to the same value what should thisvalue be The larger εk is the more influence the objectiveswill have on the policy update step Since the per-objectivecritics are learned in parallel with the policy setting εktoo high tends to destabilize learning because early on intraining when the critics produce unreliable Q-values theirinfluence on the policy will lead it in the wrong directionOn the other hand if εk is set too low then it slows downlearning because the per-objective action distribution isonly allowed to deviate by a tiny amount from the currentpolicy and the updated policy is obtained via supervisedlearning on the combination of these action distributions
A Distributional View on Multi-Objective Policy Optimization
Humanoid run
policy networklayer sizes (400 400 300)take tanh of action mean nominimum variance 10minus8
critic network(s)layer sizes (500 500 400)take tanh of action no
MPOKL-constraint on policy mean βmicro 5times10minus3
KL-constraint on policy covariance βΣ 10minus6
trainingAdam learning rate 2times10minus4
Adam ε 10minus8
Shadow Hand
MPOactions sampled per state 30default ε 001
Sawyer
MPOactions sampled per state 15
trainingtarget network update period 100
Table 3 Hyperparameters for humanoid Shadow Hand andSawyer experiments that differ from the defaults in Table 2
Eventually the learning will converge to more or less thesame policy though as long as εk is not set too high
Unequal Preference When there is a difference in pref-erences across objectives the relative scale of εk is whatmatters The more the relative scale of εk is compared to εlthe more influence objective k has over the policy updatecompared to objective l And in the extreme case whenεl is near-zero for objective l then objective l will haveno influence on the policy update and will effectively beignored We explored this briefly in Sec 51 and Fig 4
One common example of unequal preferences is when wewould like an agent to complete a task while minimiz-ing other objectivesmdasheg an action norm penalty ldquopainrdquopenalty or wrist force-torque penalty in our experimentsIn this case the ε for the task objective should be higherthan that for the other objectives to incentivize the agent toprioritize actually doing the task If the ε for the penaltiesis too high then the agent will care more about minimiz-ing the penalty (which can typically be achieved by simplytaking no actions) rather than doing the task which is notparticularly useful
The scale of εk has a similar effect as in the equal preference
Condition Settings
Humanoid run (one seed per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin linspace(0 015 100)
MO-MPO εtask = 01εpenalty isin linspace(10minus6 015 100)
Humanoid run normal vs scaled (three seeds per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin 001 005 01
MO-MPO εtask = 01εpenalty isin 001 005 01
Shadow Hand touch and turn (three seeds per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin linspace(0 09 10)
MO-MPO εtask = 001εpenalty isin linspace(0001 0015 15)
Shadow Hand orient (ten seeds per setting)
scalarized MPO wtouch = wheight = worientation = 13MO-MPO εtouch = εheight = εorientation = 001
Table 4 Settings for εk and weights (linspace(x y z) denotes aset of z evenly-spaced values between x and y)
case If the scale of εk is too high or too low then the sameissues arise as discussed for equal preferences If all εkincrease or decrease in scale by the same (moderate) factorand thus their relative scales remain the same then typicallythey will converge to more or less the same policy Thiscan be seen in Fig 5 (right) in the Deep Sea Treasuredomain regardless of whether εtime is 001 002 or 005the relationship between the relative scale of εtreasure andεtime to the treasure that the policy converges to selecting isessentially the same
A2 Deep Sea Treasure
In order to handle discrete actions we make several mi-nor adjustments to scalarized MPO and MO-MPO Thepolicy returns a categorical distribution rather than a Gaus-sian The policy is parametrized by a neural network whichoutputs a vector of logits (ie unnormalized log probabil-ities) of size |A| The KL constraint on the change of thispolicy β is 0001 The input to the critic network is thestate concatenated with a four-dimensional one-hot vectordenoting which action is chosen (eg the up action corre-sponds to [1 0 0 0]gt) Critics are trained with one-steptemporal-difference error with a discount of 0999 Otherthan these changes the network architectures and the MPOand training hyperparameters are the same as in Table 2
A Distributional View on Multi-Objective Policy Optimization
Humanoid Mocap
Scalarized VMPOKL-constraint on policy mean βmicro 01KL-constraint on policy covariance βΣ 10minus5
default ε 01initial temperature η 1
TrainingAdam learning rate 10minus4
batch size 128unroll length (for n-step return Sec D1) 32
Table 5 Hyperparameters for the humanoid mocap experiments
A3 Humanoid Mocap
For the humanoid mocap experiments we used the follow-ing architecture for both MO-VMPO and VMPO for thepolicy we process a concatentation of the mocap referenceobservations and the proprioceptive observations by a twolayer MLP with 1024 hidden units per layer This referenceencoder is followed by linear layers to produce the meanand log standard deviation of a stochastic latent variableThese latent variables are then concatenated with the propri-oceptive observations and processed by another two layerMLP with 1024 hidden units per layer to produce the actiondistribution For VMPO we use a three layer MLP with1024 hidden units per layer as the critic For MO-VMPOwe use a shared two layer MLP with 1024 hidden units perlayer followed by a one layer MLP with 1024 hidden unitsper layer per objective In both cases we use k-step returnsto train the critic with a discount factor of 095 Table 5shows additional hyperparameters used in our experiments
B Experimental DomainsWe evaluated our approach on one discrete domain (DeepSea Treasure) three simulated continuous control domains(humanoid Shadow Hand and humanoid mocap) and onereal-world continuous control domain (Sawyer robot) Herewe provide more detail about these domains and the objec-tives used in each task
B1 Deep Sea Treasure
Deep Sea Treasure (DST) is a 11times10 grid-world domainthe state space S consists of the x and y position of the agentand the action spaceA is up right down left Thelayout of the environment and values of the treasures areshown in Fig 8 If the action would cause the agent tocollide with the sea floor or go out of bounds it has noeffect Farther-away treasures have higher values Theepisode terminates when the agent collects a treasure orafter 200 timesteps
237
224
203196
161151140
115
82
07
Figure 8 Deep Sea Treasure environment from Vamplew et al(2011) with weights from Yang et al (2019) Treasures are labeledwith their respective values The agent can move around freelyin the white squares but cannot enter the black squares (ie theocean floor)
There are two objectives time penalty and treasure valueA time penalty of minus1 is given at each time step The agentreceives the value of the treasure it picks up as the rewardfor the treasure objective In other words when the agentpicks up a treasure of value v the reward vector is [minus1 v]otherwise it is [minus1 0]
B2 Shadow Hand
Our robot platform is a simulated Shadow Dexterous Hand(Shadow Robot Company 2020) in the MuJoCo physicsengine (Todorov et al 2012) The Shadow Hand has fivefingers and 24 degrees of freedom actuated by 20 motorsThe observation consists of joint angles joint velocities andtouch sensors for a total of 63 dimensions Each fingertipof our Shadow Hand has a 4times4 spatial touch sensor Thissensor has three channels measuring the normal force andthe x and y-axis tangential forces for a sensor dimension of4times4times3 per fingertip We simplify this sensor by summingacross spatial dimensions resulting in a 1times1times3 observationper fingertip
In the touch task there is a block in the environment that isalways fixed in the same pose In the turn task there is a dialin the environment that can be rotated and it is initialized toa random position between minus30 and 30 The target angleof the dial is 0 The angle of the dial is included in theagentrsquos observation In the orient task the robot interactswith a rectangular peg in the environment the initial andtarget pose of the peg remains the same across episodesThe pose of the block is included in the agentrsquos observationencoded as the xyz positions of four corners of the block(based on how Levine et al (2016) encodes end-effectorpose)
A Distributional View on Multi-Objective Policy Optimization
Figure 9 These images show what task completion looks like forthe touch turn and orient Shadow Hand tasks (from left to right)
0 2 4 6impact force
minus6
minus4
minus2
0
pain
pen
alty
Figure 10 In the touch and turn Shadow Hand tasks the painpenalty is computed based on the impact force as plotted here Forlow impact forces the pain penalty is near-zero For high impactforces the pain penalty is equal to the negative of the impact force
B21 BALANCING TASK COMPLETION AND PAIN
In the touch and turn tasks there are two objectives ldquopainrdquopenalty and task completion A sparse task completion re-ward of 1 is given for pressing a block with greater than 5Nof force or turning a dial to a fixed target angle respectivelyIn both tasks the episode terminates when the agent com-pletes the task ie the agent gets a total reward of either0 or 1 for the task completion objective per episode ThePareto plots for these two tasks in the main paper (Fig 6)show the discounted task reward (with a discount factor ofγ = 099) to capture how long it takes agents to completethe task
The ldquopainrdquo penalty is computed as in (Huang et al 2019) Itis based on the impact force m(s sprime) which is the increasein force from state s to the next state sprime In our tasks this ismeasured by a touch sensor on the block or dial The painpenalty is equal to the negative of the impact force scaledby how unacceptable it is
rpain(s a sprime) = minus[1minus aλ(m(s sprime))
]m(s sprime) (6)
where aλ(middot) isin [0 1] computes the acceptability of an impactforce aλ(middot) should be a monotonically decreasing functionthat captures how resilient the robot and the environmentare to impacts As in (Huang et al 2019) we use
aλ(m) = sigmoid(λ1(minusm+ λ2)) (7)
with λ = [2 2]gt The relationship between pain penaltyand impact force is plotted in Fig 10
675 700 725zpeg
0
1
rew
ard
height objective
0 2d(θ θpeg)
orientation objective
Figure 11 In the orient Shadow Hand task the height and orienta-tion objectives have shaped rewards as plotted here
B22 ALIGNED OBJECTIVES
In the orient task there are three non-competing objectivestouching the peg lifting the peg to a target height and ori-enting the peg to be perpendicular to the ground The targetz position and orientation is shown by the gray transparentpeg (Fig 9 right-most) note that the x and y position isnot specified so the robot in the figure gets the maximumreward with respect to all three objectives
The touch objective has a sparse reward of 1 for activat-ing the pegrsquos touch sensor and zero otherwise For theheight objective the target z position is 7cm above theground the pegrsquos z position is computed with respect toits center of mass The shaped reward for this objectiveis 1 minus tanh(50|ztarget minus zpeg|) For the orientation objec-tive since the peg is symmetrical there are eight possibleorientations that are equally valid The acceptable targetorientationsQtarget and the pegrsquos orientation qpeg are denotedas quaternions The shaped reward is computed with respectto the closest target quaternion as
minqisinQtarget
1minus tanh(2lowastd(q qpeg))
where d(middot) denotes the `2-norm of the axis-angle equivalent(in radians) for the distance between the two quaternionsFig 11 shows the shaping of the height and orientationrewards
B3 Humanoid
We make use of the humanoid run task from Tassa et al(2018)9 The observation dimension is 67 and the actiondimension is 21 Actions are joint accelerations with min-imum and maximum limits of minus1 and 1 respectively Forthis task there are two objectives
bull The original task reward given by the environment Thegoal is is to achieve a horizontal speed of 10 meters persecond in any direction This reward is shaped it isequal to min(h10 1) where h is in meters per secondFor this objective we learn a Q-function
9This task is available at githubcomdeepminddm control
A Distributional View on Multi-Objective Policy Optimization
bull Limiting energy usage by penalizing taking high-magnitude actions The penalty is the `2-norm of theaction vector ie rpenalty(a) = minusa2 For this ob-jective we do not learn a Q-function Instead wecompute this penalty directly to evaluate a given actionin a state-independent way during policy optimization
B4 Humanoid Mocap
The motion capture tracking task used in this paper wasdeveloped as part of a concurrent submission (Anonymous2020) and does not constitute a contribution of this paperWe use a simulated humanoid adapted from the ldquoCMUhumanoidrdquo available at dm controllocomotion (Merel et al2019) We adjusted various body and actuator parametersto be more comparable to that of an average human Thishumanoid has 56 degrees of freedom and the observation is1021-dimensional
This motion capture tracking task is broadly similar to previ-ous work on motion capture tracking (Chentanez et al 2018Peng et al 2018 Merel et al 2019) The task is associatedwith an underlying set of motion capture clips For thispaper we used roughly 40 minutes of locomotion motioncapture clips from the CMU motion capture database10
At the beginning of a new episode we select a frame uni-formly at random from all frames in the underlying mocapdata (excluding the last 10 frames of each clip) The simu-lated humanoid is then initialized to the pose of the selectedframe
The observations in this environment are various propriocep-tive observations about the current state of the body as wellas the relative target positions and orientations of 13 differ-ent body parts in the humanoidrsquos local frame We providethe agent with the target poses for the next 5 timesteps
The choice of reward function is crucial for motion capturetracking tasks Here we consider five different reward com-ponents each capturing a different aspect of the similarityof the pose between the simulated humanoid and the mocaptarget Four of our reward components were initially pro-posed by Peng et al (2018) The first reward component isbased on the difference in the center of mass
rcom = exp(minus10pcom minus pref
com2)
where pcom and prefcom are the positions of the center of mass
of the simulated humanoid and the mocap reference re-spectively The second reward component is based on thedifference in joint angle velocities
rvel = exp(minus01qvel minus qref
vel2)
where qvel and qrefvel are the joint angle velocities of the simu-
lated humanoid and the mocap reference respectively The
10This database is available at mocapcscmuedu
third reward component is based on the difference in end-effector positions
rapp = exp(minus40papp minus pref
app2)
where papp and prefapp are the end effector positions of the
simulated humanoid and the mocap reference respectivelyThe fourth reward component is based on the difference injoint orientations
rquat = exp(minus2qquat qref
quat2)
where denotes quaternion differences and qquat and qrefquat
are the joint quaternions of the simulated humanoid and themocap reference respectively
Finally the last reward component is based on difference inthe joint angles and the Euclidean positions of a set of 13body parts
rtrunc = 1minus 1
03(
=ε︷ ︸︸ ︷bpos minus bref
pos1 + qpos minus qrefpos1)
where bpos and brefpos correspond to the body positions of the
simulated character and the mocap reference and qpos andqref
pos correspond to the joint angles
We include an early termination condition in our task that islinked to this last reward term We terminate the episode ifε gt 03 which ensures that rtrunc isin [0 1]
In our MO-VMPO experiments we treat each reward com-ponent as a separate objective In our VMPO experimentswe use a reward of the form
r =1
2rtrunc +
1
2(01rcom + λrvel + 015rapp + 065rquat)
(8)
varying λ as described in the main paper (Sec 63)
B5 Sawyer
The Sawyer peg-in-hole setup consists of a Rethink RoboticsSawyer robot arm The robot has 7 joints driven by serieselastic actuators On the wrist a Robotiq FT 300 force-torque sensor is mounted followed by a Robotiq 2F85 par-allel gripper We implement a 3D end-effector Cartesianvelocity controller that maintains a fixed orientation of thegripper The agent controls the Cartesian velocity setpointsThe gripper is not actuated in this setup The observationsprovided are the joint positions velocities and torques theend-effector pose the wrist force-torque measurements andthe joint velocity command output of the Cartesian con-troller We augment each observation with the two previousobservations as a history
In each episode the peg is initalized randomly within an8times8times8 cm workspace (with the peg outside of the hole)
A Distributional View on Multi-Objective Policy Optimization
The environment is then run at 20 Hz for 600 steps Theaction limits are plusmn005 ms on each axis If the magnitudeof the wrist force-torque measurements exceeds 15 N onany horizontal axis (x and y axis) or 5 N on the vertical axis(z axis) the episode is terminated early
There are two objectives the task itself (inserting the peg)and minimizing wrist force measurements
The reward for the task objective is defined as
rinsertion = max(
02rapproach rinserted
)(9)
rapproach = s(ppeg popening) (10)
rinserted = raligned s(ppegz pbottom
z ) (11)
raligned = ppegxy minus popening
xy lt d2 (12)
where ppeg is the peg position popening the position of thehole opening pbottom the bottom position of the hole and dthe diameter of the hole
s(p1 p2) is a shaping operator
s(p1 p2) = 1minus tanh2(atanh(
radicl)
εp1 minus p22) (13)
which gives a reward of 1minus l if p1 and p2 are at a Euclideandistance of ε Here we chose lapproach = 095 εapproach =001 linserted = 05 and εinserted = 004 Intentionally 004corresponds to the length of the peg such that rinserted = 05if ppeg = popening As a result if the peg is in the approachposition rinserted is dominant over rapproach in rinsertion
Intuitively there is a shaped reward for aligning the pegwith a magnitude of 02 After accurate alignment withina horizontal tolerance the overall reward is dominated bya vertical shaped reward component for insertion The hor-izontal tolerance threshold ensures that the agent is notencouraged to try to insert the peg from the side
The reward for the secondary objective minimizing wristforce measurements is defined as
rforce = minusF1 (14)
where F = (Fx Fy Fz) are the forces measured by thewrist sensor
C Algorithmic DetailsC1 General Algorithm Description
We maintain one online network and one target network foreach Q-function associated to objective k with parametersdenoted by φk and φprimek respectively We also maintain oneonline network and one target network for the policy withparameters denoted by θ and θprime respectively Target net-works are updated every fixed number of steps by copying
parameters from the online network Online networks areupdated using gradient descent in each learning iterationNote that in the main paper we refer to the target policynetwork as the old policy πold
We use an asynchronous actor-learner setup in which ac-tors regularly fetch policy parameters from the learner andact in the environment writing these transitions to the re-play buffer We refer to this policy as the behavior policythroughout the paper The learner uses the transitions inthe replay buffer to update the (online) Q-functions and thepolicy Please see Algorithms 2 and 3 for more details onthe actor and learner processes respectively
Algorithm 2 MO-MPO Asynchronous Actor
1 given (N) reward functions rk(s a)Nk=1 T steps perepisode
2 repeat3 Fetch current policy parameters θ from learner4 Collect trajectory from environment5 τ = 6 for t = 0 T do7 at sim πθ(middot|st)8 Execute action and determine rewards9 r = [r1(st at) rN (st at)]
10 τ larr τ cup (st at r πθ(at|st))11 end for12 Send trajectory τ to replay buffer13 until end of training
C2 Retrace for Multi-Objective Policy Evaluation
Recall that the goal of the policy evaluation stage (Sec 41)is to learn the Q-function Qφk
(s a) for each objective kparametrized by φk These Q-functions are with respectto the policy πold We emphasize that it is valid to use anyoff-the-shelf Q-learning algorithm such as TD(0) (Sutton1988) to learn these Q-functions We choose to use Retrace(Munos et al 2016) described here
Given a replay buffer D containing trajectory snippetsτ = (s0 a0 r0 s1) (sT aT rT sT+1) where rtdenotes a reward vector rk(st at)Nk=1 that consists of ascalar reward for each ofN objectives the Retrace objectiveis as follows
minφk
EτsimD[(Qretk (st at)minus Qφk
(st at))2] (15)
with
Qretk (st at) = Qφprime
k(st at) +
Tsumj=t
γjminust( jprodz=t+1
cz
)δj
δj = rk(sj aj) + γV (sj+1)minus Qφprimek(sj aj)
V (sj+1) = Eπold(a|sj+1)[Qφprimek(sj+1 a)]
A Distributional View on Multi-Objective Policy Optimization
Algorithm 3 MO-MPO Asynchronous Learner1 given batch size (L) number of actions (M) (N) Q-functions
(N) preferences εkNk=1 policy networks and replay bufferD
2 initialize Lagrangians ηkNk=1 and ν target networks andonline networks such that πθprime = πθ and Qφprime
k= Qφk
3 repeat4 repeat5 Collect dataset si aij Qijk
LMNijk where
6 si sim D aij sim πθprime(a|si) and Qijk = Qφkprime(si aij)
78 Compute action distribution for each objective9 for k = 1 N do
10 δηk larr nablaηkηkεk + ηksumLi
1Llog
(sumMj
1M
exp(Q
ijkηk
))11 Update ηk based on δηk12 qijk prop exp(
Qijkηk
)
13 end for1415 Update parametric policy with trust region16 sg denotes a stop-gradient17 δπ larr minusnablaθ
sumLi
sumMj
sumNk q
ijk log πθ(a
ij |si)18 +sg(ν)
(β minus
sumLi KL(πθprime(a|si) πθ(a|si)) ds
)19 δν larr nablaνν
(βminussumLi KL(πθprime(a|si) πsg(θ)(a|si)) ds
)20 Update πθ based on δπ21 Update ν based on δν2223 Update Q-functions24 for k = 1 N do25 δφk larr nablaφk
sum(stat)isinτsimD
(Qφk (st at)minusQ
retk
)226 with Qret
k as in Eq 1527 Update φk based on δφk
28 end for2930 until fixed number of steps31 Update target networks32 πθprime = πθ Qφk
prime = Qφk
33 until convergence
The importance weights ck are defined as
cz = min
(1πold(az|sz)b(az|sz)
)
where b(az|sz) denotes the behavior policy used to col-lect trajectories in the environment When j = t we set(prodj
z=t+1 cz
)= 1
We also make use of a target network for each Q-function(Mnih et al 2015) parametrized by φprimek which we copyfrom the online network φk after a fixed number of gradientsteps on the Retrace objective (15)
C3 Policy Fitting
Recall that fitting the policy in the policy improvementstage (Sec 422) requires solving the following constrained
optimization
πnew = argmaxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds
stints
micro(s) KL(πold(a|s) πθ(a|s)) ds lt β (16)
We first write the generalized Lagrangian equation ie
L(θ ν) =
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds (17)
+ ν(β minus
ints
micro(s) KL(πold(a|s) πθ(a|s)) ds)
where ν is the Lagrangian multiplier Now we solve thefollowing primal problem
maxθ
minνgt0
L(θ ν)
To obtain the parameters θ for the updated policy we solvefor ν and θ by alternating between 1) fixing ν to its currentvalue and optimizing for θ and 2) fixing θ to its currentvalue and optimizing for ν This can be applied for anypolicy output distribution
For Gaussian policies we decouple the update rules to op-timize for mean and covariance independently as in (Ab-dolmaleki et al 2018b) This allows for setting differentKL bounds for the mean (βmicro) and covariance (βΣ) whichresults in more stable learning To do this we first separateout the following two policies for mean and covariance
πmicroθ (a|s) = N(amicroθ(s)Σθold
(s)) (18)
πΣθ (a|s) = N
(amicroθold
(s)Σθ(s)) (19)
Policy πmicroθ (a|s) takes the mean from the online policy net-work and covariance from the target policy network andpolicy πΣ
θ (a|s) takes the mean from the target policy net-work and covariance from online policy network Now ouroptimization problem has the following form
maxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s)(
log πmicroθ (a|s)πΣθ (a|s)
)da ds
stints
micro(s) KL(πold(a|s) πmicroθ (a|s)) ds lt βmicroints
micro(s) KL(πold(a|s) πΣθ (a|s)) ds lt βΣ (20)
As in (Abdolmaleki et al 2018b) we set a much smallerbound for covariance than for mean to keep the explorationand avoid premature convergence We can solve this op-timization problem using the same general procedure asdescribed above
A Distributional View on Multi-Objective Policy Optimization
C4 Derivation of Dual Function for η
Recall that obtaining the per-objective improved action dis-tributions qk(a|s) in the policy improvement stage (Sec421) requires solving a convex dual function for the tem-perature ηk for each objective For the derivation of thisdual function please refer to Appendix D2 of the originalMPO paper (Abdolmaleki et al 2018b)
D MO-V-MPO Multi-Objective On-PolicyMPO
In this section we describe how MO-MPO can be adaptedto the on-policy case in which a state-value function V (s)is learned and used to estimate the advantages for eachobjective We call this approach MO-V-MPO
D1 Multi-objective policy evaluation in the on-policysetting
In the on-policy setting to evaluate the previous policyπold we use advantages A(s a) estimated from a learnedstate-value function V (s) instead of a state-action valuefunction Q(s a) as in the main text We train a separateV -function for each objective by regressing to the stan-dard n-step return (Sutton 1988) associated with each ob-jective More concretely given trajectory snippets τ =(s0 a0 r0) (sT aT rT ) where rt denotes a rewardvector rk(st at)Nk=1 that consists of rewards for all N ob-jectives we find value function parameters φk by optimizingthe following objective
minφ1k
Nsumk=1
Eτ[(G
(T )k (st at)minus V πold
φk(st))
2] (21)
Here G(T )k (st at) is the T -step target for value function
k which uses the actual rewards in the trajectory andbootstraps from the current value function for the restG
(T )k (st at) =
sumTminus1`=t γ
`minustrk(s` a`) + γTminustV πoldφk
(s`+T )The advantages are then estimated as Aπold
k (st at) =
G(T )k (st at)minus V πold
φk(st)
D2 Multi-objective policy improvement in theon-policy setting
Given the previous policy πold(a|s) and estimated advan-tages Aπold
k (s a)k=1N associated with this policy foreach objective our goal is to improve the previous policyTo this end we first learn an improved variational distribu-tion qk(s a) for each objective and then combine and distillthe variational distributions into a new parametric policyπnew(a|s) Unlike in MO-MPO for MO-V-MPO we use thejoint distribution qk(s a) rather than local policies qk(a|s)because without a learned Q-function only one action perstate is available for learning This is a multi-objective vari-
ant of the two-step policy improvement procedure employedby V-MPO (Song et al 2020)
D21 OBTAINING IMPROVED VARIATIONALDISTRIBUTIONS PER OBJECTIVE (STEP 1)
In order to obtain the improved variational distributionsqk(s a) we optimize the RL objective for each objectiveAk
maxqk
intsa
qk(s a)Ak(s a) dads (22)
st KL(qk(s a)pold(s a)) lt εk
where the KL-divergence is computed over all (s a) εk de-notes the allowed expected KL divergence and pold(s a) =micro(s)πold(a|s) is the state-action distribution associated withπold
As in MO-MPO we use these εk to define the preferencesover objectives More concretely εk defines the allowed con-tribution of objective k to the change of the policy There-fore the larger a particular εk is with respect to others themore that objective k is preferred On the other hand ifεk = 0 is zero then objective k will have no contribution tothe change of the policy and will effectively be ignored
Equation (22) can be solved in closed form
qk(s a) prop pold(s a) exp(Ak(s a)
ηk
) (23)
where the temperature ηk is computed based on the con-straint εk by solving the following convex dual problem
ηk = argminηk
[ηk εk + (24)
ηk log
intsa
pold(s a) exp(Ak(s a)
η
)da ds
]
We can perform the optimization along with the loss bytaking a gradient descent step on ηk and we initialize withthe solution found in the previous policy iteration step Sinceηk must be positive we use a projection operator after eachgradient step to maintain ηk gt 0
Top-k advantages As in Song et al (2020) in practicewe used the samples corresponding to the top 50 of ad-vantages in each batch of data
D22 FITTING A NEW PARAMETRIC POLICY (STEP 2)
We next want to combine and distill the state-action distribu-tions obtained in the previous step into a single parametricpolicy πnew(a|s) that favors all of the objectives accordingto the preferences specified by εk For this we solve a su-pervised learning problem that fits a parametric policy asfollows
A Distributional View on Multi-Objective Policy Optimization
πnew = argmaxθ
Nsumk=1
intsa
qk(s a) log πθ(a|s) da ds
stints
KL(πold(a|s) πθ(a|s)) ds lt β (25)
where θ are the parameters of our function approximator(a neural network) which we initialize from the weights ofthe previous policy πold and the KL constraint enforces atrust region of size β that limits the overall change in theparametric policy to improve stability of learning As inMPO the KL constraint in this step has a regularizationeffect that prevents the policy from overfitting to the localpolicies and therefore avoids premature convergence
In order to optimize Equation (25) we employ Lagrangianrelaxation similar to the one employed for ordinary V-MPO (Song et al 2020)
E Multi-Objective Policy Improvement asInference
In the main paper we motivated the multi-objective policyupdate rules from an intuitive perspective In this section weshow that our multi-objective policy improvement algorithmcan also be derived from the RL-as-inference perspective Inthe policy improvement stage we assume that a Q-functionfor each objective is given and we would like to improveour policy with respect to these Q-functions The deriva-tion here extends the derivation for the policy improvementalgorithm in (single-objective) MPO in Abdolmaleki et al(2018a) (in appendix) to the multi-objective case
We assume there are observable binary improvement eventsRkNk=1 for each objective Rk = 1 indicates that ourpolicy has improved for objective k whileRk = 0 indicatesthat it has not Now we ask if the policy has improved withrespect to all objectives ie Rk = 1Nk=1 what would theparameters θ of that improved policy be More concretelythis is equivalent to the maximum a posteriori estimate forRk = 1Nk=1
maxθ
pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)
where the probability of improvement events depends on θAssuming independence between improvement events Rkand using log probabilities leads to the following objective
maxθ
Nsumk=1
log pθ(Rk = 1) + log p(θ) (27)
The prior distribution over parameters p(θ) is fixed duringthe policy improvement step We set the prior such that πθ
stays close to the target policy during each policy improve-ment step to improve stability of learning (Sec E2)
We use the standard expectation-maximization (EM) algo-rithm to efficiently maximize
sumNk=1 log pθ(Rk = 1) The
EM algorithm repeatedly constructs a tight lower bound inthe E-step given the previous estimate of θ (which corre-sponds to the target policy in our case) and then optimizesthat lower bound in the M-step We introduce a variationaldistribution qk(s a) per objective and use this to decom-pose the log-evidence
sumNk=1 log pθ(Rk = 1) as follows
Nsumk=1
log pθ(Rk = 1) (28)
=
Nsumk=1
KL(qk(s a) pθ(s a|Rk = 1))minus
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a))
The second term of this decomposition is expanded as
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a)) (29)
=
Nsumk=1
Eqk(sa)
[log
pθ(Rk = 1 s a)
qk(s a)
]=
Nsumk=1
Eqk(sa)
[log
p(Rk = 1|s a)πθ(a|s)micro(s)
qk(s a)
]
where micro(s) is the stationary state distribution which is as-sumed to be given in each policy improvement step Inpractice micro(s) is approximated by the distribution of thestates in the replay buffer p(Rk = 1|s a) is the likelihoodof the improvement event for objective k if our policy choseaction a in state s
The first term of the decomposition in Equation (28) isalways positive so the second term is a lower bound onthe log-evidence
sumNk=1 log pθ(Rk = 1) πθ(a|s) and
qk(a|s) = qk(sa)micro(s) are unknown even though micro(s) is given
In the E-step we estimate qk(a|s) for each objective k byminimizing the first term given θ = θprime (the parameters ofthe target policy) Then in the M-step we find a new θ byfitting the parametric policy to the distributions from firststep
E1 E-Step
The E-step corresponds to the first step of policy improve-ment in the main paper (Sec 421) In the E-step wechoose the variational distributions qk(a|s)Nk=1 such thatthe lower bound on
sumNk=1 log pθ(Rk = 1) is as tight as
A Distributional View on Multi-Objective Policy Optimization
possible when θ = θprime the parameters of the target policyThe lower bound is tight when the first term of the decompo-sition in Equation (28) is zero so we choose qk to minimizethis term We can optimize for each variational distributionqk independently
qk(a|s) = argminq
Emicro(s)
[KL(q(a|s) pθprime(s a|Rk = 1))
]= argmin
qEmicro(s)
[KL(q(a|s) πθprime(a|s))minus
Eq(a|s) log p(Rk = 1|s a))]
(30)
We can solve this optimization problem in closed formwhich gives us
qk(a|s) =πθprime(a|s) p(Rk = 1|s a)intπθprime(a|s) p(Rk = 1|s a) da
This solution weighs the actions based on their relativeimprovement likelihood p(Rk = 1|s a) for each objectiveWe define the likelihood of an improvement event Rk as
p(Rk = 1|s a) prop exp(Qk(s a)
αk
)
where αk is an objective-dependent temperature parameterthat controls how greedy the solution qk(a|s) is with respectto its associated objective Qk(s a) For example given abatch of state-action pairs at the extremes an αk of zerowould give all probability to the state-action pair with themaximum Q-value while an αk of positive infinity wouldgive the same probability to all state-action pairs In orderto automatically optimize for αk we plug the exponentialtransformation into (30) which leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) dadsminus
αk
intmicro(s) KL(q(a|s) πθprime(a|s)) ds (31)
If we instead enforce the bound on KL divergence as a hardconstraint (which we do in practice) that leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) da ds (32)
stintmicro(s) KL(q(a|s) πθprime(a|s)) ds lt εk
where as described in the main paper the parameter εkdefines preferences over objectives If we now assume thatqk(a|s) is a non-parametric sample-based distribution as in(Abdolmaleki et al 2018b) we can solve this constrained
optimization problem in closed form for each sampled states by setting
q(a|s) prop πθprime(a|s) exp(Qk(s a)
ηk
) (33)
where ηk is computed by solving the following convex dualfunction
ηk = argminη
ηεk+ (34)
η
intmicro(s) log
intπθprime(a|s) exp
(Qk(s a)
η
)da ds
Equations (32) (33) and (34) are equivalent to those givenfor the first step of multi-objective policy improvement inthe main paper (Sec 421) Please see refer to AppendixD2 in (Abdolmaleki et al 2018b) for details on derivationof the dual function
E2 M-Step
After obtaining the variational distributions qk(a|s) foreach objective k we have found a tight lower bound forsumNk=1 log pθ(Rk = 1) when θ = θprime We can now obtain
the parameters θ of the updated policy πθ(a|s) by optimiz-ing this lower bound After rearranging and dropping theterms in (27) that do not involve θ we obtain
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) da ds
+ log p(θ)
This corresponds to the supervised learning step of policyimprovement in the main paper (Sec 422)
For the prior p(θ) on the parameters θ we enforce that thenew policy πθ(a|s) is close to the target policy πθprime(a|s)We therefore optimize for
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) dads
stintmicro(s) KL(πθprime(a|s) πθ(a|s)) ds lt β
Because of the constraint on the change of the parametricpolicy we do not greedily optimize the M-step objectiveSee (Abdolmaleki et al 2018b) for more details
A Distributional View on Multi-Objective Policy Optimization
A Experiment DetailsIn this section we describe implementation details and spec-ify the hyperparameters used for our algorithm In all ourexperiments the policy and critic(s) are implemented withfeed-forward neural networks We use Adam (Kingma ampBa 2015) for optimization
In our continuous control tasks the policy returns a Gaus-sian distribution with a diagonal covariance matrix ieπθ(a|s) = N (microΣ) The policy is parametrized by aneural network which outputs the mean micro = micro(s) and di-agonal Cholesky factors A = A(s) such that Σ = AAT The diagonal factor A has positive diagonal elements en-forced by the softplus transform Aii larr log(1 + exp(Aii))to ensure positive definiteness of the diagonal covariancematrix
Table 2 shows the default hyperparameters that we use forMO-MPO and scalarized MPO and Table 3 shows the hy-perparameters that differ from these defaults for the hu-manoid Shadow Hand and Sawyer experiments For alltasks that we train policies with MO-MPO for a separatecritic is trained for each objective with a shared first layerThe only exception is the action norm penalty in humanoidrun which is computed exactly from the action We foundthat for both policy and critic networks layer normalizationof the first layer followed by a hyperbolic tangent (tanh) isimportant for the stability of the algorithm
In our experiments comparing the Pareto front found byMO-MPO versus scalarized MPO we ran MO-MPO with arange of εk settings and ran scalarized MPO with a range ofweight settings to obtain a set of policies for each algorithmThe settings for humanoid run and the Shadow Hand tasksare listed in Table 4 The settings for humanoid mocap andSawyer are specified in the main paper in Sec 63 and 64respectively
In the following subsections we first give suggestions forchoosing appropriate εk to encode a particular preference(Sec A1) then we describe the discrete MO-MPO usedfor Deep Sea Treasure (Sec A2) and finally we describeimplementation details for humanoid mocap (Sec A3)
A1 Suggestions for Setting ε
Our proposed algorithm MO-(V-)MPO requires practition-ers to translate a desired preference across objectives to nu-merical choices for εk for each objective k At first glancethis may seem daunting However in practice we havefound that encoding preferences via εk is often more in-tuitive than doing so via scalarization In this subsectionwe seek to give an intuition on how to set εk for differentdesired preferences Recall that each εk controls the influ-ence of objective k on the policy update by constraining theKL-divergence between each objective-specific distribution
Hyperparameter Default
policy networklayer sizes (300 200)take tanh of action mean yesminimum variance 10minus12
maximum variance unbounded
critic network(s)layer sizes (400 400 300)take tanh of action yesRetrace sequence size 8discount factor γ 099
both policy and critic networkslayer norm on first layer yestanh on output of layer norm yesactivation (after each hidden layer) ELU
MPOactions sampled per state 20default ε 01KL-constraint on policy mean βmicro 10minus3
KL-constraint on policy covariance βΣ 10minus5
initial temperature η 1
trainingbatch size 512replay buffer size 106
Adam learning rate 3times10minus4
Adam ε 10minus3
target network update period 200
Table 2 Default hyperparameters for MO-MPO and scalarizedMPO with decoupled update on mean and covariance (Sec C3)
and the current policy (Sec 41) We generally choose εk inthe range of 0001 to 01
Equal Preference When all objectives are equally impor-tant the general rule is to set all εk to the same value Wedid this in our orient and humanoid mocap tasks where ob-jectives are aligned or mostly aligned In contrast it can betricky to choose appropriate weights in linear scalarizationto encode equal preferencesmdashsetting all weights equal to1K (where K is the number of objectives) is only appro-priate if the objectivesrsquo rewards are of similar scales Weexplored this in Sec 51 and Fig 3
When setting all εk to the same value what should thisvalue be The larger εk is the more influence the objectiveswill have on the policy update step Since the per-objectivecritics are learned in parallel with the policy setting εktoo high tends to destabilize learning because early on intraining when the critics produce unreliable Q-values theirinfluence on the policy will lead it in the wrong directionOn the other hand if εk is set too low then it slows downlearning because the per-objective action distribution isonly allowed to deviate by a tiny amount from the currentpolicy and the updated policy is obtained via supervisedlearning on the combination of these action distributions
A Distributional View on Multi-Objective Policy Optimization
Humanoid run
policy networklayer sizes (400 400 300)take tanh of action mean nominimum variance 10minus8
critic network(s)layer sizes (500 500 400)take tanh of action no
MPOKL-constraint on policy mean βmicro 5times10minus3
KL-constraint on policy covariance βΣ 10minus6
trainingAdam learning rate 2times10minus4
Adam ε 10minus8
Shadow Hand
MPOactions sampled per state 30default ε 001
Sawyer
MPOactions sampled per state 15
trainingtarget network update period 100
Table 3 Hyperparameters for humanoid Shadow Hand andSawyer experiments that differ from the defaults in Table 2
Eventually the learning will converge to more or less thesame policy though as long as εk is not set too high
Unequal Preference When there is a difference in pref-erences across objectives the relative scale of εk is whatmatters The more the relative scale of εk is compared to εlthe more influence objective k has over the policy updatecompared to objective l And in the extreme case whenεl is near-zero for objective l then objective l will haveno influence on the policy update and will effectively beignored We explored this briefly in Sec 51 and Fig 4
One common example of unequal preferences is when wewould like an agent to complete a task while minimiz-ing other objectivesmdasheg an action norm penalty ldquopainrdquopenalty or wrist force-torque penalty in our experimentsIn this case the ε for the task objective should be higherthan that for the other objectives to incentivize the agent toprioritize actually doing the task If the ε for the penaltiesis too high then the agent will care more about minimiz-ing the penalty (which can typically be achieved by simplytaking no actions) rather than doing the task which is notparticularly useful
The scale of εk has a similar effect as in the equal preference
Condition Settings
Humanoid run (one seed per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin linspace(0 015 100)
MO-MPO εtask = 01εpenalty isin linspace(10minus6 015 100)
Humanoid run normal vs scaled (three seeds per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin 001 005 01
MO-MPO εtask = 01εpenalty isin 001 005 01
Shadow Hand touch and turn (three seeds per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin linspace(0 09 10)
MO-MPO εtask = 001εpenalty isin linspace(0001 0015 15)
Shadow Hand orient (ten seeds per setting)
scalarized MPO wtouch = wheight = worientation = 13MO-MPO εtouch = εheight = εorientation = 001
Table 4 Settings for εk and weights (linspace(x y z) denotes aset of z evenly-spaced values between x and y)
case If the scale of εk is too high or too low then the sameissues arise as discussed for equal preferences If all εkincrease or decrease in scale by the same (moderate) factorand thus their relative scales remain the same then typicallythey will converge to more or less the same policy Thiscan be seen in Fig 5 (right) in the Deep Sea Treasuredomain regardless of whether εtime is 001 002 or 005the relationship between the relative scale of εtreasure andεtime to the treasure that the policy converges to selecting isessentially the same
A2 Deep Sea Treasure
In order to handle discrete actions we make several mi-nor adjustments to scalarized MPO and MO-MPO Thepolicy returns a categorical distribution rather than a Gaus-sian The policy is parametrized by a neural network whichoutputs a vector of logits (ie unnormalized log probabil-ities) of size |A| The KL constraint on the change of thispolicy β is 0001 The input to the critic network is thestate concatenated with a four-dimensional one-hot vectordenoting which action is chosen (eg the up action corre-sponds to [1 0 0 0]gt) Critics are trained with one-steptemporal-difference error with a discount of 0999 Otherthan these changes the network architectures and the MPOand training hyperparameters are the same as in Table 2
A Distributional View on Multi-Objective Policy Optimization
Humanoid Mocap
Scalarized VMPOKL-constraint on policy mean βmicro 01KL-constraint on policy covariance βΣ 10minus5
default ε 01initial temperature η 1
TrainingAdam learning rate 10minus4
batch size 128unroll length (for n-step return Sec D1) 32
Table 5 Hyperparameters for the humanoid mocap experiments
A3 Humanoid Mocap
For the humanoid mocap experiments we used the follow-ing architecture for both MO-VMPO and VMPO for thepolicy we process a concatentation of the mocap referenceobservations and the proprioceptive observations by a twolayer MLP with 1024 hidden units per layer This referenceencoder is followed by linear layers to produce the meanand log standard deviation of a stochastic latent variableThese latent variables are then concatenated with the propri-oceptive observations and processed by another two layerMLP with 1024 hidden units per layer to produce the actiondistribution For VMPO we use a three layer MLP with1024 hidden units per layer as the critic For MO-VMPOwe use a shared two layer MLP with 1024 hidden units perlayer followed by a one layer MLP with 1024 hidden unitsper layer per objective In both cases we use k-step returnsto train the critic with a discount factor of 095 Table 5shows additional hyperparameters used in our experiments
B Experimental DomainsWe evaluated our approach on one discrete domain (DeepSea Treasure) three simulated continuous control domains(humanoid Shadow Hand and humanoid mocap) and onereal-world continuous control domain (Sawyer robot) Herewe provide more detail about these domains and the objec-tives used in each task
B1 Deep Sea Treasure
Deep Sea Treasure (DST) is a 11times10 grid-world domainthe state space S consists of the x and y position of the agentand the action spaceA is up right down left Thelayout of the environment and values of the treasures areshown in Fig 8 If the action would cause the agent tocollide with the sea floor or go out of bounds it has noeffect Farther-away treasures have higher values Theepisode terminates when the agent collects a treasure orafter 200 timesteps
237
224
203196
161151140
115
82
07
Figure 8 Deep Sea Treasure environment from Vamplew et al(2011) with weights from Yang et al (2019) Treasures are labeledwith their respective values The agent can move around freelyin the white squares but cannot enter the black squares (ie theocean floor)
There are two objectives time penalty and treasure valueA time penalty of minus1 is given at each time step The agentreceives the value of the treasure it picks up as the rewardfor the treasure objective In other words when the agentpicks up a treasure of value v the reward vector is [minus1 v]otherwise it is [minus1 0]
B2 Shadow Hand
Our robot platform is a simulated Shadow Dexterous Hand(Shadow Robot Company 2020) in the MuJoCo physicsengine (Todorov et al 2012) The Shadow Hand has fivefingers and 24 degrees of freedom actuated by 20 motorsThe observation consists of joint angles joint velocities andtouch sensors for a total of 63 dimensions Each fingertipof our Shadow Hand has a 4times4 spatial touch sensor Thissensor has three channels measuring the normal force andthe x and y-axis tangential forces for a sensor dimension of4times4times3 per fingertip We simplify this sensor by summingacross spatial dimensions resulting in a 1times1times3 observationper fingertip
In the touch task there is a block in the environment that isalways fixed in the same pose In the turn task there is a dialin the environment that can be rotated and it is initialized toa random position between minus30 and 30 The target angleof the dial is 0 The angle of the dial is included in theagentrsquos observation In the orient task the robot interactswith a rectangular peg in the environment the initial andtarget pose of the peg remains the same across episodesThe pose of the block is included in the agentrsquos observationencoded as the xyz positions of four corners of the block(based on how Levine et al (2016) encodes end-effectorpose)
A Distributional View on Multi-Objective Policy Optimization
Figure 9 These images show what task completion looks like forthe touch turn and orient Shadow Hand tasks (from left to right)
0 2 4 6impact force
minus6
minus4
minus2
0
pain
pen
alty
Figure 10 In the touch and turn Shadow Hand tasks the painpenalty is computed based on the impact force as plotted here Forlow impact forces the pain penalty is near-zero For high impactforces the pain penalty is equal to the negative of the impact force
B21 BALANCING TASK COMPLETION AND PAIN
In the touch and turn tasks there are two objectives ldquopainrdquopenalty and task completion A sparse task completion re-ward of 1 is given for pressing a block with greater than 5Nof force or turning a dial to a fixed target angle respectivelyIn both tasks the episode terminates when the agent com-pletes the task ie the agent gets a total reward of either0 or 1 for the task completion objective per episode ThePareto plots for these two tasks in the main paper (Fig 6)show the discounted task reward (with a discount factor ofγ = 099) to capture how long it takes agents to completethe task
The ldquopainrdquo penalty is computed as in (Huang et al 2019) Itis based on the impact force m(s sprime) which is the increasein force from state s to the next state sprime In our tasks this ismeasured by a touch sensor on the block or dial The painpenalty is equal to the negative of the impact force scaledby how unacceptable it is
rpain(s a sprime) = minus[1minus aλ(m(s sprime))
]m(s sprime) (6)
where aλ(middot) isin [0 1] computes the acceptability of an impactforce aλ(middot) should be a monotonically decreasing functionthat captures how resilient the robot and the environmentare to impacts As in (Huang et al 2019) we use
aλ(m) = sigmoid(λ1(minusm+ λ2)) (7)
with λ = [2 2]gt The relationship between pain penaltyand impact force is plotted in Fig 10
675 700 725zpeg
0
1
rew
ard
height objective
0 2d(θ θpeg)
orientation objective
Figure 11 In the orient Shadow Hand task the height and orienta-tion objectives have shaped rewards as plotted here
B22 ALIGNED OBJECTIVES
In the orient task there are three non-competing objectivestouching the peg lifting the peg to a target height and ori-enting the peg to be perpendicular to the ground The targetz position and orientation is shown by the gray transparentpeg (Fig 9 right-most) note that the x and y position isnot specified so the robot in the figure gets the maximumreward with respect to all three objectives
The touch objective has a sparse reward of 1 for activat-ing the pegrsquos touch sensor and zero otherwise For theheight objective the target z position is 7cm above theground the pegrsquos z position is computed with respect toits center of mass The shaped reward for this objectiveis 1 minus tanh(50|ztarget minus zpeg|) For the orientation objec-tive since the peg is symmetrical there are eight possibleorientations that are equally valid The acceptable targetorientationsQtarget and the pegrsquos orientation qpeg are denotedas quaternions The shaped reward is computed with respectto the closest target quaternion as
minqisinQtarget
1minus tanh(2lowastd(q qpeg))
where d(middot) denotes the `2-norm of the axis-angle equivalent(in radians) for the distance between the two quaternionsFig 11 shows the shaping of the height and orientationrewards
B3 Humanoid
We make use of the humanoid run task from Tassa et al(2018)9 The observation dimension is 67 and the actiondimension is 21 Actions are joint accelerations with min-imum and maximum limits of minus1 and 1 respectively Forthis task there are two objectives
bull The original task reward given by the environment Thegoal is is to achieve a horizontal speed of 10 meters persecond in any direction This reward is shaped it isequal to min(h10 1) where h is in meters per secondFor this objective we learn a Q-function
9This task is available at githubcomdeepminddm control
A Distributional View on Multi-Objective Policy Optimization
bull Limiting energy usage by penalizing taking high-magnitude actions The penalty is the `2-norm of theaction vector ie rpenalty(a) = minusa2 For this ob-jective we do not learn a Q-function Instead wecompute this penalty directly to evaluate a given actionin a state-independent way during policy optimization
B4 Humanoid Mocap
The motion capture tracking task used in this paper wasdeveloped as part of a concurrent submission (Anonymous2020) and does not constitute a contribution of this paperWe use a simulated humanoid adapted from the ldquoCMUhumanoidrdquo available at dm controllocomotion (Merel et al2019) We adjusted various body and actuator parametersto be more comparable to that of an average human Thishumanoid has 56 degrees of freedom and the observation is1021-dimensional
This motion capture tracking task is broadly similar to previ-ous work on motion capture tracking (Chentanez et al 2018Peng et al 2018 Merel et al 2019) The task is associatedwith an underlying set of motion capture clips For thispaper we used roughly 40 minutes of locomotion motioncapture clips from the CMU motion capture database10
At the beginning of a new episode we select a frame uni-formly at random from all frames in the underlying mocapdata (excluding the last 10 frames of each clip) The simu-lated humanoid is then initialized to the pose of the selectedframe
The observations in this environment are various propriocep-tive observations about the current state of the body as wellas the relative target positions and orientations of 13 differ-ent body parts in the humanoidrsquos local frame We providethe agent with the target poses for the next 5 timesteps
The choice of reward function is crucial for motion capturetracking tasks Here we consider five different reward com-ponents each capturing a different aspect of the similarityof the pose between the simulated humanoid and the mocaptarget Four of our reward components were initially pro-posed by Peng et al (2018) The first reward component isbased on the difference in the center of mass
rcom = exp(minus10pcom minus pref
com2)
where pcom and prefcom are the positions of the center of mass
of the simulated humanoid and the mocap reference re-spectively The second reward component is based on thedifference in joint angle velocities
rvel = exp(minus01qvel minus qref
vel2)
where qvel and qrefvel are the joint angle velocities of the simu-
lated humanoid and the mocap reference respectively The
10This database is available at mocapcscmuedu
third reward component is based on the difference in end-effector positions
rapp = exp(minus40papp minus pref
app2)
where papp and prefapp are the end effector positions of the
simulated humanoid and the mocap reference respectivelyThe fourth reward component is based on the difference injoint orientations
rquat = exp(minus2qquat qref
quat2)
where denotes quaternion differences and qquat and qrefquat
are the joint quaternions of the simulated humanoid and themocap reference respectively
Finally the last reward component is based on difference inthe joint angles and the Euclidean positions of a set of 13body parts
rtrunc = 1minus 1
03(
=ε︷ ︸︸ ︷bpos minus bref
pos1 + qpos minus qrefpos1)
where bpos and brefpos correspond to the body positions of the
simulated character and the mocap reference and qpos andqref
pos correspond to the joint angles
We include an early termination condition in our task that islinked to this last reward term We terminate the episode ifε gt 03 which ensures that rtrunc isin [0 1]
In our MO-VMPO experiments we treat each reward com-ponent as a separate objective In our VMPO experimentswe use a reward of the form
r =1
2rtrunc +
1
2(01rcom + λrvel + 015rapp + 065rquat)
(8)
varying λ as described in the main paper (Sec 63)
B5 Sawyer
The Sawyer peg-in-hole setup consists of a Rethink RoboticsSawyer robot arm The robot has 7 joints driven by serieselastic actuators On the wrist a Robotiq FT 300 force-torque sensor is mounted followed by a Robotiq 2F85 par-allel gripper We implement a 3D end-effector Cartesianvelocity controller that maintains a fixed orientation of thegripper The agent controls the Cartesian velocity setpointsThe gripper is not actuated in this setup The observationsprovided are the joint positions velocities and torques theend-effector pose the wrist force-torque measurements andthe joint velocity command output of the Cartesian con-troller We augment each observation with the two previousobservations as a history
In each episode the peg is initalized randomly within an8times8times8 cm workspace (with the peg outside of the hole)
A Distributional View on Multi-Objective Policy Optimization
The environment is then run at 20 Hz for 600 steps Theaction limits are plusmn005 ms on each axis If the magnitudeof the wrist force-torque measurements exceeds 15 N onany horizontal axis (x and y axis) or 5 N on the vertical axis(z axis) the episode is terminated early
There are two objectives the task itself (inserting the peg)and minimizing wrist force measurements
The reward for the task objective is defined as
rinsertion = max(
02rapproach rinserted
)(9)
rapproach = s(ppeg popening) (10)
rinserted = raligned s(ppegz pbottom
z ) (11)
raligned = ppegxy minus popening
xy lt d2 (12)
where ppeg is the peg position popening the position of thehole opening pbottom the bottom position of the hole and dthe diameter of the hole
s(p1 p2) is a shaping operator
s(p1 p2) = 1minus tanh2(atanh(
radicl)
εp1 minus p22) (13)
which gives a reward of 1minus l if p1 and p2 are at a Euclideandistance of ε Here we chose lapproach = 095 εapproach =001 linserted = 05 and εinserted = 004 Intentionally 004corresponds to the length of the peg such that rinserted = 05if ppeg = popening As a result if the peg is in the approachposition rinserted is dominant over rapproach in rinsertion
Intuitively there is a shaped reward for aligning the pegwith a magnitude of 02 After accurate alignment withina horizontal tolerance the overall reward is dominated bya vertical shaped reward component for insertion The hor-izontal tolerance threshold ensures that the agent is notencouraged to try to insert the peg from the side
The reward for the secondary objective minimizing wristforce measurements is defined as
rforce = minusF1 (14)
where F = (Fx Fy Fz) are the forces measured by thewrist sensor
C Algorithmic DetailsC1 General Algorithm Description
We maintain one online network and one target network foreach Q-function associated to objective k with parametersdenoted by φk and φprimek respectively We also maintain oneonline network and one target network for the policy withparameters denoted by θ and θprime respectively Target net-works are updated every fixed number of steps by copying
parameters from the online network Online networks areupdated using gradient descent in each learning iterationNote that in the main paper we refer to the target policynetwork as the old policy πold
We use an asynchronous actor-learner setup in which ac-tors regularly fetch policy parameters from the learner andact in the environment writing these transitions to the re-play buffer We refer to this policy as the behavior policythroughout the paper The learner uses the transitions inthe replay buffer to update the (online) Q-functions and thepolicy Please see Algorithms 2 and 3 for more details onthe actor and learner processes respectively
Algorithm 2 MO-MPO Asynchronous Actor
1 given (N) reward functions rk(s a)Nk=1 T steps perepisode
2 repeat3 Fetch current policy parameters θ from learner4 Collect trajectory from environment5 τ = 6 for t = 0 T do7 at sim πθ(middot|st)8 Execute action and determine rewards9 r = [r1(st at) rN (st at)]
10 τ larr τ cup (st at r πθ(at|st))11 end for12 Send trajectory τ to replay buffer13 until end of training
C2 Retrace for Multi-Objective Policy Evaluation
Recall that the goal of the policy evaluation stage (Sec 41)is to learn the Q-function Qφk
(s a) for each objective kparametrized by φk These Q-functions are with respectto the policy πold We emphasize that it is valid to use anyoff-the-shelf Q-learning algorithm such as TD(0) (Sutton1988) to learn these Q-functions We choose to use Retrace(Munos et al 2016) described here
Given a replay buffer D containing trajectory snippetsτ = (s0 a0 r0 s1) (sT aT rT sT+1) where rtdenotes a reward vector rk(st at)Nk=1 that consists of ascalar reward for each ofN objectives the Retrace objectiveis as follows
minφk
EτsimD[(Qretk (st at)minus Qφk
(st at))2] (15)
with
Qretk (st at) = Qφprime
k(st at) +
Tsumj=t
γjminust( jprodz=t+1
cz
)δj
δj = rk(sj aj) + γV (sj+1)minus Qφprimek(sj aj)
V (sj+1) = Eπold(a|sj+1)[Qφprimek(sj+1 a)]
A Distributional View on Multi-Objective Policy Optimization
Algorithm 3 MO-MPO Asynchronous Learner1 given batch size (L) number of actions (M) (N) Q-functions
(N) preferences εkNk=1 policy networks and replay bufferD
2 initialize Lagrangians ηkNk=1 and ν target networks andonline networks such that πθprime = πθ and Qφprime
k= Qφk
3 repeat4 repeat5 Collect dataset si aij Qijk
LMNijk where
6 si sim D aij sim πθprime(a|si) and Qijk = Qφkprime(si aij)
78 Compute action distribution for each objective9 for k = 1 N do
10 δηk larr nablaηkηkεk + ηksumLi
1Llog
(sumMj
1M
exp(Q
ijkηk
))11 Update ηk based on δηk12 qijk prop exp(
Qijkηk
)
13 end for1415 Update parametric policy with trust region16 sg denotes a stop-gradient17 δπ larr minusnablaθ
sumLi
sumMj
sumNk q
ijk log πθ(a
ij |si)18 +sg(ν)
(β minus
sumLi KL(πθprime(a|si) πθ(a|si)) ds
)19 δν larr nablaνν
(βminussumLi KL(πθprime(a|si) πsg(θ)(a|si)) ds
)20 Update πθ based on δπ21 Update ν based on δν2223 Update Q-functions24 for k = 1 N do25 δφk larr nablaφk
sum(stat)isinτsimD
(Qφk (st at)minusQ
retk
)226 with Qret
k as in Eq 1527 Update φk based on δφk
28 end for2930 until fixed number of steps31 Update target networks32 πθprime = πθ Qφk
prime = Qφk
33 until convergence
The importance weights ck are defined as
cz = min
(1πold(az|sz)b(az|sz)
)
where b(az|sz) denotes the behavior policy used to col-lect trajectories in the environment When j = t we set(prodj
z=t+1 cz
)= 1
We also make use of a target network for each Q-function(Mnih et al 2015) parametrized by φprimek which we copyfrom the online network φk after a fixed number of gradientsteps on the Retrace objective (15)
C3 Policy Fitting
Recall that fitting the policy in the policy improvementstage (Sec 422) requires solving the following constrained
optimization
πnew = argmaxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds
stints
micro(s) KL(πold(a|s) πθ(a|s)) ds lt β (16)
We first write the generalized Lagrangian equation ie
L(θ ν) =
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds (17)
+ ν(β minus
ints
micro(s) KL(πold(a|s) πθ(a|s)) ds)
where ν is the Lagrangian multiplier Now we solve thefollowing primal problem
maxθ
minνgt0
L(θ ν)
To obtain the parameters θ for the updated policy we solvefor ν and θ by alternating between 1) fixing ν to its currentvalue and optimizing for θ and 2) fixing θ to its currentvalue and optimizing for ν This can be applied for anypolicy output distribution
For Gaussian policies we decouple the update rules to op-timize for mean and covariance independently as in (Ab-dolmaleki et al 2018b) This allows for setting differentKL bounds for the mean (βmicro) and covariance (βΣ) whichresults in more stable learning To do this we first separateout the following two policies for mean and covariance
πmicroθ (a|s) = N(amicroθ(s)Σθold
(s)) (18)
πΣθ (a|s) = N
(amicroθold
(s)Σθ(s)) (19)
Policy πmicroθ (a|s) takes the mean from the online policy net-work and covariance from the target policy network andpolicy πΣ
θ (a|s) takes the mean from the target policy net-work and covariance from online policy network Now ouroptimization problem has the following form
maxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s)(
log πmicroθ (a|s)πΣθ (a|s)
)da ds
stints
micro(s) KL(πold(a|s) πmicroθ (a|s)) ds lt βmicroints
micro(s) KL(πold(a|s) πΣθ (a|s)) ds lt βΣ (20)
As in (Abdolmaleki et al 2018b) we set a much smallerbound for covariance than for mean to keep the explorationand avoid premature convergence We can solve this op-timization problem using the same general procedure asdescribed above
A Distributional View on Multi-Objective Policy Optimization
C4 Derivation of Dual Function for η
Recall that obtaining the per-objective improved action dis-tributions qk(a|s) in the policy improvement stage (Sec421) requires solving a convex dual function for the tem-perature ηk for each objective For the derivation of thisdual function please refer to Appendix D2 of the originalMPO paper (Abdolmaleki et al 2018b)
D MO-V-MPO Multi-Objective On-PolicyMPO
In this section we describe how MO-MPO can be adaptedto the on-policy case in which a state-value function V (s)is learned and used to estimate the advantages for eachobjective We call this approach MO-V-MPO
D1 Multi-objective policy evaluation in the on-policysetting
In the on-policy setting to evaluate the previous policyπold we use advantages A(s a) estimated from a learnedstate-value function V (s) instead of a state-action valuefunction Q(s a) as in the main text We train a separateV -function for each objective by regressing to the stan-dard n-step return (Sutton 1988) associated with each ob-jective More concretely given trajectory snippets τ =(s0 a0 r0) (sT aT rT ) where rt denotes a rewardvector rk(st at)Nk=1 that consists of rewards for all N ob-jectives we find value function parameters φk by optimizingthe following objective
minφ1k
Nsumk=1
Eτ[(G
(T )k (st at)minus V πold
φk(st))
2] (21)
Here G(T )k (st at) is the T -step target for value function
k which uses the actual rewards in the trajectory andbootstraps from the current value function for the restG
(T )k (st at) =
sumTminus1`=t γ
`minustrk(s` a`) + γTminustV πoldφk
(s`+T )The advantages are then estimated as Aπold
k (st at) =
G(T )k (st at)minus V πold
φk(st)
D2 Multi-objective policy improvement in theon-policy setting
Given the previous policy πold(a|s) and estimated advan-tages Aπold
k (s a)k=1N associated with this policy foreach objective our goal is to improve the previous policyTo this end we first learn an improved variational distribu-tion qk(s a) for each objective and then combine and distillthe variational distributions into a new parametric policyπnew(a|s) Unlike in MO-MPO for MO-V-MPO we use thejoint distribution qk(s a) rather than local policies qk(a|s)because without a learned Q-function only one action perstate is available for learning This is a multi-objective vari-
ant of the two-step policy improvement procedure employedby V-MPO (Song et al 2020)
D21 OBTAINING IMPROVED VARIATIONALDISTRIBUTIONS PER OBJECTIVE (STEP 1)
In order to obtain the improved variational distributionsqk(s a) we optimize the RL objective for each objectiveAk
maxqk
intsa
qk(s a)Ak(s a) dads (22)
st KL(qk(s a)pold(s a)) lt εk
where the KL-divergence is computed over all (s a) εk de-notes the allowed expected KL divergence and pold(s a) =micro(s)πold(a|s) is the state-action distribution associated withπold
As in MO-MPO we use these εk to define the preferencesover objectives More concretely εk defines the allowed con-tribution of objective k to the change of the policy There-fore the larger a particular εk is with respect to others themore that objective k is preferred On the other hand ifεk = 0 is zero then objective k will have no contribution tothe change of the policy and will effectively be ignored
Equation (22) can be solved in closed form
qk(s a) prop pold(s a) exp(Ak(s a)
ηk
) (23)
where the temperature ηk is computed based on the con-straint εk by solving the following convex dual problem
ηk = argminηk
[ηk εk + (24)
ηk log
intsa
pold(s a) exp(Ak(s a)
η
)da ds
]
We can perform the optimization along with the loss bytaking a gradient descent step on ηk and we initialize withthe solution found in the previous policy iteration step Sinceηk must be positive we use a projection operator after eachgradient step to maintain ηk gt 0
Top-k advantages As in Song et al (2020) in practicewe used the samples corresponding to the top 50 of ad-vantages in each batch of data
D22 FITTING A NEW PARAMETRIC POLICY (STEP 2)
We next want to combine and distill the state-action distribu-tions obtained in the previous step into a single parametricpolicy πnew(a|s) that favors all of the objectives accordingto the preferences specified by εk For this we solve a su-pervised learning problem that fits a parametric policy asfollows
A Distributional View on Multi-Objective Policy Optimization
πnew = argmaxθ
Nsumk=1
intsa
qk(s a) log πθ(a|s) da ds
stints
KL(πold(a|s) πθ(a|s)) ds lt β (25)
where θ are the parameters of our function approximator(a neural network) which we initialize from the weights ofthe previous policy πold and the KL constraint enforces atrust region of size β that limits the overall change in theparametric policy to improve stability of learning As inMPO the KL constraint in this step has a regularizationeffect that prevents the policy from overfitting to the localpolicies and therefore avoids premature convergence
In order to optimize Equation (25) we employ Lagrangianrelaxation similar to the one employed for ordinary V-MPO (Song et al 2020)
E Multi-Objective Policy Improvement asInference
In the main paper we motivated the multi-objective policyupdate rules from an intuitive perspective In this section weshow that our multi-objective policy improvement algorithmcan also be derived from the RL-as-inference perspective Inthe policy improvement stage we assume that a Q-functionfor each objective is given and we would like to improveour policy with respect to these Q-functions The deriva-tion here extends the derivation for the policy improvementalgorithm in (single-objective) MPO in Abdolmaleki et al(2018a) (in appendix) to the multi-objective case
We assume there are observable binary improvement eventsRkNk=1 for each objective Rk = 1 indicates that ourpolicy has improved for objective k whileRk = 0 indicatesthat it has not Now we ask if the policy has improved withrespect to all objectives ie Rk = 1Nk=1 what would theparameters θ of that improved policy be More concretelythis is equivalent to the maximum a posteriori estimate forRk = 1Nk=1
maxθ
pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)
where the probability of improvement events depends on θAssuming independence between improvement events Rkand using log probabilities leads to the following objective
maxθ
Nsumk=1
log pθ(Rk = 1) + log p(θ) (27)
The prior distribution over parameters p(θ) is fixed duringthe policy improvement step We set the prior such that πθ
stays close to the target policy during each policy improve-ment step to improve stability of learning (Sec E2)
We use the standard expectation-maximization (EM) algo-rithm to efficiently maximize
sumNk=1 log pθ(Rk = 1) The
EM algorithm repeatedly constructs a tight lower bound inthe E-step given the previous estimate of θ (which corre-sponds to the target policy in our case) and then optimizesthat lower bound in the M-step We introduce a variationaldistribution qk(s a) per objective and use this to decom-pose the log-evidence
sumNk=1 log pθ(Rk = 1) as follows
Nsumk=1
log pθ(Rk = 1) (28)
=
Nsumk=1
KL(qk(s a) pθ(s a|Rk = 1))minus
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a))
The second term of this decomposition is expanded as
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a)) (29)
=
Nsumk=1
Eqk(sa)
[log
pθ(Rk = 1 s a)
qk(s a)
]=
Nsumk=1
Eqk(sa)
[log
p(Rk = 1|s a)πθ(a|s)micro(s)
qk(s a)
]
where micro(s) is the stationary state distribution which is as-sumed to be given in each policy improvement step Inpractice micro(s) is approximated by the distribution of thestates in the replay buffer p(Rk = 1|s a) is the likelihoodof the improvement event for objective k if our policy choseaction a in state s
The first term of the decomposition in Equation (28) isalways positive so the second term is a lower bound onthe log-evidence
sumNk=1 log pθ(Rk = 1) πθ(a|s) and
qk(a|s) = qk(sa)micro(s) are unknown even though micro(s) is given
In the E-step we estimate qk(a|s) for each objective k byminimizing the first term given θ = θprime (the parameters ofthe target policy) Then in the M-step we find a new θ byfitting the parametric policy to the distributions from firststep
E1 E-Step
The E-step corresponds to the first step of policy improve-ment in the main paper (Sec 421) In the E-step wechoose the variational distributions qk(a|s)Nk=1 such thatthe lower bound on
sumNk=1 log pθ(Rk = 1) is as tight as
A Distributional View on Multi-Objective Policy Optimization
possible when θ = θprime the parameters of the target policyThe lower bound is tight when the first term of the decompo-sition in Equation (28) is zero so we choose qk to minimizethis term We can optimize for each variational distributionqk independently
qk(a|s) = argminq
Emicro(s)
[KL(q(a|s) pθprime(s a|Rk = 1))
]= argmin
qEmicro(s)
[KL(q(a|s) πθprime(a|s))minus
Eq(a|s) log p(Rk = 1|s a))]
(30)
We can solve this optimization problem in closed formwhich gives us
qk(a|s) =πθprime(a|s) p(Rk = 1|s a)intπθprime(a|s) p(Rk = 1|s a) da
This solution weighs the actions based on their relativeimprovement likelihood p(Rk = 1|s a) for each objectiveWe define the likelihood of an improvement event Rk as
p(Rk = 1|s a) prop exp(Qk(s a)
αk
)
where αk is an objective-dependent temperature parameterthat controls how greedy the solution qk(a|s) is with respectto its associated objective Qk(s a) For example given abatch of state-action pairs at the extremes an αk of zerowould give all probability to the state-action pair with themaximum Q-value while an αk of positive infinity wouldgive the same probability to all state-action pairs In orderto automatically optimize for αk we plug the exponentialtransformation into (30) which leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) dadsminus
αk
intmicro(s) KL(q(a|s) πθprime(a|s)) ds (31)
If we instead enforce the bound on KL divergence as a hardconstraint (which we do in practice) that leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) da ds (32)
stintmicro(s) KL(q(a|s) πθprime(a|s)) ds lt εk
where as described in the main paper the parameter εkdefines preferences over objectives If we now assume thatqk(a|s) is a non-parametric sample-based distribution as in(Abdolmaleki et al 2018b) we can solve this constrained
optimization problem in closed form for each sampled states by setting
q(a|s) prop πθprime(a|s) exp(Qk(s a)
ηk
) (33)
where ηk is computed by solving the following convex dualfunction
ηk = argminη
ηεk+ (34)
η
intmicro(s) log
intπθprime(a|s) exp
(Qk(s a)
η
)da ds
Equations (32) (33) and (34) are equivalent to those givenfor the first step of multi-objective policy improvement inthe main paper (Sec 421) Please see refer to AppendixD2 in (Abdolmaleki et al 2018b) for details on derivationof the dual function
E2 M-Step
After obtaining the variational distributions qk(a|s) foreach objective k we have found a tight lower bound forsumNk=1 log pθ(Rk = 1) when θ = θprime We can now obtain
the parameters θ of the updated policy πθ(a|s) by optimiz-ing this lower bound After rearranging and dropping theterms in (27) that do not involve θ we obtain
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) da ds
+ log p(θ)
This corresponds to the supervised learning step of policyimprovement in the main paper (Sec 422)
For the prior p(θ) on the parameters θ we enforce that thenew policy πθ(a|s) is close to the target policy πθprime(a|s)We therefore optimize for
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) dads
stintmicro(s) KL(πθprime(a|s) πθ(a|s)) ds lt β
Because of the constraint on the change of the parametricpolicy we do not greedily optimize the M-step objectiveSee (Abdolmaleki et al 2018b) for more details
A Distributional View on Multi-Objective Policy Optimization
Humanoid run
policy networklayer sizes (400 400 300)take tanh of action mean nominimum variance 10minus8
critic network(s)layer sizes (500 500 400)take tanh of action no
MPOKL-constraint on policy mean βmicro 5times10minus3
KL-constraint on policy covariance βΣ 10minus6
trainingAdam learning rate 2times10minus4
Adam ε 10minus8
Shadow Hand
MPOactions sampled per state 30default ε 001
Sawyer
MPOactions sampled per state 15
trainingtarget network update period 100
Table 3 Hyperparameters for humanoid Shadow Hand andSawyer experiments that differ from the defaults in Table 2
Eventually the learning will converge to more or less thesame policy though as long as εk is not set too high
Unequal Preference When there is a difference in pref-erences across objectives the relative scale of εk is whatmatters The more the relative scale of εk is compared to εlthe more influence objective k has over the policy updatecompared to objective l And in the extreme case whenεl is near-zero for objective l then objective l will haveno influence on the policy update and will effectively beignored We explored this briefly in Sec 51 and Fig 4
One common example of unequal preferences is when wewould like an agent to complete a task while minimiz-ing other objectivesmdasheg an action norm penalty ldquopainrdquopenalty or wrist force-torque penalty in our experimentsIn this case the ε for the task objective should be higherthan that for the other objectives to incentivize the agent toprioritize actually doing the task If the ε for the penaltiesis too high then the agent will care more about minimiz-ing the penalty (which can typically be achieved by simplytaking no actions) rather than doing the task which is notparticularly useful
The scale of εk has a similar effect as in the equal preference
Condition Settings
Humanoid run (one seed per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin linspace(0 015 100)
MO-MPO εtask = 01εpenalty isin linspace(10minus6 015 100)
Humanoid run normal vs scaled (three seeds per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin 001 005 01
MO-MPO εtask = 01εpenalty isin 001 005 01
Shadow Hand touch and turn (three seeds per setting)
scalarized MPO wtask = 1minus wpenaltywpenalty isin linspace(0 09 10)
MO-MPO εtask = 001εpenalty isin linspace(0001 0015 15)
Shadow Hand orient (ten seeds per setting)
scalarized MPO wtouch = wheight = worientation = 13MO-MPO εtouch = εheight = εorientation = 001
Table 4 Settings for εk and weights (linspace(x y z) denotes aset of z evenly-spaced values between x and y)
case If the scale of εk is too high or too low then the sameissues arise as discussed for equal preferences If all εkincrease or decrease in scale by the same (moderate) factorand thus their relative scales remain the same then typicallythey will converge to more or less the same policy Thiscan be seen in Fig 5 (right) in the Deep Sea Treasuredomain regardless of whether εtime is 001 002 or 005the relationship between the relative scale of εtreasure andεtime to the treasure that the policy converges to selecting isessentially the same
A2 Deep Sea Treasure
In order to handle discrete actions we make several mi-nor adjustments to scalarized MPO and MO-MPO Thepolicy returns a categorical distribution rather than a Gaus-sian The policy is parametrized by a neural network whichoutputs a vector of logits (ie unnormalized log probabil-ities) of size |A| The KL constraint on the change of thispolicy β is 0001 The input to the critic network is thestate concatenated with a four-dimensional one-hot vectordenoting which action is chosen (eg the up action corre-sponds to [1 0 0 0]gt) Critics are trained with one-steptemporal-difference error with a discount of 0999 Otherthan these changes the network architectures and the MPOand training hyperparameters are the same as in Table 2
A Distributional View on Multi-Objective Policy Optimization
Humanoid Mocap
Scalarized VMPOKL-constraint on policy mean βmicro 01KL-constraint on policy covariance βΣ 10minus5
default ε 01initial temperature η 1
TrainingAdam learning rate 10minus4
batch size 128unroll length (for n-step return Sec D1) 32
Table 5 Hyperparameters for the humanoid mocap experiments
A3 Humanoid Mocap
For the humanoid mocap experiments we used the follow-ing architecture for both MO-VMPO and VMPO for thepolicy we process a concatentation of the mocap referenceobservations and the proprioceptive observations by a twolayer MLP with 1024 hidden units per layer This referenceencoder is followed by linear layers to produce the meanand log standard deviation of a stochastic latent variableThese latent variables are then concatenated with the propri-oceptive observations and processed by another two layerMLP with 1024 hidden units per layer to produce the actiondistribution For VMPO we use a three layer MLP with1024 hidden units per layer as the critic For MO-VMPOwe use a shared two layer MLP with 1024 hidden units perlayer followed by a one layer MLP with 1024 hidden unitsper layer per objective In both cases we use k-step returnsto train the critic with a discount factor of 095 Table 5shows additional hyperparameters used in our experiments
B Experimental DomainsWe evaluated our approach on one discrete domain (DeepSea Treasure) three simulated continuous control domains(humanoid Shadow Hand and humanoid mocap) and onereal-world continuous control domain (Sawyer robot) Herewe provide more detail about these domains and the objec-tives used in each task
B1 Deep Sea Treasure
Deep Sea Treasure (DST) is a 11times10 grid-world domainthe state space S consists of the x and y position of the agentand the action spaceA is up right down left Thelayout of the environment and values of the treasures areshown in Fig 8 If the action would cause the agent tocollide with the sea floor or go out of bounds it has noeffect Farther-away treasures have higher values Theepisode terminates when the agent collects a treasure orafter 200 timesteps
237
224
203196
161151140
115
82
07
Figure 8 Deep Sea Treasure environment from Vamplew et al(2011) with weights from Yang et al (2019) Treasures are labeledwith their respective values The agent can move around freelyin the white squares but cannot enter the black squares (ie theocean floor)
There are two objectives time penalty and treasure valueA time penalty of minus1 is given at each time step The agentreceives the value of the treasure it picks up as the rewardfor the treasure objective In other words when the agentpicks up a treasure of value v the reward vector is [minus1 v]otherwise it is [minus1 0]
B2 Shadow Hand
Our robot platform is a simulated Shadow Dexterous Hand(Shadow Robot Company 2020) in the MuJoCo physicsengine (Todorov et al 2012) The Shadow Hand has fivefingers and 24 degrees of freedom actuated by 20 motorsThe observation consists of joint angles joint velocities andtouch sensors for a total of 63 dimensions Each fingertipof our Shadow Hand has a 4times4 spatial touch sensor Thissensor has three channels measuring the normal force andthe x and y-axis tangential forces for a sensor dimension of4times4times3 per fingertip We simplify this sensor by summingacross spatial dimensions resulting in a 1times1times3 observationper fingertip
In the touch task there is a block in the environment that isalways fixed in the same pose In the turn task there is a dialin the environment that can be rotated and it is initialized toa random position between minus30 and 30 The target angleof the dial is 0 The angle of the dial is included in theagentrsquos observation In the orient task the robot interactswith a rectangular peg in the environment the initial andtarget pose of the peg remains the same across episodesThe pose of the block is included in the agentrsquos observationencoded as the xyz positions of four corners of the block(based on how Levine et al (2016) encodes end-effectorpose)
A Distributional View on Multi-Objective Policy Optimization
Figure 9 These images show what task completion looks like forthe touch turn and orient Shadow Hand tasks (from left to right)
0 2 4 6impact force
minus6
minus4
minus2
0
pain
pen
alty
Figure 10 In the touch and turn Shadow Hand tasks the painpenalty is computed based on the impact force as plotted here Forlow impact forces the pain penalty is near-zero For high impactforces the pain penalty is equal to the negative of the impact force
B21 BALANCING TASK COMPLETION AND PAIN
In the touch and turn tasks there are two objectives ldquopainrdquopenalty and task completion A sparse task completion re-ward of 1 is given for pressing a block with greater than 5Nof force or turning a dial to a fixed target angle respectivelyIn both tasks the episode terminates when the agent com-pletes the task ie the agent gets a total reward of either0 or 1 for the task completion objective per episode ThePareto plots for these two tasks in the main paper (Fig 6)show the discounted task reward (with a discount factor ofγ = 099) to capture how long it takes agents to completethe task
The ldquopainrdquo penalty is computed as in (Huang et al 2019) Itis based on the impact force m(s sprime) which is the increasein force from state s to the next state sprime In our tasks this ismeasured by a touch sensor on the block or dial The painpenalty is equal to the negative of the impact force scaledby how unacceptable it is
rpain(s a sprime) = minus[1minus aλ(m(s sprime))
]m(s sprime) (6)
where aλ(middot) isin [0 1] computes the acceptability of an impactforce aλ(middot) should be a monotonically decreasing functionthat captures how resilient the robot and the environmentare to impacts As in (Huang et al 2019) we use
aλ(m) = sigmoid(λ1(minusm+ λ2)) (7)
with λ = [2 2]gt The relationship between pain penaltyand impact force is plotted in Fig 10
675 700 725zpeg
0
1
rew
ard
height objective
0 2d(θ θpeg)
orientation objective
Figure 11 In the orient Shadow Hand task the height and orienta-tion objectives have shaped rewards as plotted here
B22 ALIGNED OBJECTIVES
In the orient task there are three non-competing objectivestouching the peg lifting the peg to a target height and ori-enting the peg to be perpendicular to the ground The targetz position and orientation is shown by the gray transparentpeg (Fig 9 right-most) note that the x and y position isnot specified so the robot in the figure gets the maximumreward with respect to all three objectives
The touch objective has a sparse reward of 1 for activat-ing the pegrsquos touch sensor and zero otherwise For theheight objective the target z position is 7cm above theground the pegrsquos z position is computed with respect toits center of mass The shaped reward for this objectiveis 1 minus tanh(50|ztarget minus zpeg|) For the orientation objec-tive since the peg is symmetrical there are eight possibleorientations that are equally valid The acceptable targetorientationsQtarget and the pegrsquos orientation qpeg are denotedas quaternions The shaped reward is computed with respectto the closest target quaternion as
minqisinQtarget
1minus tanh(2lowastd(q qpeg))
where d(middot) denotes the `2-norm of the axis-angle equivalent(in radians) for the distance between the two quaternionsFig 11 shows the shaping of the height and orientationrewards
B3 Humanoid
We make use of the humanoid run task from Tassa et al(2018)9 The observation dimension is 67 and the actiondimension is 21 Actions are joint accelerations with min-imum and maximum limits of minus1 and 1 respectively Forthis task there are two objectives
bull The original task reward given by the environment Thegoal is is to achieve a horizontal speed of 10 meters persecond in any direction This reward is shaped it isequal to min(h10 1) where h is in meters per secondFor this objective we learn a Q-function
9This task is available at githubcomdeepminddm control
A Distributional View on Multi-Objective Policy Optimization
bull Limiting energy usage by penalizing taking high-magnitude actions The penalty is the `2-norm of theaction vector ie rpenalty(a) = minusa2 For this ob-jective we do not learn a Q-function Instead wecompute this penalty directly to evaluate a given actionin a state-independent way during policy optimization
B4 Humanoid Mocap
The motion capture tracking task used in this paper wasdeveloped as part of a concurrent submission (Anonymous2020) and does not constitute a contribution of this paperWe use a simulated humanoid adapted from the ldquoCMUhumanoidrdquo available at dm controllocomotion (Merel et al2019) We adjusted various body and actuator parametersto be more comparable to that of an average human Thishumanoid has 56 degrees of freedom and the observation is1021-dimensional
This motion capture tracking task is broadly similar to previ-ous work on motion capture tracking (Chentanez et al 2018Peng et al 2018 Merel et al 2019) The task is associatedwith an underlying set of motion capture clips For thispaper we used roughly 40 minutes of locomotion motioncapture clips from the CMU motion capture database10
At the beginning of a new episode we select a frame uni-formly at random from all frames in the underlying mocapdata (excluding the last 10 frames of each clip) The simu-lated humanoid is then initialized to the pose of the selectedframe
The observations in this environment are various propriocep-tive observations about the current state of the body as wellas the relative target positions and orientations of 13 differ-ent body parts in the humanoidrsquos local frame We providethe agent with the target poses for the next 5 timesteps
The choice of reward function is crucial for motion capturetracking tasks Here we consider five different reward com-ponents each capturing a different aspect of the similarityof the pose between the simulated humanoid and the mocaptarget Four of our reward components were initially pro-posed by Peng et al (2018) The first reward component isbased on the difference in the center of mass
rcom = exp(minus10pcom minus pref
com2)
where pcom and prefcom are the positions of the center of mass
of the simulated humanoid and the mocap reference re-spectively The second reward component is based on thedifference in joint angle velocities
rvel = exp(minus01qvel minus qref
vel2)
where qvel and qrefvel are the joint angle velocities of the simu-
lated humanoid and the mocap reference respectively The
10This database is available at mocapcscmuedu
third reward component is based on the difference in end-effector positions
rapp = exp(minus40papp minus pref
app2)
where papp and prefapp are the end effector positions of the
simulated humanoid and the mocap reference respectivelyThe fourth reward component is based on the difference injoint orientations
rquat = exp(minus2qquat qref
quat2)
where denotes quaternion differences and qquat and qrefquat
are the joint quaternions of the simulated humanoid and themocap reference respectively
Finally the last reward component is based on difference inthe joint angles and the Euclidean positions of a set of 13body parts
rtrunc = 1minus 1
03(
=ε︷ ︸︸ ︷bpos minus bref
pos1 + qpos minus qrefpos1)
where bpos and brefpos correspond to the body positions of the
simulated character and the mocap reference and qpos andqref
pos correspond to the joint angles
We include an early termination condition in our task that islinked to this last reward term We terminate the episode ifε gt 03 which ensures that rtrunc isin [0 1]
In our MO-VMPO experiments we treat each reward com-ponent as a separate objective In our VMPO experimentswe use a reward of the form
r =1
2rtrunc +
1
2(01rcom + λrvel + 015rapp + 065rquat)
(8)
varying λ as described in the main paper (Sec 63)
B5 Sawyer
The Sawyer peg-in-hole setup consists of a Rethink RoboticsSawyer robot arm The robot has 7 joints driven by serieselastic actuators On the wrist a Robotiq FT 300 force-torque sensor is mounted followed by a Robotiq 2F85 par-allel gripper We implement a 3D end-effector Cartesianvelocity controller that maintains a fixed orientation of thegripper The agent controls the Cartesian velocity setpointsThe gripper is not actuated in this setup The observationsprovided are the joint positions velocities and torques theend-effector pose the wrist force-torque measurements andthe joint velocity command output of the Cartesian con-troller We augment each observation with the two previousobservations as a history
In each episode the peg is initalized randomly within an8times8times8 cm workspace (with the peg outside of the hole)
A Distributional View on Multi-Objective Policy Optimization
The environment is then run at 20 Hz for 600 steps Theaction limits are plusmn005 ms on each axis If the magnitudeof the wrist force-torque measurements exceeds 15 N onany horizontal axis (x and y axis) or 5 N on the vertical axis(z axis) the episode is terminated early
There are two objectives the task itself (inserting the peg)and minimizing wrist force measurements
The reward for the task objective is defined as
rinsertion = max(
02rapproach rinserted
)(9)
rapproach = s(ppeg popening) (10)
rinserted = raligned s(ppegz pbottom
z ) (11)
raligned = ppegxy minus popening
xy lt d2 (12)
where ppeg is the peg position popening the position of thehole opening pbottom the bottom position of the hole and dthe diameter of the hole
s(p1 p2) is a shaping operator
s(p1 p2) = 1minus tanh2(atanh(
radicl)
εp1 minus p22) (13)
which gives a reward of 1minus l if p1 and p2 are at a Euclideandistance of ε Here we chose lapproach = 095 εapproach =001 linserted = 05 and εinserted = 004 Intentionally 004corresponds to the length of the peg such that rinserted = 05if ppeg = popening As a result if the peg is in the approachposition rinserted is dominant over rapproach in rinsertion
Intuitively there is a shaped reward for aligning the pegwith a magnitude of 02 After accurate alignment withina horizontal tolerance the overall reward is dominated bya vertical shaped reward component for insertion The hor-izontal tolerance threshold ensures that the agent is notencouraged to try to insert the peg from the side
The reward for the secondary objective minimizing wristforce measurements is defined as
rforce = minusF1 (14)
where F = (Fx Fy Fz) are the forces measured by thewrist sensor
C Algorithmic DetailsC1 General Algorithm Description
We maintain one online network and one target network foreach Q-function associated to objective k with parametersdenoted by φk and φprimek respectively We also maintain oneonline network and one target network for the policy withparameters denoted by θ and θprime respectively Target net-works are updated every fixed number of steps by copying
parameters from the online network Online networks areupdated using gradient descent in each learning iterationNote that in the main paper we refer to the target policynetwork as the old policy πold
We use an asynchronous actor-learner setup in which ac-tors regularly fetch policy parameters from the learner andact in the environment writing these transitions to the re-play buffer We refer to this policy as the behavior policythroughout the paper The learner uses the transitions inthe replay buffer to update the (online) Q-functions and thepolicy Please see Algorithms 2 and 3 for more details onthe actor and learner processes respectively
Algorithm 2 MO-MPO Asynchronous Actor
1 given (N) reward functions rk(s a)Nk=1 T steps perepisode
2 repeat3 Fetch current policy parameters θ from learner4 Collect trajectory from environment5 τ = 6 for t = 0 T do7 at sim πθ(middot|st)8 Execute action and determine rewards9 r = [r1(st at) rN (st at)]
10 τ larr τ cup (st at r πθ(at|st))11 end for12 Send trajectory τ to replay buffer13 until end of training
C2 Retrace for Multi-Objective Policy Evaluation
Recall that the goal of the policy evaluation stage (Sec 41)is to learn the Q-function Qφk
(s a) for each objective kparametrized by φk These Q-functions are with respectto the policy πold We emphasize that it is valid to use anyoff-the-shelf Q-learning algorithm such as TD(0) (Sutton1988) to learn these Q-functions We choose to use Retrace(Munos et al 2016) described here
Given a replay buffer D containing trajectory snippetsτ = (s0 a0 r0 s1) (sT aT rT sT+1) where rtdenotes a reward vector rk(st at)Nk=1 that consists of ascalar reward for each ofN objectives the Retrace objectiveis as follows
minφk
EτsimD[(Qretk (st at)minus Qφk
(st at))2] (15)
with
Qretk (st at) = Qφprime
k(st at) +
Tsumj=t
γjminust( jprodz=t+1
cz
)δj
δj = rk(sj aj) + γV (sj+1)minus Qφprimek(sj aj)
V (sj+1) = Eπold(a|sj+1)[Qφprimek(sj+1 a)]
A Distributional View on Multi-Objective Policy Optimization
Algorithm 3 MO-MPO Asynchronous Learner1 given batch size (L) number of actions (M) (N) Q-functions
(N) preferences εkNk=1 policy networks and replay bufferD
2 initialize Lagrangians ηkNk=1 and ν target networks andonline networks such that πθprime = πθ and Qφprime
k= Qφk
3 repeat4 repeat5 Collect dataset si aij Qijk
LMNijk where
6 si sim D aij sim πθprime(a|si) and Qijk = Qφkprime(si aij)
78 Compute action distribution for each objective9 for k = 1 N do
10 δηk larr nablaηkηkεk + ηksumLi
1Llog
(sumMj
1M
exp(Q
ijkηk
))11 Update ηk based on δηk12 qijk prop exp(
Qijkηk
)
13 end for1415 Update parametric policy with trust region16 sg denotes a stop-gradient17 δπ larr minusnablaθ
sumLi
sumMj
sumNk q
ijk log πθ(a
ij |si)18 +sg(ν)
(β minus
sumLi KL(πθprime(a|si) πθ(a|si)) ds
)19 δν larr nablaνν
(βminussumLi KL(πθprime(a|si) πsg(θ)(a|si)) ds
)20 Update πθ based on δπ21 Update ν based on δν2223 Update Q-functions24 for k = 1 N do25 δφk larr nablaφk
sum(stat)isinτsimD
(Qφk (st at)minusQ
retk
)226 with Qret
k as in Eq 1527 Update φk based on δφk
28 end for2930 until fixed number of steps31 Update target networks32 πθprime = πθ Qφk
prime = Qφk
33 until convergence
The importance weights ck are defined as
cz = min
(1πold(az|sz)b(az|sz)
)
where b(az|sz) denotes the behavior policy used to col-lect trajectories in the environment When j = t we set(prodj
z=t+1 cz
)= 1
We also make use of a target network for each Q-function(Mnih et al 2015) parametrized by φprimek which we copyfrom the online network φk after a fixed number of gradientsteps on the Retrace objective (15)
C3 Policy Fitting
Recall that fitting the policy in the policy improvementstage (Sec 422) requires solving the following constrained
optimization
πnew = argmaxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds
stints
micro(s) KL(πold(a|s) πθ(a|s)) ds lt β (16)
We first write the generalized Lagrangian equation ie
L(θ ν) =
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds (17)
+ ν(β minus
ints
micro(s) KL(πold(a|s) πθ(a|s)) ds)
where ν is the Lagrangian multiplier Now we solve thefollowing primal problem
maxθ
minνgt0
L(θ ν)
To obtain the parameters θ for the updated policy we solvefor ν and θ by alternating between 1) fixing ν to its currentvalue and optimizing for θ and 2) fixing θ to its currentvalue and optimizing for ν This can be applied for anypolicy output distribution
For Gaussian policies we decouple the update rules to op-timize for mean and covariance independently as in (Ab-dolmaleki et al 2018b) This allows for setting differentKL bounds for the mean (βmicro) and covariance (βΣ) whichresults in more stable learning To do this we first separateout the following two policies for mean and covariance
πmicroθ (a|s) = N(amicroθ(s)Σθold
(s)) (18)
πΣθ (a|s) = N
(amicroθold
(s)Σθ(s)) (19)
Policy πmicroθ (a|s) takes the mean from the online policy net-work and covariance from the target policy network andpolicy πΣ
θ (a|s) takes the mean from the target policy net-work and covariance from online policy network Now ouroptimization problem has the following form
maxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s)(
log πmicroθ (a|s)πΣθ (a|s)
)da ds
stints
micro(s) KL(πold(a|s) πmicroθ (a|s)) ds lt βmicroints
micro(s) KL(πold(a|s) πΣθ (a|s)) ds lt βΣ (20)
As in (Abdolmaleki et al 2018b) we set a much smallerbound for covariance than for mean to keep the explorationand avoid premature convergence We can solve this op-timization problem using the same general procedure asdescribed above
A Distributional View on Multi-Objective Policy Optimization
C4 Derivation of Dual Function for η
Recall that obtaining the per-objective improved action dis-tributions qk(a|s) in the policy improvement stage (Sec421) requires solving a convex dual function for the tem-perature ηk for each objective For the derivation of thisdual function please refer to Appendix D2 of the originalMPO paper (Abdolmaleki et al 2018b)
D MO-V-MPO Multi-Objective On-PolicyMPO
In this section we describe how MO-MPO can be adaptedto the on-policy case in which a state-value function V (s)is learned and used to estimate the advantages for eachobjective We call this approach MO-V-MPO
D1 Multi-objective policy evaluation in the on-policysetting
In the on-policy setting to evaluate the previous policyπold we use advantages A(s a) estimated from a learnedstate-value function V (s) instead of a state-action valuefunction Q(s a) as in the main text We train a separateV -function for each objective by regressing to the stan-dard n-step return (Sutton 1988) associated with each ob-jective More concretely given trajectory snippets τ =(s0 a0 r0) (sT aT rT ) where rt denotes a rewardvector rk(st at)Nk=1 that consists of rewards for all N ob-jectives we find value function parameters φk by optimizingthe following objective
minφ1k
Nsumk=1
Eτ[(G
(T )k (st at)minus V πold
φk(st))
2] (21)
Here G(T )k (st at) is the T -step target for value function
k which uses the actual rewards in the trajectory andbootstraps from the current value function for the restG
(T )k (st at) =
sumTminus1`=t γ
`minustrk(s` a`) + γTminustV πoldφk
(s`+T )The advantages are then estimated as Aπold
k (st at) =
G(T )k (st at)minus V πold
φk(st)
D2 Multi-objective policy improvement in theon-policy setting
Given the previous policy πold(a|s) and estimated advan-tages Aπold
k (s a)k=1N associated with this policy foreach objective our goal is to improve the previous policyTo this end we first learn an improved variational distribu-tion qk(s a) for each objective and then combine and distillthe variational distributions into a new parametric policyπnew(a|s) Unlike in MO-MPO for MO-V-MPO we use thejoint distribution qk(s a) rather than local policies qk(a|s)because without a learned Q-function only one action perstate is available for learning This is a multi-objective vari-
ant of the two-step policy improvement procedure employedby V-MPO (Song et al 2020)
D21 OBTAINING IMPROVED VARIATIONALDISTRIBUTIONS PER OBJECTIVE (STEP 1)
In order to obtain the improved variational distributionsqk(s a) we optimize the RL objective for each objectiveAk
maxqk
intsa
qk(s a)Ak(s a) dads (22)
st KL(qk(s a)pold(s a)) lt εk
where the KL-divergence is computed over all (s a) εk de-notes the allowed expected KL divergence and pold(s a) =micro(s)πold(a|s) is the state-action distribution associated withπold
As in MO-MPO we use these εk to define the preferencesover objectives More concretely εk defines the allowed con-tribution of objective k to the change of the policy There-fore the larger a particular εk is with respect to others themore that objective k is preferred On the other hand ifεk = 0 is zero then objective k will have no contribution tothe change of the policy and will effectively be ignored
Equation (22) can be solved in closed form
qk(s a) prop pold(s a) exp(Ak(s a)
ηk
) (23)
where the temperature ηk is computed based on the con-straint εk by solving the following convex dual problem
ηk = argminηk
[ηk εk + (24)
ηk log
intsa
pold(s a) exp(Ak(s a)
η
)da ds
]
We can perform the optimization along with the loss bytaking a gradient descent step on ηk and we initialize withthe solution found in the previous policy iteration step Sinceηk must be positive we use a projection operator after eachgradient step to maintain ηk gt 0
Top-k advantages As in Song et al (2020) in practicewe used the samples corresponding to the top 50 of ad-vantages in each batch of data
D22 FITTING A NEW PARAMETRIC POLICY (STEP 2)
We next want to combine and distill the state-action distribu-tions obtained in the previous step into a single parametricpolicy πnew(a|s) that favors all of the objectives accordingto the preferences specified by εk For this we solve a su-pervised learning problem that fits a parametric policy asfollows
A Distributional View on Multi-Objective Policy Optimization
πnew = argmaxθ
Nsumk=1
intsa
qk(s a) log πθ(a|s) da ds
stints
KL(πold(a|s) πθ(a|s)) ds lt β (25)
where θ are the parameters of our function approximator(a neural network) which we initialize from the weights ofthe previous policy πold and the KL constraint enforces atrust region of size β that limits the overall change in theparametric policy to improve stability of learning As inMPO the KL constraint in this step has a regularizationeffect that prevents the policy from overfitting to the localpolicies and therefore avoids premature convergence
In order to optimize Equation (25) we employ Lagrangianrelaxation similar to the one employed for ordinary V-MPO (Song et al 2020)
E Multi-Objective Policy Improvement asInference
In the main paper we motivated the multi-objective policyupdate rules from an intuitive perspective In this section weshow that our multi-objective policy improvement algorithmcan also be derived from the RL-as-inference perspective Inthe policy improvement stage we assume that a Q-functionfor each objective is given and we would like to improveour policy with respect to these Q-functions The deriva-tion here extends the derivation for the policy improvementalgorithm in (single-objective) MPO in Abdolmaleki et al(2018a) (in appendix) to the multi-objective case
We assume there are observable binary improvement eventsRkNk=1 for each objective Rk = 1 indicates that ourpolicy has improved for objective k whileRk = 0 indicatesthat it has not Now we ask if the policy has improved withrespect to all objectives ie Rk = 1Nk=1 what would theparameters θ of that improved policy be More concretelythis is equivalent to the maximum a posteriori estimate forRk = 1Nk=1
maxθ
pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)
where the probability of improvement events depends on θAssuming independence between improvement events Rkand using log probabilities leads to the following objective
maxθ
Nsumk=1
log pθ(Rk = 1) + log p(θ) (27)
The prior distribution over parameters p(θ) is fixed duringthe policy improvement step We set the prior such that πθ
stays close to the target policy during each policy improve-ment step to improve stability of learning (Sec E2)
We use the standard expectation-maximization (EM) algo-rithm to efficiently maximize
sumNk=1 log pθ(Rk = 1) The
EM algorithm repeatedly constructs a tight lower bound inthe E-step given the previous estimate of θ (which corre-sponds to the target policy in our case) and then optimizesthat lower bound in the M-step We introduce a variationaldistribution qk(s a) per objective and use this to decom-pose the log-evidence
sumNk=1 log pθ(Rk = 1) as follows
Nsumk=1
log pθ(Rk = 1) (28)
=
Nsumk=1
KL(qk(s a) pθ(s a|Rk = 1))minus
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a))
The second term of this decomposition is expanded as
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a)) (29)
=
Nsumk=1
Eqk(sa)
[log
pθ(Rk = 1 s a)
qk(s a)
]=
Nsumk=1
Eqk(sa)
[log
p(Rk = 1|s a)πθ(a|s)micro(s)
qk(s a)
]
where micro(s) is the stationary state distribution which is as-sumed to be given in each policy improvement step Inpractice micro(s) is approximated by the distribution of thestates in the replay buffer p(Rk = 1|s a) is the likelihoodof the improvement event for objective k if our policy choseaction a in state s
The first term of the decomposition in Equation (28) isalways positive so the second term is a lower bound onthe log-evidence
sumNk=1 log pθ(Rk = 1) πθ(a|s) and
qk(a|s) = qk(sa)micro(s) are unknown even though micro(s) is given
In the E-step we estimate qk(a|s) for each objective k byminimizing the first term given θ = θprime (the parameters ofthe target policy) Then in the M-step we find a new θ byfitting the parametric policy to the distributions from firststep
E1 E-Step
The E-step corresponds to the first step of policy improve-ment in the main paper (Sec 421) In the E-step wechoose the variational distributions qk(a|s)Nk=1 such thatthe lower bound on
sumNk=1 log pθ(Rk = 1) is as tight as
A Distributional View on Multi-Objective Policy Optimization
possible when θ = θprime the parameters of the target policyThe lower bound is tight when the first term of the decompo-sition in Equation (28) is zero so we choose qk to minimizethis term We can optimize for each variational distributionqk independently
qk(a|s) = argminq
Emicro(s)
[KL(q(a|s) pθprime(s a|Rk = 1))
]= argmin
qEmicro(s)
[KL(q(a|s) πθprime(a|s))minus
Eq(a|s) log p(Rk = 1|s a))]
(30)
We can solve this optimization problem in closed formwhich gives us
qk(a|s) =πθprime(a|s) p(Rk = 1|s a)intπθprime(a|s) p(Rk = 1|s a) da
This solution weighs the actions based on their relativeimprovement likelihood p(Rk = 1|s a) for each objectiveWe define the likelihood of an improvement event Rk as
p(Rk = 1|s a) prop exp(Qk(s a)
αk
)
where αk is an objective-dependent temperature parameterthat controls how greedy the solution qk(a|s) is with respectto its associated objective Qk(s a) For example given abatch of state-action pairs at the extremes an αk of zerowould give all probability to the state-action pair with themaximum Q-value while an αk of positive infinity wouldgive the same probability to all state-action pairs In orderto automatically optimize for αk we plug the exponentialtransformation into (30) which leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) dadsminus
αk
intmicro(s) KL(q(a|s) πθprime(a|s)) ds (31)
If we instead enforce the bound on KL divergence as a hardconstraint (which we do in practice) that leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) da ds (32)
stintmicro(s) KL(q(a|s) πθprime(a|s)) ds lt εk
where as described in the main paper the parameter εkdefines preferences over objectives If we now assume thatqk(a|s) is a non-parametric sample-based distribution as in(Abdolmaleki et al 2018b) we can solve this constrained
optimization problem in closed form for each sampled states by setting
q(a|s) prop πθprime(a|s) exp(Qk(s a)
ηk
) (33)
where ηk is computed by solving the following convex dualfunction
ηk = argminη
ηεk+ (34)
η
intmicro(s) log
intπθprime(a|s) exp
(Qk(s a)
η
)da ds
Equations (32) (33) and (34) are equivalent to those givenfor the first step of multi-objective policy improvement inthe main paper (Sec 421) Please see refer to AppendixD2 in (Abdolmaleki et al 2018b) for details on derivationof the dual function
E2 M-Step
After obtaining the variational distributions qk(a|s) foreach objective k we have found a tight lower bound forsumNk=1 log pθ(Rk = 1) when θ = θprime We can now obtain
the parameters θ of the updated policy πθ(a|s) by optimiz-ing this lower bound After rearranging and dropping theterms in (27) that do not involve θ we obtain
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) da ds
+ log p(θ)
This corresponds to the supervised learning step of policyimprovement in the main paper (Sec 422)
For the prior p(θ) on the parameters θ we enforce that thenew policy πθ(a|s) is close to the target policy πθprime(a|s)We therefore optimize for
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) dads
stintmicro(s) KL(πθprime(a|s) πθ(a|s)) ds lt β
Because of the constraint on the change of the parametricpolicy we do not greedily optimize the M-step objectiveSee (Abdolmaleki et al 2018b) for more details
A Distributional View on Multi-Objective Policy Optimization
Humanoid Mocap
Scalarized VMPOKL-constraint on policy mean βmicro 01KL-constraint on policy covariance βΣ 10minus5
default ε 01initial temperature η 1
TrainingAdam learning rate 10minus4
batch size 128unroll length (for n-step return Sec D1) 32
Table 5 Hyperparameters for the humanoid mocap experiments
A3 Humanoid Mocap
For the humanoid mocap experiments we used the follow-ing architecture for both MO-VMPO and VMPO for thepolicy we process a concatentation of the mocap referenceobservations and the proprioceptive observations by a twolayer MLP with 1024 hidden units per layer This referenceencoder is followed by linear layers to produce the meanand log standard deviation of a stochastic latent variableThese latent variables are then concatenated with the propri-oceptive observations and processed by another two layerMLP with 1024 hidden units per layer to produce the actiondistribution For VMPO we use a three layer MLP with1024 hidden units per layer as the critic For MO-VMPOwe use a shared two layer MLP with 1024 hidden units perlayer followed by a one layer MLP with 1024 hidden unitsper layer per objective In both cases we use k-step returnsto train the critic with a discount factor of 095 Table 5shows additional hyperparameters used in our experiments
B Experimental DomainsWe evaluated our approach on one discrete domain (DeepSea Treasure) three simulated continuous control domains(humanoid Shadow Hand and humanoid mocap) and onereal-world continuous control domain (Sawyer robot) Herewe provide more detail about these domains and the objec-tives used in each task
B1 Deep Sea Treasure
Deep Sea Treasure (DST) is a 11times10 grid-world domainthe state space S consists of the x and y position of the agentand the action spaceA is up right down left Thelayout of the environment and values of the treasures areshown in Fig 8 If the action would cause the agent tocollide with the sea floor or go out of bounds it has noeffect Farther-away treasures have higher values Theepisode terminates when the agent collects a treasure orafter 200 timesteps
237
224
203196
161151140
115
82
07
Figure 8 Deep Sea Treasure environment from Vamplew et al(2011) with weights from Yang et al (2019) Treasures are labeledwith their respective values The agent can move around freelyin the white squares but cannot enter the black squares (ie theocean floor)
There are two objectives time penalty and treasure valueA time penalty of minus1 is given at each time step The agentreceives the value of the treasure it picks up as the rewardfor the treasure objective In other words when the agentpicks up a treasure of value v the reward vector is [minus1 v]otherwise it is [minus1 0]
B2 Shadow Hand
Our robot platform is a simulated Shadow Dexterous Hand(Shadow Robot Company 2020) in the MuJoCo physicsengine (Todorov et al 2012) The Shadow Hand has fivefingers and 24 degrees of freedom actuated by 20 motorsThe observation consists of joint angles joint velocities andtouch sensors for a total of 63 dimensions Each fingertipof our Shadow Hand has a 4times4 spatial touch sensor Thissensor has three channels measuring the normal force andthe x and y-axis tangential forces for a sensor dimension of4times4times3 per fingertip We simplify this sensor by summingacross spatial dimensions resulting in a 1times1times3 observationper fingertip
In the touch task there is a block in the environment that isalways fixed in the same pose In the turn task there is a dialin the environment that can be rotated and it is initialized toa random position between minus30 and 30 The target angleof the dial is 0 The angle of the dial is included in theagentrsquos observation In the orient task the robot interactswith a rectangular peg in the environment the initial andtarget pose of the peg remains the same across episodesThe pose of the block is included in the agentrsquos observationencoded as the xyz positions of four corners of the block(based on how Levine et al (2016) encodes end-effectorpose)
A Distributional View on Multi-Objective Policy Optimization
Figure 9 These images show what task completion looks like forthe touch turn and orient Shadow Hand tasks (from left to right)
0 2 4 6impact force
minus6
minus4
minus2
0
pain
pen
alty
Figure 10 In the touch and turn Shadow Hand tasks the painpenalty is computed based on the impact force as plotted here Forlow impact forces the pain penalty is near-zero For high impactforces the pain penalty is equal to the negative of the impact force
B21 BALANCING TASK COMPLETION AND PAIN
In the touch and turn tasks there are two objectives ldquopainrdquopenalty and task completion A sparse task completion re-ward of 1 is given for pressing a block with greater than 5Nof force or turning a dial to a fixed target angle respectivelyIn both tasks the episode terminates when the agent com-pletes the task ie the agent gets a total reward of either0 or 1 for the task completion objective per episode ThePareto plots for these two tasks in the main paper (Fig 6)show the discounted task reward (with a discount factor ofγ = 099) to capture how long it takes agents to completethe task
The ldquopainrdquo penalty is computed as in (Huang et al 2019) Itis based on the impact force m(s sprime) which is the increasein force from state s to the next state sprime In our tasks this ismeasured by a touch sensor on the block or dial The painpenalty is equal to the negative of the impact force scaledby how unacceptable it is
rpain(s a sprime) = minus[1minus aλ(m(s sprime))
]m(s sprime) (6)
where aλ(middot) isin [0 1] computes the acceptability of an impactforce aλ(middot) should be a monotonically decreasing functionthat captures how resilient the robot and the environmentare to impacts As in (Huang et al 2019) we use
aλ(m) = sigmoid(λ1(minusm+ λ2)) (7)
with λ = [2 2]gt The relationship between pain penaltyand impact force is plotted in Fig 10
675 700 725zpeg
0
1
rew
ard
height objective
0 2d(θ θpeg)
orientation objective
Figure 11 In the orient Shadow Hand task the height and orienta-tion objectives have shaped rewards as plotted here
B22 ALIGNED OBJECTIVES
In the orient task there are three non-competing objectivestouching the peg lifting the peg to a target height and ori-enting the peg to be perpendicular to the ground The targetz position and orientation is shown by the gray transparentpeg (Fig 9 right-most) note that the x and y position isnot specified so the robot in the figure gets the maximumreward with respect to all three objectives
The touch objective has a sparse reward of 1 for activat-ing the pegrsquos touch sensor and zero otherwise For theheight objective the target z position is 7cm above theground the pegrsquos z position is computed with respect toits center of mass The shaped reward for this objectiveis 1 minus tanh(50|ztarget minus zpeg|) For the orientation objec-tive since the peg is symmetrical there are eight possibleorientations that are equally valid The acceptable targetorientationsQtarget and the pegrsquos orientation qpeg are denotedas quaternions The shaped reward is computed with respectto the closest target quaternion as
minqisinQtarget
1minus tanh(2lowastd(q qpeg))
where d(middot) denotes the `2-norm of the axis-angle equivalent(in radians) for the distance between the two quaternionsFig 11 shows the shaping of the height and orientationrewards
B3 Humanoid
We make use of the humanoid run task from Tassa et al(2018)9 The observation dimension is 67 and the actiondimension is 21 Actions are joint accelerations with min-imum and maximum limits of minus1 and 1 respectively Forthis task there are two objectives
bull The original task reward given by the environment Thegoal is is to achieve a horizontal speed of 10 meters persecond in any direction This reward is shaped it isequal to min(h10 1) where h is in meters per secondFor this objective we learn a Q-function
9This task is available at githubcomdeepminddm control
A Distributional View on Multi-Objective Policy Optimization
bull Limiting energy usage by penalizing taking high-magnitude actions The penalty is the `2-norm of theaction vector ie rpenalty(a) = minusa2 For this ob-jective we do not learn a Q-function Instead wecompute this penalty directly to evaluate a given actionin a state-independent way during policy optimization
B4 Humanoid Mocap
The motion capture tracking task used in this paper wasdeveloped as part of a concurrent submission (Anonymous2020) and does not constitute a contribution of this paperWe use a simulated humanoid adapted from the ldquoCMUhumanoidrdquo available at dm controllocomotion (Merel et al2019) We adjusted various body and actuator parametersto be more comparable to that of an average human Thishumanoid has 56 degrees of freedom and the observation is1021-dimensional
This motion capture tracking task is broadly similar to previ-ous work on motion capture tracking (Chentanez et al 2018Peng et al 2018 Merel et al 2019) The task is associatedwith an underlying set of motion capture clips For thispaper we used roughly 40 minutes of locomotion motioncapture clips from the CMU motion capture database10
At the beginning of a new episode we select a frame uni-formly at random from all frames in the underlying mocapdata (excluding the last 10 frames of each clip) The simu-lated humanoid is then initialized to the pose of the selectedframe
The observations in this environment are various propriocep-tive observations about the current state of the body as wellas the relative target positions and orientations of 13 differ-ent body parts in the humanoidrsquos local frame We providethe agent with the target poses for the next 5 timesteps
The choice of reward function is crucial for motion capturetracking tasks Here we consider five different reward com-ponents each capturing a different aspect of the similarityof the pose between the simulated humanoid and the mocaptarget Four of our reward components were initially pro-posed by Peng et al (2018) The first reward component isbased on the difference in the center of mass
rcom = exp(minus10pcom minus pref
com2)
where pcom and prefcom are the positions of the center of mass
of the simulated humanoid and the mocap reference re-spectively The second reward component is based on thedifference in joint angle velocities
rvel = exp(minus01qvel minus qref
vel2)
where qvel and qrefvel are the joint angle velocities of the simu-
lated humanoid and the mocap reference respectively The
10This database is available at mocapcscmuedu
third reward component is based on the difference in end-effector positions
rapp = exp(minus40papp minus pref
app2)
where papp and prefapp are the end effector positions of the
simulated humanoid and the mocap reference respectivelyThe fourth reward component is based on the difference injoint orientations
rquat = exp(minus2qquat qref
quat2)
where denotes quaternion differences and qquat and qrefquat
are the joint quaternions of the simulated humanoid and themocap reference respectively
Finally the last reward component is based on difference inthe joint angles and the Euclidean positions of a set of 13body parts
rtrunc = 1minus 1
03(
=ε︷ ︸︸ ︷bpos minus bref
pos1 + qpos minus qrefpos1)
where bpos and brefpos correspond to the body positions of the
simulated character and the mocap reference and qpos andqref
pos correspond to the joint angles
We include an early termination condition in our task that islinked to this last reward term We terminate the episode ifε gt 03 which ensures that rtrunc isin [0 1]
In our MO-VMPO experiments we treat each reward com-ponent as a separate objective In our VMPO experimentswe use a reward of the form
r =1
2rtrunc +
1
2(01rcom + λrvel + 015rapp + 065rquat)
(8)
varying λ as described in the main paper (Sec 63)
B5 Sawyer
The Sawyer peg-in-hole setup consists of a Rethink RoboticsSawyer robot arm The robot has 7 joints driven by serieselastic actuators On the wrist a Robotiq FT 300 force-torque sensor is mounted followed by a Robotiq 2F85 par-allel gripper We implement a 3D end-effector Cartesianvelocity controller that maintains a fixed orientation of thegripper The agent controls the Cartesian velocity setpointsThe gripper is not actuated in this setup The observationsprovided are the joint positions velocities and torques theend-effector pose the wrist force-torque measurements andthe joint velocity command output of the Cartesian con-troller We augment each observation with the two previousobservations as a history
In each episode the peg is initalized randomly within an8times8times8 cm workspace (with the peg outside of the hole)
A Distributional View on Multi-Objective Policy Optimization
The environment is then run at 20 Hz for 600 steps Theaction limits are plusmn005 ms on each axis If the magnitudeof the wrist force-torque measurements exceeds 15 N onany horizontal axis (x and y axis) or 5 N on the vertical axis(z axis) the episode is terminated early
There are two objectives the task itself (inserting the peg)and minimizing wrist force measurements
The reward for the task objective is defined as
rinsertion = max(
02rapproach rinserted
)(9)
rapproach = s(ppeg popening) (10)
rinserted = raligned s(ppegz pbottom
z ) (11)
raligned = ppegxy minus popening
xy lt d2 (12)
where ppeg is the peg position popening the position of thehole opening pbottom the bottom position of the hole and dthe diameter of the hole
s(p1 p2) is a shaping operator
s(p1 p2) = 1minus tanh2(atanh(
radicl)
εp1 minus p22) (13)
which gives a reward of 1minus l if p1 and p2 are at a Euclideandistance of ε Here we chose lapproach = 095 εapproach =001 linserted = 05 and εinserted = 004 Intentionally 004corresponds to the length of the peg such that rinserted = 05if ppeg = popening As a result if the peg is in the approachposition rinserted is dominant over rapproach in rinsertion
Intuitively there is a shaped reward for aligning the pegwith a magnitude of 02 After accurate alignment withina horizontal tolerance the overall reward is dominated bya vertical shaped reward component for insertion The hor-izontal tolerance threshold ensures that the agent is notencouraged to try to insert the peg from the side
The reward for the secondary objective minimizing wristforce measurements is defined as
rforce = minusF1 (14)
where F = (Fx Fy Fz) are the forces measured by thewrist sensor
C Algorithmic DetailsC1 General Algorithm Description
We maintain one online network and one target network foreach Q-function associated to objective k with parametersdenoted by φk and φprimek respectively We also maintain oneonline network and one target network for the policy withparameters denoted by θ and θprime respectively Target net-works are updated every fixed number of steps by copying
parameters from the online network Online networks areupdated using gradient descent in each learning iterationNote that in the main paper we refer to the target policynetwork as the old policy πold
We use an asynchronous actor-learner setup in which ac-tors regularly fetch policy parameters from the learner andact in the environment writing these transitions to the re-play buffer We refer to this policy as the behavior policythroughout the paper The learner uses the transitions inthe replay buffer to update the (online) Q-functions and thepolicy Please see Algorithms 2 and 3 for more details onthe actor and learner processes respectively
Algorithm 2 MO-MPO Asynchronous Actor
1 given (N) reward functions rk(s a)Nk=1 T steps perepisode
2 repeat3 Fetch current policy parameters θ from learner4 Collect trajectory from environment5 τ = 6 for t = 0 T do7 at sim πθ(middot|st)8 Execute action and determine rewards9 r = [r1(st at) rN (st at)]
10 τ larr τ cup (st at r πθ(at|st))11 end for12 Send trajectory τ to replay buffer13 until end of training
C2 Retrace for Multi-Objective Policy Evaluation
Recall that the goal of the policy evaluation stage (Sec 41)is to learn the Q-function Qφk
(s a) for each objective kparametrized by φk These Q-functions are with respectto the policy πold We emphasize that it is valid to use anyoff-the-shelf Q-learning algorithm such as TD(0) (Sutton1988) to learn these Q-functions We choose to use Retrace(Munos et al 2016) described here
Given a replay buffer D containing trajectory snippetsτ = (s0 a0 r0 s1) (sT aT rT sT+1) where rtdenotes a reward vector rk(st at)Nk=1 that consists of ascalar reward for each ofN objectives the Retrace objectiveis as follows
minφk
EτsimD[(Qretk (st at)minus Qφk
(st at))2] (15)
with
Qretk (st at) = Qφprime
k(st at) +
Tsumj=t
γjminust( jprodz=t+1
cz
)δj
δj = rk(sj aj) + γV (sj+1)minus Qφprimek(sj aj)
V (sj+1) = Eπold(a|sj+1)[Qφprimek(sj+1 a)]
A Distributional View on Multi-Objective Policy Optimization
Algorithm 3 MO-MPO Asynchronous Learner1 given batch size (L) number of actions (M) (N) Q-functions
(N) preferences εkNk=1 policy networks and replay bufferD
2 initialize Lagrangians ηkNk=1 and ν target networks andonline networks such that πθprime = πθ and Qφprime
k= Qφk
3 repeat4 repeat5 Collect dataset si aij Qijk
LMNijk where
6 si sim D aij sim πθprime(a|si) and Qijk = Qφkprime(si aij)
78 Compute action distribution for each objective9 for k = 1 N do
10 δηk larr nablaηkηkεk + ηksumLi
1Llog
(sumMj
1M
exp(Q
ijkηk
))11 Update ηk based on δηk12 qijk prop exp(
Qijkηk
)
13 end for1415 Update parametric policy with trust region16 sg denotes a stop-gradient17 δπ larr minusnablaθ
sumLi
sumMj
sumNk q
ijk log πθ(a
ij |si)18 +sg(ν)
(β minus
sumLi KL(πθprime(a|si) πθ(a|si)) ds
)19 δν larr nablaνν
(βminussumLi KL(πθprime(a|si) πsg(θ)(a|si)) ds
)20 Update πθ based on δπ21 Update ν based on δν2223 Update Q-functions24 for k = 1 N do25 δφk larr nablaφk
sum(stat)isinτsimD
(Qφk (st at)minusQ
retk
)226 with Qret
k as in Eq 1527 Update φk based on δφk
28 end for2930 until fixed number of steps31 Update target networks32 πθprime = πθ Qφk
prime = Qφk
33 until convergence
The importance weights ck are defined as
cz = min
(1πold(az|sz)b(az|sz)
)
where b(az|sz) denotes the behavior policy used to col-lect trajectories in the environment When j = t we set(prodj
z=t+1 cz
)= 1
We also make use of a target network for each Q-function(Mnih et al 2015) parametrized by φprimek which we copyfrom the online network φk after a fixed number of gradientsteps on the Retrace objective (15)
C3 Policy Fitting
Recall that fitting the policy in the policy improvementstage (Sec 422) requires solving the following constrained
optimization
πnew = argmaxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds
stints
micro(s) KL(πold(a|s) πθ(a|s)) ds lt β (16)
We first write the generalized Lagrangian equation ie
L(θ ν) =
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds (17)
+ ν(β minus
ints
micro(s) KL(πold(a|s) πθ(a|s)) ds)
where ν is the Lagrangian multiplier Now we solve thefollowing primal problem
maxθ
minνgt0
L(θ ν)
To obtain the parameters θ for the updated policy we solvefor ν and θ by alternating between 1) fixing ν to its currentvalue and optimizing for θ and 2) fixing θ to its currentvalue and optimizing for ν This can be applied for anypolicy output distribution
For Gaussian policies we decouple the update rules to op-timize for mean and covariance independently as in (Ab-dolmaleki et al 2018b) This allows for setting differentKL bounds for the mean (βmicro) and covariance (βΣ) whichresults in more stable learning To do this we first separateout the following two policies for mean and covariance
πmicroθ (a|s) = N(amicroθ(s)Σθold
(s)) (18)
πΣθ (a|s) = N
(amicroθold
(s)Σθ(s)) (19)
Policy πmicroθ (a|s) takes the mean from the online policy net-work and covariance from the target policy network andpolicy πΣ
θ (a|s) takes the mean from the target policy net-work and covariance from online policy network Now ouroptimization problem has the following form
maxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s)(
log πmicroθ (a|s)πΣθ (a|s)
)da ds
stints
micro(s) KL(πold(a|s) πmicroθ (a|s)) ds lt βmicroints
micro(s) KL(πold(a|s) πΣθ (a|s)) ds lt βΣ (20)
As in (Abdolmaleki et al 2018b) we set a much smallerbound for covariance than for mean to keep the explorationand avoid premature convergence We can solve this op-timization problem using the same general procedure asdescribed above
A Distributional View on Multi-Objective Policy Optimization
C4 Derivation of Dual Function for η
Recall that obtaining the per-objective improved action dis-tributions qk(a|s) in the policy improvement stage (Sec421) requires solving a convex dual function for the tem-perature ηk for each objective For the derivation of thisdual function please refer to Appendix D2 of the originalMPO paper (Abdolmaleki et al 2018b)
D MO-V-MPO Multi-Objective On-PolicyMPO
In this section we describe how MO-MPO can be adaptedto the on-policy case in which a state-value function V (s)is learned and used to estimate the advantages for eachobjective We call this approach MO-V-MPO
D1 Multi-objective policy evaluation in the on-policysetting
In the on-policy setting to evaluate the previous policyπold we use advantages A(s a) estimated from a learnedstate-value function V (s) instead of a state-action valuefunction Q(s a) as in the main text We train a separateV -function for each objective by regressing to the stan-dard n-step return (Sutton 1988) associated with each ob-jective More concretely given trajectory snippets τ =(s0 a0 r0) (sT aT rT ) where rt denotes a rewardvector rk(st at)Nk=1 that consists of rewards for all N ob-jectives we find value function parameters φk by optimizingthe following objective
minφ1k
Nsumk=1
Eτ[(G
(T )k (st at)minus V πold
φk(st))
2] (21)
Here G(T )k (st at) is the T -step target for value function
k which uses the actual rewards in the trajectory andbootstraps from the current value function for the restG
(T )k (st at) =
sumTminus1`=t γ
`minustrk(s` a`) + γTminustV πoldφk
(s`+T )The advantages are then estimated as Aπold
k (st at) =
G(T )k (st at)minus V πold
φk(st)
D2 Multi-objective policy improvement in theon-policy setting
Given the previous policy πold(a|s) and estimated advan-tages Aπold
k (s a)k=1N associated with this policy foreach objective our goal is to improve the previous policyTo this end we first learn an improved variational distribu-tion qk(s a) for each objective and then combine and distillthe variational distributions into a new parametric policyπnew(a|s) Unlike in MO-MPO for MO-V-MPO we use thejoint distribution qk(s a) rather than local policies qk(a|s)because without a learned Q-function only one action perstate is available for learning This is a multi-objective vari-
ant of the two-step policy improvement procedure employedby V-MPO (Song et al 2020)
D21 OBTAINING IMPROVED VARIATIONALDISTRIBUTIONS PER OBJECTIVE (STEP 1)
In order to obtain the improved variational distributionsqk(s a) we optimize the RL objective for each objectiveAk
maxqk
intsa
qk(s a)Ak(s a) dads (22)
st KL(qk(s a)pold(s a)) lt εk
where the KL-divergence is computed over all (s a) εk de-notes the allowed expected KL divergence and pold(s a) =micro(s)πold(a|s) is the state-action distribution associated withπold
As in MO-MPO we use these εk to define the preferencesover objectives More concretely εk defines the allowed con-tribution of objective k to the change of the policy There-fore the larger a particular εk is with respect to others themore that objective k is preferred On the other hand ifεk = 0 is zero then objective k will have no contribution tothe change of the policy and will effectively be ignored
Equation (22) can be solved in closed form
qk(s a) prop pold(s a) exp(Ak(s a)
ηk
) (23)
where the temperature ηk is computed based on the con-straint εk by solving the following convex dual problem
ηk = argminηk
[ηk εk + (24)
ηk log
intsa
pold(s a) exp(Ak(s a)
η
)da ds
]
We can perform the optimization along with the loss bytaking a gradient descent step on ηk and we initialize withthe solution found in the previous policy iteration step Sinceηk must be positive we use a projection operator after eachgradient step to maintain ηk gt 0
Top-k advantages As in Song et al (2020) in practicewe used the samples corresponding to the top 50 of ad-vantages in each batch of data
D22 FITTING A NEW PARAMETRIC POLICY (STEP 2)
We next want to combine and distill the state-action distribu-tions obtained in the previous step into a single parametricpolicy πnew(a|s) that favors all of the objectives accordingto the preferences specified by εk For this we solve a su-pervised learning problem that fits a parametric policy asfollows
A Distributional View on Multi-Objective Policy Optimization
πnew = argmaxθ
Nsumk=1
intsa
qk(s a) log πθ(a|s) da ds
stints
KL(πold(a|s) πθ(a|s)) ds lt β (25)
where θ are the parameters of our function approximator(a neural network) which we initialize from the weights ofthe previous policy πold and the KL constraint enforces atrust region of size β that limits the overall change in theparametric policy to improve stability of learning As inMPO the KL constraint in this step has a regularizationeffect that prevents the policy from overfitting to the localpolicies and therefore avoids premature convergence
In order to optimize Equation (25) we employ Lagrangianrelaxation similar to the one employed for ordinary V-MPO (Song et al 2020)
E Multi-Objective Policy Improvement asInference
In the main paper we motivated the multi-objective policyupdate rules from an intuitive perspective In this section weshow that our multi-objective policy improvement algorithmcan also be derived from the RL-as-inference perspective Inthe policy improvement stage we assume that a Q-functionfor each objective is given and we would like to improveour policy with respect to these Q-functions The deriva-tion here extends the derivation for the policy improvementalgorithm in (single-objective) MPO in Abdolmaleki et al(2018a) (in appendix) to the multi-objective case
We assume there are observable binary improvement eventsRkNk=1 for each objective Rk = 1 indicates that ourpolicy has improved for objective k whileRk = 0 indicatesthat it has not Now we ask if the policy has improved withrespect to all objectives ie Rk = 1Nk=1 what would theparameters θ of that improved policy be More concretelythis is equivalent to the maximum a posteriori estimate forRk = 1Nk=1
maxθ
pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)
where the probability of improvement events depends on θAssuming independence between improvement events Rkand using log probabilities leads to the following objective
maxθ
Nsumk=1
log pθ(Rk = 1) + log p(θ) (27)
The prior distribution over parameters p(θ) is fixed duringthe policy improvement step We set the prior such that πθ
stays close to the target policy during each policy improve-ment step to improve stability of learning (Sec E2)
We use the standard expectation-maximization (EM) algo-rithm to efficiently maximize
sumNk=1 log pθ(Rk = 1) The
EM algorithm repeatedly constructs a tight lower bound inthe E-step given the previous estimate of θ (which corre-sponds to the target policy in our case) and then optimizesthat lower bound in the M-step We introduce a variationaldistribution qk(s a) per objective and use this to decom-pose the log-evidence
sumNk=1 log pθ(Rk = 1) as follows
Nsumk=1
log pθ(Rk = 1) (28)
=
Nsumk=1
KL(qk(s a) pθ(s a|Rk = 1))minus
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a))
The second term of this decomposition is expanded as
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a)) (29)
=
Nsumk=1
Eqk(sa)
[log
pθ(Rk = 1 s a)
qk(s a)
]=
Nsumk=1
Eqk(sa)
[log
p(Rk = 1|s a)πθ(a|s)micro(s)
qk(s a)
]
where micro(s) is the stationary state distribution which is as-sumed to be given in each policy improvement step Inpractice micro(s) is approximated by the distribution of thestates in the replay buffer p(Rk = 1|s a) is the likelihoodof the improvement event for objective k if our policy choseaction a in state s
The first term of the decomposition in Equation (28) isalways positive so the second term is a lower bound onthe log-evidence
sumNk=1 log pθ(Rk = 1) πθ(a|s) and
qk(a|s) = qk(sa)micro(s) are unknown even though micro(s) is given
In the E-step we estimate qk(a|s) for each objective k byminimizing the first term given θ = θprime (the parameters ofthe target policy) Then in the M-step we find a new θ byfitting the parametric policy to the distributions from firststep
E1 E-Step
The E-step corresponds to the first step of policy improve-ment in the main paper (Sec 421) In the E-step wechoose the variational distributions qk(a|s)Nk=1 such thatthe lower bound on
sumNk=1 log pθ(Rk = 1) is as tight as
A Distributional View on Multi-Objective Policy Optimization
possible when θ = θprime the parameters of the target policyThe lower bound is tight when the first term of the decompo-sition in Equation (28) is zero so we choose qk to minimizethis term We can optimize for each variational distributionqk independently
qk(a|s) = argminq
Emicro(s)
[KL(q(a|s) pθprime(s a|Rk = 1))
]= argmin
qEmicro(s)
[KL(q(a|s) πθprime(a|s))minus
Eq(a|s) log p(Rk = 1|s a))]
(30)
We can solve this optimization problem in closed formwhich gives us
qk(a|s) =πθprime(a|s) p(Rk = 1|s a)intπθprime(a|s) p(Rk = 1|s a) da
This solution weighs the actions based on their relativeimprovement likelihood p(Rk = 1|s a) for each objectiveWe define the likelihood of an improvement event Rk as
p(Rk = 1|s a) prop exp(Qk(s a)
αk
)
where αk is an objective-dependent temperature parameterthat controls how greedy the solution qk(a|s) is with respectto its associated objective Qk(s a) For example given abatch of state-action pairs at the extremes an αk of zerowould give all probability to the state-action pair with themaximum Q-value while an αk of positive infinity wouldgive the same probability to all state-action pairs In orderto automatically optimize for αk we plug the exponentialtransformation into (30) which leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) dadsminus
αk
intmicro(s) KL(q(a|s) πθprime(a|s)) ds (31)
If we instead enforce the bound on KL divergence as a hardconstraint (which we do in practice) that leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) da ds (32)
stintmicro(s) KL(q(a|s) πθprime(a|s)) ds lt εk
where as described in the main paper the parameter εkdefines preferences over objectives If we now assume thatqk(a|s) is a non-parametric sample-based distribution as in(Abdolmaleki et al 2018b) we can solve this constrained
optimization problem in closed form for each sampled states by setting
q(a|s) prop πθprime(a|s) exp(Qk(s a)
ηk
) (33)
where ηk is computed by solving the following convex dualfunction
ηk = argminη
ηεk+ (34)
η
intmicro(s) log
intπθprime(a|s) exp
(Qk(s a)
η
)da ds
Equations (32) (33) and (34) are equivalent to those givenfor the first step of multi-objective policy improvement inthe main paper (Sec 421) Please see refer to AppendixD2 in (Abdolmaleki et al 2018b) for details on derivationof the dual function
E2 M-Step
After obtaining the variational distributions qk(a|s) foreach objective k we have found a tight lower bound forsumNk=1 log pθ(Rk = 1) when θ = θprime We can now obtain
the parameters θ of the updated policy πθ(a|s) by optimiz-ing this lower bound After rearranging and dropping theterms in (27) that do not involve θ we obtain
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) da ds
+ log p(θ)
This corresponds to the supervised learning step of policyimprovement in the main paper (Sec 422)
For the prior p(θ) on the parameters θ we enforce that thenew policy πθ(a|s) is close to the target policy πθprime(a|s)We therefore optimize for
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) dads
stintmicro(s) KL(πθprime(a|s) πθ(a|s)) ds lt β
Because of the constraint on the change of the parametricpolicy we do not greedily optimize the M-step objectiveSee (Abdolmaleki et al 2018b) for more details
A Distributional View on Multi-Objective Policy Optimization
Figure 9 These images show what task completion looks like forthe touch turn and orient Shadow Hand tasks (from left to right)
0 2 4 6impact force
minus6
minus4
minus2
0
pain
pen
alty
Figure 10 In the touch and turn Shadow Hand tasks the painpenalty is computed based on the impact force as plotted here Forlow impact forces the pain penalty is near-zero For high impactforces the pain penalty is equal to the negative of the impact force
B21 BALANCING TASK COMPLETION AND PAIN
In the touch and turn tasks there are two objectives ldquopainrdquopenalty and task completion A sparse task completion re-ward of 1 is given for pressing a block with greater than 5Nof force or turning a dial to a fixed target angle respectivelyIn both tasks the episode terminates when the agent com-pletes the task ie the agent gets a total reward of either0 or 1 for the task completion objective per episode ThePareto plots for these two tasks in the main paper (Fig 6)show the discounted task reward (with a discount factor ofγ = 099) to capture how long it takes agents to completethe task
The ldquopainrdquo penalty is computed as in (Huang et al 2019) Itis based on the impact force m(s sprime) which is the increasein force from state s to the next state sprime In our tasks this ismeasured by a touch sensor on the block or dial The painpenalty is equal to the negative of the impact force scaledby how unacceptable it is
rpain(s a sprime) = minus[1minus aλ(m(s sprime))
]m(s sprime) (6)
where aλ(middot) isin [0 1] computes the acceptability of an impactforce aλ(middot) should be a monotonically decreasing functionthat captures how resilient the robot and the environmentare to impacts As in (Huang et al 2019) we use
aλ(m) = sigmoid(λ1(minusm+ λ2)) (7)
with λ = [2 2]gt The relationship between pain penaltyand impact force is plotted in Fig 10
675 700 725zpeg
0
1
rew
ard
height objective
0 2d(θ θpeg)
orientation objective
Figure 11 In the orient Shadow Hand task the height and orienta-tion objectives have shaped rewards as plotted here
B22 ALIGNED OBJECTIVES
In the orient task there are three non-competing objectivestouching the peg lifting the peg to a target height and ori-enting the peg to be perpendicular to the ground The targetz position and orientation is shown by the gray transparentpeg (Fig 9 right-most) note that the x and y position isnot specified so the robot in the figure gets the maximumreward with respect to all three objectives
The touch objective has a sparse reward of 1 for activat-ing the pegrsquos touch sensor and zero otherwise For theheight objective the target z position is 7cm above theground the pegrsquos z position is computed with respect toits center of mass The shaped reward for this objectiveis 1 minus tanh(50|ztarget minus zpeg|) For the orientation objec-tive since the peg is symmetrical there are eight possibleorientations that are equally valid The acceptable targetorientationsQtarget and the pegrsquos orientation qpeg are denotedas quaternions The shaped reward is computed with respectto the closest target quaternion as
minqisinQtarget
1minus tanh(2lowastd(q qpeg))
where d(middot) denotes the `2-norm of the axis-angle equivalent(in radians) for the distance between the two quaternionsFig 11 shows the shaping of the height and orientationrewards
B3 Humanoid
We make use of the humanoid run task from Tassa et al(2018)9 The observation dimension is 67 and the actiondimension is 21 Actions are joint accelerations with min-imum and maximum limits of minus1 and 1 respectively Forthis task there are two objectives
bull The original task reward given by the environment Thegoal is is to achieve a horizontal speed of 10 meters persecond in any direction This reward is shaped it isequal to min(h10 1) where h is in meters per secondFor this objective we learn a Q-function
9This task is available at githubcomdeepminddm control
A Distributional View on Multi-Objective Policy Optimization
bull Limiting energy usage by penalizing taking high-magnitude actions The penalty is the `2-norm of theaction vector ie rpenalty(a) = minusa2 For this ob-jective we do not learn a Q-function Instead wecompute this penalty directly to evaluate a given actionin a state-independent way during policy optimization
B4 Humanoid Mocap
The motion capture tracking task used in this paper wasdeveloped as part of a concurrent submission (Anonymous2020) and does not constitute a contribution of this paperWe use a simulated humanoid adapted from the ldquoCMUhumanoidrdquo available at dm controllocomotion (Merel et al2019) We adjusted various body and actuator parametersto be more comparable to that of an average human Thishumanoid has 56 degrees of freedom and the observation is1021-dimensional
This motion capture tracking task is broadly similar to previ-ous work on motion capture tracking (Chentanez et al 2018Peng et al 2018 Merel et al 2019) The task is associatedwith an underlying set of motion capture clips For thispaper we used roughly 40 minutes of locomotion motioncapture clips from the CMU motion capture database10
At the beginning of a new episode we select a frame uni-formly at random from all frames in the underlying mocapdata (excluding the last 10 frames of each clip) The simu-lated humanoid is then initialized to the pose of the selectedframe
The observations in this environment are various propriocep-tive observations about the current state of the body as wellas the relative target positions and orientations of 13 differ-ent body parts in the humanoidrsquos local frame We providethe agent with the target poses for the next 5 timesteps
The choice of reward function is crucial for motion capturetracking tasks Here we consider five different reward com-ponents each capturing a different aspect of the similarityof the pose between the simulated humanoid and the mocaptarget Four of our reward components were initially pro-posed by Peng et al (2018) The first reward component isbased on the difference in the center of mass
rcom = exp(minus10pcom minus pref
com2)
where pcom and prefcom are the positions of the center of mass
of the simulated humanoid and the mocap reference re-spectively The second reward component is based on thedifference in joint angle velocities
rvel = exp(minus01qvel minus qref
vel2)
where qvel and qrefvel are the joint angle velocities of the simu-
lated humanoid and the mocap reference respectively The
10This database is available at mocapcscmuedu
third reward component is based on the difference in end-effector positions
rapp = exp(minus40papp minus pref
app2)
where papp and prefapp are the end effector positions of the
simulated humanoid and the mocap reference respectivelyThe fourth reward component is based on the difference injoint orientations
rquat = exp(minus2qquat qref
quat2)
where denotes quaternion differences and qquat and qrefquat
are the joint quaternions of the simulated humanoid and themocap reference respectively
Finally the last reward component is based on difference inthe joint angles and the Euclidean positions of a set of 13body parts
rtrunc = 1minus 1
03(
=ε︷ ︸︸ ︷bpos minus bref
pos1 + qpos minus qrefpos1)
where bpos and brefpos correspond to the body positions of the
simulated character and the mocap reference and qpos andqref
pos correspond to the joint angles
We include an early termination condition in our task that islinked to this last reward term We terminate the episode ifε gt 03 which ensures that rtrunc isin [0 1]
In our MO-VMPO experiments we treat each reward com-ponent as a separate objective In our VMPO experimentswe use a reward of the form
r =1
2rtrunc +
1
2(01rcom + λrvel + 015rapp + 065rquat)
(8)
varying λ as described in the main paper (Sec 63)
B5 Sawyer
The Sawyer peg-in-hole setup consists of a Rethink RoboticsSawyer robot arm The robot has 7 joints driven by serieselastic actuators On the wrist a Robotiq FT 300 force-torque sensor is mounted followed by a Robotiq 2F85 par-allel gripper We implement a 3D end-effector Cartesianvelocity controller that maintains a fixed orientation of thegripper The agent controls the Cartesian velocity setpointsThe gripper is not actuated in this setup The observationsprovided are the joint positions velocities and torques theend-effector pose the wrist force-torque measurements andthe joint velocity command output of the Cartesian con-troller We augment each observation with the two previousobservations as a history
In each episode the peg is initalized randomly within an8times8times8 cm workspace (with the peg outside of the hole)
A Distributional View on Multi-Objective Policy Optimization
The environment is then run at 20 Hz for 600 steps Theaction limits are plusmn005 ms on each axis If the magnitudeof the wrist force-torque measurements exceeds 15 N onany horizontal axis (x and y axis) or 5 N on the vertical axis(z axis) the episode is terminated early
There are two objectives the task itself (inserting the peg)and minimizing wrist force measurements
The reward for the task objective is defined as
rinsertion = max(
02rapproach rinserted
)(9)
rapproach = s(ppeg popening) (10)
rinserted = raligned s(ppegz pbottom
z ) (11)
raligned = ppegxy minus popening
xy lt d2 (12)
where ppeg is the peg position popening the position of thehole opening pbottom the bottom position of the hole and dthe diameter of the hole
s(p1 p2) is a shaping operator
s(p1 p2) = 1minus tanh2(atanh(
radicl)
εp1 minus p22) (13)
which gives a reward of 1minus l if p1 and p2 are at a Euclideandistance of ε Here we chose lapproach = 095 εapproach =001 linserted = 05 and εinserted = 004 Intentionally 004corresponds to the length of the peg such that rinserted = 05if ppeg = popening As a result if the peg is in the approachposition rinserted is dominant over rapproach in rinsertion
Intuitively there is a shaped reward for aligning the pegwith a magnitude of 02 After accurate alignment withina horizontal tolerance the overall reward is dominated bya vertical shaped reward component for insertion The hor-izontal tolerance threshold ensures that the agent is notencouraged to try to insert the peg from the side
The reward for the secondary objective minimizing wristforce measurements is defined as
rforce = minusF1 (14)
where F = (Fx Fy Fz) are the forces measured by thewrist sensor
C Algorithmic DetailsC1 General Algorithm Description
We maintain one online network and one target network foreach Q-function associated to objective k with parametersdenoted by φk and φprimek respectively We also maintain oneonline network and one target network for the policy withparameters denoted by θ and θprime respectively Target net-works are updated every fixed number of steps by copying
parameters from the online network Online networks areupdated using gradient descent in each learning iterationNote that in the main paper we refer to the target policynetwork as the old policy πold
We use an asynchronous actor-learner setup in which ac-tors regularly fetch policy parameters from the learner andact in the environment writing these transitions to the re-play buffer We refer to this policy as the behavior policythroughout the paper The learner uses the transitions inthe replay buffer to update the (online) Q-functions and thepolicy Please see Algorithms 2 and 3 for more details onthe actor and learner processes respectively
Algorithm 2 MO-MPO Asynchronous Actor
1 given (N) reward functions rk(s a)Nk=1 T steps perepisode
2 repeat3 Fetch current policy parameters θ from learner4 Collect trajectory from environment5 τ = 6 for t = 0 T do7 at sim πθ(middot|st)8 Execute action and determine rewards9 r = [r1(st at) rN (st at)]
10 τ larr τ cup (st at r πθ(at|st))11 end for12 Send trajectory τ to replay buffer13 until end of training
C2 Retrace for Multi-Objective Policy Evaluation
Recall that the goal of the policy evaluation stage (Sec 41)is to learn the Q-function Qφk
(s a) for each objective kparametrized by φk These Q-functions are with respectto the policy πold We emphasize that it is valid to use anyoff-the-shelf Q-learning algorithm such as TD(0) (Sutton1988) to learn these Q-functions We choose to use Retrace(Munos et al 2016) described here
Given a replay buffer D containing trajectory snippetsτ = (s0 a0 r0 s1) (sT aT rT sT+1) where rtdenotes a reward vector rk(st at)Nk=1 that consists of ascalar reward for each ofN objectives the Retrace objectiveis as follows
minφk
EτsimD[(Qretk (st at)minus Qφk
(st at))2] (15)
with
Qretk (st at) = Qφprime
k(st at) +
Tsumj=t
γjminust( jprodz=t+1
cz
)δj
δj = rk(sj aj) + γV (sj+1)minus Qφprimek(sj aj)
V (sj+1) = Eπold(a|sj+1)[Qφprimek(sj+1 a)]
A Distributional View on Multi-Objective Policy Optimization
Algorithm 3 MO-MPO Asynchronous Learner1 given batch size (L) number of actions (M) (N) Q-functions
(N) preferences εkNk=1 policy networks and replay bufferD
2 initialize Lagrangians ηkNk=1 and ν target networks andonline networks such that πθprime = πθ and Qφprime
k= Qφk
3 repeat4 repeat5 Collect dataset si aij Qijk
LMNijk where
6 si sim D aij sim πθprime(a|si) and Qijk = Qφkprime(si aij)
78 Compute action distribution for each objective9 for k = 1 N do
10 δηk larr nablaηkηkεk + ηksumLi
1Llog
(sumMj
1M
exp(Q
ijkηk
))11 Update ηk based on δηk12 qijk prop exp(
Qijkηk
)
13 end for1415 Update parametric policy with trust region16 sg denotes a stop-gradient17 δπ larr minusnablaθ
sumLi
sumMj
sumNk q
ijk log πθ(a
ij |si)18 +sg(ν)
(β minus
sumLi KL(πθprime(a|si) πθ(a|si)) ds
)19 δν larr nablaνν
(βminussumLi KL(πθprime(a|si) πsg(θ)(a|si)) ds
)20 Update πθ based on δπ21 Update ν based on δν2223 Update Q-functions24 for k = 1 N do25 δφk larr nablaφk
sum(stat)isinτsimD
(Qφk (st at)minusQ
retk
)226 with Qret
k as in Eq 1527 Update φk based on δφk
28 end for2930 until fixed number of steps31 Update target networks32 πθprime = πθ Qφk
prime = Qφk
33 until convergence
The importance weights ck are defined as
cz = min
(1πold(az|sz)b(az|sz)
)
where b(az|sz) denotes the behavior policy used to col-lect trajectories in the environment When j = t we set(prodj
z=t+1 cz
)= 1
We also make use of a target network for each Q-function(Mnih et al 2015) parametrized by φprimek which we copyfrom the online network φk after a fixed number of gradientsteps on the Retrace objective (15)
C3 Policy Fitting
Recall that fitting the policy in the policy improvementstage (Sec 422) requires solving the following constrained
optimization
πnew = argmaxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds
stints
micro(s) KL(πold(a|s) πθ(a|s)) ds lt β (16)
We first write the generalized Lagrangian equation ie
L(θ ν) =
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds (17)
+ ν(β minus
ints
micro(s) KL(πold(a|s) πθ(a|s)) ds)
where ν is the Lagrangian multiplier Now we solve thefollowing primal problem
maxθ
minνgt0
L(θ ν)
To obtain the parameters θ for the updated policy we solvefor ν and θ by alternating between 1) fixing ν to its currentvalue and optimizing for θ and 2) fixing θ to its currentvalue and optimizing for ν This can be applied for anypolicy output distribution
For Gaussian policies we decouple the update rules to op-timize for mean and covariance independently as in (Ab-dolmaleki et al 2018b) This allows for setting differentKL bounds for the mean (βmicro) and covariance (βΣ) whichresults in more stable learning To do this we first separateout the following two policies for mean and covariance
πmicroθ (a|s) = N(amicroθ(s)Σθold
(s)) (18)
πΣθ (a|s) = N
(amicroθold
(s)Σθ(s)) (19)
Policy πmicroθ (a|s) takes the mean from the online policy net-work and covariance from the target policy network andpolicy πΣ
θ (a|s) takes the mean from the target policy net-work and covariance from online policy network Now ouroptimization problem has the following form
maxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s)(
log πmicroθ (a|s)πΣθ (a|s)
)da ds
stints
micro(s) KL(πold(a|s) πmicroθ (a|s)) ds lt βmicroints
micro(s) KL(πold(a|s) πΣθ (a|s)) ds lt βΣ (20)
As in (Abdolmaleki et al 2018b) we set a much smallerbound for covariance than for mean to keep the explorationand avoid premature convergence We can solve this op-timization problem using the same general procedure asdescribed above
A Distributional View on Multi-Objective Policy Optimization
C4 Derivation of Dual Function for η
Recall that obtaining the per-objective improved action dis-tributions qk(a|s) in the policy improvement stage (Sec421) requires solving a convex dual function for the tem-perature ηk for each objective For the derivation of thisdual function please refer to Appendix D2 of the originalMPO paper (Abdolmaleki et al 2018b)
D MO-V-MPO Multi-Objective On-PolicyMPO
In this section we describe how MO-MPO can be adaptedto the on-policy case in which a state-value function V (s)is learned and used to estimate the advantages for eachobjective We call this approach MO-V-MPO
D1 Multi-objective policy evaluation in the on-policysetting
In the on-policy setting to evaluate the previous policyπold we use advantages A(s a) estimated from a learnedstate-value function V (s) instead of a state-action valuefunction Q(s a) as in the main text We train a separateV -function for each objective by regressing to the stan-dard n-step return (Sutton 1988) associated with each ob-jective More concretely given trajectory snippets τ =(s0 a0 r0) (sT aT rT ) where rt denotes a rewardvector rk(st at)Nk=1 that consists of rewards for all N ob-jectives we find value function parameters φk by optimizingthe following objective
minφ1k
Nsumk=1
Eτ[(G
(T )k (st at)minus V πold
φk(st))
2] (21)
Here G(T )k (st at) is the T -step target for value function
k which uses the actual rewards in the trajectory andbootstraps from the current value function for the restG
(T )k (st at) =
sumTminus1`=t γ
`minustrk(s` a`) + γTminustV πoldφk
(s`+T )The advantages are then estimated as Aπold
k (st at) =
G(T )k (st at)minus V πold
φk(st)
D2 Multi-objective policy improvement in theon-policy setting
Given the previous policy πold(a|s) and estimated advan-tages Aπold
k (s a)k=1N associated with this policy foreach objective our goal is to improve the previous policyTo this end we first learn an improved variational distribu-tion qk(s a) for each objective and then combine and distillthe variational distributions into a new parametric policyπnew(a|s) Unlike in MO-MPO for MO-V-MPO we use thejoint distribution qk(s a) rather than local policies qk(a|s)because without a learned Q-function only one action perstate is available for learning This is a multi-objective vari-
ant of the two-step policy improvement procedure employedby V-MPO (Song et al 2020)
D21 OBTAINING IMPROVED VARIATIONALDISTRIBUTIONS PER OBJECTIVE (STEP 1)
In order to obtain the improved variational distributionsqk(s a) we optimize the RL objective for each objectiveAk
maxqk
intsa
qk(s a)Ak(s a) dads (22)
st KL(qk(s a)pold(s a)) lt εk
where the KL-divergence is computed over all (s a) εk de-notes the allowed expected KL divergence and pold(s a) =micro(s)πold(a|s) is the state-action distribution associated withπold
As in MO-MPO we use these εk to define the preferencesover objectives More concretely εk defines the allowed con-tribution of objective k to the change of the policy There-fore the larger a particular εk is with respect to others themore that objective k is preferred On the other hand ifεk = 0 is zero then objective k will have no contribution tothe change of the policy and will effectively be ignored
Equation (22) can be solved in closed form
qk(s a) prop pold(s a) exp(Ak(s a)
ηk
) (23)
where the temperature ηk is computed based on the con-straint εk by solving the following convex dual problem
ηk = argminηk
[ηk εk + (24)
ηk log
intsa
pold(s a) exp(Ak(s a)
η
)da ds
]
We can perform the optimization along with the loss bytaking a gradient descent step on ηk and we initialize withthe solution found in the previous policy iteration step Sinceηk must be positive we use a projection operator after eachgradient step to maintain ηk gt 0
Top-k advantages As in Song et al (2020) in practicewe used the samples corresponding to the top 50 of ad-vantages in each batch of data
D22 FITTING A NEW PARAMETRIC POLICY (STEP 2)
We next want to combine and distill the state-action distribu-tions obtained in the previous step into a single parametricpolicy πnew(a|s) that favors all of the objectives accordingto the preferences specified by εk For this we solve a su-pervised learning problem that fits a parametric policy asfollows
A Distributional View on Multi-Objective Policy Optimization
πnew = argmaxθ
Nsumk=1
intsa
qk(s a) log πθ(a|s) da ds
stints
KL(πold(a|s) πθ(a|s)) ds lt β (25)
where θ are the parameters of our function approximator(a neural network) which we initialize from the weights ofthe previous policy πold and the KL constraint enforces atrust region of size β that limits the overall change in theparametric policy to improve stability of learning As inMPO the KL constraint in this step has a regularizationeffect that prevents the policy from overfitting to the localpolicies and therefore avoids premature convergence
In order to optimize Equation (25) we employ Lagrangianrelaxation similar to the one employed for ordinary V-MPO (Song et al 2020)
E Multi-Objective Policy Improvement asInference
In the main paper we motivated the multi-objective policyupdate rules from an intuitive perspective In this section weshow that our multi-objective policy improvement algorithmcan also be derived from the RL-as-inference perspective Inthe policy improvement stage we assume that a Q-functionfor each objective is given and we would like to improveour policy with respect to these Q-functions The deriva-tion here extends the derivation for the policy improvementalgorithm in (single-objective) MPO in Abdolmaleki et al(2018a) (in appendix) to the multi-objective case
We assume there are observable binary improvement eventsRkNk=1 for each objective Rk = 1 indicates that ourpolicy has improved for objective k whileRk = 0 indicatesthat it has not Now we ask if the policy has improved withrespect to all objectives ie Rk = 1Nk=1 what would theparameters θ of that improved policy be More concretelythis is equivalent to the maximum a posteriori estimate forRk = 1Nk=1
maxθ
pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)
where the probability of improvement events depends on θAssuming independence between improvement events Rkand using log probabilities leads to the following objective
maxθ
Nsumk=1
log pθ(Rk = 1) + log p(θ) (27)
The prior distribution over parameters p(θ) is fixed duringthe policy improvement step We set the prior such that πθ
stays close to the target policy during each policy improve-ment step to improve stability of learning (Sec E2)
We use the standard expectation-maximization (EM) algo-rithm to efficiently maximize
sumNk=1 log pθ(Rk = 1) The
EM algorithm repeatedly constructs a tight lower bound inthe E-step given the previous estimate of θ (which corre-sponds to the target policy in our case) and then optimizesthat lower bound in the M-step We introduce a variationaldistribution qk(s a) per objective and use this to decom-pose the log-evidence
sumNk=1 log pθ(Rk = 1) as follows
Nsumk=1
log pθ(Rk = 1) (28)
=
Nsumk=1
KL(qk(s a) pθ(s a|Rk = 1))minus
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a))
The second term of this decomposition is expanded as
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a)) (29)
=
Nsumk=1
Eqk(sa)
[log
pθ(Rk = 1 s a)
qk(s a)
]=
Nsumk=1
Eqk(sa)
[log
p(Rk = 1|s a)πθ(a|s)micro(s)
qk(s a)
]
where micro(s) is the stationary state distribution which is as-sumed to be given in each policy improvement step Inpractice micro(s) is approximated by the distribution of thestates in the replay buffer p(Rk = 1|s a) is the likelihoodof the improvement event for objective k if our policy choseaction a in state s
The first term of the decomposition in Equation (28) isalways positive so the second term is a lower bound onthe log-evidence
sumNk=1 log pθ(Rk = 1) πθ(a|s) and
qk(a|s) = qk(sa)micro(s) are unknown even though micro(s) is given
In the E-step we estimate qk(a|s) for each objective k byminimizing the first term given θ = θprime (the parameters ofthe target policy) Then in the M-step we find a new θ byfitting the parametric policy to the distributions from firststep
E1 E-Step
The E-step corresponds to the first step of policy improve-ment in the main paper (Sec 421) In the E-step wechoose the variational distributions qk(a|s)Nk=1 such thatthe lower bound on
sumNk=1 log pθ(Rk = 1) is as tight as
A Distributional View on Multi-Objective Policy Optimization
possible when θ = θprime the parameters of the target policyThe lower bound is tight when the first term of the decompo-sition in Equation (28) is zero so we choose qk to minimizethis term We can optimize for each variational distributionqk independently
qk(a|s) = argminq
Emicro(s)
[KL(q(a|s) pθprime(s a|Rk = 1))
]= argmin
qEmicro(s)
[KL(q(a|s) πθprime(a|s))minus
Eq(a|s) log p(Rk = 1|s a))]
(30)
We can solve this optimization problem in closed formwhich gives us
qk(a|s) =πθprime(a|s) p(Rk = 1|s a)intπθprime(a|s) p(Rk = 1|s a) da
This solution weighs the actions based on their relativeimprovement likelihood p(Rk = 1|s a) for each objectiveWe define the likelihood of an improvement event Rk as
p(Rk = 1|s a) prop exp(Qk(s a)
αk
)
where αk is an objective-dependent temperature parameterthat controls how greedy the solution qk(a|s) is with respectto its associated objective Qk(s a) For example given abatch of state-action pairs at the extremes an αk of zerowould give all probability to the state-action pair with themaximum Q-value while an αk of positive infinity wouldgive the same probability to all state-action pairs In orderto automatically optimize for αk we plug the exponentialtransformation into (30) which leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) dadsminus
αk
intmicro(s) KL(q(a|s) πθprime(a|s)) ds (31)
If we instead enforce the bound on KL divergence as a hardconstraint (which we do in practice) that leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) da ds (32)
stintmicro(s) KL(q(a|s) πθprime(a|s)) ds lt εk
where as described in the main paper the parameter εkdefines preferences over objectives If we now assume thatqk(a|s) is a non-parametric sample-based distribution as in(Abdolmaleki et al 2018b) we can solve this constrained
optimization problem in closed form for each sampled states by setting
q(a|s) prop πθprime(a|s) exp(Qk(s a)
ηk
) (33)
where ηk is computed by solving the following convex dualfunction
ηk = argminη
ηεk+ (34)
η
intmicro(s) log
intπθprime(a|s) exp
(Qk(s a)
η
)da ds
Equations (32) (33) and (34) are equivalent to those givenfor the first step of multi-objective policy improvement inthe main paper (Sec 421) Please see refer to AppendixD2 in (Abdolmaleki et al 2018b) for details on derivationof the dual function
E2 M-Step
After obtaining the variational distributions qk(a|s) foreach objective k we have found a tight lower bound forsumNk=1 log pθ(Rk = 1) when θ = θprime We can now obtain
the parameters θ of the updated policy πθ(a|s) by optimiz-ing this lower bound After rearranging and dropping theterms in (27) that do not involve θ we obtain
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) da ds
+ log p(θ)
This corresponds to the supervised learning step of policyimprovement in the main paper (Sec 422)
For the prior p(θ) on the parameters θ we enforce that thenew policy πθ(a|s) is close to the target policy πθprime(a|s)We therefore optimize for
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) dads
stintmicro(s) KL(πθprime(a|s) πθ(a|s)) ds lt β
Because of the constraint on the change of the parametricpolicy we do not greedily optimize the M-step objectiveSee (Abdolmaleki et al 2018b) for more details
A Distributional View on Multi-Objective Policy Optimization
bull Limiting energy usage by penalizing taking high-magnitude actions The penalty is the `2-norm of theaction vector ie rpenalty(a) = minusa2 For this ob-jective we do not learn a Q-function Instead wecompute this penalty directly to evaluate a given actionin a state-independent way during policy optimization
B4 Humanoid Mocap
The motion capture tracking task used in this paper wasdeveloped as part of a concurrent submission (Anonymous2020) and does not constitute a contribution of this paperWe use a simulated humanoid adapted from the ldquoCMUhumanoidrdquo available at dm controllocomotion (Merel et al2019) We adjusted various body and actuator parametersto be more comparable to that of an average human Thishumanoid has 56 degrees of freedom and the observation is1021-dimensional
This motion capture tracking task is broadly similar to previ-ous work on motion capture tracking (Chentanez et al 2018Peng et al 2018 Merel et al 2019) The task is associatedwith an underlying set of motion capture clips For thispaper we used roughly 40 minutes of locomotion motioncapture clips from the CMU motion capture database10
At the beginning of a new episode we select a frame uni-formly at random from all frames in the underlying mocapdata (excluding the last 10 frames of each clip) The simu-lated humanoid is then initialized to the pose of the selectedframe
The observations in this environment are various propriocep-tive observations about the current state of the body as wellas the relative target positions and orientations of 13 differ-ent body parts in the humanoidrsquos local frame We providethe agent with the target poses for the next 5 timesteps
The choice of reward function is crucial for motion capturetracking tasks Here we consider five different reward com-ponents each capturing a different aspect of the similarityof the pose between the simulated humanoid and the mocaptarget Four of our reward components were initially pro-posed by Peng et al (2018) The first reward component isbased on the difference in the center of mass
rcom = exp(minus10pcom minus pref
com2)
where pcom and prefcom are the positions of the center of mass
of the simulated humanoid and the mocap reference re-spectively The second reward component is based on thedifference in joint angle velocities
rvel = exp(minus01qvel minus qref
vel2)
where qvel and qrefvel are the joint angle velocities of the simu-
lated humanoid and the mocap reference respectively The
10This database is available at mocapcscmuedu
third reward component is based on the difference in end-effector positions
rapp = exp(minus40papp minus pref
app2)
where papp and prefapp are the end effector positions of the
simulated humanoid and the mocap reference respectivelyThe fourth reward component is based on the difference injoint orientations
rquat = exp(minus2qquat qref
quat2)
where denotes quaternion differences and qquat and qrefquat
are the joint quaternions of the simulated humanoid and themocap reference respectively
Finally the last reward component is based on difference inthe joint angles and the Euclidean positions of a set of 13body parts
rtrunc = 1minus 1
03(
=ε︷ ︸︸ ︷bpos minus bref
pos1 + qpos minus qrefpos1)
where bpos and brefpos correspond to the body positions of the
simulated character and the mocap reference and qpos andqref
pos correspond to the joint angles
We include an early termination condition in our task that islinked to this last reward term We terminate the episode ifε gt 03 which ensures that rtrunc isin [0 1]
In our MO-VMPO experiments we treat each reward com-ponent as a separate objective In our VMPO experimentswe use a reward of the form
r =1
2rtrunc +
1
2(01rcom + λrvel + 015rapp + 065rquat)
(8)
varying λ as described in the main paper (Sec 63)
B5 Sawyer
The Sawyer peg-in-hole setup consists of a Rethink RoboticsSawyer robot arm The robot has 7 joints driven by serieselastic actuators On the wrist a Robotiq FT 300 force-torque sensor is mounted followed by a Robotiq 2F85 par-allel gripper We implement a 3D end-effector Cartesianvelocity controller that maintains a fixed orientation of thegripper The agent controls the Cartesian velocity setpointsThe gripper is not actuated in this setup The observationsprovided are the joint positions velocities and torques theend-effector pose the wrist force-torque measurements andthe joint velocity command output of the Cartesian con-troller We augment each observation with the two previousobservations as a history
In each episode the peg is initalized randomly within an8times8times8 cm workspace (with the peg outside of the hole)
A Distributional View on Multi-Objective Policy Optimization
The environment is then run at 20 Hz for 600 steps Theaction limits are plusmn005 ms on each axis If the magnitudeof the wrist force-torque measurements exceeds 15 N onany horizontal axis (x and y axis) or 5 N on the vertical axis(z axis) the episode is terminated early
There are two objectives the task itself (inserting the peg)and minimizing wrist force measurements
The reward for the task objective is defined as
rinsertion = max(
02rapproach rinserted
)(9)
rapproach = s(ppeg popening) (10)
rinserted = raligned s(ppegz pbottom
z ) (11)
raligned = ppegxy minus popening
xy lt d2 (12)
where ppeg is the peg position popening the position of thehole opening pbottom the bottom position of the hole and dthe diameter of the hole
s(p1 p2) is a shaping operator
s(p1 p2) = 1minus tanh2(atanh(
radicl)
εp1 minus p22) (13)
which gives a reward of 1minus l if p1 and p2 are at a Euclideandistance of ε Here we chose lapproach = 095 εapproach =001 linserted = 05 and εinserted = 004 Intentionally 004corresponds to the length of the peg such that rinserted = 05if ppeg = popening As a result if the peg is in the approachposition rinserted is dominant over rapproach in rinsertion
Intuitively there is a shaped reward for aligning the pegwith a magnitude of 02 After accurate alignment withina horizontal tolerance the overall reward is dominated bya vertical shaped reward component for insertion The hor-izontal tolerance threshold ensures that the agent is notencouraged to try to insert the peg from the side
The reward for the secondary objective minimizing wristforce measurements is defined as
rforce = minusF1 (14)
where F = (Fx Fy Fz) are the forces measured by thewrist sensor
C Algorithmic DetailsC1 General Algorithm Description
We maintain one online network and one target network foreach Q-function associated to objective k with parametersdenoted by φk and φprimek respectively We also maintain oneonline network and one target network for the policy withparameters denoted by θ and θprime respectively Target net-works are updated every fixed number of steps by copying
parameters from the online network Online networks areupdated using gradient descent in each learning iterationNote that in the main paper we refer to the target policynetwork as the old policy πold
We use an asynchronous actor-learner setup in which ac-tors regularly fetch policy parameters from the learner andact in the environment writing these transitions to the re-play buffer We refer to this policy as the behavior policythroughout the paper The learner uses the transitions inthe replay buffer to update the (online) Q-functions and thepolicy Please see Algorithms 2 and 3 for more details onthe actor and learner processes respectively
Algorithm 2 MO-MPO Asynchronous Actor
1 given (N) reward functions rk(s a)Nk=1 T steps perepisode
2 repeat3 Fetch current policy parameters θ from learner4 Collect trajectory from environment5 τ = 6 for t = 0 T do7 at sim πθ(middot|st)8 Execute action and determine rewards9 r = [r1(st at) rN (st at)]
10 τ larr τ cup (st at r πθ(at|st))11 end for12 Send trajectory τ to replay buffer13 until end of training
C2 Retrace for Multi-Objective Policy Evaluation
Recall that the goal of the policy evaluation stage (Sec 41)is to learn the Q-function Qφk
(s a) for each objective kparametrized by φk These Q-functions are with respectto the policy πold We emphasize that it is valid to use anyoff-the-shelf Q-learning algorithm such as TD(0) (Sutton1988) to learn these Q-functions We choose to use Retrace(Munos et al 2016) described here
Given a replay buffer D containing trajectory snippetsτ = (s0 a0 r0 s1) (sT aT rT sT+1) where rtdenotes a reward vector rk(st at)Nk=1 that consists of ascalar reward for each ofN objectives the Retrace objectiveis as follows
minφk
EτsimD[(Qretk (st at)minus Qφk
(st at))2] (15)
with
Qretk (st at) = Qφprime
k(st at) +
Tsumj=t
γjminust( jprodz=t+1
cz
)δj
δj = rk(sj aj) + γV (sj+1)minus Qφprimek(sj aj)
V (sj+1) = Eπold(a|sj+1)[Qφprimek(sj+1 a)]
A Distributional View on Multi-Objective Policy Optimization
Algorithm 3 MO-MPO Asynchronous Learner1 given batch size (L) number of actions (M) (N) Q-functions
(N) preferences εkNk=1 policy networks and replay bufferD
2 initialize Lagrangians ηkNk=1 and ν target networks andonline networks such that πθprime = πθ and Qφprime
k= Qφk
3 repeat4 repeat5 Collect dataset si aij Qijk
LMNijk where
6 si sim D aij sim πθprime(a|si) and Qijk = Qφkprime(si aij)
78 Compute action distribution for each objective9 for k = 1 N do
10 δηk larr nablaηkηkεk + ηksumLi
1Llog
(sumMj
1M
exp(Q
ijkηk
))11 Update ηk based on δηk12 qijk prop exp(
Qijkηk
)
13 end for1415 Update parametric policy with trust region16 sg denotes a stop-gradient17 δπ larr minusnablaθ
sumLi
sumMj
sumNk q
ijk log πθ(a
ij |si)18 +sg(ν)
(β minus
sumLi KL(πθprime(a|si) πθ(a|si)) ds
)19 δν larr nablaνν
(βminussumLi KL(πθprime(a|si) πsg(θ)(a|si)) ds
)20 Update πθ based on δπ21 Update ν based on δν2223 Update Q-functions24 for k = 1 N do25 δφk larr nablaφk
sum(stat)isinτsimD
(Qφk (st at)minusQ
retk
)226 with Qret
k as in Eq 1527 Update φk based on δφk
28 end for2930 until fixed number of steps31 Update target networks32 πθprime = πθ Qφk
prime = Qφk
33 until convergence
The importance weights ck are defined as
cz = min
(1πold(az|sz)b(az|sz)
)
where b(az|sz) denotes the behavior policy used to col-lect trajectories in the environment When j = t we set(prodj
z=t+1 cz
)= 1
We also make use of a target network for each Q-function(Mnih et al 2015) parametrized by φprimek which we copyfrom the online network φk after a fixed number of gradientsteps on the Retrace objective (15)
C3 Policy Fitting
Recall that fitting the policy in the policy improvementstage (Sec 422) requires solving the following constrained
optimization
πnew = argmaxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds
stints
micro(s) KL(πold(a|s) πθ(a|s)) ds lt β (16)
We first write the generalized Lagrangian equation ie
L(θ ν) =
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds (17)
+ ν(β minus
ints
micro(s) KL(πold(a|s) πθ(a|s)) ds)
where ν is the Lagrangian multiplier Now we solve thefollowing primal problem
maxθ
minνgt0
L(θ ν)
To obtain the parameters θ for the updated policy we solvefor ν and θ by alternating between 1) fixing ν to its currentvalue and optimizing for θ and 2) fixing θ to its currentvalue and optimizing for ν This can be applied for anypolicy output distribution
For Gaussian policies we decouple the update rules to op-timize for mean and covariance independently as in (Ab-dolmaleki et al 2018b) This allows for setting differentKL bounds for the mean (βmicro) and covariance (βΣ) whichresults in more stable learning To do this we first separateout the following two policies for mean and covariance
πmicroθ (a|s) = N(amicroθ(s)Σθold
(s)) (18)
πΣθ (a|s) = N
(amicroθold
(s)Σθ(s)) (19)
Policy πmicroθ (a|s) takes the mean from the online policy net-work and covariance from the target policy network andpolicy πΣ
θ (a|s) takes the mean from the target policy net-work and covariance from online policy network Now ouroptimization problem has the following form
maxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s)(
log πmicroθ (a|s)πΣθ (a|s)
)da ds
stints
micro(s) KL(πold(a|s) πmicroθ (a|s)) ds lt βmicroints
micro(s) KL(πold(a|s) πΣθ (a|s)) ds lt βΣ (20)
As in (Abdolmaleki et al 2018b) we set a much smallerbound for covariance than for mean to keep the explorationand avoid premature convergence We can solve this op-timization problem using the same general procedure asdescribed above
A Distributional View on Multi-Objective Policy Optimization
C4 Derivation of Dual Function for η
Recall that obtaining the per-objective improved action dis-tributions qk(a|s) in the policy improvement stage (Sec421) requires solving a convex dual function for the tem-perature ηk for each objective For the derivation of thisdual function please refer to Appendix D2 of the originalMPO paper (Abdolmaleki et al 2018b)
D MO-V-MPO Multi-Objective On-PolicyMPO
In this section we describe how MO-MPO can be adaptedto the on-policy case in which a state-value function V (s)is learned and used to estimate the advantages for eachobjective We call this approach MO-V-MPO
D1 Multi-objective policy evaluation in the on-policysetting
In the on-policy setting to evaluate the previous policyπold we use advantages A(s a) estimated from a learnedstate-value function V (s) instead of a state-action valuefunction Q(s a) as in the main text We train a separateV -function for each objective by regressing to the stan-dard n-step return (Sutton 1988) associated with each ob-jective More concretely given trajectory snippets τ =(s0 a0 r0) (sT aT rT ) where rt denotes a rewardvector rk(st at)Nk=1 that consists of rewards for all N ob-jectives we find value function parameters φk by optimizingthe following objective
minφ1k
Nsumk=1
Eτ[(G
(T )k (st at)minus V πold
φk(st))
2] (21)
Here G(T )k (st at) is the T -step target for value function
k which uses the actual rewards in the trajectory andbootstraps from the current value function for the restG
(T )k (st at) =
sumTminus1`=t γ
`minustrk(s` a`) + γTminustV πoldφk
(s`+T )The advantages are then estimated as Aπold
k (st at) =
G(T )k (st at)minus V πold
φk(st)
D2 Multi-objective policy improvement in theon-policy setting
Given the previous policy πold(a|s) and estimated advan-tages Aπold
k (s a)k=1N associated with this policy foreach objective our goal is to improve the previous policyTo this end we first learn an improved variational distribu-tion qk(s a) for each objective and then combine and distillthe variational distributions into a new parametric policyπnew(a|s) Unlike in MO-MPO for MO-V-MPO we use thejoint distribution qk(s a) rather than local policies qk(a|s)because without a learned Q-function only one action perstate is available for learning This is a multi-objective vari-
ant of the two-step policy improvement procedure employedby V-MPO (Song et al 2020)
D21 OBTAINING IMPROVED VARIATIONALDISTRIBUTIONS PER OBJECTIVE (STEP 1)
In order to obtain the improved variational distributionsqk(s a) we optimize the RL objective for each objectiveAk
maxqk
intsa
qk(s a)Ak(s a) dads (22)
st KL(qk(s a)pold(s a)) lt εk
where the KL-divergence is computed over all (s a) εk de-notes the allowed expected KL divergence and pold(s a) =micro(s)πold(a|s) is the state-action distribution associated withπold
As in MO-MPO we use these εk to define the preferencesover objectives More concretely εk defines the allowed con-tribution of objective k to the change of the policy There-fore the larger a particular εk is with respect to others themore that objective k is preferred On the other hand ifεk = 0 is zero then objective k will have no contribution tothe change of the policy and will effectively be ignored
Equation (22) can be solved in closed form
qk(s a) prop pold(s a) exp(Ak(s a)
ηk
) (23)
where the temperature ηk is computed based on the con-straint εk by solving the following convex dual problem
ηk = argminηk
[ηk εk + (24)
ηk log
intsa
pold(s a) exp(Ak(s a)
η
)da ds
]
We can perform the optimization along with the loss bytaking a gradient descent step on ηk and we initialize withthe solution found in the previous policy iteration step Sinceηk must be positive we use a projection operator after eachgradient step to maintain ηk gt 0
Top-k advantages As in Song et al (2020) in practicewe used the samples corresponding to the top 50 of ad-vantages in each batch of data
D22 FITTING A NEW PARAMETRIC POLICY (STEP 2)
We next want to combine and distill the state-action distribu-tions obtained in the previous step into a single parametricpolicy πnew(a|s) that favors all of the objectives accordingto the preferences specified by εk For this we solve a su-pervised learning problem that fits a parametric policy asfollows
A Distributional View on Multi-Objective Policy Optimization
πnew = argmaxθ
Nsumk=1
intsa
qk(s a) log πθ(a|s) da ds
stints
KL(πold(a|s) πθ(a|s)) ds lt β (25)
where θ are the parameters of our function approximator(a neural network) which we initialize from the weights ofthe previous policy πold and the KL constraint enforces atrust region of size β that limits the overall change in theparametric policy to improve stability of learning As inMPO the KL constraint in this step has a regularizationeffect that prevents the policy from overfitting to the localpolicies and therefore avoids premature convergence
In order to optimize Equation (25) we employ Lagrangianrelaxation similar to the one employed for ordinary V-MPO (Song et al 2020)
E Multi-Objective Policy Improvement asInference
In the main paper we motivated the multi-objective policyupdate rules from an intuitive perspective In this section weshow that our multi-objective policy improvement algorithmcan also be derived from the RL-as-inference perspective Inthe policy improvement stage we assume that a Q-functionfor each objective is given and we would like to improveour policy with respect to these Q-functions The deriva-tion here extends the derivation for the policy improvementalgorithm in (single-objective) MPO in Abdolmaleki et al(2018a) (in appendix) to the multi-objective case
We assume there are observable binary improvement eventsRkNk=1 for each objective Rk = 1 indicates that ourpolicy has improved for objective k whileRk = 0 indicatesthat it has not Now we ask if the policy has improved withrespect to all objectives ie Rk = 1Nk=1 what would theparameters θ of that improved policy be More concretelythis is equivalent to the maximum a posteriori estimate forRk = 1Nk=1
maxθ
pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)
where the probability of improvement events depends on θAssuming independence between improvement events Rkand using log probabilities leads to the following objective
maxθ
Nsumk=1
log pθ(Rk = 1) + log p(θ) (27)
The prior distribution over parameters p(θ) is fixed duringthe policy improvement step We set the prior such that πθ
stays close to the target policy during each policy improve-ment step to improve stability of learning (Sec E2)
We use the standard expectation-maximization (EM) algo-rithm to efficiently maximize
sumNk=1 log pθ(Rk = 1) The
EM algorithm repeatedly constructs a tight lower bound inthe E-step given the previous estimate of θ (which corre-sponds to the target policy in our case) and then optimizesthat lower bound in the M-step We introduce a variationaldistribution qk(s a) per objective and use this to decom-pose the log-evidence
sumNk=1 log pθ(Rk = 1) as follows
Nsumk=1
log pθ(Rk = 1) (28)
=
Nsumk=1
KL(qk(s a) pθ(s a|Rk = 1))minus
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a))
The second term of this decomposition is expanded as
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a)) (29)
=
Nsumk=1
Eqk(sa)
[log
pθ(Rk = 1 s a)
qk(s a)
]=
Nsumk=1
Eqk(sa)
[log
p(Rk = 1|s a)πθ(a|s)micro(s)
qk(s a)
]
where micro(s) is the stationary state distribution which is as-sumed to be given in each policy improvement step Inpractice micro(s) is approximated by the distribution of thestates in the replay buffer p(Rk = 1|s a) is the likelihoodof the improvement event for objective k if our policy choseaction a in state s
The first term of the decomposition in Equation (28) isalways positive so the second term is a lower bound onthe log-evidence
sumNk=1 log pθ(Rk = 1) πθ(a|s) and
qk(a|s) = qk(sa)micro(s) are unknown even though micro(s) is given
In the E-step we estimate qk(a|s) for each objective k byminimizing the first term given θ = θprime (the parameters ofthe target policy) Then in the M-step we find a new θ byfitting the parametric policy to the distributions from firststep
E1 E-Step
The E-step corresponds to the first step of policy improve-ment in the main paper (Sec 421) In the E-step wechoose the variational distributions qk(a|s)Nk=1 such thatthe lower bound on
sumNk=1 log pθ(Rk = 1) is as tight as
A Distributional View on Multi-Objective Policy Optimization
possible when θ = θprime the parameters of the target policyThe lower bound is tight when the first term of the decompo-sition in Equation (28) is zero so we choose qk to minimizethis term We can optimize for each variational distributionqk independently
qk(a|s) = argminq
Emicro(s)
[KL(q(a|s) pθprime(s a|Rk = 1))
]= argmin
qEmicro(s)
[KL(q(a|s) πθprime(a|s))minus
Eq(a|s) log p(Rk = 1|s a))]
(30)
We can solve this optimization problem in closed formwhich gives us
qk(a|s) =πθprime(a|s) p(Rk = 1|s a)intπθprime(a|s) p(Rk = 1|s a) da
This solution weighs the actions based on their relativeimprovement likelihood p(Rk = 1|s a) for each objectiveWe define the likelihood of an improvement event Rk as
p(Rk = 1|s a) prop exp(Qk(s a)
αk
)
where αk is an objective-dependent temperature parameterthat controls how greedy the solution qk(a|s) is with respectto its associated objective Qk(s a) For example given abatch of state-action pairs at the extremes an αk of zerowould give all probability to the state-action pair with themaximum Q-value while an αk of positive infinity wouldgive the same probability to all state-action pairs In orderto automatically optimize for αk we plug the exponentialtransformation into (30) which leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) dadsminus
αk
intmicro(s) KL(q(a|s) πθprime(a|s)) ds (31)
If we instead enforce the bound on KL divergence as a hardconstraint (which we do in practice) that leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) da ds (32)
stintmicro(s) KL(q(a|s) πθprime(a|s)) ds lt εk
where as described in the main paper the parameter εkdefines preferences over objectives If we now assume thatqk(a|s) is a non-parametric sample-based distribution as in(Abdolmaleki et al 2018b) we can solve this constrained
optimization problem in closed form for each sampled states by setting
q(a|s) prop πθprime(a|s) exp(Qk(s a)
ηk
) (33)
where ηk is computed by solving the following convex dualfunction
ηk = argminη
ηεk+ (34)
η
intmicro(s) log
intπθprime(a|s) exp
(Qk(s a)
η
)da ds
Equations (32) (33) and (34) are equivalent to those givenfor the first step of multi-objective policy improvement inthe main paper (Sec 421) Please see refer to AppendixD2 in (Abdolmaleki et al 2018b) for details on derivationof the dual function
E2 M-Step
After obtaining the variational distributions qk(a|s) foreach objective k we have found a tight lower bound forsumNk=1 log pθ(Rk = 1) when θ = θprime We can now obtain
the parameters θ of the updated policy πθ(a|s) by optimiz-ing this lower bound After rearranging and dropping theterms in (27) that do not involve θ we obtain
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) da ds
+ log p(θ)
This corresponds to the supervised learning step of policyimprovement in the main paper (Sec 422)
For the prior p(θ) on the parameters θ we enforce that thenew policy πθ(a|s) is close to the target policy πθprime(a|s)We therefore optimize for
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) dads
stintmicro(s) KL(πθprime(a|s) πθ(a|s)) ds lt β
Because of the constraint on the change of the parametricpolicy we do not greedily optimize the M-step objectiveSee (Abdolmaleki et al 2018b) for more details
A Distributional View on Multi-Objective Policy Optimization
The environment is then run at 20 Hz for 600 steps Theaction limits are plusmn005 ms on each axis If the magnitudeof the wrist force-torque measurements exceeds 15 N onany horizontal axis (x and y axis) or 5 N on the vertical axis(z axis) the episode is terminated early
There are two objectives the task itself (inserting the peg)and minimizing wrist force measurements
The reward for the task objective is defined as
rinsertion = max(
02rapproach rinserted
)(9)
rapproach = s(ppeg popening) (10)
rinserted = raligned s(ppegz pbottom
z ) (11)
raligned = ppegxy minus popening
xy lt d2 (12)
where ppeg is the peg position popening the position of thehole opening pbottom the bottom position of the hole and dthe diameter of the hole
s(p1 p2) is a shaping operator
s(p1 p2) = 1minus tanh2(atanh(
radicl)
εp1 minus p22) (13)
which gives a reward of 1minus l if p1 and p2 are at a Euclideandistance of ε Here we chose lapproach = 095 εapproach =001 linserted = 05 and εinserted = 004 Intentionally 004corresponds to the length of the peg such that rinserted = 05if ppeg = popening As a result if the peg is in the approachposition rinserted is dominant over rapproach in rinsertion
Intuitively there is a shaped reward for aligning the pegwith a magnitude of 02 After accurate alignment withina horizontal tolerance the overall reward is dominated bya vertical shaped reward component for insertion The hor-izontal tolerance threshold ensures that the agent is notencouraged to try to insert the peg from the side
The reward for the secondary objective minimizing wristforce measurements is defined as
rforce = minusF1 (14)
where F = (Fx Fy Fz) are the forces measured by thewrist sensor
C Algorithmic DetailsC1 General Algorithm Description
We maintain one online network and one target network foreach Q-function associated to objective k with parametersdenoted by φk and φprimek respectively We also maintain oneonline network and one target network for the policy withparameters denoted by θ and θprime respectively Target net-works are updated every fixed number of steps by copying
parameters from the online network Online networks areupdated using gradient descent in each learning iterationNote that in the main paper we refer to the target policynetwork as the old policy πold
We use an asynchronous actor-learner setup in which ac-tors regularly fetch policy parameters from the learner andact in the environment writing these transitions to the re-play buffer We refer to this policy as the behavior policythroughout the paper The learner uses the transitions inthe replay buffer to update the (online) Q-functions and thepolicy Please see Algorithms 2 and 3 for more details onthe actor and learner processes respectively
Algorithm 2 MO-MPO Asynchronous Actor
1 given (N) reward functions rk(s a)Nk=1 T steps perepisode
2 repeat3 Fetch current policy parameters θ from learner4 Collect trajectory from environment5 τ = 6 for t = 0 T do7 at sim πθ(middot|st)8 Execute action and determine rewards9 r = [r1(st at) rN (st at)]
10 τ larr τ cup (st at r πθ(at|st))11 end for12 Send trajectory τ to replay buffer13 until end of training
C2 Retrace for Multi-Objective Policy Evaluation
Recall that the goal of the policy evaluation stage (Sec 41)is to learn the Q-function Qφk
(s a) for each objective kparametrized by φk These Q-functions are with respectto the policy πold We emphasize that it is valid to use anyoff-the-shelf Q-learning algorithm such as TD(0) (Sutton1988) to learn these Q-functions We choose to use Retrace(Munos et al 2016) described here
Given a replay buffer D containing trajectory snippetsτ = (s0 a0 r0 s1) (sT aT rT sT+1) where rtdenotes a reward vector rk(st at)Nk=1 that consists of ascalar reward for each ofN objectives the Retrace objectiveis as follows
minφk
EτsimD[(Qretk (st at)minus Qφk
(st at))2] (15)
with
Qretk (st at) = Qφprime
k(st at) +
Tsumj=t
γjminust( jprodz=t+1
cz
)δj
δj = rk(sj aj) + γV (sj+1)minus Qφprimek(sj aj)
V (sj+1) = Eπold(a|sj+1)[Qφprimek(sj+1 a)]
A Distributional View on Multi-Objective Policy Optimization
Algorithm 3 MO-MPO Asynchronous Learner1 given batch size (L) number of actions (M) (N) Q-functions
(N) preferences εkNk=1 policy networks and replay bufferD
2 initialize Lagrangians ηkNk=1 and ν target networks andonline networks such that πθprime = πθ and Qφprime
k= Qφk
3 repeat4 repeat5 Collect dataset si aij Qijk
LMNijk where
6 si sim D aij sim πθprime(a|si) and Qijk = Qφkprime(si aij)
78 Compute action distribution for each objective9 for k = 1 N do
10 δηk larr nablaηkηkεk + ηksumLi
1Llog
(sumMj
1M
exp(Q
ijkηk
))11 Update ηk based on δηk12 qijk prop exp(
Qijkηk
)
13 end for1415 Update parametric policy with trust region16 sg denotes a stop-gradient17 δπ larr minusnablaθ
sumLi
sumMj
sumNk q
ijk log πθ(a
ij |si)18 +sg(ν)
(β minus
sumLi KL(πθprime(a|si) πθ(a|si)) ds
)19 δν larr nablaνν
(βminussumLi KL(πθprime(a|si) πsg(θ)(a|si)) ds
)20 Update πθ based on δπ21 Update ν based on δν2223 Update Q-functions24 for k = 1 N do25 δφk larr nablaφk
sum(stat)isinτsimD
(Qφk (st at)minusQ
retk
)226 with Qret
k as in Eq 1527 Update φk based on δφk
28 end for2930 until fixed number of steps31 Update target networks32 πθprime = πθ Qφk
prime = Qφk
33 until convergence
The importance weights ck are defined as
cz = min
(1πold(az|sz)b(az|sz)
)
where b(az|sz) denotes the behavior policy used to col-lect trajectories in the environment When j = t we set(prodj
z=t+1 cz
)= 1
We also make use of a target network for each Q-function(Mnih et al 2015) parametrized by φprimek which we copyfrom the online network φk after a fixed number of gradientsteps on the Retrace objective (15)
C3 Policy Fitting
Recall that fitting the policy in the policy improvementstage (Sec 422) requires solving the following constrained
optimization
πnew = argmaxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds
stints
micro(s) KL(πold(a|s) πθ(a|s)) ds lt β (16)
We first write the generalized Lagrangian equation ie
L(θ ν) =
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds (17)
+ ν(β minus
ints
micro(s) KL(πold(a|s) πθ(a|s)) ds)
where ν is the Lagrangian multiplier Now we solve thefollowing primal problem
maxθ
minνgt0
L(θ ν)
To obtain the parameters θ for the updated policy we solvefor ν and θ by alternating between 1) fixing ν to its currentvalue and optimizing for θ and 2) fixing θ to its currentvalue and optimizing for ν This can be applied for anypolicy output distribution
For Gaussian policies we decouple the update rules to op-timize for mean and covariance independently as in (Ab-dolmaleki et al 2018b) This allows for setting differentKL bounds for the mean (βmicro) and covariance (βΣ) whichresults in more stable learning To do this we first separateout the following two policies for mean and covariance
πmicroθ (a|s) = N(amicroθ(s)Σθold
(s)) (18)
πΣθ (a|s) = N
(amicroθold
(s)Σθ(s)) (19)
Policy πmicroθ (a|s) takes the mean from the online policy net-work and covariance from the target policy network andpolicy πΣ
θ (a|s) takes the mean from the target policy net-work and covariance from online policy network Now ouroptimization problem has the following form
maxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s)(
log πmicroθ (a|s)πΣθ (a|s)
)da ds
stints
micro(s) KL(πold(a|s) πmicroθ (a|s)) ds lt βmicroints
micro(s) KL(πold(a|s) πΣθ (a|s)) ds lt βΣ (20)
As in (Abdolmaleki et al 2018b) we set a much smallerbound for covariance than for mean to keep the explorationand avoid premature convergence We can solve this op-timization problem using the same general procedure asdescribed above
A Distributional View on Multi-Objective Policy Optimization
C4 Derivation of Dual Function for η
Recall that obtaining the per-objective improved action dis-tributions qk(a|s) in the policy improvement stage (Sec421) requires solving a convex dual function for the tem-perature ηk for each objective For the derivation of thisdual function please refer to Appendix D2 of the originalMPO paper (Abdolmaleki et al 2018b)
D MO-V-MPO Multi-Objective On-PolicyMPO
In this section we describe how MO-MPO can be adaptedto the on-policy case in which a state-value function V (s)is learned and used to estimate the advantages for eachobjective We call this approach MO-V-MPO
D1 Multi-objective policy evaluation in the on-policysetting
In the on-policy setting to evaluate the previous policyπold we use advantages A(s a) estimated from a learnedstate-value function V (s) instead of a state-action valuefunction Q(s a) as in the main text We train a separateV -function for each objective by regressing to the stan-dard n-step return (Sutton 1988) associated with each ob-jective More concretely given trajectory snippets τ =(s0 a0 r0) (sT aT rT ) where rt denotes a rewardvector rk(st at)Nk=1 that consists of rewards for all N ob-jectives we find value function parameters φk by optimizingthe following objective
minφ1k
Nsumk=1
Eτ[(G
(T )k (st at)minus V πold
φk(st))
2] (21)
Here G(T )k (st at) is the T -step target for value function
k which uses the actual rewards in the trajectory andbootstraps from the current value function for the restG
(T )k (st at) =
sumTminus1`=t γ
`minustrk(s` a`) + γTminustV πoldφk
(s`+T )The advantages are then estimated as Aπold
k (st at) =
G(T )k (st at)minus V πold
φk(st)
D2 Multi-objective policy improvement in theon-policy setting
Given the previous policy πold(a|s) and estimated advan-tages Aπold
k (s a)k=1N associated with this policy foreach objective our goal is to improve the previous policyTo this end we first learn an improved variational distribu-tion qk(s a) for each objective and then combine and distillthe variational distributions into a new parametric policyπnew(a|s) Unlike in MO-MPO for MO-V-MPO we use thejoint distribution qk(s a) rather than local policies qk(a|s)because without a learned Q-function only one action perstate is available for learning This is a multi-objective vari-
ant of the two-step policy improvement procedure employedby V-MPO (Song et al 2020)
D21 OBTAINING IMPROVED VARIATIONALDISTRIBUTIONS PER OBJECTIVE (STEP 1)
In order to obtain the improved variational distributionsqk(s a) we optimize the RL objective for each objectiveAk
maxqk
intsa
qk(s a)Ak(s a) dads (22)
st KL(qk(s a)pold(s a)) lt εk
where the KL-divergence is computed over all (s a) εk de-notes the allowed expected KL divergence and pold(s a) =micro(s)πold(a|s) is the state-action distribution associated withπold
As in MO-MPO we use these εk to define the preferencesover objectives More concretely εk defines the allowed con-tribution of objective k to the change of the policy There-fore the larger a particular εk is with respect to others themore that objective k is preferred On the other hand ifεk = 0 is zero then objective k will have no contribution tothe change of the policy and will effectively be ignored
Equation (22) can be solved in closed form
qk(s a) prop pold(s a) exp(Ak(s a)
ηk
) (23)
where the temperature ηk is computed based on the con-straint εk by solving the following convex dual problem
ηk = argminηk
[ηk εk + (24)
ηk log
intsa
pold(s a) exp(Ak(s a)
η
)da ds
]
We can perform the optimization along with the loss bytaking a gradient descent step on ηk and we initialize withthe solution found in the previous policy iteration step Sinceηk must be positive we use a projection operator after eachgradient step to maintain ηk gt 0
Top-k advantages As in Song et al (2020) in practicewe used the samples corresponding to the top 50 of ad-vantages in each batch of data
D22 FITTING A NEW PARAMETRIC POLICY (STEP 2)
We next want to combine and distill the state-action distribu-tions obtained in the previous step into a single parametricpolicy πnew(a|s) that favors all of the objectives accordingto the preferences specified by εk For this we solve a su-pervised learning problem that fits a parametric policy asfollows
A Distributional View on Multi-Objective Policy Optimization
πnew = argmaxθ
Nsumk=1
intsa
qk(s a) log πθ(a|s) da ds
stints
KL(πold(a|s) πθ(a|s)) ds lt β (25)
where θ are the parameters of our function approximator(a neural network) which we initialize from the weights ofthe previous policy πold and the KL constraint enforces atrust region of size β that limits the overall change in theparametric policy to improve stability of learning As inMPO the KL constraint in this step has a regularizationeffect that prevents the policy from overfitting to the localpolicies and therefore avoids premature convergence
In order to optimize Equation (25) we employ Lagrangianrelaxation similar to the one employed for ordinary V-MPO (Song et al 2020)
E Multi-Objective Policy Improvement asInference
In the main paper we motivated the multi-objective policyupdate rules from an intuitive perspective In this section weshow that our multi-objective policy improvement algorithmcan also be derived from the RL-as-inference perspective Inthe policy improvement stage we assume that a Q-functionfor each objective is given and we would like to improveour policy with respect to these Q-functions The deriva-tion here extends the derivation for the policy improvementalgorithm in (single-objective) MPO in Abdolmaleki et al(2018a) (in appendix) to the multi-objective case
We assume there are observable binary improvement eventsRkNk=1 for each objective Rk = 1 indicates that ourpolicy has improved for objective k whileRk = 0 indicatesthat it has not Now we ask if the policy has improved withrespect to all objectives ie Rk = 1Nk=1 what would theparameters θ of that improved policy be More concretelythis is equivalent to the maximum a posteriori estimate forRk = 1Nk=1
maxθ
pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)
where the probability of improvement events depends on θAssuming independence between improvement events Rkand using log probabilities leads to the following objective
maxθ
Nsumk=1
log pθ(Rk = 1) + log p(θ) (27)
The prior distribution over parameters p(θ) is fixed duringthe policy improvement step We set the prior such that πθ
stays close to the target policy during each policy improve-ment step to improve stability of learning (Sec E2)
We use the standard expectation-maximization (EM) algo-rithm to efficiently maximize
sumNk=1 log pθ(Rk = 1) The
EM algorithm repeatedly constructs a tight lower bound inthe E-step given the previous estimate of θ (which corre-sponds to the target policy in our case) and then optimizesthat lower bound in the M-step We introduce a variationaldistribution qk(s a) per objective and use this to decom-pose the log-evidence
sumNk=1 log pθ(Rk = 1) as follows
Nsumk=1
log pθ(Rk = 1) (28)
=
Nsumk=1
KL(qk(s a) pθ(s a|Rk = 1))minus
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a))
The second term of this decomposition is expanded as
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a)) (29)
=
Nsumk=1
Eqk(sa)
[log
pθ(Rk = 1 s a)
qk(s a)
]=
Nsumk=1
Eqk(sa)
[log
p(Rk = 1|s a)πθ(a|s)micro(s)
qk(s a)
]
where micro(s) is the stationary state distribution which is as-sumed to be given in each policy improvement step Inpractice micro(s) is approximated by the distribution of thestates in the replay buffer p(Rk = 1|s a) is the likelihoodof the improvement event for objective k if our policy choseaction a in state s
The first term of the decomposition in Equation (28) isalways positive so the second term is a lower bound onthe log-evidence
sumNk=1 log pθ(Rk = 1) πθ(a|s) and
qk(a|s) = qk(sa)micro(s) are unknown even though micro(s) is given
In the E-step we estimate qk(a|s) for each objective k byminimizing the first term given θ = θprime (the parameters ofthe target policy) Then in the M-step we find a new θ byfitting the parametric policy to the distributions from firststep
E1 E-Step
The E-step corresponds to the first step of policy improve-ment in the main paper (Sec 421) In the E-step wechoose the variational distributions qk(a|s)Nk=1 such thatthe lower bound on
sumNk=1 log pθ(Rk = 1) is as tight as
A Distributional View on Multi-Objective Policy Optimization
possible when θ = θprime the parameters of the target policyThe lower bound is tight when the first term of the decompo-sition in Equation (28) is zero so we choose qk to minimizethis term We can optimize for each variational distributionqk independently
qk(a|s) = argminq
Emicro(s)
[KL(q(a|s) pθprime(s a|Rk = 1))
]= argmin
qEmicro(s)
[KL(q(a|s) πθprime(a|s))minus
Eq(a|s) log p(Rk = 1|s a))]
(30)
We can solve this optimization problem in closed formwhich gives us
qk(a|s) =πθprime(a|s) p(Rk = 1|s a)intπθprime(a|s) p(Rk = 1|s a) da
This solution weighs the actions based on their relativeimprovement likelihood p(Rk = 1|s a) for each objectiveWe define the likelihood of an improvement event Rk as
p(Rk = 1|s a) prop exp(Qk(s a)
αk
)
where αk is an objective-dependent temperature parameterthat controls how greedy the solution qk(a|s) is with respectto its associated objective Qk(s a) For example given abatch of state-action pairs at the extremes an αk of zerowould give all probability to the state-action pair with themaximum Q-value while an αk of positive infinity wouldgive the same probability to all state-action pairs In orderto automatically optimize for αk we plug the exponentialtransformation into (30) which leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) dadsminus
αk
intmicro(s) KL(q(a|s) πθprime(a|s)) ds (31)
If we instead enforce the bound on KL divergence as a hardconstraint (which we do in practice) that leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) da ds (32)
stintmicro(s) KL(q(a|s) πθprime(a|s)) ds lt εk
where as described in the main paper the parameter εkdefines preferences over objectives If we now assume thatqk(a|s) is a non-parametric sample-based distribution as in(Abdolmaleki et al 2018b) we can solve this constrained
optimization problem in closed form for each sampled states by setting
q(a|s) prop πθprime(a|s) exp(Qk(s a)
ηk
) (33)
where ηk is computed by solving the following convex dualfunction
ηk = argminη
ηεk+ (34)
η
intmicro(s) log
intπθprime(a|s) exp
(Qk(s a)
η
)da ds
Equations (32) (33) and (34) are equivalent to those givenfor the first step of multi-objective policy improvement inthe main paper (Sec 421) Please see refer to AppendixD2 in (Abdolmaleki et al 2018b) for details on derivationof the dual function
E2 M-Step
After obtaining the variational distributions qk(a|s) foreach objective k we have found a tight lower bound forsumNk=1 log pθ(Rk = 1) when θ = θprime We can now obtain
the parameters θ of the updated policy πθ(a|s) by optimiz-ing this lower bound After rearranging and dropping theterms in (27) that do not involve θ we obtain
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) da ds
+ log p(θ)
This corresponds to the supervised learning step of policyimprovement in the main paper (Sec 422)
For the prior p(θ) on the parameters θ we enforce that thenew policy πθ(a|s) is close to the target policy πθprime(a|s)We therefore optimize for
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) dads
stintmicro(s) KL(πθprime(a|s) πθ(a|s)) ds lt β
Because of the constraint on the change of the parametricpolicy we do not greedily optimize the M-step objectiveSee (Abdolmaleki et al 2018b) for more details
A Distributional View on Multi-Objective Policy Optimization
Algorithm 3 MO-MPO Asynchronous Learner1 given batch size (L) number of actions (M) (N) Q-functions
(N) preferences εkNk=1 policy networks and replay bufferD
2 initialize Lagrangians ηkNk=1 and ν target networks andonline networks such that πθprime = πθ and Qφprime
k= Qφk
3 repeat4 repeat5 Collect dataset si aij Qijk
LMNijk where
6 si sim D aij sim πθprime(a|si) and Qijk = Qφkprime(si aij)
78 Compute action distribution for each objective9 for k = 1 N do
10 δηk larr nablaηkηkεk + ηksumLi
1Llog
(sumMj
1M
exp(Q
ijkηk
))11 Update ηk based on δηk12 qijk prop exp(
Qijkηk
)
13 end for1415 Update parametric policy with trust region16 sg denotes a stop-gradient17 δπ larr minusnablaθ
sumLi
sumMj
sumNk q
ijk log πθ(a
ij |si)18 +sg(ν)
(β minus
sumLi KL(πθprime(a|si) πθ(a|si)) ds
)19 δν larr nablaνν
(βminussumLi KL(πθprime(a|si) πsg(θ)(a|si)) ds
)20 Update πθ based on δπ21 Update ν based on δν2223 Update Q-functions24 for k = 1 N do25 δφk larr nablaφk
sum(stat)isinτsimD
(Qφk (st at)minusQ
retk
)226 with Qret
k as in Eq 1527 Update φk based on δφk
28 end for2930 until fixed number of steps31 Update target networks32 πθprime = πθ Qφk
prime = Qφk
33 until convergence
The importance weights ck are defined as
cz = min
(1πold(az|sz)b(az|sz)
)
where b(az|sz) denotes the behavior policy used to col-lect trajectories in the environment When j = t we set(prodj
z=t+1 cz
)= 1
We also make use of a target network for each Q-function(Mnih et al 2015) parametrized by φprimek which we copyfrom the online network φk after a fixed number of gradientsteps on the Retrace objective (15)
C3 Policy Fitting
Recall that fitting the policy in the policy improvementstage (Sec 422) requires solving the following constrained
optimization
πnew = argmaxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds
stints
micro(s) KL(πold(a|s) πθ(a|s)) ds lt β (16)
We first write the generalized Lagrangian equation ie
L(θ ν) =
Nsumk=1
ints
micro(s)
inta
qk(a|s) log πθ(a|s) da ds (17)
+ ν(β minus
ints
micro(s) KL(πold(a|s) πθ(a|s)) ds)
where ν is the Lagrangian multiplier Now we solve thefollowing primal problem
maxθ
minνgt0
L(θ ν)
To obtain the parameters θ for the updated policy we solvefor ν and θ by alternating between 1) fixing ν to its currentvalue and optimizing for θ and 2) fixing θ to its currentvalue and optimizing for ν This can be applied for anypolicy output distribution
For Gaussian policies we decouple the update rules to op-timize for mean and covariance independently as in (Ab-dolmaleki et al 2018b) This allows for setting differentKL bounds for the mean (βmicro) and covariance (βΣ) whichresults in more stable learning To do this we first separateout the following two policies for mean and covariance
πmicroθ (a|s) = N(amicroθ(s)Σθold
(s)) (18)
πΣθ (a|s) = N
(amicroθold
(s)Σθ(s)) (19)
Policy πmicroθ (a|s) takes the mean from the online policy net-work and covariance from the target policy network andpolicy πΣ
θ (a|s) takes the mean from the target policy net-work and covariance from online policy network Now ouroptimization problem has the following form
maxθ
Nsumk=1
ints
micro(s)
inta
qk(a|s)(
log πmicroθ (a|s)πΣθ (a|s)
)da ds
stints
micro(s) KL(πold(a|s) πmicroθ (a|s)) ds lt βmicroints
micro(s) KL(πold(a|s) πΣθ (a|s)) ds lt βΣ (20)
As in (Abdolmaleki et al 2018b) we set a much smallerbound for covariance than for mean to keep the explorationand avoid premature convergence We can solve this op-timization problem using the same general procedure asdescribed above
A Distributional View on Multi-Objective Policy Optimization
C4 Derivation of Dual Function for η
Recall that obtaining the per-objective improved action dis-tributions qk(a|s) in the policy improvement stage (Sec421) requires solving a convex dual function for the tem-perature ηk for each objective For the derivation of thisdual function please refer to Appendix D2 of the originalMPO paper (Abdolmaleki et al 2018b)
D MO-V-MPO Multi-Objective On-PolicyMPO
In this section we describe how MO-MPO can be adaptedto the on-policy case in which a state-value function V (s)is learned and used to estimate the advantages for eachobjective We call this approach MO-V-MPO
D1 Multi-objective policy evaluation in the on-policysetting
In the on-policy setting to evaluate the previous policyπold we use advantages A(s a) estimated from a learnedstate-value function V (s) instead of a state-action valuefunction Q(s a) as in the main text We train a separateV -function for each objective by regressing to the stan-dard n-step return (Sutton 1988) associated with each ob-jective More concretely given trajectory snippets τ =(s0 a0 r0) (sT aT rT ) where rt denotes a rewardvector rk(st at)Nk=1 that consists of rewards for all N ob-jectives we find value function parameters φk by optimizingthe following objective
minφ1k
Nsumk=1
Eτ[(G
(T )k (st at)minus V πold
φk(st))
2] (21)
Here G(T )k (st at) is the T -step target for value function
k which uses the actual rewards in the trajectory andbootstraps from the current value function for the restG
(T )k (st at) =
sumTminus1`=t γ
`minustrk(s` a`) + γTminustV πoldφk
(s`+T )The advantages are then estimated as Aπold
k (st at) =
G(T )k (st at)minus V πold
φk(st)
D2 Multi-objective policy improvement in theon-policy setting
Given the previous policy πold(a|s) and estimated advan-tages Aπold
k (s a)k=1N associated with this policy foreach objective our goal is to improve the previous policyTo this end we first learn an improved variational distribu-tion qk(s a) for each objective and then combine and distillthe variational distributions into a new parametric policyπnew(a|s) Unlike in MO-MPO for MO-V-MPO we use thejoint distribution qk(s a) rather than local policies qk(a|s)because without a learned Q-function only one action perstate is available for learning This is a multi-objective vari-
ant of the two-step policy improvement procedure employedby V-MPO (Song et al 2020)
D21 OBTAINING IMPROVED VARIATIONALDISTRIBUTIONS PER OBJECTIVE (STEP 1)
In order to obtain the improved variational distributionsqk(s a) we optimize the RL objective for each objectiveAk
maxqk
intsa
qk(s a)Ak(s a) dads (22)
st KL(qk(s a)pold(s a)) lt εk
where the KL-divergence is computed over all (s a) εk de-notes the allowed expected KL divergence and pold(s a) =micro(s)πold(a|s) is the state-action distribution associated withπold
As in MO-MPO we use these εk to define the preferencesover objectives More concretely εk defines the allowed con-tribution of objective k to the change of the policy There-fore the larger a particular εk is with respect to others themore that objective k is preferred On the other hand ifεk = 0 is zero then objective k will have no contribution tothe change of the policy and will effectively be ignored
Equation (22) can be solved in closed form
qk(s a) prop pold(s a) exp(Ak(s a)
ηk
) (23)
where the temperature ηk is computed based on the con-straint εk by solving the following convex dual problem
ηk = argminηk
[ηk εk + (24)
ηk log
intsa
pold(s a) exp(Ak(s a)
η
)da ds
]
We can perform the optimization along with the loss bytaking a gradient descent step on ηk and we initialize withthe solution found in the previous policy iteration step Sinceηk must be positive we use a projection operator after eachgradient step to maintain ηk gt 0
Top-k advantages As in Song et al (2020) in practicewe used the samples corresponding to the top 50 of ad-vantages in each batch of data
D22 FITTING A NEW PARAMETRIC POLICY (STEP 2)
We next want to combine and distill the state-action distribu-tions obtained in the previous step into a single parametricpolicy πnew(a|s) that favors all of the objectives accordingto the preferences specified by εk For this we solve a su-pervised learning problem that fits a parametric policy asfollows
A Distributional View on Multi-Objective Policy Optimization
πnew = argmaxθ
Nsumk=1
intsa
qk(s a) log πθ(a|s) da ds
stints
KL(πold(a|s) πθ(a|s)) ds lt β (25)
where θ are the parameters of our function approximator(a neural network) which we initialize from the weights ofthe previous policy πold and the KL constraint enforces atrust region of size β that limits the overall change in theparametric policy to improve stability of learning As inMPO the KL constraint in this step has a regularizationeffect that prevents the policy from overfitting to the localpolicies and therefore avoids premature convergence
In order to optimize Equation (25) we employ Lagrangianrelaxation similar to the one employed for ordinary V-MPO (Song et al 2020)
E Multi-Objective Policy Improvement asInference
In the main paper we motivated the multi-objective policyupdate rules from an intuitive perspective In this section weshow that our multi-objective policy improvement algorithmcan also be derived from the RL-as-inference perspective Inthe policy improvement stage we assume that a Q-functionfor each objective is given and we would like to improveour policy with respect to these Q-functions The deriva-tion here extends the derivation for the policy improvementalgorithm in (single-objective) MPO in Abdolmaleki et al(2018a) (in appendix) to the multi-objective case
We assume there are observable binary improvement eventsRkNk=1 for each objective Rk = 1 indicates that ourpolicy has improved for objective k whileRk = 0 indicatesthat it has not Now we ask if the policy has improved withrespect to all objectives ie Rk = 1Nk=1 what would theparameters θ of that improved policy be More concretelythis is equivalent to the maximum a posteriori estimate forRk = 1Nk=1
maxθ
pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)
where the probability of improvement events depends on θAssuming independence between improvement events Rkand using log probabilities leads to the following objective
maxθ
Nsumk=1
log pθ(Rk = 1) + log p(θ) (27)
The prior distribution over parameters p(θ) is fixed duringthe policy improvement step We set the prior such that πθ
stays close to the target policy during each policy improve-ment step to improve stability of learning (Sec E2)
We use the standard expectation-maximization (EM) algo-rithm to efficiently maximize
sumNk=1 log pθ(Rk = 1) The
EM algorithm repeatedly constructs a tight lower bound inthe E-step given the previous estimate of θ (which corre-sponds to the target policy in our case) and then optimizesthat lower bound in the M-step We introduce a variationaldistribution qk(s a) per objective and use this to decom-pose the log-evidence
sumNk=1 log pθ(Rk = 1) as follows
Nsumk=1
log pθ(Rk = 1) (28)
=
Nsumk=1
KL(qk(s a) pθ(s a|Rk = 1))minus
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a))
The second term of this decomposition is expanded as
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a)) (29)
=
Nsumk=1
Eqk(sa)
[log
pθ(Rk = 1 s a)
qk(s a)
]=
Nsumk=1
Eqk(sa)
[log
p(Rk = 1|s a)πθ(a|s)micro(s)
qk(s a)
]
where micro(s) is the stationary state distribution which is as-sumed to be given in each policy improvement step Inpractice micro(s) is approximated by the distribution of thestates in the replay buffer p(Rk = 1|s a) is the likelihoodof the improvement event for objective k if our policy choseaction a in state s
The first term of the decomposition in Equation (28) isalways positive so the second term is a lower bound onthe log-evidence
sumNk=1 log pθ(Rk = 1) πθ(a|s) and
qk(a|s) = qk(sa)micro(s) are unknown even though micro(s) is given
In the E-step we estimate qk(a|s) for each objective k byminimizing the first term given θ = θprime (the parameters ofthe target policy) Then in the M-step we find a new θ byfitting the parametric policy to the distributions from firststep
E1 E-Step
The E-step corresponds to the first step of policy improve-ment in the main paper (Sec 421) In the E-step wechoose the variational distributions qk(a|s)Nk=1 such thatthe lower bound on
sumNk=1 log pθ(Rk = 1) is as tight as
A Distributional View on Multi-Objective Policy Optimization
possible when θ = θprime the parameters of the target policyThe lower bound is tight when the first term of the decompo-sition in Equation (28) is zero so we choose qk to minimizethis term We can optimize for each variational distributionqk independently
qk(a|s) = argminq
Emicro(s)
[KL(q(a|s) pθprime(s a|Rk = 1))
]= argmin
qEmicro(s)
[KL(q(a|s) πθprime(a|s))minus
Eq(a|s) log p(Rk = 1|s a))]
(30)
We can solve this optimization problem in closed formwhich gives us
qk(a|s) =πθprime(a|s) p(Rk = 1|s a)intπθprime(a|s) p(Rk = 1|s a) da
This solution weighs the actions based on their relativeimprovement likelihood p(Rk = 1|s a) for each objectiveWe define the likelihood of an improvement event Rk as
p(Rk = 1|s a) prop exp(Qk(s a)
αk
)
where αk is an objective-dependent temperature parameterthat controls how greedy the solution qk(a|s) is with respectto its associated objective Qk(s a) For example given abatch of state-action pairs at the extremes an αk of zerowould give all probability to the state-action pair with themaximum Q-value while an αk of positive infinity wouldgive the same probability to all state-action pairs In orderto automatically optimize for αk we plug the exponentialtransformation into (30) which leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) dadsminus
αk
intmicro(s) KL(q(a|s) πθprime(a|s)) ds (31)
If we instead enforce the bound on KL divergence as a hardconstraint (which we do in practice) that leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) da ds (32)
stintmicro(s) KL(q(a|s) πθprime(a|s)) ds lt εk
where as described in the main paper the parameter εkdefines preferences over objectives If we now assume thatqk(a|s) is a non-parametric sample-based distribution as in(Abdolmaleki et al 2018b) we can solve this constrained
optimization problem in closed form for each sampled states by setting
q(a|s) prop πθprime(a|s) exp(Qk(s a)
ηk
) (33)
where ηk is computed by solving the following convex dualfunction
ηk = argminη
ηεk+ (34)
η
intmicro(s) log
intπθprime(a|s) exp
(Qk(s a)
η
)da ds
Equations (32) (33) and (34) are equivalent to those givenfor the first step of multi-objective policy improvement inthe main paper (Sec 421) Please see refer to AppendixD2 in (Abdolmaleki et al 2018b) for details on derivationof the dual function
E2 M-Step
After obtaining the variational distributions qk(a|s) foreach objective k we have found a tight lower bound forsumNk=1 log pθ(Rk = 1) when θ = θprime We can now obtain
the parameters θ of the updated policy πθ(a|s) by optimiz-ing this lower bound After rearranging and dropping theterms in (27) that do not involve θ we obtain
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) da ds
+ log p(θ)
This corresponds to the supervised learning step of policyimprovement in the main paper (Sec 422)
For the prior p(θ) on the parameters θ we enforce that thenew policy πθ(a|s) is close to the target policy πθprime(a|s)We therefore optimize for
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) dads
stintmicro(s) KL(πθprime(a|s) πθ(a|s)) ds lt β
Because of the constraint on the change of the parametricpolicy we do not greedily optimize the M-step objectiveSee (Abdolmaleki et al 2018b) for more details
A Distributional View on Multi-Objective Policy Optimization
C4 Derivation of Dual Function for η
Recall that obtaining the per-objective improved action dis-tributions qk(a|s) in the policy improvement stage (Sec421) requires solving a convex dual function for the tem-perature ηk for each objective For the derivation of thisdual function please refer to Appendix D2 of the originalMPO paper (Abdolmaleki et al 2018b)
D MO-V-MPO Multi-Objective On-PolicyMPO
In this section we describe how MO-MPO can be adaptedto the on-policy case in which a state-value function V (s)is learned and used to estimate the advantages for eachobjective We call this approach MO-V-MPO
D1 Multi-objective policy evaluation in the on-policysetting
In the on-policy setting to evaluate the previous policyπold we use advantages A(s a) estimated from a learnedstate-value function V (s) instead of a state-action valuefunction Q(s a) as in the main text We train a separateV -function for each objective by regressing to the stan-dard n-step return (Sutton 1988) associated with each ob-jective More concretely given trajectory snippets τ =(s0 a0 r0) (sT aT rT ) where rt denotes a rewardvector rk(st at)Nk=1 that consists of rewards for all N ob-jectives we find value function parameters φk by optimizingthe following objective
minφ1k
Nsumk=1
Eτ[(G
(T )k (st at)minus V πold
φk(st))
2] (21)
Here G(T )k (st at) is the T -step target for value function
k which uses the actual rewards in the trajectory andbootstraps from the current value function for the restG
(T )k (st at) =
sumTminus1`=t γ
`minustrk(s` a`) + γTminustV πoldφk
(s`+T )The advantages are then estimated as Aπold
k (st at) =
G(T )k (st at)minus V πold
φk(st)
D2 Multi-objective policy improvement in theon-policy setting
Given the previous policy πold(a|s) and estimated advan-tages Aπold
k (s a)k=1N associated with this policy foreach objective our goal is to improve the previous policyTo this end we first learn an improved variational distribu-tion qk(s a) for each objective and then combine and distillthe variational distributions into a new parametric policyπnew(a|s) Unlike in MO-MPO for MO-V-MPO we use thejoint distribution qk(s a) rather than local policies qk(a|s)because without a learned Q-function only one action perstate is available for learning This is a multi-objective vari-
ant of the two-step policy improvement procedure employedby V-MPO (Song et al 2020)
D21 OBTAINING IMPROVED VARIATIONALDISTRIBUTIONS PER OBJECTIVE (STEP 1)
In order to obtain the improved variational distributionsqk(s a) we optimize the RL objective for each objectiveAk
maxqk
intsa
qk(s a)Ak(s a) dads (22)
st KL(qk(s a)pold(s a)) lt εk
where the KL-divergence is computed over all (s a) εk de-notes the allowed expected KL divergence and pold(s a) =micro(s)πold(a|s) is the state-action distribution associated withπold
As in MO-MPO we use these εk to define the preferencesover objectives More concretely εk defines the allowed con-tribution of objective k to the change of the policy There-fore the larger a particular εk is with respect to others themore that objective k is preferred On the other hand ifεk = 0 is zero then objective k will have no contribution tothe change of the policy and will effectively be ignored
Equation (22) can be solved in closed form
qk(s a) prop pold(s a) exp(Ak(s a)
ηk
) (23)
where the temperature ηk is computed based on the con-straint εk by solving the following convex dual problem
ηk = argminηk
[ηk εk + (24)
ηk log
intsa
pold(s a) exp(Ak(s a)
η
)da ds
]
We can perform the optimization along with the loss bytaking a gradient descent step on ηk and we initialize withthe solution found in the previous policy iteration step Sinceηk must be positive we use a projection operator after eachgradient step to maintain ηk gt 0
Top-k advantages As in Song et al (2020) in practicewe used the samples corresponding to the top 50 of ad-vantages in each batch of data
D22 FITTING A NEW PARAMETRIC POLICY (STEP 2)
We next want to combine and distill the state-action distribu-tions obtained in the previous step into a single parametricpolicy πnew(a|s) that favors all of the objectives accordingto the preferences specified by εk For this we solve a su-pervised learning problem that fits a parametric policy asfollows
A Distributional View on Multi-Objective Policy Optimization
πnew = argmaxθ
Nsumk=1
intsa
qk(s a) log πθ(a|s) da ds
stints
KL(πold(a|s) πθ(a|s)) ds lt β (25)
where θ are the parameters of our function approximator(a neural network) which we initialize from the weights ofthe previous policy πold and the KL constraint enforces atrust region of size β that limits the overall change in theparametric policy to improve stability of learning As inMPO the KL constraint in this step has a regularizationeffect that prevents the policy from overfitting to the localpolicies and therefore avoids premature convergence
In order to optimize Equation (25) we employ Lagrangianrelaxation similar to the one employed for ordinary V-MPO (Song et al 2020)
E Multi-Objective Policy Improvement asInference
In the main paper we motivated the multi-objective policyupdate rules from an intuitive perspective In this section weshow that our multi-objective policy improvement algorithmcan also be derived from the RL-as-inference perspective Inthe policy improvement stage we assume that a Q-functionfor each objective is given and we would like to improveour policy with respect to these Q-functions The deriva-tion here extends the derivation for the policy improvementalgorithm in (single-objective) MPO in Abdolmaleki et al(2018a) (in appendix) to the multi-objective case
We assume there are observable binary improvement eventsRkNk=1 for each objective Rk = 1 indicates that ourpolicy has improved for objective k whileRk = 0 indicatesthat it has not Now we ask if the policy has improved withrespect to all objectives ie Rk = 1Nk=1 what would theparameters θ of that improved policy be More concretelythis is equivalent to the maximum a posteriori estimate forRk = 1Nk=1
maxθ
pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)
where the probability of improvement events depends on θAssuming independence between improvement events Rkand using log probabilities leads to the following objective
maxθ
Nsumk=1
log pθ(Rk = 1) + log p(θ) (27)
The prior distribution over parameters p(θ) is fixed duringthe policy improvement step We set the prior such that πθ
stays close to the target policy during each policy improve-ment step to improve stability of learning (Sec E2)
We use the standard expectation-maximization (EM) algo-rithm to efficiently maximize
sumNk=1 log pθ(Rk = 1) The
EM algorithm repeatedly constructs a tight lower bound inthe E-step given the previous estimate of θ (which corre-sponds to the target policy in our case) and then optimizesthat lower bound in the M-step We introduce a variationaldistribution qk(s a) per objective and use this to decom-pose the log-evidence
sumNk=1 log pθ(Rk = 1) as follows
Nsumk=1
log pθ(Rk = 1) (28)
=
Nsumk=1
KL(qk(s a) pθ(s a|Rk = 1))minus
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a))
The second term of this decomposition is expanded as
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a)) (29)
=
Nsumk=1
Eqk(sa)
[log
pθ(Rk = 1 s a)
qk(s a)
]=
Nsumk=1
Eqk(sa)
[log
p(Rk = 1|s a)πθ(a|s)micro(s)
qk(s a)
]
where micro(s) is the stationary state distribution which is as-sumed to be given in each policy improvement step Inpractice micro(s) is approximated by the distribution of thestates in the replay buffer p(Rk = 1|s a) is the likelihoodof the improvement event for objective k if our policy choseaction a in state s
The first term of the decomposition in Equation (28) isalways positive so the second term is a lower bound onthe log-evidence
sumNk=1 log pθ(Rk = 1) πθ(a|s) and
qk(a|s) = qk(sa)micro(s) are unknown even though micro(s) is given
In the E-step we estimate qk(a|s) for each objective k byminimizing the first term given θ = θprime (the parameters ofthe target policy) Then in the M-step we find a new θ byfitting the parametric policy to the distributions from firststep
E1 E-Step
The E-step corresponds to the first step of policy improve-ment in the main paper (Sec 421) In the E-step wechoose the variational distributions qk(a|s)Nk=1 such thatthe lower bound on
sumNk=1 log pθ(Rk = 1) is as tight as
A Distributional View on Multi-Objective Policy Optimization
possible when θ = θprime the parameters of the target policyThe lower bound is tight when the first term of the decompo-sition in Equation (28) is zero so we choose qk to minimizethis term We can optimize for each variational distributionqk independently
qk(a|s) = argminq
Emicro(s)
[KL(q(a|s) pθprime(s a|Rk = 1))
]= argmin
qEmicro(s)
[KL(q(a|s) πθprime(a|s))minus
Eq(a|s) log p(Rk = 1|s a))]
(30)
We can solve this optimization problem in closed formwhich gives us
qk(a|s) =πθprime(a|s) p(Rk = 1|s a)intπθprime(a|s) p(Rk = 1|s a) da
This solution weighs the actions based on their relativeimprovement likelihood p(Rk = 1|s a) for each objectiveWe define the likelihood of an improvement event Rk as
p(Rk = 1|s a) prop exp(Qk(s a)
αk
)
where αk is an objective-dependent temperature parameterthat controls how greedy the solution qk(a|s) is with respectto its associated objective Qk(s a) For example given abatch of state-action pairs at the extremes an αk of zerowould give all probability to the state-action pair with themaximum Q-value while an αk of positive infinity wouldgive the same probability to all state-action pairs In orderto automatically optimize for αk we plug the exponentialtransformation into (30) which leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) dadsminus
αk
intmicro(s) KL(q(a|s) πθprime(a|s)) ds (31)
If we instead enforce the bound on KL divergence as a hardconstraint (which we do in practice) that leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) da ds (32)
stintmicro(s) KL(q(a|s) πθprime(a|s)) ds lt εk
where as described in the main paper the parameter εkdefines preferences over objectives If we now assume thatqk(a|s) is a non-parametric sample-based distribution as in(Abdolmaleki et al 2018b) we can solve this constrained
optimization problem in closed form for each sampled states by setting
q(a|s) prop πθprime(a|s) exp(Qk(s a)
ηk
) (33)
where ηk is computed by solving the following convex dualfunction
ηk = argminη
ηεk+ (34)
η
intmicro(s) log
intπθprime(a|s) exp
(Qk(s a)
η
)da ds
Equations (32) (33) and (34) are equivalent to those givenfor the first step of multi-objective policy improvement inthe main paper (Sec 421) Please see refer to AppendixD2 in (Abdolmaleki et al 2018b) for details on derivationof the dual function
E2 M-Step
After obtaining the variational distributions qk(a|s) foreach objective k we have found a tight lower bound forsumNk=1 log pθ(Rk = 1) when θ = θprime We can now obtain
the parameters θ of the updated policy πθ(a|s) by optimiz-ing this lower bound After rearranging and dropping theterms in (27) that do not involve θ we obtain
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) da ds
+ log p(θ)
This corresponds to the supervised learning step of policyimprovement in the main paper (Sec 422)
For the prior p(θ) on the parameters θ we enforce that thenew policy πθ(a|s) is close to the target policy πθprime(a|s)We therefore optimize for
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) dads
stintmicro(s) KL(πθprime(a|s) πθ(a|s)) ds lt β
Because of the constraint on the change of the parametricpolicy we do not greedily optimize the M-step objectiveSee (Abdolmaleki et al 2018b) for more details
A Distributional View on Multi-Objective Policy Optimization
πnew = argmaxθ
Nsumk=1
intsa
qk(s a) log πθ(a|s) da ds
stints
KL(πold(a|s) πθ(a|s)) ds lt β (25)
where θ are the parameters of our function approximator(a neural network) which we initialize from the weights ofthe previous policy πold and the KL constraint enforces atrust region of size β that limits the overall change in theparametric policy to improve stability of learning As inMPO the KL constraint in this step has a regularizationeffect that prevents the policy from overfitting to the localpolicies and therefore avoids premature convergence
In order to optimize Equation (25) we employ Lagrangianrelaxation similar to the one employed for ordinary V-MPO (Song et al 2020)
E Multi-Objective Policy Improvement asInference
In the main paper we motivated the multi-objective policyupdate rules from an intuitive perspective In this section weshow that our multi-objective policy improvement algorithmcan also be derived from the RL-as-inference perspective Inthe policy improvement stage we assume that a Q-functionfor each objective is given and we would like to improveour policy with respect to these Q-functions The deriva-tion here extends the derivation for the policy improvementalgorithm in (single-objective) MPO in Abdolmaleki et al(2018a) (in appendix) to the multi-objective case
We assume there are observable binary improvement eventsRkNk=1 for each objective Rk = 1 indicates that ourpolicy has improved for objective k whileRk = 0 indicatesthat it has not Now we ask if the policy has improved withrespect to all objectives ie Rk = 1Nk=1 what would theparameters θ of that improved policy be More concretelythis is equivalent to the maximum a posteriori estimate forRk = 1Nk=1
maxθ
pθ(R1 = 1 R2 = 1 RN = 1) p(θ) (26)
where the probability of improvement events depends on θAssuming independence between improvement events Rkand using log probabilities leads to the following objective
maxθ
Nsumk=1
log pθ(Rk = 1) + log p(θ) (27)
The prior distribution over parameters p(θ) is fixed duringthe policy improvement step We set the prior such that πθ
stays close to the target policy during each policy improve-ment step to improve stability of learning (Sec E2)
We use the standard expectation-maximization (EM) algo-rithm to efficiently maximize
sumNk=1 log pθ(Rk = 1) The
EM algorithm repeatedly constructs a tight lower bound inthe E-step given the previous estimate of θ (which corre-sponds to the target policy in our case) and then optimizesthat lower bound in the M-step We introduce a variationaldistribution qk(s a) per objective and use this to decom-pose the log-evidence
sumNk=1 log pθ(Rk = 1) as follows
Nsumk=1
log pθ(Rk = 1) (28)
=
Nsumk=1
KL(qk(s a) pθ(s a|Rk = 1))minus
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a))
The second term of this decomposition is expanded as
Nsumk=1
KL(qk(s a) pθ(Rk = 1 s a)) (29)
=
Nsumk=1
Eqk(sa)
[log
pθ(Rk = 1 s a)
qk(s a)
]=
Nsumk=1
Eqk(sa)
[log
p(Rk = 1|s a)πθ(a|s)micro(s)
qk(s a)
]
where micro(s) is the stationary state distribution which is as-sumed to be given in each policy improvement step Inpractice micro(s) is approximated by the distribution of thestates in the replay buffer p(Rk = 1|s a) is the likelihoodof the improvement event for objective k if our policy choseaction a in state s
The first term of the decomposition in Equation (28) isalways positive so the second term is a lower bound onthe log-evidence
sumNk=1 log pθ(Rk = 1) πθ(a|s) and
qk(a|s) = qk(sa)micro(s) are unknown even though micro(s) is given
In the E-step we estimate qk(a|s) for each objective k byminimizing the first term given θ = θprime (the parameters ofthe target policy) Then in the M-step we find a new θ byfitting the parametric policy to the distributions from firststep
E1 E-Step
The E-step corresponds to the first step of policy improve-ment in the main paper (Sec 421) In the E-step wechoose the variational distributions qk(a|s)Nk=1 such thatthe lower bound on
sumNk=1 log pθ(Rk = 1) is as tight as
A Distributional View on Multi-Objective Policy Optimization
possible when θ = θprime the parameters of the target policyThe lower bound is tight when the first term of the decompo-sition in Equation (28) is zero so we choose qk to minimizethis term We can optimize for each variational distributionqk independently
qk(a|s) = argminq
Emicro(s)
[KL(q(a|s) pθprime(s a|Rk = 1))
]= argmin
qEmicro(s)
[KL(q(a|s) πθprime(a|s))minus
Eq(a|s) log p(Rk = 1|s a))]
(30)
We can solve this optimization problem in closed formwhich gives us
qk(a|s) =πθprime(a|s) p(Rk = 1|s a)intπθprime(a|s) p(Rk = 1|s a) da
This solution weighs the actions based on their relativeimprovement likelihood p(Rk = 1|s a) for each objectiveWe define the likelihood of an improvement event Rk as
p(Rk = 1|s a) prop exp(Qk(s a)
αk
)
where αk is an objective-dependent temperature parameterthat controls how greedy the solution qk(a|s) is with respectto its associated objective Qk(s a) For example given abatch of state-action pairs at the extremes an αk of zerowould give all probability to the state-action pair with themaximum Q-value while an αk of positive infinity wouldgive the same probability to all state-action pairs In orderto automatically optimize for αk we plug the exponentialtransformation into (30) which leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) dadsminus
αk
intmicro(s) KL(q(a|s) πθprime(a|s)) ds (31)
If we instead enforce the bound on KL divergence as a hardconstraint (which we do in practice) that leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) da ds (32)
stintmicro(s) KL(q(a|s) πθprime(a|s)) ds lt εk
where as described in the main paper the parameter εkdefines preferences over objectives If we now assume thatqk(a|s) is a non-parametric sample-based distribution as in(Abdolmaleki et al 2018b) we can solve this constrained
optimization problem in closed form for each sampled states by setting
q(a|s) prop πθprime(a|s) exp(Qk(s a)
ηk
) (33)
where ηk is computed by solving the following convex dualfunction
ηk = argminη
ηεk+ (34)
η
intmicro(s) log
intπθprime(a|s) exp
(Qk(s a)
η
)da ds
Equations (32) (33) and (34) are equivalent to those givenfor the first step of multi-objective policy improvement inthe main paper (Sec 421) Please see refer to AppendixD2 in (Abdolmaleki et al 2018b) for details on derivationof the dual function
E2 M-Step
After obtaining the variational distributions qk(a|s) foreach objective k we have found a tight lower bound forsumNk=1 log pθ(Rk = 1) when θ = θprime We can now obtain
the parameters θ of the updated policy πθ(a|s) by optimiz-ing this lower bound After rearranging and dropping theterms in (27) that do not involve θ we obtain
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) da ds
+ log p(θ)
This corresponds to the supervised learning step of policyimprovement in the main paper (Sec 422)
For the prior p(θ) on the parameters θ we enforce that thenew policy πθ(a|s) is close to the target policy πθprime(a|s)We therefore optimize for
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) dads
stintmicro(s) KL(πθprime(a|s) πθ(a|s)) ds lt β
Because of the constraint on the change of the parametricpolicy we do not greedily optimize the M-step objectiveSee (Abdolmaleki et al 2018b) for more details
A Distributional View on Multi-Objective Policy Optimization
possible when θ = θprime the parameters of the target policyThe lower bound is tight when the first term of the decompo-sition in Equation (28) is zero so we choose qk to minimizethis term We can optimize for each variational distributionqk independently
qk(a|s) = argminq
Emicro(s)
[KL(q(a|s) pθprime(s a|Rk = 1))
]= argmin
qEmicro(s)
[KL(q(a|s) πθprime(a|s))minus
Eq(a|s) log p(Rk = 1|s a))]
(30)
We can solve this optimization problem in closed formwhich gives us
qk(a|s) =πθprime(a|s) p(Rk = 1|s a)intπθprime(a|s) p(Rk = 1|s a) da
This solution weighs the actions based on their relativeimprovement likelihood p(Rk = 1|s a) for each objectiveWe define the likelihood of an improvement event Rk as
p(Rk = 1|s a) prop exp(Qk(s a)
αk
)
where αk is an objective-dependent temperature parameterthat controls how greedy the solution qk(a|s) is with respectto its associated objective Qk(s a) For example given abatch of state-action pairs at the extremes an αk of zerowould give all probability to the state-action pair with themaximum Q-value while an αk of positive infinity wouldgive the same probability to all state-action pairs In orderto automatically optimize for αk we plug the exponentialtransformation into (30) which leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) dadsminus
αk
intmicro(s) KL(q(a|s) πθprime(a|s)) ds (31)
If we instead enforce the bound on KL divergence as a hardconstraint (which we do in practice) that leads to
qk(a|s) = argmaxq
intmicro(s)
intq(a|s)Qk(s a) da ds (32)
stintmicro(s) KL(q(a|s) πθprime(a|s)) ds lt εk
where as described in the main paper the parameter εkdefines preferences over objectives If we now assume thatqk(a|s) is a non-parametric sample-based distribution as in(Abdolmaleki et al 2018b) we can solve this constrained
optimization problem in closed form for each sampled states by setting
q(a|s) prop πθprime(a|s) exp(Qk(s a)
ηk
) (33)
where ηk is computed by solving the following convex dualfunction
ηk = argminη
ηεk+ (34)
η
intmicro(s) log
intπθprime(a|s) exp
(Qk(s a)
η
)da ds
Equations (32) (33) and (34) are equivalent to those givenfor the first step of multi-objective policy improvement inthe main paper (Sec 421) Please see refer to AppendixD2 in (Abdolmaleki et al 2018b) for details on derivationof the dual function
E2 M-Step
After obtaining the variational distributions qk(a|s) foreach objective k we have found a tight lower bound forsumNk=1 log pθ(Rk = 1) when θ = θprime We can now obtain
the parameters θ of the updated policy πθ(a|s) by optimiz-ing this lower bound After rearranging and dropping theterms in (27) that do not involve θ we obtain
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) da ds
+ log p(θ)
This corresponds to the supervised learning step of policyimprovement in the main paper (Sec 422)
For the prior p(θ) on the parameters θ we enforce that thenew policy πθ(a|s) is close to the target policy πθprime(a|s)We therefore optimize for
θlowast = argmaxθ
Nsumk=1
intmicro(s)
intqk(a|s) log πθ(a|s) dads
stintmicro(s) KL(πθprime(a|s) πθ(a|s)) ds lt β
Because of the constraint on the change of the parametricpolicy we do not greedily optimize the M-step objectiveSee (Abdolmaleki et al 2018b) for more details