Abstract arXiv:2103.03236v1 [cs.LG] 4 Mar 2021

Of Moments and Matching:Trade-offs and Treatments in Imitation Learning

Gokul Swamy 1 Sanjiban Choudhury 2 Zhiwei Steven Wu 3 J. Andrew Bagnell 1 2

AbstractWe provide a unifying view of a large family ofprevious imitation learning algorithms throughthe lens of moment matching. At its core, our clas-sification scheme is based on whether the learnerattempts to match (1) reward or (2) action-valuemoments of the expert’s behavior, with each op-tion leading to differing algorithmic approaches.By considering adversarially chosen divergencesbetween learner and expert behavior, we are ableto derive bounds on policy performance that applyfor all algorithms in each of these classes, the firstto our knowledge. We also introduce the notion ofrecoverability, implicit in many previous analysesof imitation learning, which allows us to cleanlydelineate how well each algorithmic family is ableto mitigate compounding errors. We derive twonovel algorithm templates, AdVIL and AdRIL,with strong guarantees, simple implementation,and competitive empirical performance.

1. IntroductionWhen formulated as a statistical learning problem, imitationlearning is fundamentally concerned with finding policiesthat minimize some notion of divergence between learnerbehavior and that of an expert demonstrator. Existing workhas explored various types of divergences including KL(Pomerleau 1989, Bojarski et al. 2016), Jensen-Shannon(Rhinehart et al. 2018), Reverse KL (Kostrikov et al. 2019),f (Ke et al. 2019), and Wasserstein (Dadashi et al. 2020).

At heart, though, we care about the performance of thelearned policy under an objective function that is not knownto the learner. As argued by (Abbeel and Ng 2004) and(Ziebart et al. 2008), this goal is most cleanly formulated asa problem of moment matching, or, equivalently, optimizingIntegral Probability Metrics (Sun et al. 2019) (IPMs). This

1Robotics Institute, Carnegie Mellon University 2Aurora Inno-vation 3Institute for Software Research, Carnegie Mellon Univer-sity. Correspondence to: Gokul Swamy <[email protected]>.

(a) (b)

( ) ( )E

f( ) ( )E

f

(c)

( ) ( )E

f

Figure 1. We consider three classes of imitation learning algo-rithms. (a) On-policy reward moment-matching algorithms re-quire access to the environment to generate learner trajectories. (b)Off-policy Q-value moment-matching algorithms run completelyoffline but can produce policies with quadratically compoundingerrors. (c) On-policy Q-value moment-matching algorithms re-quire access to the environment and a queryable expert but canreduce the exploration burden of the learner in recoverable MDPs.

is because a learner that in expectation matches the experton all the basis functions of a class that includes the ex-pert’s objective function, or matches moments, must achievethe same performance and will thus be indistinguishable interms of quality. Additionally, in sharp contrast to recentlyproposed approaches (Ke et al. 2019, Kostrikov et al. 2019,Jarrett et al. 2020, Rhinehart et al. 2018), moments, due totheir simple forms as expectations of basis functions, canbe effectively estimated via demonstrator samples and theuncertainty in these estimates can often be quantified to reg-ularize the matching objective (Dudik et al. 2004). In short:these moment matching procedures are simple, effective,and provide the strongest policy performance guarantees weare aware of for imitation learning.

As illustrated in Fig. 1, there are three classes of momentsa learner can focus on matching: (a) on-policy rewardmoments, (b) off-policy Q-value moments, and (c) on-policyQ-value moments, each of which have different require-ments on the environment and on the expert. We abbreviatethem as reward, off-Q, and on-Q moments, respectively.

arX

iv:2

103.

0323

6v1

[cs

.LG

] 4

Mar

202

1

Of Moments and Matching

Our key insight is that reward moments have more discrim-inative power because they can pick up on differences ininduced state visitation distributions rather than just actionconditionals. Thus, reward moment matching is a harderproblem with stronger guarantees than off-Q and on-Q mo-ment matching.

Our work makes the following three contributions:

1. We present a unifying framework for moment match-ing in imitation learning. Our framework captures a widerange of prior approaches and allows us to construct, toour knowledge, the first formal lower bounds demonstrat-ing that the choice between matching “reward”, “off-Q”, or“on-Q” moments is fundamental to the problem of imitationlearning rather than an artifact of a particular algorithm oranalysis.

2. We clarify the dependence of imitation learningbounds on problem structure. We introduce a joint prop-erty of an expert policy and moment class, recoverability,that helps us characterize the problems for which compound-ing errors are likely to occur, regardless of the kind of feed-back the learner is exposed to.

3. We provide two novel algorithms with strong perfor-mance guarantees. We derive idealized algorithms thatmatch each class of moments. We also provide practicalinstantiations of these ideas, AdVIL and AdRIL. These al-gorithms have significantly different practical performanceas well as theoretical guarantees in terms of compoundingof errors over time steps of the problem.

2. Related WorkImitation Learning. Imitation learning has been shownto be an effective method of solving a variety of problems,from getting cars to drive themselves (Pomerleau 1989),to achieving superhuman performance in games like Go(Silver et al. 2016, Sun et al. 2018), to sample-efficientlearning of control policies for high DoF robots (Levine andKoltun 2013), to allowing human operators to effectivelysupervise and teach robot fleets (Swamy et al. 2020). Theissue of compounding errors in imitation learning was firstformalized by (Ross et al. 2011), with the authors provingthat an interactive expert that can suggest actions in statesgenerated via learner policy rollouts will be able to teachthe learner to recover from mistakes.

Adversarial Imitation Learning. Starting with the sem-inal work of (Ho and Ermon 2016), numerous proposedapproaches have framed imitation learning as a game be-tween a learner’s policy and another network that attemptsto discriminate between learner rollouts and expert demon-strations (Fu et al. 2018, Song et al. 2018). We build uponthis work by elucidating the properties that result from the

kind of feedback the learner is exposed to – whether theyare able to see the consequences of their own actions viarollouts or if they are only able to propose actions in statesfrom expert trajectories. Our proposed approaches also havestronger guarantees and less brittle performance than thepopular GAIL (Ho and Ermon 2016).

Mathematical Tools. Our algorithmic approach combinestwo tools that have enjoyed success in imitation learning:functional gradients (Ratliff et al. 2009) and the IntegralProbability Metric (Sun et al. 2019). We define two algo-rithms, AdRIL and AdVIL that are based on optimizing thevalue-directed IPM, with AdRIL having the discriminativeplayer perform updates via functional gradient descent. TheIPM is linear in the discriminative function, unlike otherproposed metrics like the Donsker-Varadhan bound on KLdivergence. Specifically, the Donsker-Varadhan bound in-cludes an expectation of the exponentiated discriminativefunction, which makes estimation difficult with a few sam-ples (McAllester and Stratos 2020). Our analysis makes re-peated use of the Performance Difference Lemma (Kakadeand Langford 2002, Bagnell et al. 2003) or PDL, whichallows us to bound the suboptimality of the learner’s policy.

Our proposed algorithms bear resemblance to two previ-ously proposed methods, with AdRIL resembling SQIL(Reddy et al. 2019) and AdVIL resembling ValueDICE(Kostrikov et al. 2019). We note that AdVIL, while cleanlyderived from the PDL, can also be derived from an IPM byusing a telescoping substitution similar to the ValueDICEderivation. Notably, because AdVIL is linear in the dis-criminator, it does not suffer from ValueDICE’s difficultyin estimating the expectation of an exponential. This dif-ficulty might help explain why ValueDICE can underper-form the behavioral cloning baseline on several benchmarktasks (Jarrett et al. 2020). Similarly, AdRIL can avoid thesharp degradation in policy performance that SQIL demon-strates (Barde et al. 2020). This is because SQIL hard-codesthe discriminator while AdRIL adaptively updates the dis-criminator to account for changes in the policy’s trajectorydistribution.

Recent work by Jarrett et al. attempts to circumvent thelimits of off-Q moment matching via framing the problemof imitation learning as one of learning a joint energy-basedmodel (Jarrett et al. 2020). However, we find several fun-damental mathematical misconceptions in their proposedapproach and detail them in Appendix D

3. Moment Matching Imitation LearningWe begin by formalizing our setup and objective.


3.1. Problem Definition

Let ∆(X ) denote the space of all probability distribu-tion over a set X . Consider an MDP parameterized by〈S,A, T , r, T, P0〉1, where S is the state space, A is theaction space, T : S ×A → ∆(S) is the transition operator,r : S × A → [−1, 1] is a reward function, T is the hori-zon, and P0 is the initial state distribution. Let π denotethe learner’s policy and let πE denote the demonstrator’spolicy. A trajectory τ ∼ π = st, att=1...T refers to asequence of state-action pairs generated by first samplinga state s1 from P0 and then repeatedly sampling actionsat and next states st+1 from π and T for T − 1 time-steps. We also define our value and Q-value functionsas V πt (s) = Eτ∼π|st=s[

∑Tt′=t r(st′ , at′)], Q

πt (s, a) =

Eτ∼π|st=s,at=a[∑Tt′=t r(st′ , at′)]. We also define the ad-

vantage function as Aπt (s, a) = Qπt (s, a)− V πt (s). Lastly,let performance be J(π) = Eτ∼π[

∑Tt=1 r(st, at)]. Then,

Definition 1. We define the imitation gap as:

J(πE)− J(π) (1)

The goal of the learner is to minimize (1) and close theimitation gap. The unique challenge in imitation learning isthat the reward function r is unknown and the learner mustrely on demonstrations from the expert πE to minimize thegap. A natural way to solve this problem is to empiricallymatch all moments of J(πE). If all the moments are per-fectly matched, regardless of the unknown reward function,the imitation gap must go to zero. We now delve into thevarious types of moments we can match.

3.2. Moment Taxonomy

Broadly speaking, a learner can focus on matching per-timestep reward or over-the-horizon Q-value moments ofexpert behavior. We use Fr : S ×A → [−1, 1] to denote aclass of reward functions, FQ : S×A → [−T, T ] to denotethe set of Q functions induced by sampling actions fromsome π ∈ Π, and FQE : S × A → [−Q,Q] to denote theset of Q functions induced by sampling actions from πE .We assume all three function classes are closed under nega-tion. Lastly, we refer to H ∈ [0, 2Q] as the recoverabilityconstant of the problem and define it as follows:

Definition 2. A pair (πE ,FQE ) of an expert policy and setof expert Q-functions is said to be H-recoverable if ∀s ∈ S ,∀a ∈ A, ∀f ∈ FQE , |f(s, a)− Ea′∼πE(s)[f(s, a′)]| < H .

H is an upper bound on all possible advantages that couldbe obtained by the expert under a Q-function in FQE . Intu-itively, H tells us how many time-steps it takes the expert torecover from an arbitrary mistake. We defer a more in-depthdiscussion of the implications of this concept to Section 4.5.

1We ignore the discount factor for simplicity.

MOMENT CLASS ENV. ACCESS πE QUERIES

REWARD Fr : [−1, 1] 3 7OFF-POLICY Q FQ : [−T, T ] 7 7

ON-POLICY Q FQE : [−Q,Q] 3 3

Table 1. An overview of the requirements for the three classes ofmoment matching.

Reward. Matching reward moments entails minimizing thefollowing expansion of the imitation gap:

J(πE)− J(π)

= Eτ∼πE

∑T

t=1r(st, at)− E

τ∼π

∑T

t=1r(st, at)

= Eτ∼π

∑T

t=1−r(st, at)− E

τ∼πE

∑T

t=1−r(st, at)

≤ supf∈Fr

Eτ∼π

∑T

t=1f(st, at)− E

τ∼πE

∑T

t=1f(st, at)

In the last step, we use the fact that −r(s, a) ∈ Fr. Cru-cially, reward moment-matching demands on-policy rolloutsτ ∼ π for the learner to calculate per-timestep divergences.

Instead of matching moments of the reward function, we canconsider matching moments of the action-value function.We can apply the Performance Difference Lemma (PDL)to expand the imitation gap (1) into either on-policy or off-policy expressions.

Off-Policy Q. Starting from the PDL:

J(πE)− J(π)

= Eτ∼πE

[∑T

t=1Qπt (st, at)− E

a∼π(st)[Qπt (st, a)]]

≤ supf∈FQ

Eτ∼πE

[∑T

t=1E

a∼π(st)[f(st, a)]− f(st, at)]

In the last step, we use the fact that Qπt (s, a) ∈ FQ for allπ ∈ Π and r ∈ Fr. The above expression is off-policy – itonly requires a collected dataset of expert trajectories to beevaluated and minimized. In general though, FQ can be afar more complex class than Fr because it has to captureboth the dynamics of the MDP and the choices of any policy.

On-Policy Q. Expanding in the reverse direction:

J(πE)− J(π)

= − Eτ∼π

[∑T

t=1QπEt (st, at)− E

a∼πE(st)[QπEt (st, a)]]

≤ supf∈FQE

Eτ∼π

[∑T

t=1E

a∼πE(st)[f(st, a)]− f(st, at)]

In the last step, we use the fact that QπEt (s, a) ∈ FQE forall r ∈ Fr. In the realizable setting, πE ∈ Π, FQE ⊆FQ. While FQE is a smaller class, to actually evaluatethis expression, we require an interactive expert that cantell us what action they would take in any state visited by


the learner as well as on-policy samples from the learner’scurrent policy τ ∼ π.

With this taxonomy in mind, we now turn our attention toderiving policy performance bounds.

3.3. Moment Matching Games

A unifying perspective on the three moment matchingvariants can be achieved by viewing the learner as solv-ing a game. More specifically, we consider variants ofa two-player minimax game between a learner and a dis-criminator. The learner selects a policy π ∈ Π, whereΠ , π : S → ∆(A). We assume Π is convex, com-pact and that πE ∈ Π.2 The discriminator (adversarially)selects a function f ∈ F , where F , f : S × A → R.We assume that F is convex, compact, closed under nega-tion, and finite dimensional.3 Depending on the class ofmoments being matched, we assume that F is spanned byconvex combinations of the elements of Fr/2, FQ/2T , orFQE/2T . Lastly, we set the learner as the minimizationplayer and the discriminator as the maximization player.

Definition 3. The on-policy reward, off-policy Q, and on-policy Q payoff functions are, respectively:

U1(π, f) =1

T( Eτ∼π

[

T∑t=1

f(st, at)]− Eτ∼πE

[

T∑t=1

f(st, at)])

U2(π, f) =1

T( Eτ∼πEa∼π(st)

[

T∑t=1

f(st, a)]− Eτ∼πE

[

T∑t=1

f(st, at)])

U3(π, f) =1

T( Eτ∼π

[

T∑t=1

f(st, at)]− Eτ∼π

a∼πE(st)

[

T∑t=1

f(st, a)])

When optimizing over the policy class Π which containsπE , we have a minimax value of 0: for j ∈ 1, 2, 3,

minπ∈Π

maxf∈F

Uj(π, f) = 0

Furthermore, for certain representations of the policy,4

strong duality holds: for j ∈ 1, 2, 3,

minπ∈Π

maxf∈F

Uj(π, f) = maxf∈F

minπ∈Π

Uj(π, f)

We now study the properties that result from achieving anapproximate equilibrium for each imitation game.

2The full policy class satisfies all these assumptions.3Our results extend to infinite-dimensional Reproducing Kernel

Hilbert Spaces.4One option is a mixture distribution over class Π′ where opti-

mization is now performed over the mixture weights. This distribu-tion can then be collapsed to a single policy (Syed et al. 2008). Asecond option is to optimize over the causal polytope P (AT ||ST ),as defined in (Ziebart et al. 2010).

MOMENT MATCHED UPPER BOUND LOWER BOUND

REWARD O(εT ) Ω(εT )OFF-POLICY Q O(εT 2) Ω(εT 2)ON-POLICY Q O(εHT ) Ω(εT )

Table 2. An overview of the difference in bounds between the threetypes of moment matching. All bounds are on imitation gap (1).

4. From Approximate Equilibria to BoundedRegret

A learner computing an equilibrium policy for any of themoment matching games will be imperfect due to manysources including restricted policy class, optimization error,or imperfect estimation of expert moments. More formally,in a game with payoff Uj , a pair (π ∈ Π, f ∈ F) is aδ-approximate equilibrium solution if the following holds:

supf∈F

Uj(f, π)− δ

2≤ Uj(f , π) ≤ inf

π∈ΠUj(f , π) +

δ

2

We assume access to an algorithmic primitive capable offinding such strategies:Definition 4. An imitation game δ-oracle Ψδ(·) takespayoff function U : Π × F → [−k, k] and returns a (kδ)-approximate equilibrium strategy for the policy player.

We now bound the imitation gap of solutions returned bysuch an oracle.

4.1. Example MDPs

For use in our analysis, we first introduce two MDPs, LOOPand CLIFF. As seen in Fig. 2, LOOP is an MDP where alearner can enter a state where it has seen no expert demon-strations (s2) and make errors for the rest of the horizon.CLIFF is an MDP where a single mistake can result in thelearner being stuck in an absorbing state.

s0

s1

s2

a1a1

a1

a2

a2

a2

s0 s1 s2 . . .

sx

a1 a1 a1

a2 a2 a2

a1

Figure 2. Left: Borrowed from (Ross et al. 2011), the goal of LOOP

is to spend time in s1. Right: a “folklore” MDP CLIFF, where thegoal is to not “fall off the cliff” and end up in sx.

4.2. Reward Moment Performance Bounds

Let us first consider the reward moment-matching game.


Lemma 1. Reward Upper Bound: If Fr spans F , thenfor all MDPs, πE , and π ← Ψε(U1), J(πE) − J(π) ≤O(εT ).

Proof. We start by expanding the imitation gap:

J(πE)− J(π)

≤ supf∈Fr

Eτ∼π

∑T

t=1f(st, at)− E

τ∼πE

∑T

t=1f(st, at)

≤ supf∈F

Eτ∼π

∑T

t=12f(st, at)− E

τ∼πE

∑T

t=12f(st, at)

= 2T supf∈F

U1(π, f) ≤ 2Tε

The first line follows from the closure of Fr under negation.The last line follows from the definition of an ε-approximateequilibrium.

In words, this bound means that in the worst case, we havean imitation gap that is O(εT ) rather than an imitation gapthat compounds quadratically in time.

Lemma 2. Reward Lower Bound: There exists an MDP,πE , and π ← Ψε(U1) such that J(πE)−J(π) ≥ Ω(εT ).

Proof. Consider CLIFF with a reward function composed oftwo indicators: r(s, a) = −1sx − 1a2 and a perfect expertthat never takes a2. If with probability ε the learner’s policytakes action a2 only in s0, the optimal discriminator wouldnot only be able to penalize the learner for taking a2 butalso for the next T − 1 timesteps for being in sx. Together,this would lead to an average cost of ε per timestep. Underr, this would make the learner εT worse than the expert,giving us J(πE)− J(π) = εT ≥ Ω(εT ).

Notably, both of these bounds are purely a function of thegame, not the policy search algorithm and therefore applyfor all algorithms that can be written in the form of a re-ward moment-matching imitation game. Our bounds do notdepend on the size of the state space and therefore applyto continuous spaces, unlike those presented in (Rajaramanet al. 2020). Several recently proposed algorithms (Ho andErmon 2016, Brantley et al. 2020, Spencer et al. 2021) in-cluding GAIL and SQIL can be understood as also solvingthis or a related game.

4.3. Off-Q Moment Performance Bounds

We contrast the preceding guarantees with those based onmatching off-Q moments.

Lemma 3. Off-Q Upper Bound: If FQ spans F , then forall MDPs, πE , and π ← Ψε(U2), J(πE) − J(π) ≤O(εT 2).

Proof. Starting from the PDL:

J(πE)− J(π)

≤ supf∈FQ

Eτ∼πE

[∑T

t=1E


≤ supf∈F

Eτ∼πE

[∑T

t=1E

a∼π(st)[2Tf(st, a)]− 2Tf(st, at)]

= 2T 2 supf∈F

U2(π, f) ≤ 2εT 2

The T in the second to last line comes from the fact thatFQ/2T ⊆ F . Thus, a policy π returned by Ψε(U2) mustsatisfy J(πE) − J(π) ≤ O(εT 2) – that is, it can do up toO(εT 2) worse than the expert.

Lemma 4. Off-Q Lower Bound: There exists an MDP, πE ,and π ← Ψε(U2) such that J(πE)− J(π) ≥ Ω(εT 2).

Proof. Once again, consider CLIFF. If the learner policyinstead takes a2 with probability εT in s0, the optimal dis-criminator would be able to penalize the learner up to εTfor that timestep and ε on average. However, on rollouts,the learner would have an εT chance of paying a cost of1 for the rest of the horizon, leading to a lower bound ofJ(πE)− J(π) = εT 2 ≥ Ω(εT 2).

These bounds apply for all algorithms that can be written inthe form of an off-Q imitation game, including behavioralcloning (Pomerleau 1989) and ValueDICE (Kostrikov et al.2019).

4.4. On-Q Moment Performance Bounds

We now derive performance bounds for on-Q algorithmswith interactive experts.

Lemma 5. On-Q Upper Bound: If FQE spans F , thenfor all MDPs with H-recoverable (FQE , πE), and π ←Ψε(U3), J(πE)− J(π) ≤ O(εHT ).

Proof. Starting from the PDL:

J(πE)− J(π)

≤ supf∈FQE

Eτ∼πE

[∑T

t=1E


≤ supf∈F

Eτ∼πE

[∑T

t=1E

a∼π(st)[2Tf(st, a)]− 2Tf(st, at)]

= 2T 2 supf∈F

U3(π, f) ≤ (εH

2T)2T 2 = εHT

As before, the T in the second to last line comes from thefact that FQE/2T ⊆ F . The H/2T factor comes from thescale of the payoff. Thus, a policy π returned by Ψε(U3)must satisfy J(πE)− J(π) ≤ O(εHT ) – that is, it can doup to O(εHT ) worse than the expert.

Lemma 6. On-Q Lower Bound: There exists an MDP, πE ,and π ← Ψε(U3) such that J(πE)− J(π) ≥ Ω(εT ).

Proof. The proof of the reward lower bound holds verbatimbecause every policy, including the previously consideredπE will be stuck in sx after it falls in.


As before, these bounds apply for all algorithms that canbe written in the form of an on-Q imitation game, includ-ing DAgger (Ross et al. 2011) and AggreVaTe (Ross andBagnell 2014). For example, in the bounds for AggreVaTe,Qmax is equivalent to the recoverability constant H .

4.5. Recoverability in Imitation Learning

The bounds we presented above beg the question of whenon-Q moment matching has error properties similar to thoseof reward moment matching versus those of off-Q momentmatching. Recoverability allows us to cleanly answer thisquestion and others. We begin by providing more intuitionfor said concept.

Concretely, in Fig. 2, LOOP is 1-recoverable for the ex-pert policy that always moves towards s1. CLIFF is notH-recoverable for any H < T if the expert never ends upin sx. A sufficient condition for H-recoverability is thatthe state occupancy distribution that results from taking anarbitrary action and then H − 1 actions according to πE isthe same as that of taking H actions according to πE .

Bound Clarification. Our previously derived upper boundfor on-Q moment matching (J(πE) − J(π) ≤ O(εHT ))tells us that for O(1)-recoverable MDPs, on-Q momentmatching behaves like reward moment matching whilefor O(T )-recoverable MDPs, it instead behaves like off-Q moment matching and has an O(εT 2) upper bound.Thus, O(1)-recoverability is in a certain sense necessaryfor achieving O(εT ) error.

Another perspective on recoverability is that it helps usdelineate problems where compounding errors are hard toavoid for both on-Q and reward moment matching. Letl(s) =

∑a′∈A |Ea∼πE(s)[1a′(a)] − Ea∼π(s)[1a′(a)]| be

the classification error of a state s. We prove the followinglemma in Supplementary Material A.1:

Lemma 7. Let ε > 0. There exists a (πE , r,−r) pairthat for any H < T is not H-recoverable such that l(s) = εon any state leads to J(πE)− J(π) ≥ Ω(εT 2).

Because some states might not appear on expert rollouts,evaluating l(s) can require an interactive expert. However,even with this strong form of feedback, a classification errorof ε on a single state can lead to Ω(εT 2) imitation gap forO(T )-recoverable MDPs. This lemma also implies thatin such an MDP, achieving O(εT ) imitation gap via on-policy moment matching would require the learner to satisfysupf U1(π, f) ≤ ε/T or supf U3(π, f) ≤ ε/T , or to makevanishingly rare errors as we increase the horizon. We notethat this does not conflict with our previously stated boundsbut reveals that the supposedly time-independent ε mightneed to scale inversely with time to achieve linear error forO(T )-recoverable MDPs. Thus, neither on-Q nor rewardmoment matching is a silver bullet for getting O(εT ) error.

s0 s1 s2 . . . sk . . .

s1 s2 . . . sk

a1 a1 a1 a1 a1

a1 a1 a1

a1a2

a2 a2

Figure 3. ROAD is an MDP where exploration on the bottom rowis hard so there is a meaningful separation between reward andon-Q moment matching.

Exploration in O(1)-Recoverable MDPs. One mightwonder if the on-Q setup provides any benefits over thereward-matching setup in O(1)-recoverable MDPs. We pro-vide an example of such an MDP in Fig. 3. In ROAD, anexpert might only produce trajectories along the top lane ofsi’s. If the learner takes a2 in s0 and shifts to the bottomlane, they would need to make k − 1 correct choices whilethe discriminator could penalize either action because thereare no expert samples along the si. While this MDP is k-recoverable, learning to satisfy supf∈F U1(π, f) ≤ ε is achallenging exploration problem for O(1) but large k. Incontrast, an interactive expert could show the learner thatthey should always take a1 along the bottom lane, enablingthem to get back to the top lane. Thus, in O(1)-recoverableMDPs where the upper bounds of reward and on-Q momentmatching are harder to separate, an interactive expert canreduce the exploration burden of the learner.

5. Finding Equilibria: Idealized AlgorithmsWe now provide reduction-based methods for computing(approximate) equilibria for these moment matching gameswhich can be seen as blueprints for constructing our previ-ously described oracle. We study, in particular, finite stateproblems and a complete policy class. We analyze an ap-proach to equilibria finding where an outer player followsa no-regret strategy and the inner player follow a (modi-fied) best response strategy, by which we can efficiently findpolicies with strong performance guarantees.

5.1. Preliminaries

An efficient no-regret algorithm over a class X producesiterates x1, . . . , xH ∈ X that satisfy the following propertyfor any sequence of loss functions l1, . . . , lH :

Regret(H) =∑H

tlt(xt)−min

x∈X

∑H

tlt(x) ≤ βX (H)

where βX (H)/H ≤ ε holds for H that are O(poly( 1ε )).


Algorithm 1 AdVILInput: Expert demonstrations DE , Policy class Π, Dis-criminator class F , Performance threshold δ, Learningrates ηf > ηπOutput: Trained policy πSet π ∈ Π, f ∈ F , L(π, f) = 2δwhile L(π, f) > δ doL(π, f) = E(s,a)∼DE [Ex∼π(s)[f(s, x)]− f(s, a)]f ← f + ηf∇fL(π, f)π ← π − ηπ∇πL(π, f)

end while

5.2. Theoretical Guarantees

We are interested in obtaining a policy efficiently that is anear-equilibrium solution to the game. We consider twogeneral strategies to do so:

Primal. We execute a no-regret algorithm on the policyrepresentation, while a maximization oracle over the spaceF computes the best response to those policies.

Dual. We execute a no-regret algorithm on the space F ,while a minimization oracle over policies computes entropyregularized best response policies.

The asymmetry in the above is driven by the need to recoverthe equilibrium strategy for the policy player and the factthat a dual approach on the original, unregularized objectiveUj(·, f) will typical not converge to a single policy butrather shift rapidly as f changes.5

By choosing the the policy representation to be acausally conditioned probability distribution over actions,P (AT ||ST ) =

∏Tt=1 P (At|S1:t,A1:t−1), we find each of

the imitation games is bilinear in both policy and discrimina-tor f and strongly dual .6 Thus, we can efficiently computea near-equilibrium assuming access to the optimization ora-cles in either primal or dual above:

Theorem 1. Given access to the no-regret and maxi-mization oracles in either primal or dual above, forall three imitation games we are able to compute a δ-approximate equilibrium strategy for the policy player inpoly( 1

δ , T, ln |S|, ln |A|) iterations of the outer player opti-mization.

This result relies on a general lemma that establishes we

5The average over iterations of the policies generated in anunregularized dual will also be near-equilibrium but can be in-convenient. Entropy regularization provides a convenient way toextract a single policy and meshes well with empirical practice.

6Following (Ziebart et al. 2010) we can represent the policy asan element of the causal conditioned polytope and regularize withthe causal entropy H(P (AT||ST)) to denote the causal Shan-non entropy of a policy. An equivalent result can be proved foroptimizing over occupancy measures (Ho and Ermon 2016).

Algorithm 2 AdRILInput: Expert demonstrations DE , Policy class Π, Dy-namics T , Kernel K, Performance threshold δOutput: Trained policy πSet π ∈ Π, f = 0 , Dπ = , D′ = , L(π, f) = 2δwhile L(π, f) > δ dof ← Eτ∼Dπ [

∑tK(sa, ·)]− Eτ∼DE [

∑tK(sa, ·)]

π,D′ ← MaxEntRL(T = T ,r = −f)Dπ ← Dπ ∪ D′L(π, f) = Eτ∼D′ [

∑t f(s, a)]− Eτ∼DE [

∑t f(s, a)]

end while

can recover the equilibrium inner player by appropriatelyentropy regularizing that inner player’s decisions:

Lemma 8. Entropy Regularization Lemma: By optimiz-ing Uj(π, f)−αH(π) to a δ-approximate equilibrium, one

recovers at worst a QM√

2δα +αT (ln |A|+ ln |S|) equilib-

rium strategy for the policy player on the original game.

We prove both in Supplementary Material A. By substitutingδ = ε in Theorem 1 and combining with our previouslyderived bounds, we can cleanly show that these theoremscan also be viewed as certificates of policy performance.Thus, our framework enables us to efficiently close theimitation gap.

6. Practical Moment Matching AlgorithmsWe now present two practical algorithms that match ei-ther reward moments (AdRIL) or off-Q moments (AdVIL).They are specific, implementable versions of our preced-ing abstract procedures. At their core, both AdRIL andAdVIL optimize an IPM. IPMs are a distance between twoprobability distributions (Muller et al. 1997):

supf∈F

Ex∼P1

[f(x)]− Ex∼P2

[f(x)]

Plugging in the learner and expert trajectory distributions,we end up with our IPM-based objective:

supf∈F

T∑t=1

Eτ∼π

[f(st, at)]− Eτ∼πE

[f(st, at)] (2)

As noted previously, this objective is equivalent to and in-herits the strong guarantees of moment matching and allowsour analysis to easily transfer.

6.1. AdVIL: Adversarial Value-moment ImitationLearning

We can transform our IPM-based objective (2) into an ex-pression only over (s, a) pairs from expert data to fit into theoff-Q moment matching framework by performing a series


Figure 4. Our proposed methods (in orange) are able to match or out-perform the baselines across a variety of continuous control tasks.J(π)s are averaged across 10 evaluation episodes. Standard errors are across 5 runs of each algorithm.

of substitutions. We refer interested readers to Supplemen-tary Material B.1. We arrive at the following expression:

supv∈F

Eτ∼πE

[

T∑t=1

Ea∼π(st)

[v(st, a)]− v(st, at)] (3)

Intuitively, by minimizing (3) over policies π ∈ Π, one isdriving down the extra cost the learner has over the expert.We term this approach Adversarial Value-moment ImitationLearning (AdVIL) because if we view π as optimizing cu-mulative reward, −v could be viewed as a value function.We set the learning rate for f to be greater than that for π,making AdVIL a primal algorithm.

Practically, AdVIL bears similarity to the Wasserstein GAN(WGAN) framework (Arjovsky et al. 2017), though weconsider an IPM rather than the more restricted Wassersteindistances. However, the overlap is enough that techniquesfrom the WGAN literature including gradient penalties onthe discriminator (Gulrajani et al. 2017) and orthogonalregularization of the policy player (Brock et al. 2018) appearto help with convergence.

6.2. AdRIL: Adversarial Reward-moment ImitationLearning

We now present our dual reward moment matching algo-rithm and refer interested readers to Supplementary MaterialB.2 for the full derivation. In brief, we solve for the dis-criminator in closed form via functional gradient descentin Reproducing Kernel Hilbert Space (RKHS) and have thepolicy player compute a best response via maximum en-tropy reinforcement learning. Let Dk denote the aggregateddataset of policy rollouts. Assuming a constant numberof training steps at each iteration and averaging functionalgradients over k iterations of the algorithm, we get the costfunction for the policy:

C(πk) =

T∑t=1

1

|Dk|

Dk∑τ

K([st, at], ·)−1

|DE |

DE∑τ

K([st, at], ·)

For an indicator kernel and under the assumption we neversee the same state twice, this is equivalent to maximiz-ing a reward function that is 1 at each expert datapoint,∝ −1

k along previous rollouts that do not perfectly match

MOMENT PRIMAL DUAL

OFF-Q VDICE, ADVIL 7

REWARD GAILMMP, LEARCH,

MAXENT IOC, SQIL,ADRIL, A+N

ON-Q DAGGER, GPSIFAIL 7

Table 3. An taxonomy of moment matching algorithms. Bold textrefers to algorithms that are IPM-based.

the expert, and 0 everywhere else. We term this approachAdversarial Reward-moment Imitation Learning (AdRIL).

We note that (5) resembles the objective in SQIL (Reddyet al. 2019). SQIL can be seen as a degenerate case ofAdRIL that never updates the discriminator function. Thisoddity removes solution quality guarantees while introduc-ing the need for early stopping (Arenz and Neumann 2020).

7. A Unifying View of Moment Matching ILIn light of our previous discussion, we can embrace a cohe-sive perspective of moment matching in imitation learning.The six classes are succinctly captured in Table 3.

We note that reward moment-matching dual algorithms havebeen a repeated success story in imitation learning (MMP:(Ratliff et al. 2006), LEARCH: (Ratliff et al. 2009), Max-Ent IOC (Ziebart et al. 2008), (Abbeel and Ng 2004)), butthat there has been comparatively less work done in off-Qand on-Q dual algorithms (iFAIL: (Sun et al. 2019), GPS:(Levine and Koltun 2013)) than in the corresponding primalformulations.

8. ExperimentsWe test our algorithms against several baselines on severalhigher-dimensional continuous control tasks from the Py-Bullet suite (Coumans and Bai 2016–2019). We measurethe performance of off-Q algorithms as a function of theamount of data provided with a fixed maximum computa-tional budget and of reward moment-matching algorithms


as a function of the amount of environment interactions. Wesee from Fig. 4 that AdVIL can match the performanceof ValueDICE and Behavioral Cloning across most tasks.AdRIL performs better than GAIL across all environmentsand does not exhibit the catastrophic collapse in perfor-mance SQIL does on the tested environments. On both envi-ronments, behavioral cloning is able to recover the optimalpolicy with enough data, indicating there is little covariateshift (Spencer et al. 2021). However, on HalfCheetah, wesee AdRIL recover a strong policy with far less data than ittakes Behavioral Cloning to, showcasing the potential bene-fit of the learner observing the consequences of their ownactions. We refer readers to Supplementary Material C for adescription of our hyparameters and setup. Notably, AdVILis able to converge reliably to a policy of the same quality asthat found by ValueDICE with an order of magnitude lesscompute.

We release our code:

https://github.com/gkswamy98/pillbox

9. DiscussionIn this work, we tease apart the differences in requirementsand performance guarantees that come from matching re-ward, on-Q, and off-Q adversarially chosen moments. Re-ward moment matching has strong guarantees but requiresaccess to an accurate simulator or the real world. Off-Qmoment matching can be done purely on collected data butincurs an O(εT 2) imitation gap.

We formalize a notion of recoverability that is both neces-sary and sufficient to understand recovering from errors inimitation learning. If a problem (due to expert or the MDPitself) is O(T )-recoverable, no algorithm can escape anO(T 2) compounding of errors; if it is O(1)-recoverable, wefind on policy algorithms prevent compounding. Together,these constitute a cohesive picture of moment matching inimitation learning.

We derive idealized no-regret procedures and practical IPM-based algorithms that are conceptually elegant and correctfor difficulties encountered by prior methods. As it operatesin value-space, AdVIL helps prevent headaches from smallerrors in policy-space compounding to large errors in value-space, which might make it preferable to behavioral cloningon some tasks. AdRIL is simple to implement and enjoysstrong performance guarantees.

AcknowledgmentsWe thank Siddharth Reddy for helpful discussions. We alsothank Allie Del Giorno, Anirudh Vemula, and the membersof the Social AI Group for their comments on drafts ofthis work. ZSW was supported in part by the NSF FAI

Award #1939606, a Google Faculty Research Award, a J.P.Morgan Faculty Award, a Facebook Research Award, anda Mozilla Research Grant. Lastly, GS would like to thankJohn Steinbeck for inspiring the title of this work:

“As happens sometimes, a moment settled andhovered and remained for much more than a mo-ment.” – John Steinbeck, Of Mice and Men.

ReferencesPieter Abbeel and Andrew Y Ng. Apprenticeship learn-

ing via inverse reinforcement learning. In Proceedingsof the twenty-first international conference on Machinelearning, page 1, 2004.

Oleg Arenz and Gerhard Neumann. Non-adversarial imita-tion learning and its connections to adversarial methods,2020.

Martin Arjovsky, Soumith Chintala, and Leon Bottou.Wasserstein gan, 2017.

James Bagnell, Sham M Kakade, Jeff Schneider, and An-drew Ng. Policy search by dynamic programming. Ad-vances in neural information processing systems, 16:831–838, 2003.

Paul Barde, Julien Roy, Wonseok Jeon, Joelle Pineau,Christopher Pal, and Derek Nowrouzezahrai. Adversarialsoft advantage fitting: Imitation learning without policyoptimization, 2020.

Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski,Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D.Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, XinZhang, Jake Zhao, and Karol Zieba. End to end learningfor self-driving cars. CoRR, abs/1604.07316, 2016. URLhttp://arxiv.org/abs/1604.07316.

Kiante Brantley, Wen Sun, and Mikael Henaff.Disagreement-regularized imitation learning. In EighthInternational Conference on Learning Representations(ICLR), April 2020.

Andrew Brock, Jeff Donahue, and Karen Simonyan. Largescale gan training for high fidelity natural image synthesis.arXiv preprint arXiv:1809.11096, 2018.

Erwin Coumans and Yunfei Bai. Pybullet, a python modulefor physics simulation for games, robotics and machinelearning. http://pybullet.org, 2016–2019.

Robert Dadashi, Leonard Hussenot, Matthieu Geist, andOlivier Pietquin. Primal wasserstein imitation learning.arXiv preprint arXiv:2006.04678, 2020.

https://github.com/gkswamy98/pillbox

http://arxiv.org/abs/1604.07316

http://pybullet.org


Miroslav Dudik, Steven J Phillips, and Robert E Schapire.Performance guarantees for regularized maximum en-tropy density estimation. In International Conferenceon Computational Learning Theory, pages 472–486.Springer, 2004.

Yoav Freund and Robert E Schapire. A decision-theoreticgeneralization of on-line learning and an application toboosting. Journal of computer and system sciences, 55(1):119–139, 1997.

Justin Fu, Katie Luo, and Sergey Levine. Learning robustrewards with adversarial inverse reinforcement learning,2018.

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, VincentDumoulin, and Aaron C Courville. Improved trainingof wasserstein gans. In Advances in neural informationprocessing systems, pages 5767–5777, 2017.

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and SergeyLevine. Soft actor-critic: Off-policy maximum entropydeep reinforcement learning with a stochastic actor. arXivpreprint arXiv:1801.01290, 2018.

Jonathan Ho and Stefano Ermon. Generative adversarialimitation learning, 2016.

Daniel Jarrett, Ioana Bica, and Mihaela van der Schaar.Strictly batch imitation learning by energy-based distri-bution matching, 2020.

Sham Kakade and John Langford. Approximately optimalapproximate reinforcement learning. In In Proc. 19thInternational Conference on Machine Learning. Citeseer,2002.

Liyiming Ke, Matt Barnes, Wen Sun, Gilwoo Lee, San-jiban Choudhury, and Siddhartha S. Srinivasa. Imi-tation learning as f-divergence minimization. CoRR,abs/1905.12888, 2019. URL http://arxiv.org/abs/1905.12888.

Ilya Kostrikov, Ofir Nachum, and Jonathan Tompson. Imita-tion learning via off-policy distribution matching, 2019.

Sergey Levine and Vladlen Koltun. Guided policy search.In International Conference on Machine Learning, pages1–9, 2013.

David McAllester and Karl Stratos. Formal limitations onthe measurement of mutual information, 2020.

K-R Muller, Alexander J Smola, Gunnar Ratsch, BernhardScholkopf, Jens Kohlmorgen, and Vladimir Vapnik. Pre-dicting time series with support vector machines. InInternational Conference on Artificial Neural Networks,pages 999–1004. Springer, 1997.

Dean A Pomerleau. Alvinn: An autonomous land vehiclein a neural network. In Advances in neural informationprocessing systems, pages 305–313, 1989.

Antonin Raffin. Rl baselines3 zoo. https://github.com/DLR-RM/rl-baselines3-zoo, 2020.

Antonin Raffin, Ashley Hill, Maximilian Ernestus, AdamGleave, Anssi Kanervisto, and Noah Dormann. Sta-ble baselines3. https://github.com/DLR-RM/stable-baselines3, 2019.

Nived Rajaraman, Lin F. Yang, Jiantao Jiao, and Kannan Ra-machandran. Toward the fundamental limits of imitationlearning, 2020.

Nathan D Ratliff, J Andrew Bagnell, and Martin A Zinke-vich. Maximum margin planning. In Proceedings ofthe 23rd international conference on Machine learning,pages 729–736, 2006.

Nathan D Ratliff, David Silver, and J Andrew Bagnell.Learning to search: Functional gradient techniques forimitation learning. Autonomous Robots, 27(1):25–53,2009.

Siddharth Reddy, Anca D. Dragan, and Sergey Levine. Sqil:Imitation learning via reinforcement learning with sparserewards, 2019.

Nicholas Rhinehart, Kris M Kitani, and Paul Vernaza. R2p2:A reparameterized pushforward policy for diverse, pre-cise generative path forecasting. In Proceedings of theEuropean Conference on Computer Vision (ECCV), pages772–788, 2018.

Stephane Ross and J. Andrew Bagnell. Reinforcement andimitation learning via interactive no-regret learning, 2014.

Stephane Ross, Geoffrey J. Gordon, and J. Andrew Bag-nell. A reduction of imitation learning and structuredprediction to no-regret online learning, 2011.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad-ford, and Oleg Klimov. Proximal policy optimizationalgorithms. arXiv preprint arXiv:1707.06347, 2017.

David Silver, Aja Huang, Chris J Maddison, Arthur Guez,Laurent Sifre, George Van Den Driessche, Julian Schrit-twieser, Ioannis Antonoglou, Veda Panneershelvam,Marc Lanctot, et al. Mastering the game of go withdeep neural networks and tree search. nature, 529(7587):484–489, 2016.

Jiaming Song, Hongyu Ren, Dorsa Sadigh, and StefanoErmon. Multi-agent generative adversarial imitationlearning. CoRR, abs/1807.09936, 2018. URL http://arxiv.org/abs/1807.09936.



https://github.com/DLR-RM/rl-baselines3-zoo

https://github.com/DLR-RM/rl-baselines3-zoo

https://github.com/DLR-RM/stable-baselines3

https://github.com/DLR-RM/stable-baselines3




Jonathan Spencer, Sanjiban Choudhury, Arun Venkatraman,Brian Ziebart, and J. Andrew Bagnell. Feedback in imita-tion learning: The three regimes of covariate shift, 2021.

Wen Sun, Geoffrey J Gordon, Byron Boots, and J Bagnell.Dual policy iteration. Advances in Neural InformationProcessing Systems, 31:7059–7069, 2018.

Wen Sun, Anirudh Vemula, Byron Boots, and J. AndrewBagnell. Provably efficient imitation learning from obser-vation alone, 2019.

Gokul Swamy, Siddharth Reddy, Sergey Levine, andAnca D Dragan. Scaled autonomy: Enabling humanoperators to control robot fleets. In 2020 IEEE Interna-tional Conference on Robotics and Automation (ICRA),pages 5942–5948. IEEE, 2020.

Umar Syed, Michael Bowling, and Robert E Schapire. Ap-prenticeship learning using linear programming. In Pro-ceedings of the 25th international conference on Machinelearning, pages 1032–1039, 2008.

Steven Wang, Sam Toyer, Adam Gleave, and Scott Emmons.The imitation library for imitation learning andinverse reinforcement learning. https://github.com/HumanCompatibleAI/imitation, 2020.

Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, andAnind K Dey. Maximum entropy inverse reinforcementlearning. In Proceedings of the Twenty-Third AAAI Con-ference on Artificial Intelligence, 2008.

Brian D Ziebart, J Andrew Bagnell, and Anind K Dey. Mod-eling interaction via the principle of maximum causalentropy. 2010.

https://github.com/HumanCompatibleAI/imitation

https://github.com/HumanCompatibleAI/imitation

Supplementary Material

A. ProofsA.1. Proof of Lemma 7

Proof. Consider the following MDP:

s1

s2

a2

a1

a1

Figure 5. In UNICYCLE, the goal is to stay in s1 forever.

As illustrated above, c(s, a) = 1s2 is the cost functionfor this MDP. Let the expert perfectly optimize this func-tion by always taking a1 in s1. Thus, we are in theO(T )-recoverable setting. Then, for any ε > 0, if thelearner takes a2 in s1 with probability ε, J(πE)− J(π) =∑Tt=1 ε(1− ε)t−1(T − t) = Ω(εT 2). There is only one ac-

tion in s2 so it is not possible to have a nonzero classificationerror in this state.

A.2. Proof of Entropy Regularization Lemma

Proof. We optimize in the P (AT ||ST ) policy representationwhere strong duality holds and define the following:

πR = arg minπ∈Π

(maxf∈F

Uj(π, f)− αH(π))

First, we derive a bound on the distance between π and πR.We define M as follows:

M(π) = maxf∈F

Uj(π, f)− αH(π) + αT (ln |A|+ ln |S|)

M is an α-strongly convex function with respect to || · ||1because U is a max of linear functions, −H is 1-stronglyconvex, and the third term is a constant. This tells us that:

M(πR)−M(π) ≤ ∇M(πR)T (πR − π)− α

2||πR − π||21

We note that because πR minimizes M , the first term on theRHS is negative, allowing us to simplify this expression to:

α

2||πR − π||21 ≤M(π)−M(πR)

We now upper bound the RHS of this expression via thefollowing series of substitutions:

M(π) = maxf∈F

Uj(π, f)− αH(π) + αT (ln |A|+ ln |S|)

≤ Uj(π, f)− αH(π) + αT (ln |A|+ ln |S|) + δ

≤ Uj(πR, f)− αH(πR) + αT (ln |A|+ ln |S|) + δ

≤ maxf∈F

Uj(πR, f)− αH(πR) + αT (ln |A|+ ln |S|) + δ

= M(πR) + δ

Rearranging terms to get the desired bound on strategydistance:

M(π)−M(πR) ≤ δ

⇒ ||πR − π||21 ≤2δ

α

⇒ ||πR − π||1 ≤√

2δ

α

Next, we prove that πR is a αT (ln |A|+ln |S|)-approximateequilibrium strategy for the original, unregularized game.We note that H(π) ∈ [0, T (ln |A| + ln |S|)] and then pro-ceed as follows:

maxf∈F

Uj(πR, f) = M(πR) + αH(πR)− αT (ln |A|+ ln |S|)

≤M(πR)

≤ αT (ln |A|+ ln |S|)

The last line comes from the fact that playing the optimalstrategy in the original game on the regularized game couldat worst lead to a payoff of αT (ln |A|+ ln |S|). Therefore,the value of the regularized game can at most be this quan-tity. Recalling that the value of the original game is 0 andrearranging terms, we get:

maxf∈F

Uj(πR, f)− αT (ln |A|+ ln |S|) ≤ 0 = max

f∈Fminπ∈Π

Uj(π, f)

Thus by definition, πR must be half of an αT (ln |A| +ln |S|)-approximate equilibrium strategy pair.

Next, let QM denote the absolute difference between theminimum and maximum Q-value. For a fixed f , the maxi-mum amount the policy player could gain from switching topolicies within an L1 ball of radius r centered at the originalpolicy is rQM by the bilinearity of the game and Holder’sinequality. Because the supremum over k-Lipschitz func-tions is known to be k-Lipschitz, this implies the same istrue for the payoff against the best response f . To com-

plete the proof, we can set r =√

2δα and combine this


with the fact that πR achieves in the worst case a payoff ofαT (ln |A|+ln |S|) to prove that π can at most achieve a pay-

off of QM√

2δα +αT (ln |A|+ ln |S|) on the original game,

which establishes π as a (QM

√2δα + αT (ln |A|+ ln |S|))-

approximate equilibrium solution.

A.3. Proof of Theorem 1

We proceed in cases.Proof. We first consider the primal case. Our goal is tocompute a policy π such that:

maxf∈F

Uj(π, f) ≤ δ

We prove that such a policy can be found efficiently byexecuting the following procedure for a polynomially largenumber of iterations:

1. For t = 1 . . . N , do:

2. No-regret algorithm computes πt.

3. Set f t to the best response to πt.

4. Return π = πt∗, t∗ = arg mint Uj(π

t, f t).

Recall that via our no-regret assumption we know that

1

N

N∑t

Uj(πt, f t)− 1

Nminπ∈Π

N∑t

Uj(π, ft) ≤ βΠ(N)

N≤ δ

for some N that is poly( 1δ ). We can rearrange terms and

use the fact that πE ∈ Π to upper bound the average payoff:

1

N

N∑t

Uj(πt, f t) ≤ δ +

1

Nminπ∈Π

N∑t

Uj(π, ft) ≤ δ

Using the property that there must be at least one elementin an average that is at most the value of the average:

mintUj(π

t, f t) ≤ 1

N

N∑t

Uj(πt, f t) ≤ δ

To complete the proof, we recall that f t is chosen as thebest response to πt, giving us that:

mint

maxf∈F

Uj(πt, f) ≤ δ

In words, this means that by setting π to the policy with thelowest loss out of the N computed, we are able to efficiently(within poly( 1

δ ) iterations) find a δ-approximate equilibriumstrategy for the policy player. Note that this result holdswithout assuming a finite S and A and does not require

regularization of the policy. However, it requires us to havea no-regret algorithm over Π which can be a challenge forthe reward moment-matching game.

We now consider the dual case. As before, we wish to finda policy π such that:

maxf∈F

Uj(π, f) ≤ δ

We run the following procedure on Uj(π, f)− αH(π):

1. For t = 1 . . . N , do:

2. No-regret algorithm computes f t.

3. Set πt to the best response to f t.

4. Return π = arg minπ∈Π Uj(π, f) +H(π).

By the classic result of (Freund and Schapire 1997), weknow that the average of the N iterates produced bythe above procedure (which we denote f and π) is aδ′-approximate equilibrium strategy for some N that ispoly( 1

δ′ ). Applying our Entropy Regularization Lemma,we can upper bound the payoff of π on the original game:

supf∈F

Uj(π, f) ≤ QM

√2δ′

α+ αT (ln |A|+ ln |S|)

We now proceed similarly to our proof of the Entropy Regu-larization Lemma by first bounding the distance between πand π and the appealing to the Qm-Lipschitzness of Uj . Letl(π) = Uj(π, f) − αH(π). Then, while keeping the factthat l is α-strongly convex in mind:

l(π)− l(π) ≤ ∇l(π)T (π − π)− α

2||π − π||21

⇒ α

2||π − π||21 ≤ l(π)− l(π) +∇l(π)T (π − π)

⇒ α

2||π − π||21 ≤ l(π)− l(π)

⇒ α

2||π − π||21 ≤ δ′

⇒ ||π − π||1 ≤√

2δ′

α

As before, the second to last step follows from the definitionof a δ′-approximate equilibrium. Now, by the bilinearity ofthe game, Holder’s inequality, and the fact that supremumover k-Lipschitz functions is known to be k-Lipschitz, wecan state that:

supf∈F

Uj(π, f) ≤ 2QM

√2δ′

α+ αT (ln |A|+ ln |S|)

To ensure that the LHS of this expression is upper boundedby δ, it is sufficient to set α = δ

2T (ln |A|+ln |S|) and δ′ =δ2α

32Q2M

. Plugging in these terms, we arrive at:

supf∈F

Uj(π, f) ≤ δ

2+δ

2≤ δ


We note that in practice, α is rather sensitive hyperparame-ter of maximum entropy reinforcement learning algorithms(Haarnoja et al. 2018) and hope that the above expres-sion might provide some rough guidance for how to setα. To complete the proof, note that N is poly( 1

δ′ ) and 1δ′ =

64Q2MT (ln |A|+ln |S|)

δ3 . Thus, N is poly( 1δ , T, ln |A|, ln |S|)).

B. Algorithm DerivationsB.1. AdVIL Derivation

We begin by performing the following substitution: f =v − Bπv, where

Bπv = Est+1∼T (·|st,at),at+1∼π(st+1)

[v]

is the expected Bellman operator under the learner’s currentpolicy. Our objective (2) then becomes:

supv∈F

T∑t=1

Eτ∼π

[v(st, at)− Bπv(st, at)]

− Eτ∼πE


This expression telescopes over time, simplifying to:

supv∈F

Eτ∼π

[v(s0, a0)]−T∑t=1

Eτ∼πE


We approximate Bπv via a single-sample estimate fromthe respective expert trajectory, yielding the following off-policy expression:

supv∈F

Eτ∼π

[v(s0, a0)]−T∑t=1

Eτ∼πE

[v(st, at)− Ea∼π(st+1)

[v(st+1, a)]]

This resembles the form of the objective in ValueDICE(Kostrikov et al. 2019) but without requiring us to take theexpectation of the exponentiated discriminator. We canfurther simplify this objective by noticing that trajectoriesgenerated by πE and π have the same starting state distribu-tion:

supv∈F

Eτ∼πE

[

T∑t=1

Ea∼π(st)

[v(st, a)]− v(st, at)] (4)

We also note that this AdVIL objective can be derivedstraightforwardly via the Performance Difference Lemma.

B.2. AdRIL Derivation

Let F be a RKHS be equipped with kernel K : (S ×A)×(S × A)→ R. On iteration k of the algorithm, consider a

purely cosmetic variation of our IPM-based objective (2):

supc∈F

T∑t=1

( Eτ∼πk

[c(st, at)]− Eτ∼πE

[c(st, at)]) = supc∈F

Lk(c)

We evaluate the first expectation by collecting on-policyrollouts into a datasetDk and the second by sampling from afixed set of expert demonstrations DE . Assume that |Dk| isconstant across iterations. Let E be the evaluation functional.Then, taking the functional gradient:

∇cLk(c) =

T∑t=1

1

|Dk|

Dk∑τ

∇cE [c; (st, at)]−1

|DE |

DE∑τ

∇cE [c; (st, at)]

=

T∑t=1

1

|Dk|

Dk∑τ

K([st, at], ·)−1

|DE |

DE∑τ

K([st, at], ·)

whereK could be an state-action indicator (1s,a) in discretespaces and relaxed to a Gaussian in continuous spaces. LetDk =

⋃ki=0Di be the aggregation of all previous Di. Aver-

aging functional gradients over iterations of the algorithm(which, other than a scale factor that does not affect the op-timal policy, is equivalent to having a constant learning rateof 1), we get the cost function our policy tries to minimize:

C(πk) =

k∑i=0

∇cLi(c)

=

T∑t=1

1

|Dk|

Dk∑τ

K([st, at], ·)−1

|DE |

DE∑τ

K([st, at], ·)

(5)

C. Experimental SetupC.1. Expert

We use the Stable Baselines 3 (Raffin et al. 2019) implemen-tation of PPO (Schulman et al. 2017) and SAC (Haarnojaet al. 2018) to train experts for each environment, mostlyusing the already tuned hyperparameters from (Raffin 2020).Specifically, we use the modifications in Tables 4 and 5 tothe Stable Baselines Defaults.

C.2. Baselines

For all learning algorithms, we perform 5 runs and use acommon architecture of 256 x 2 with ReLU activations.For each datapoint, we average the cumulative rewardof 10 trajectories. For offline algorithms, we train on5, 10, 15, 20, 25 expert trajectories with a maximum of500k iterations of the optimization procedure. For onlinealgorithms, we train on a fixed number of trajectories (5 forHalfCheetah and 20 for Ant) for 500k environment steps.For GAIL (Ho and Ermon 2016) and behavioral cloning(Pomerleau 1989), we use the implementation produced by


PARAMETER VALUE

BUFFER SIZE 300000BATCH SIZE 256γ 0.98τ 0.02TRAINING FREQ. 64GRADIENT STEPS 64LEARNING RATE 7.3E-4POLICY ARCHITECTURE 256 X 2STATE-DEPENDENT EXPLORATION TRUETRAINING TIMESTEPS 1E6

Table 4. Expert hyperparameters for HalfCheetah Bullet Task.

PARAMETER VALUE

BUFFER SIZE 300000BATCH SIZE 256γ 0.98τ 0.02TRAINING FREQ. 64GRADIENT STEPS 64LEARNING RATE 7.3E-4POLICY ARCHITECTURE 256 X 2STATE-DEPENDENT EXPLORATION TRUETRAINING TIMESTEPS 1E6

Table 5. Expert hyperparameters for Ant Bullet Task.

(Wang et al. 2020). We use the changes from the defaultvalues in Tables 6 and 7 for all tasks.

For SQIL (Reddy et al. 2019), we build a custom implemen-tation on top of Stable Baselines with feedback from theauthors. As seen in Table 9, we use the similar parametersfor SAC as we did for training the expert.

We modify the open-sourced code for ValueDICE(Kostrikov et al. 2019) to be actually off-policy with feed-back from the authors. The publically available version ofthe ValueDICE code uses on-policy samples to compute aregularization term, even when it is turned off in the flags.We release our version.7 We use the default hyperparametersfor all experiments (and thus, train for 500k steps).

7https://github.com/gkswamy98/valuedice

ENV. EXPERT BC PERFORMANCE

HALFCHEETAH 2154 2083ANT 2585 2526

Table 6. With enough data (25 trajectories) and 100k steps of gradi-ent descent, behavioral cloning is able to solve all tasks considered,replicating the results of (Spencer et al. 2021). However, otherapproaches are able to perform better when there is less data avail-able.

PARAMETER VALUE

ENTROPY WEIGHT 0L2 WEIGHT 0TRAINING TIMESTEPS 5E5

Table 7. Learner hyperparameters for Behavioral Cloning.

PARAMETER VALUE

NUM STEPS 1024EXPERT BATCH SIZE 32

Table 8. Learner hyperparameters for GAIL.

C.3. Our Algorithms

In this section, we use magenta text to highlight sensitivehyperparameters. Similarly to SQIL, AdRIL is built on topof the Stable Baselines implementation of SAC. AdVIL iswritten in pure PyTorch. We use the same network archi-tecture choices as for the baselines. For AdRIL we use thehyperparameters in Table 10 across all experiments.

We note that AdRIL requires careful tuning of f UpdateFreq. for strong performance. To find the value specified, weran trials with 1250, 2500, 5000, 12500, 25000, 50000and selected the one that achieved the most stable updates.In practice, we would recommend evaluating a trained pol-icy on a validation set to set this parameter. We also notebecause SAC is an off-policy algorithm, we are free to ini-tialize the learner by adding all expert samples to the replaybuffer at the start, as is done for SQIL.

We change one parameter between environments for AdRIL– for HalfCheetah, we perform standard sampling from thereplay buffer while for Ant we sample an expert trajectorywith p = 1

2 and a learner trajectory otherwise, similar toSQIL. We find that for certain environments, this modifica-tion can somewhat increase the stability of updates whilefor other environments it can significantly hamper learnerperformance. We recommend trying both options if possiblebut defaulting to standard sampling.

For AdVIL, we use the hyperparameters in Table 11 acrossall tasks. Emperically, small learning rates, large batchsizes, and regularization of both players are critical to stableconvergence. We find that AdVIL converges significantlymore quickly than ValueDICE, requiring only 50k stepsfor HalfCheetah and 100k Steps for Ant instead of 500ksteps for both tasks. However, we also find that runningAdVIL for longer than these prescribed amounts can lead toa collapse of policy performance. Fortunately, this can easilybe caught by watching for sudden and large fluctuations inpolicy loss after a long period of steady decreases. Onecan perform this early-stopping check without access to the

https://github.com/gkswamy98/valuedice


PARAMETER VALUE

γ 0.98τ 0.02TRAINING FREQ. 64GRADIENT STEPS 64LEARNING RATE LINEAR SCHEDULE OF 7.3E-4

Table 9. Learner hyperparameters for SQIL.

PARAMETER VALUE

γ 0.98τ 0.02TRAINING FREQ. 64GRADIENT STEPS 64LEARNING RATE LINEAR SCHEDULE OF 7.3E-4f UPDATE FREQ. 1250

Table 10. Learner hyperparameters for AdRIL.

environment.

D. Critique of Strictly Batch ILWe now focus on a recently proposed method for off-Qimitation learning: Energy-based Distribution Matching(EDM) (Jarrett et al. 2020). The authors propose that byfixing the policy to be a member of the exponential family:

πθ(a|s) =efθ(s)[a]∑a e

fθ(s)[a]

they are able to optimize the KL divergence from the demon-strator’s state-action marginal distribution to that of thelearner:

J(πθ) = Es,a∼pD

[− log pπθ (s, a)]

= Es,a∼pD

[− log πθ(a|s)− log pθ(s)]

The first term under the expectation is the behavioral cloningobjective, while the second is a product of marginalizingover time. To evaluate the latter on-policy term, the authorspropose defining an energy-based model that captures thevisitation distribution of states of the learner’s policy:

Eθ(s) = log∑a

efθ(s)[a]

pθ(s) =e−Eθ(s)∑s e−Eθ(s)

and sampling from it via Langevin MCMC. Notice that thetransition probabilities of the MDP do not appear anywherein this expression, which is a cause for concern. The authorsproceed to argue that they can circumvent this fact becausethe exponential family is invariant to an additive shift ap-plied to each fθ(s)[·] score, and one could use these |S|

0e5 1e5 2e5 3e5 4e5Environment Steps

0

500

1000

1500

2000

2500

J(π

)

AntBulletEnv-v0 @ 20 Expert Demos

AdRIL Balanced AdRIL πE

Figure 6. A comparison of balanced vs. unbalanced sampling forAdRIL on the Ant Environment. For certain tasks, balanced sam-pling can help with the stability of updates.

PARAMETER VALUE

ηπ 8E-6ηf 8E-4BATCH SIZE 1024f GRADIENT TARGET 0.4f GRADIENT PENALTY WEIGHT 10π ORTHOGONAL REGULARIZATION 1E-4π MSE REGULARIZATION WEIGHT 0.2NORMALIZE STATES WITH EXPERT DATA TRUENORMALIZE ACTIONS TO [-1, 1] TRUEGRADIENT NORM CLIPPING [-40, 40]

Table 11. Learner hyperparameters for AdVIL.

degrees of freedom to capture the state visitation distribu-tion of the policy. While there does exist a state-dependentadditive shift that captures the true state visitation distribu-tion of the policy (because we have |S| degrees of freedomand could use one per state), there is no guarantee that thestate distribution that is defined above is the same as that ofthe policy. The following policy parameterization has thesame degrees of freedom as the one the author’s assume:

πθ(a|s) =efθ(s)[a]+g(s)∑a e

fθ(s)[a]+g(s)

s.t. ∀s,∑a

fθ(s)[a] = 0.

and would have the following sampling distribution:

pθ(s) =e−g(s)

∑a e

fθ(s)[a]∑s(e−g(s) ∑

a efθ(s)[a])

Notice that because the softmax operator is invariant toan additive shift to all logits like g(s), the policy’s truestate visitation distribution is dependent only on fθ. For theauthor’s claim that pθ(s) = pθ(s) to be true (Lemma 1 in(Jarrett et al. 2020)), we would need to be able to arbitrarily


set g(s) and still preserve equality. Concretely, considersetting g(s) = log

∑a e

fθ(s)[a]. Then,

pθ(s) =e− log

∑a e

fθ(s)[a]∑a e

fθ(s)[a]∑s(e− log

∑a e

fθ(s)[a]∑a e

fθ(s)[a])

=1∑s 1

This is a uniform distribution over state space, regardlessof the policy. We can easily select policy parameters θfor any MDP without trivial transition dynamics such thatpθ(s) 6= U(S). This property is also extremely unlikely tobe satisfied by most policies encountered during training.Thus, regardless of the efficacy of the authors’ proposedobjective, their optimization procedure is quite unlikely tobe actually optimizing it.

We can make a stronger claim when we do not have enoughdegrees of freedom in our policy class to represent all poli-cies and state visitation frequencies simultaneously, as islikely to be the case for continuous state spaces: that theauthors’ objective can make it harder to optimize the BCobjective. This directly contradicts the authors’ claim thatthey are performing multitask learning.

For simplicity, consider an MDP with arbitrary dynamicsand two states, s1 and s2, each of which has two actions, a1

and a2. Assume that both the learner and expert have thefollowing policy class:

πθ(a2|s1) =eθ

1 + eθ, πθ(a2|s2) =

ekθ

1 + ekθ

Note that this means that the learner can exactly matchexpert behavior. Initialize θ = 0 and set θE = 1. Then,E(s,a)∼DE [∇θ log πθ] > 0. Then, considering the secondterm in the authors’ objective:

log pθ(s2) = log(1 + ekθ)− log(1 + eθ + 1 + ekθ)

∇θ log pθ(s2) =kekθ

1 + ekθ− eθ + kekθ

2 + eθ + ekθ

We can plug in our initial condition of θ = 0 to arrive at:

∇θ log pθ(s2) =k

2− 1 + k

4=k − 1

4

Thus, by setting k = 12 < 1, ∇θ log pθ(s2) < 0, which

makes it more difficult to optimize for the BC task. Intu-itively, for 0 < k < 1, the expert prefers a2 in both states,but prefers it more in s1. However, without knowing any-thing about the transition dynamics of the MDP, one has noidea how this relative preference translates to state visitationdistribution. Perhaps even more concerningly, with θ = 1,∇θ log pθ(s2) ≈ −0.245 6= 0, pushing us away from theoptimal θ. Following∇θ log pθ is an arbitrary, and in bothof these cases, incorrect choice.

Documents

Abstract arXiv:2103.03236v1 [cs.LG] 4 Mar 2021