A Kernel Loss for Solving the Bellman Equation · 2019. 6. 21. · A Kernel Loss for Solving the Bellman Equation In this paper, we propose a novel loss function for value function

A Kernel Loss for Solving the Bellman Equation

Yihao Feng 1 Lihong Li 2 Qiang Liu 1

AbstractValue function learning plays a central role inmany state-of-the-art reinforcement-learning algo-rithms. Many popular algorithms like Q-learningdo not optimize any objective function, but arefixed-point iterations of some variant of Bellmanoperator that is not necessarily a contraction. Asa result, they may easily lose convergence guaran-tees, as can be observed in practice. In this paper,we propose a novel loss function, which can beoptimized using standard gradient-based methodswithout risking divergence. The key advantageis that its gradient can be easily approximatedusing sampled transitions, avoiding the need fordouble samples required by prior algorithms likeresidual gradient. Our approach may be combinedwith general function classes such as neural net-works, on either on- or off-policy data, and isshown to work reliably and effectively in severalbenchmarks.

1. IntroductionThe goal of a reinforcement learning (RL) agent is to opti-mize the policy to maximize the long-term return throughrepeated interaction with an external environment. The in-teraction is often modeled as a Markov decision process,whose value functions are the unique fixed points of theircorresponding Bellman operators. Many state-of-the-artalgorithms, including TD(λ), Q-learning and actor-critic,have value function learning as a key component (Sutton &Barto, 2018).

A fundamental property of the Bellman operator is that it is acontraction in the value function space in the `∞-norm (Put-erman, 1994). Therefore, starting from any bounded initialfunction, with repeated applications of the operator, thevalue function converges to the correct value function. Anumber of algorithms are inspired by this property, such

*Equal contribution 1Department of Computer Science, Uni-versity of Texas, Austin 2Google Research. Correspondence to:Yihao Feng <[email protected]>.

Real-world Sequential Decision Making workshop at ICML 2019.Copyright 2019 by the author(s).

as temporal difference (Sutton, 1988) and its many vari-ants (Bertsekas & Tsitsiklis, 1996; Sutton & Barto, 2018).Unfortunately, when function approximation such as neuralnetworks is used to represent the value function in large-scale problems, the critical property of contraction is gener-ally lost (e.g., Boyan & Moore, 1995; Baird, 1995; Tsitsiklis& Van Roy, 1997), except in rather restricted cases (e.g.,Gordon, 1995; Tsitsiklis & Van Roy, 1997). This instabil-ity issue is not only one of the core theoretical challengesin RL, but also has broad practical significance, given thegrowing popularity of algorithms like DQN (Mnih et al.,2015), A3C (Mnih et al., 2016) and their many variants (e.g.,Gu et al., 2016; Schulman et al., 2016; Wang et al., 2016;Wu et al., 2017), whose stability largely depends on thecontraction property. The instability becomes even harderto avoid, when training data (transitions) are sampled froman off-policy distribution, a situation known as the deadlytriad (Sutton & Barto, 2018, Sec. 11.3).

The brittleness of Bellman operator’s contraction propertyhas inspired a number of works that aim to reformulate valuefunction learning as an optimization problem, where stan-dard algorithms like stochastic gradient descent can be usedto minimize the objective, without the risk of divergence(under mild assumptions). One of the earliest attempts isresidual gradient, or RG (Baird, 1995), which relies on min-imizing squared temporal differences. The algorithm isconvergent, but its objective is not necessarily a good proxydue to a well-known “double sample” problem. As a result,it may converge to an inferior solution; see Sections 2 and 6for further details and numerical examples. This drawbackis inherited by similar algorithms like PCL (Nachum et al.,2017; 2018).

Another line of work seeks alternative objective functions,the minimization of which leads to desired value func-tions (Sutton et al., 2009; Maei, 2011; Liu et al., 2015;Dai et al., 2017). Most existing works are either for linearapproximation, or for evaluation of a fixed policy. An ex-ception is the SBEED algorithm (Dai et al., 2018), whichtransforms the Bellman equation to a saddle-point problem.While SBEED is provably convergent under fairly standardconditions, it relies on solving a minimax problem, whoseoptimization can be rather challenging in practice, especiallywith nonconvex approximation classes like neural networks.


In this paper, we propose a novel loss function for valuefunction learning. It avoids the double-sample problem (un-like RG), and can be easily estimated and optimized usingsampled transitions (in both on- and off-policy scenarios).This is made possible by leveraging an important propertyof integrally strictly positive definite kernels (Stewart, 1976;Sriperumbudur et al., 2010). This new objective functionallows us to derive simple yet effective algorithms to ap-proximate the value function, without risking instabilityor divergence (unlike TD algorithms), or solving a moresophisticated saddle-point problem (unlike SBEED). Ourapproach also allows great flexibility in choosing the valuefunction approximation classes, including nonlinear oneslike neural networks. Experiments in several benchmarksdemonstrate the effectiveness of our method, for both policyevaluation and optimization problems. We will focus on thebatch setting (or the growing-batch setting with a growingreplay buffer), and leave the online setting for future work.

2. BackgroundThis section starts with necessary notation and backgroundinformation, then reviews two representative algorithmsthat work with general, nonlinear (differentiable) functionclasses.

Notation. A Markov decision process (MDP) is denotedby M = 〈S,A, P,R, γ), where S is a (possibly infi-nite) state space, A an action space, P (s′ | s, a) thetransition probability, R(s, a) the average immediate re-ward, and γ ∈ (0, 1) a discount factor. The value func-tion of a policy π : S 7→ RA+ , denoted V π(s) :=E [∑∞t=0 γ

tR(st, at) | s0 = s, at ∼ π(·, st)] , measures theexpected long-term return of a state. It is well-known thatV = V π is the unique solution to the Bellman equation (Put-erman, 1994), V = BπV , where Bπ : RS → RS is theBellman operator, defined by

BπV (s) := Ea∼π(·|s),s′∼P (·|s,a)[R(s, a) + γV (s′) | s] .

While we develop and analyze our approach mostly for Bπgiven a fixed π (policy evaluation), we will also extend theapproach to policy optimization, where the correspondingBellman operator is

BV (s) := maxa

Es′∼P (·|s,a)[R(s, a) + γV (s′) | s, a] .

The unique fixed point of B is known as the optimal valuefunction, denoted V ∗; that is, BV ∗ = V ∗.

Our work is built on top of an alternative to the fixed-pointview above: given some fixed distribution µ whose supportis S, V π is the unique minimizer of the squared Bellmanerror:

L2(V ) := ‖BπV − V ‖2µ = Es∼µ[

(BπV (s)− V (s))2 ].

Denote by RπV := BπV − V the Bellman error operator.With a set D = {(si, ai, ri, s′i)}1≤i≤n of transitions whereai ∼ π(·|si), the Bellman operator in state si can be ap-proximated by bootstrapping: BπV (si) := ri + γV (s′i).Clearly, E[BπVθ(si)|si] = BπVθ(si). In the literature,BπVθ(si)−Vθ(si) is also known as the temporal differenceor TD error, whose expectation is the Bellman error. Fornotation, all distributions are equivalized to their probabilitydensity functions in the work.

Basic Algorithms. We are interested in estimating V π,from a parametric family {Vθ : θ ∈ Θ}, from data D. Theresidual gradient algorithm (Baird, 1995) minimizes thesquared TD error:

LRG(Vθ) :=1

n

n∑i=1

(BπVθ(si)− Vθ(si)

)2

, (1)

with gradient descent update θt+1 = θt − ε∇θLRG(Vθt),where

∇θLRG(Vθ) =2

n

n∑i=1

((BπVθ(si)− Vθ(si)

)· ∇θ

(BπVθ(si)− Vθ(si)

)).

However, the objective in (1) is a biased and inconsistentestimate of the squared Bellman error. This is becauseEs∼µ[LRG(V )] = L2(V ) + Es∼µ

[var(BπV (s)|s)

]6=

L2(V ), where there is an extra term that involves the con-ditional variance of the empirical Bellman operator, whichdoes not vanish unless the state transitions are deterministic.As a result, RG can converge to incorrect value functions(see also Section 6). With random transitions, correcting thebias requires double samples (i.e., at least two independentsamples of (r, s′) for the same (s, a) pair) to estimate theconditional variance.

More popular algorithms in the literature are instead basedon fixed-point iterations, using Bπ to construct a target valueto update Vθ(si). An example is fitted value iteration, orFVI (Bertsekas & Tsitsiklis, 1996; Munos & Szepesvári,2008), which includes as special cases the empirically suc-cessful DQN and variants, as well as serves a key componentin many state-of-the-art actor-critic algorithms. In its basicform, FVI starts from an initial θ0, and iteratively updatesthe parameter by

θt+1 = arg minθ∈Θ

{L

(t+1)FVI (Vθ) :=

1

n

n∑i=1

(Vθ(si)− BπVθt(si)

)2}.

(2)

Different from RG, when gradient-based methods are ap-plied to solve (2), the current parameter θt is treatedas a constant: ∇θL(t+1)

FVI (Vθ) = 1n

∑ni=1

(Vθ(si) −

BπVθt(si))∇θVθ(si). TD(0) (Sutton, 1988) may be viewed

as a stochastic version of FVI, where a single sample (i.e.,


n = 1) is drawn randomly (either from a stream of tran-sitions or from a replay buffer) to estimate the gradient of(2).

Being fixed-point iteration methods, FVI-style algorithmsdo not optimize any objective function, and their conver-gence is guaranteed only in rather restricted cases (e.g.,Gordon, 1995; Tsitsiklis & Van Roy, 1997; Antos et al.,2008). Such divergent behavior is well-known and empir-ically observed (Baird, 1995; Boyan & Moore, 1995); seeSection 6 for more numerical examples. It creates substan-tial difficulty in parameter tuning and model selection inpractice.

3. Kernel Loss for Policy EvaluationMuch of the algorithmic challenge described earlier lies inthe difficulty in estimating squared Bellman error from data.In this section, we address this difficulty by proposing a newloss function that is more amenable to statistical estimationfrom empirical data. Proofs are deferred to the appendix.

Our framework relies on an integrally strictly positivedefinite (ISPD) kernel K : S × S → R, which is asymmetric bi-variate function that satisfies ‖f‖2K :=∫S2 K(s, s)f(s)f(s) ds ds > 0, for any non-zero L2-

integrable function f . We call ‖f‖K the K-norm of f .Many commonly used kernels, such as Gaussian RBF ker-nel K(s, s) = exp(−‖s− s‖22 /h) is ISPD. More discus-sion on ISPD kernels can be found in Stewart (1976) andSriperumbudur et al. (2010).

3.1. The New Loss Function

Recall that Rπ = BπV − V is the Bellman error operator.Our new loss function is defined by

LK(V ) = ‖RπV ‖2K,µ:= Es,s∼µ [K(s, s) · RπV (s) · RπV (s)] , (3)

where µ is any positive density function on states s, ands, s ∼ µ means s and s are drawn i.i.d. from µ. Here,‖·‖K,µ is regarded as the K-norm under measure µ. Itis easy to show that ‖f‖K,µ = ‖fµ‖K . Note that µ canbe either the visitation distribution under policy π (the on-policy case), or some other distribution (the off-policy case).Our approach handles both cases in a unified way. Thefollowing theorem shows that the loss LK is consistent:

Theorem 3.1. Let K be an ISPD kernel and assumeµ(s) > 0,∀s ∈ S. Then, LK(V ) ≥ 0 for any V ; andLK(V ) = 0 if and only if V = V π. In other words,V π = arg minV LK(V ).

The next result relates the kernel loss to a “dual” kernelnorm of the value function error, V − V π .

Theorem 3.2. Under the same assumptions as Theorem 3.1,we have LK(V ) = ‖V − V π‖2K∗,µ, where ‖·‖K∗,µ is theK∗-norm under measure µ with a “dual” kernel K∗(s, s),defined by

K∗(s′, s′) := Es,s ∼ d∗π,µ

[K(s′, s′) + γ2K(s, s)

− γ(K(s′, s) +K(s, s′)

)| s′, s′

],

and the expectation notation is shorthand forEs∼d∗π,µ [f(s)|s′] =

∫f(s)d∗π,µ(s|s′)ds , with

d∗π,µ(s|s′) :=∑a

π(a|s)P (s′|s, a)µ(s)/µ(s′) .

The norm involves a quantity, d∗π,µ(s|s′), which may beheuristically viewed as a “backward” conditional probabilityof state s conditioning on observing the next state s′ (butnote that d∗π,µ(s|s′) is not normalized to sum to one unlessµ = dπ).

Empirical Estimation The key advantage of the new lossLK is that it can be easily estimated and optimized from ob-served transitions, without requiring double samples. Givena set of empirical data D = {(si, ai, ri, s′i)}1≤i≤n, a wayto estimate LK is to use the so-called V-statistics,

LK(Vθ) :=1

n2

∑1≤i,j≤n

K(si, sj) · RπVθ(si) · RπVθ(sj) .

(4)Similarly, the gradient ∇θLK(Vθ) =2Eµ[K(s, s)RπVθ(s)∇θ(RπVθ(s))] can be estimated by

∇θLK(Vθ) :=2

n2

∑1≤i,j≤n

K(si, sj)·RπVθ(si)·∇θRπVθ(sj) .

Remark An alternative approach is to use the U-statistics,which removes the diagonal (i = j) terms in the pairwiseaverage in (4). Following standard statistical approximationtheory (e.g., Serfling, 2009), both U/V-statistics provideconsistent estimation of the expected quadratic quantitygiven the sample is weakly dependent and satisfies certainmixing condition (e.g., Denker & Keller, 1983; Beutner &Zähle, 2012); this often amounts to saying that {si} formsa Markov chain that converges to its stationary distributionµ sufficiently fast. In the case of i.i.d. samples, it is knownthat U-statistics forms an unbiased estimate, but may havehigher variance than the V-statistics. In our experiments, weobserve that V-statistics works better than U-statistics.

Remark Another advantage of our kernel loss is that wehave LK(V ) = 0 iff V = V π. Therefore, the magnitudeof the empirical loss LK(V ) reflects the closeness of Vto the true value function V π. In fact, by using methodsfrom kernel-based hypothesis testing (e.g., Gretton et al.,


2012; Liu et al., 2016; Chwialkowski et al., 2016), onecan design statistically calibrated methods to test if V =V π has been achieved, which may be useful for designingefficient exploration strategies. In this work, we focus onestimating V π and leave it as future work to test valuefunction proximity.

3.2. Interpretations of the Kernel Loss

We now provide some insights into the new loss function,based on two interpretations.

Eigenfunction Interpretation Mercer’s theorem impliesthe following decomposition

K(s, s) =

∞∑i=1

λiei(s)ei(s) , (5)

of any continuous positive definite kernel on a compactdomain, where {ei}∞i=1 is a countable set of orthonormaleigenfunctions w.r.t. µ (i.e., Es∼µ[ei(s)ej(s)] = 1{i = j}),and {λi}∞i=1 are their eigenvalues. For ISPD kernels, all theeigenvalues must be positive: ∀i, λi > 0.

The following shows thatLK is a squared projected Bellmanerror in the space spanned by {ei}∞i=1.

Proposition 3.3. If (5) holds, then LK(V ) =∑∞i=1 λi (Es∼µ [RπV (s)× ei(s)])2. Furthermore, if

{ei} is a complete orthonormal basis of L2-spaceunder measure µ, then the L2 loss is L2(V ) =∑∞i=1 (Es∼µ [RπV (s)× ei(s)])2. Therefore, LK(V ) ≤

λmaxL2(V ), where λmax := maxi{λi}.

This result shows that the eigenvalue λi controls the contri-bution of the projected Bellman error to the eigenfunctionei in LK . It may be tempting to have λi ≡ 1, in whichLK(V ) = L2(V ), but the Mercer expansion in (5) can di-verge to infinity, resulting in an ill-defined kernel K(s, s).To avoid this, the eigenvalues must decay to zero fast enoughsuch that

∑∞i=1 λi <∞. Therefore, the kernel loss LK(V )

can be viewed as prioritizing the projections to the eigen-functions with larger eigenvalues. In typical kernels suchas Gaussian RBF kernels, these dominant eigenfunctionsare Fourier bases with low frequencies (and hence highsmoothness), which may intuitively be more relevant thanthe higher frequency bases for practical purposes.

RKHS Interpretation The squared Bellman error has thefollowing variational form:

L2(V ) = maxf

(Es∼µ [RπV (s)× f(s)])2

s.t. Es∼µ[(f(s))2] ≤ 1 , (6)

which involves finding a function f in the unit L2-ball,whose inner product withRπV (s) is maximal. Our kernel

loss has a similar interpretation, with a different unit ball.

Any positive kernel K(s, s) is associated with a Repro-ducing Kernel Hilbert Space (RKHS) HK , which is theHilbert space consisting of (the closure of) the linear spanof K(·, s), for s ∈ S , and satisfies the reproducing property,f(x) = 〈f, K(·, x)〉HK , for any f ∈ HK . RKHS has beenwidely used as a powerful tool in various machine learn-ing and statistical problems; see Berlinet & Thomas-Agnan(2011); Muandet et al. (2017) for overviews.

Proposition 3.4. Let HK be the RKHS of kernel K(s, s),we have

LK(V ) = maxf∈HK

{(Es∼µ [RπV (s)× f(s)])

2: ‖f‖HK ≤ 1

}.

(7)

Since RKHS is a subset of theL2 space that includes smoothfunctions, we can again see that LK(V ) emphasizes morethe projections to smooth basis functions, matching theintuitive from Theorem 3.3. It also draws a connectionto the recent primal-dual reformulations of the Bellmanequation (Dai et al., 2017; 2018), which formulate V π as asaddle-point of the following minimax problem:

minV

maxf

Es∼µ[2RπV (s)× f(s)− f(s)2

], (8)

This is equivalent to minimizing L2(V ) as (6), except thatthe L2 constraint is replaced by a quadratic penalty term.When only samples are available, the expectation in (8)is replaced by the empirical version. If the optimizationdomain of f is unconstrained, solving the empirical (8) re-duces to the empirical L2 loss (1), which yields inconsistentestimation. Therefore, existing works propose to furtherconstrain the optimization of f in (8) to either RKHS (Daiet al., 2017) or neural networks (Dai et al., 2018), and hencederive a minimax strategy for learning V . Unfortunately,this is substantially more expensive than our method due tothe cost of updating another neural network f jointly; theminimax procedure may also make the training less stableand more difficult to converge in practice.

3.3. Connection to Temporal Difference (TD) Methods

We now instantiate our algorithm in the tabular and linearcases to gain further insights. Interestingly, we show thatour loss coincides with previous work, and as a result leadsto the same value function as several classic algorithms.Hence, the approach developed here may be considered astheir strict extensions to the much more general nonlinearfunction approximation classes.

Again, let D be a set of n transitions sampled from distribu-tion µ, and linear approximation be used: Vθ(s) = θTφ(s),where φ : S → Rd is a feature function, and θ ∈ Rd isthe parameter to be learned. The TD solution, θTD, for


either on- and off-policy cases, can be found by variousalgorithms (e.g., Sutton, 1988; Boyan, 1999; Sutton et al.,2009; Dann et al., 2014), and its theoretical properties havebeen extensively studied (e.g., Tsitsiklis & Van Roy, 1997;Lazaric et al., 2012).

Corollary 3.5. When using a linear kernel of formk(s, s) = φ(s)Tφ(s), minimizing the kernel objective (4)gives the TD solution θTD.

Remark The result follows from the observation that ourloss becomes the Norm of the Expected TD Update (NEU)in the linear case (Dann et al., 2014), whose minimizercoincides with θTD. Moreover, in finite-state MDPs, thecorollary includes tabular TD as a special case, by usinga one-hot vector (indicator basis) to represent states. Inthis case, the TD solution coincides with that of a model-based approach (Parr et al., 2008) known as certainty equiv-alence (Kumar & Varaiya, 1986). It follows that our al-gorithm includes certainty equivalence as a special case infinite-state problems.

4. Kernel Loss for Policy OptimizationThere are different ways to extend our approach to policyoptimization. One is to use the kernel loss (3) inside anexisting algorithm, as an alternative to RG or TD to learnV π(s). For example, our loss fits naturally into an actor-critic algorithm, where we replace the critic update (oftenimplemented by TD(λ) or its variant) with our method, andthe actor updating part remains unchanged. Another, moregeneral way is to design a kernelized loss for V (s) andpolicy π(a|s) jointly, so that the policy optimization can besolved using a single optimization procedure. Here, we takethe first approach, leveraging our method to improve thecritic update step in Trust-PCL (Nachum et al., 2018).

Trust-PCL is based on a temporal/path consistency conditionresulting from policy smoothing (Nachum et al., 2017). Westart with the smoothed Bellman operator, defined by

BλV (s) = maxπ(·|s)∈PA

Eπ[R(s, a) + γV (s′) + λH(π | s) | s] ,

(9)

where PA is the set of distributions over action spaceA; theconditional expectation Eπ[·|s] denotes a ∼ π(·|s), and λ >0 is a smoothing parameter; H is a state-dependent entropyterm: H(π | s) := −

∑a∈A π(a|s) log π(a|s). Intuitively,

Bλ is a smoothed approximation of B. It is known thatBλ is a γ-contraction (Fox et al., 2016), so has a uniquefixed point V ∗λ . Furthermore, with λ = 0 we recover thestandard Bellman operator, and λ smoothly controls ‖V ∗λ −V ∗‖∞ (Dai et al., 2018).

The entropy regularization above implies the following pathconsistency condition. Let π∗λ be an optimal policy in (9)

for Bλ, which yields V ∗λ . Then, (V, π) = (V ∗λ , π∗λ) uniquely

solves

∀(s, a) ∈ S×A : V (s) = R(s, a)+γEs′|s,a[V (s′)]−λ log π(a|s) .

This property inspires a natural extension of the kernel loss(3) to the controlled case:

LK(V ) = Es,s∼µ,a∼π(·|s),a∼π(·|s)[K([s, a], [s, a])

· Rπ,λV (s, a) · Rπ,λV (s, a)] ,

whereRπ,λV (s, a) is given by

Rπ,λV (s, a) = R(s, a)+γEs′|s,a[V (s′)]−λ log π(a|s)−V (s) .

Given a set of transitions D = {(si, ai, ri, s′i)}1≤i≤n, theobjective can be estimated by

LK(Vθ) =1

n2

∑1≤i,j≤n

[K([si, ai], [sj , aj ])RiRj ] ,

with Ri = ri + γVθ(s′i) − λ log πθ(ai|si) − Vθ(si) . The

U-statistics version and the multi-step bootstrapping can besimilarly obtained (Nachum et al., 2017).

5. Related WorkIn this work, we studied value function learning, one of themost studied and fundamental problems in reinforcementlearning. The dominant approach is based on fixed-pointiterations (Bertsekas & Tsitsiklis, 1996; Szepesvári, 2010;Sutton & Barto, 2018), which can risk instability and evendivergence when function approximation is used, as dis-cussed in the introduction.

Our approach exemplifies more recent efforts that aim toimprove stability of value function learning by reformulat-ing it as an optimization problem. Our key innovation isthe use of a kernel method to estimate the squared Bellmanerror, which is otherwise hard to estimate directly from sam-ples, thus avoids the double-sample issue unaddressed byprior algorithms like residual gradient (Baird, 1995) andPCL (Nachum et al., 2017; 2018). As a result, our algorithmis consistent: it finds the true value function with enoughdata, using sufficiently expressive function approximationclasses). Furthermore, the solution found by our algorithmminimizes the projected Bellman error, as in prior workswhen specialized to the same settings (Sutton et al., 2009;Maei et al., 2010; Liu et al., 2015; Macua et al., 2015).However, our algorithm is more general: it allows us to usenonlinear value function classes and can be naturally im-plemented for policy optimization. Compared to nonlinearGTD2/TDC (Maei et al., 2009), our method is simpler (with-out having to do a local linear expansion) and empiricallymore effective (as demonstrated in the next section).


W3 2W3

W1

W2r = 0

r = 0

r = 1

r = 0

r = 0

p = 0.9

p=0.1p=0.8

p=0.2

0 200 400 600 800 1000Iterations

10 30

10 20

10 10

0

1010

0 200 400 600 800 1000Iterations

0.0

0.5

1.0

1.5

2.0

K-lossTD(0)FVIRG

(a) Our MDP Example (b) MSE vs. Iteration (c) ||w −w∗|| vs. Iteration

Figure 1. Modified example of (Tsitsiklis & Van Roy, 1997).

0 500 1000 1500 2000Epochs

100

102

104

106

108

0 500 1000 1500 2000Epochs

1.64

1.66

1.68

1.70

1.72

1.3 1.5 1.7 1.9L2/K-Loss

0.0

0.7

1.4

2.1

2.8

1.3 1.5 1.7 1.9L2/K-Loss

1.63

1.65

1.67

1.69

1.71

1.73

GTD2 (nonlinear)TD0FVIRGSBEEDK-loss

(a) MSE (b) Bellman Error (c) L2/K-Loss vs MSE (d) L2/K-Loss vs Bellman Error

Figure 2. Results on Puddle World.

As discussed in Section 3, our method is related to the re-cently proposed SBEED algorithm (Dai et al., 2018) whichshares many advantages with this work. However, SBEEDrequires solving a minimax problem that can be rather chal-lenging in practice. In contrast, our algorithm only needs tosolve a minimization problem, for which a wide range ofpowerful methods exist (e.g., Bertsekas, 2016). Note thatthere exist other saddle-point formulations for RL, whichso far have focused on finite-state MDPs or linear valuefunction approximation (Chen et al., 2018; Wang, 2017).

Finally, the kernel method has been widely used in machinelearning (e.g., Schölkopf & Smola, 2001; Muandet et al.,2017). In RL, authors have used kernels either to modeltransition probabilities (Ormoneit & Sen, 2002) or to repre-sent the value function (e.g., Xu et al., 2005; 2007; Taylor& Parr, 2009). These works differ significantly from ourmethod in that they use kernels to specify the function classof value functions or transition models. In contrast, we lever-age kernels for designing proper loss functions to addressthe double-sampling problem, while putting no constraintson which approximation classes to represent the value func-tion. Our approach is thus expected to be more flexible andscalable in practice, especially when combined with neuralnetworks.

6. ExperimentsWe compare our method (labelled “K-loss” in all experi-ments) with several representative baselines in both classicexamples and popular benchmark problems, for both policy

evaluation and optimization.

6.1. Modified Example of Tsitsiklis & Van Roy

Fig. 1 (a) shows a modified problem of the classic exam-ple by Tsitsiklis & Van Roy (1997), by making transitionsstochastic. It consists of 5 states, including 4 nonterminal(circles) and 1 terminal states (square), and 1 action. Thearrows represent transitions between states. The value func-tion estimate is linear in the weight w = [w1, w2, w3]: forexample, the leftmost and bottom-right states’ values are w1

and 2w3, respectively. Furthermore, we set γ = 1, so V (s)is exact with the optimal weight w∗ = [0.8, 1.0, 0]. In theexperiment, we randomly collect 2 000 transition tuples fortraining. We use a linear kernel in our method, so that it willfind the TD solution (Corollary 3.5).

Fig. 1 (b&c) show the learning curves of mean squared error(‖V − V ∗‖2) and weight error (‖www −w∗‖) of different al-gorithms over iterations. Results are consistent with theory:our method converges to the true weight w∗, while both FVIand TD(0) diverge, and RG converges to a wrong solution.

6.2. Policy Evaluation with Neural Networks

While popular in recent RL literature, neural networks areknown to be unstable for a long time. Here, we revisitthe classic divergence example of Puddle World (Boyan &Moore, 1995), and demonstrate the stability of our method.Experimental details are found in Appendix B.1.

Fig. 2 summarizes the result using a neural network as value


0 500 1000 1500 2000Epochs

10 2

100

102

104

0 500 1000 1500 2000Epochs

1.6

1.5

1.4

1.3

1.2

1.1

GTD2 (nonlinear)TD0FVIRGSBEEDK-loss

0 1000 2000 3000Epochs

102.5

103

103.5

104

104.5

0 1000 2000 3000Epochs

1.0

0.5

0.0

0.5

1.0

1.5

(a) CartPole MSE (b) CartPole Bellman Error (c) Mountain Car MSE (d) Mountain Car Bellman Error

Figure 3. Policy evaluation results on CartPole and Mountain Car.

Ave

rage

Ret

urns

0.0 0.2 0.4 0.6 0.8 1.0million steps

50100150200250300350

0.0 0.2 0.4 0.6 0.8million steps

2000

4000

6000

8000

10000

0.0 0.3 0.6 0.9 1.2 1.5million steps

500

1000

1500

2000

0.0 0.1 0.2 0.3 0.4 0.5million steps

200

400

600

800

1000

K-lossTD(0)FVIRG

(a) Swimmer (b) InvertedDoublePendulum (c) Ant (d) InvertedPendulum

Figure 4. Results of various variants of Trust PCL on Mujoco Benchmark.

function for two metrics: ‖V − V ∗‖22 and ‖BV − V ‖22,both evaluated on the training transitions. First, as shownin (a-b), our method works well while residual gradientconverged to inferior solutions. In contrast, FVI and TD(0)exhibit unstable/oscilating behavior, and can even diverge,which is consistent with past findings (Boyan & Moore,1995). In addition, non-linear GTD2 (Maei et al., 2009) andSBEED (Dai et al., 2017; 2018), which do not find a bettersolution than our method in terms of MSE.

Second, Fig. 2(c&d) show the correlation between MSE,emperical Bellman error of the value function estimationand an algorithm’s training objective respectively. Our ker-nel loss appears to be a good proxy for learning the valuefunction, for both MSE and Bellman error. In contrast, theL2 loss (used by residual gradient) does not correlate well,which also explains why residual gradient has been observednot to work well empirically.

Fig. 3 shows more results on value function learning onCartPole and Mountain Car, which again demonstrate thatour method performs better than other methods in general.

6.3. Policy Optimization

To demonstrate the use of our method in policy optimiza-tion, we combine it with Trust-PCL, and compare withvariants of Trust-PCL combined with FVI, TD0 and RG. Tofairly evaluate the performance of all these four methods,we use Trust-PCL (Nachum et al., 2018) framework andthe public code for our experiments. We only modify thetraining of Vθ(s) for each of the method and keep rest same

as original release. Experimental details can be found inAppendix B.2.1.

We evaluate the performance of these four methods on Mu-joco benchmark and report the best performance of thesefour methods in Figure 4 (averaged on five different randomseeds). K-loss consistently outperforms all the other meth-ods, learning bettere policy with fewer data. Note that weonly modify the update of value functions inside Trust PCL,which can be implemented relatively easily. We expect thatwe can improve many other algorithms in similar ways byimproving the value function using our kernel loss.

7. ConclusionThis paper studies the fundamental problem of solving Bell-man equations with parametric value functions. A novelkernel loss is proposed, which is easy to be estimated andoptimized using sampled transitions. Empirical results showthat, compared to prior algorithms, our method is conver-gent, produces more accurate value functions, and can beeasily adapted for policy optimization. These promising re-sults open the door to many interesting directions for futurework, including finite-sample analysis, adaptation to onlineRL, and uncertainty estimation for exploration.

ReferencesAntos, A., Szepesvári, C., and Munos, R. Learning near-

optimal policies with Bellman-residual minimizing basedfitted policy iteration and a single sample path. MachineLearning, 71(1):89–129, 2008.


Baird, L. C. Residual algorithms: Reinforcement learn-ing with function approximation. In Proceedings of theTwelfth International Conference on Machine Learning,pp. 30–37, 1995.

Berlinet, A. and Thomas-Agnan, C. Reproducing kernelHilbert spaces in probability and statistics. SpringerScience & Business Media, 2011.

Bertsekas, D. P. Nonlinear Programming. Athena Scientific,3rd edition, 2016.

Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-Dynamic Pro-gramming. Athena Scientific, September 1996.

Beutner, E. and Zähle, H. Deriving the asymptotic distri-bution of U- and V-statistics of dependent data usingweighted empirical processes. Bernoulli, pp. 803–822,2012.

Boyan, J. A. Least-squares temporal difference learning. InProceedings of the Sixteenth International Conference onMachine Learning, pp. 49–56, 1999.

Boyan, J. A. and Moore, A. W. Generalization in reinforce-ment learning: Safely approximating the value function.In Advances in Neural Information Processing Systems 7,pp. 369–376, 1995.

Chen, Y., Li, L., and Wang, M. Scalable bilinear π-learningusing state and action features. In Proceedings of theThirty-Fifth International Conference on Machine Learn-ing, pp. 833–842, 2018.

Chwialkowski, K., Strathmann, H., and Gretton, A. A kerneltest of goodness of fit. JMLR: Workshop and ConferenceProceedings, 2016.

Dai, B., He, N., Pan, Y., Boots, B., and Song, L. Learn-ing from conditional distributions via dual embeddings.In Artificial Intelligence and Statistics, pp. 1458–1467,2017.

Dai, B., Shaw, A., Li, L., Xiao, L., He, N., Liu, Z., Chen, J.,and Song, L. SBEED: Convergent reinforcement learningwith nonlinear function approximation. In Proceedingsof the Thirty-Fifth International Conference on MachineLearning, pp. 1133–1142, 2018.

Dann, C., Neumann, G., and Peters, J. Policy evaluationwith temporal differences: A survey and comparison.Journal of Machine Learning Research, 15(1):809–883,2014.

Denker, M. and Keller, G. On U-statistics and v. mise’statistics for weakly dependent processes. Zeitschrift fürWahrscheinlichkeitstheorie und verwandte Gebiete, 64(4):505–522, 1983.

Farahmand, A. M., Ghavamzadeh, M., Szepesvári, C., andMannor, S. Regularized policy iteration. In Advances inNeural Information Processing Systems 21, pp. 441–448,2008.

Fox, R., Pakman, A., and Tishby, N. Taming the noise inreinforcement learning via soft updates. In Proceedings ofthe Thirty-Second Conference on Uncertainty in ArtificialIntelligence, 2016.

Gordon, G. J. Stable function approximation in dynamicprogramming. In Proceedings of the Twelfth InternationalConference on Machine Learning, pp. 261–268, 1995.

Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf,B., and Smola, A. A kernel two-sample test. Journal ofMachine Learning Research, 13(Mar):723–773, 2012.

Gu, S., Lillicrap, T. P., Sutskever, I., and Levine, S. Contin-uous deep Q-learning with model-based acceleration. InProceedings of the Thirty-third International Conferenceon Machine Learning, pp. 2829–2838, 2016.

Kumar, P. and Varaiya, P. Stochastic Systems: Estimation,Identification, and Adaptive Control. Prentice Hall, 1986.

Lazaric, A., Ghavamzadeh, M., and Munos, R. Finite-sample analysis of least-squares policy iteration. Journalof Machine Learning Research, pp. 3041–3074, 2012.

Liu, B., Liu, J., Ghavamzadeh, M., Mahadevan, S., andPetrik, M. Finite-sample analysis of proximal gradi-ent TD algorithms. In Proceedings of the Thirty-FirstConference on Uncertainty in Artificial Intelligence, pp.504–513, 2015.

Liu, Q., Lee, J., and Jordan, M. A kernelized Stein discrep-ancy for goodness-of-fit tests. In International Confer-ence on Machine Learning, pp. 276–284, 2016.

Macua, S. V., Chen, J., Zazo, S., and Sayed, A. H. Dis-tributed policy evaluation under multiple behavior strate-gies. IEEE Transactions on Automatic Control, 60(5):1260–1274, 2015.

Maei, H. R. Gradient Temporal-Difference Learning Al-gorithms. PhD thesis, University of Alberta, Edmonton,Alberta, Canada, 2011.

Maei, H. R., Szepesvári, C., Bhatnagar, S., Precup, D., Sil-ver, D., and Sutton, R. S. Convergent temporal-differencelearning with arbitrary smooth function approximation.In Advances in Neural Information Processing Systems22, pp. 1204–1212, 2009.

Maei, H. R., Szepesvári, C., Bhatnagar, S., and Sutton,R. S. Toward off-policy learning control with functionapproximation. In Proceedings of the Twenty-Seventh


International Conference on Machine Learning, pp. 719–726, 2010.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje-land, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik,A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D.,Legg, S., and Hassabis, D. Human-level control throughdeep reinforcement learning. Nature, 518:529–533, 2015.

Mnih, V., Adrià, Badia, P., Mirza, M., Graves, A., Lillicrap,T. P., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn-chronous methods for deep reinforcement learning. InProceedings of the Thirty-third International Conferenceon Machine Learning, pp. 1928–1937, 2016.

Muandet, K., Fukumizu, K., Sriperumbudur, B., Schölkopf,B., et al. Kernel mean embedding of distributions: A re-view and beyond. Foundations and Trends R© in MachineLearning, 10(1-2):1–141, 2017.

Munos, R. and Szepesvári, C. Finite-time bounds forsampling-based fitted value iteration. Journal of MachineLearning Research, 9:815–857, 2008.

Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D.Bridging the gap between value and policy based rein-forcement learning. In Advances in Neural InformationProcessing Systems 30, pp. 2772–2782, 2017.

Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D.Trust-PCL: An off-policy trust region method for contin-uous control. In International Conference on LearningRepresentations, 2018.

Ormoneit, D. and Sen, S. Kernel-based reinforcement learn-ing. Machine Learning, 49:161–178, 2002.

Parr, R., Li, L., Taylor, G., Painter-Wakefield, C., andLittman, M. L. An analysis of linear models, linearvalue-function approximation, and feature selection forreinforcement learning. In Proceedings of the Twenty-Fifth International Conference on Machine Learning, pp.752–759, 2008.

Puterman, M. L. Markov Decision Processes: DiscreteStochastic Dynamic Programming. Wiley-Interscience,New York, 1994.

Schölkopf, B. and Smola, A. J. Learning with kernels:Support vector machines, regularization, optimization,and beyond. MIT press, 2001.

Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel,P. High-dimensional continuous control using generalizedadvantage estimation. In Proceedings of the InternationalConference on Learning Representations, 2016.

Serfling, R. J. Approximation theorems of mathematicalstatistics, volume 162. John Wiley & Sons, 2009.

Sriperumbudur, B. K., Gretton, A., Fukumizu, K.,Schölkopf, B., and Lanckriet, G. R. Hilbert space embed-dings and metrics on probability measures. Journal ofMachine Learning Research, 11(Apr):1517–1561, 2010.

Stewart, J. Positive definite functions and generalizations,an historical survey. The Rocky Mountain Journal ofMathematics, 6(3):409–434, 1976.

Sutton, R. S. Learning to predict by the methods of temporaldifferences. Machine Learning, 3(1):9–44, 1988.

Sutton, R. S. and Barto, A. G. Reinforcement Learning:An Introduction. Adaptive Computation and MachineLearning. MIT Press, 2nd edition, 2018.

Sutton, R. S., Maei, H., Precup, D., Bhatnagar, S.,Szepesvári, C., and Wiewiora, E. Fast gradient-descentmethods for temporal-difference learning with linear func-tion approximation. In Proceedings of the Twenty-SixthInternational Conference on Machine Learning, pp. 993–1000, 2009.

Szepesvári, C. Algorithms for Reinforcement Learning.Morgan & Claypool, 2010.

Taylor, G. and Parr, R. Kernelized value function approx-imation for reinforcement learning. In Proceedings ofthe Twenty-Sixth International Conference on MachineLearning, pp. 1017–1024, 2009.

Tsitsiklis, J. N. and Van Roy, B. An analysis of temporal-difference learning with function approximation. IEEETransactions on Automatic Control, 42:674–690, 1997.

Wang, M. Primal-dual π learning: Sample complexity andsublinear run time for ergodic Markov decision problems,2017. CoRR abs/1710.06100.

Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot,M., and de Freitas, N. Dueling network architecturesfor deep reinforcement learning. In Proceedings of theThird International Conference on Machine Learning, pp.1995–2003, 2016.

Wu, Y., Mansimov, E., Grosse, R. B., Liao, S., and Ba,J. Scalable trust-region method for deep reinforcementlearning using Kronecker-factored approximation. InAdvances in Neural Information Processing Systems 30,pp. 5285–5294, 2017.

Xu, X., Xie, T., Hu, D., and Lu, X. Kernel least-squarestemporal difference learning. International Journal ofInformation and Technology, 11(9):54–63, 2005.


Xu, X., Hu, D., and Lu, X. Kernel-based least-squares policyiteration for reinforcement learning. IEEE Transactionson Neural Networks, 18(4):973–992, 2007.


Appendix

A. Proofs for Section 3A.1. Proof of Theorem 3.1

The assertion that LK(V ) ≥ 0 for all V is immediate from definition. For the second part, we have

LK(V ) = 0 ⇐⇒ ‖RπV ‖K,µ = 0 (since K is an is ISPD kernel)⇐⇒ ‖RπV · µ‖K = 0

⇐⇒ ∀s ∈ S : RπV (s)µ(s) = 0

⇐⇒ ∀s ∈ S : RπV (s) = 0

⇐⇒ V = V π .

A.2. Proof of Theorem 3.2

Define δ = V − V π to be the value function error. Furthermore, let I be the identity operator (IV = V ), and

PπV (s) := Ea∼π(·|s),s′∼P (·|s,a)[γV (s′) | s]

the state-transition part of Bellman operator without the local reward term R(s, a).

Note thatRπV π = BπV π − V π = 0 by the Bellman equation, so

RπV = RπV −RπV π = (PπV − V )− (PπV π − V π) = (Pπ − I)(V − V π) = (Pπ − I)δ .

Therefore,

LK(V ) = Eµ[RπV (s) · RπV (s) ·K(s, s)]

= Eµ[(I − Pπ)δ(s) · (I − Pπ)δ(s) ·K(s, s)]

= E(s,s′),(s,s′)∼dπ,µ [(δ(s)− γδ(s′)) · (δ(s)− γδ(s′)) ·K(s, s)],

where Edπ,µ [·] denotes the expectation under the joint distribution

dπ,µ(s, s′) := µ(s)∑a∈A

π(a|s)P (s′|s, a).

Expanding the quadratic form above, we have

LK(V )

= Edπ,µ [(δ(s)K(s, s)δ(s)− γδ(s′)δ(s)K(s, s)− γδ(s′)δ(s)K(s, s) + γ2δ(s′)δ(s′)K(s, s)]

= Eµ[δ(s′)K∗(s′, s′)δ(s′)],

where K∗(s′, s′) is as defined in the theorem statement:

K∗(s′, s′) = Ed∗π,µ[K(s′, s′)− γ(K(s′, s) +K(s, s′)) + γ2K(s, s) | (s′, s′)

]with the expectation w.r.t. the following “backward” conditional probability

d∗π,µ(s | s′) :=

∑a∈A π(a|s)P (s′|s, a)µ(s)

µ(s′),

which can be heuristically viewed as the distribution of state s conditioning on observing its next state s′ when followingdπ,µ(s, s′).


A.3. Proof of Proposition 3.3

Using the eigen-decomposition (5), we have

LK(V ) = Eµ[RπV (s)K(s, s)RπV (s)]

= Eµ[RπV (s)

∞∑i=1

λiei(s)ei(s)RπV (s)]

=

∞∑i=1

λi (Eµ[RπV (s)ei(s)])2.

The decomposition of L2(V ) follows directly from Parseval’s identity.

A.4. Proof of Proposition 3.4

The reproducing property of RKHS implies f(s) = 〈f, K(s, ·)〉HK for any f ∈ HK . Therefore,

Eµ[RπV (s)f(s)] = Eµ[RπV (s)〈f, K(s, ·)〉HK ]

= 〈f, Eµ[RπV (s)K(s, ·)]〉HK= 〈f, f∗〉HK .

where we have defined f∗(·) := Eµ[RπV (s)K(s, ·)]. Maximizing 〈f, f∗〉 subject to ‖f‖HK :=√〈f, f〉HK ≤ 1 yields

that f = f∗/ ‖f∗‖HK . Therefore,

maxf∈HK : ‖f‖HK≤1

(Es [RπV (s)f(s)])2

= (〈 f∗

‖f∗‖HK, f∗〉HK )2 = ‖f∗‖2HK .

Further, we can show that

‖f∗‖2HK = 〈f∗, f∗〉HK= 〈Eµ[RπV (s)K(s, ·)], Eµ[RπV (s)K(s, ·)]〉HK= Eµ[RπV (s)K(s, s)RπV (s)],

where the last step follows from the reproducing property, K(s, s) = 〈K(s, ·),K(s, ·)〉HK . This completes the proof, bydefinition of LK(V ).

A.5. Proof of Corollary 3.5

Under the conditions of the corollary, the kernel loss becomes the Norm of the Expected TD Update (NEU), whose minimizercoincides with the TD solution (Dann et al., 2014). For completeness, we provide a self-contained proof.

Since we are estimating the value function of a fixed policy, we ignore the actions, and the set of transitions is D ={(si, ri, s′i)}1≤i≤n. Define the following vector/matrices:

r = [r1; r2, · · · ; rn] ∈ Rn×1 ,

X = [φ(s1); φ(s2); . . . ; φ(sn)] ∈ Rn×d ,X ′ = [φ(s′1); φ(s′2); . . . ; φ(s′n)] ∈ Rn×d ,

and Z = X − γX ′, where d is the feature dimension. Then, the TD solution is given by

θTD = (XTZ)−1XTr .

Note that the above includes both the on-policy case as well as the off-policy case as in many previous algorithms withlinear value function approximation (Dann et al., 2014), where the difference is in whether si is sampled from the stateoccupation distribution of the target policy or not.


Define δ ∈ Rn×1 to be the TD error vector; that is, δ = r − Zθ, where

δi = ri + γV (s′i)− V (si) = ri + θT(γφ(s′i)− φ(si)) .

With a linear kernel, our objective function becomes:

`(θ) =1

n2

∑i,j

δiK(si, sj)δj =1

n2δTXXTδ =

1

n2(r − Zθ)TXXT(r − Zθ) .

Its gradient is given by

∇` = ∇(1

n2

(θTZTXXTZθ − 2rTXXTZθ + constant(θ)

)=

2

n2(ZTXXTZθ − ZTXXTr) .

Letting ∇` = 0 gives the solution obtained by minimizing our kernel loss:1

θKBE = (ZTXXTZ)−1ZTXXTr .

Therefore,

θKBE − θTD =((ZTXXTZ)−1ZTX − (XTZ)−1

)XTr

=((ZTXXTZ)−1ZTX(XTZ)− I

)(XTZ)−1XTr

= (I − I) (XTZ)−1XTr = 0 .

B. Experiment DetailsB.1. Policy Evaluation

We compare our method with representative policy evaluation methods including TD(0), FVI, RG, nonlinear GTD2 (Maeiet al., 2009) and SBEED (Dai et al., 2017; 2018) on three different stochastic environments: Puddle World, CartPole andMountain Car. Followings are the detail of the policy evaluation experiments.

Network Structure We parameterize the value function Vθ(s) using a fully connected neural network with one hiddenlayer of 80 units, using relu as activation function. For test function f(s) in SBEED, we use a small neural network with10 hidden units and relu as activation function.

Data Collection For each environment, we randomly collect 5000 independent transition tuples with states uniformlydrawn from state space using a policy π learned by policy optimization, for which we want to learn the value functionV π(s).

Estimating the true value function V π(s) To evaluate and compare all methods, we approximate the true value functionby finely discretizing the state space and then applying tabular value iteration on the discretized MDP. Specifically, wediscretize the state space into 25 × 25 grid for Puddle World, 20 × 25 discrete states for CartPole, and 30 × 25 discretestates for Mountain Car.

Training Details For each environment and each policy evaluation method, we train the value function Vθ(s) on thecollected 5000 transition tuples for 2000 epochs (3000 for Mountain Car), with a batch size n = 150 in each epochusing Adam optimizer. We search the learning rate in {0.003, 0.001, 0.0003} for all methods and report the best resultaveraging over 10 trials using different random seeds. For our method, we use a Gaussian RBF kernel K(si, sj) =

exp (−‖si − sj‖22 /h2) and take the bandwidth to be h = 0.5. For FVI, we update the target network at the end of each

epoch training. For SBEED, we perform 10 times gradient ascent updates on the test function f(s) and 1 gradient descentupdate on Vθ(s) at each iteration. We fix the discount factor to γ = 0.98 for all environments and policy evaluation methods.

1For simplicity, assume all involved matrices of size d× d are non-singular, as is typical in analyzing TD algorithms. Without thisassumption, we may either add L2-regularization to XXT (Farahmand et al., 2008), for which the same equivalence between TD andours can be proved, or show that the solutions lie in an affine space in Rd but the corresponding value functions are identical.


B.2. Policy Optimization

In this section we describe in detail the experimental setup for policy optimization regarding implementation and hyper-parameter search. The code of Trust-PCL is available at github2. Algorithm 1 describes details in pseudocode, where the themain change compared to Trust-PCL is highlighted. Note that as in previous work, we use the d-step version of Bellmanoperator, an immediate extension to the d = 1 case described in the main text.

B.2.1. NETWORK ARCHITECTURES

We use fully-connected feed-forward neural network to represent both policy and value network. The policy πθ is representedby a neural network with 64× 64 hidden layers with tanh activations. At each time step t, the next action at is sampledrandomly from a Gaussian distribution N (µθ(st), σθ). The value network Vθ(s) is represented by a neural network with64× 64 hidden layers with tanh activations. At each time step t, the network is given the observation st and it produces asingle scalar output value. All methods share the same policy and value network architectures.

B.2.2. TRAINING DETAILS

We average over the best 5 of 6 randomly seeded training runs and evaluate each method using the mean µθ(s) of thediagonal Gaussian policy πθ. Since Trust-PCL is off-policy, we collect experience and train on batches of experiencesampled from the replay buffer. At each training iteration, we will first sample T = 10 timestep samples and add them to thereplay buffer, then both the policy and value parameters are updated in a single gradient step using the Adam optimizer witha proper learning rate searched, using a minibatch randomly sampled from replay buffer. For Trust-PCL using FVI updatingVθ(s), which requires a target network to estimate the final state for each path, we use an exponentially moving average,with a smoothing constant τ = 0.99, to update the target value network weights as common in the prior work (Mnih et al.,2015). For Trust-PCL using TD(0), we will directly use current value network Vθ(s) to estimate the final states except wedo not perform gradient update for the final states. For Trsut-PCL using RG and K-loss, which has an objective loss, we willdirectly perform gradient descent to optimize both policy and value parameters.

B.2.3. HYPERPARAMETER SEARCH

We follow the same hyperparameter search procedure in Nachum et al. (2018) for FVI, TD(0) and RG based Trust-PCL3.We search the maximum divergence ε between πθ and πθ among ∈ {0.001, 0.0005, 0.002}, and parameter learning ratein {0.001, 0.0003, 0.0001}, and the rollout length d ∈ {1, 5, 10}. We also searched with the entropy coefficient λ, eitherkeeping it at a constant 0 (thus, no exploration) or decaying it from 0.1 to 0.0 by a smoothed exponential rate of 0.1 every2500 training iterations. For each hyper-parameter setting, we average best 5 of 6 seeds and report the best performance forthese methods.

For our proposed K-loss, we also search the maximum divergence ε but keep the learning rate as 0.001. Additionally, forK-loss we use a Gaussian RBF kernel K([si, ai], [sj , aj ]) = exp (−(‖si − sj‖22 + ‖ai − aj‖22)/h), and take the bandwidthto be h = (α×med)2, where we search α ∈ {0.1, 0.01, (1/

√logB)}, and B = 64 is the gradient update batch size. We

fix the discount to γ = 0.995 for all environments and batch size B = 64 for each training iteration.

2https://github.com/tensorflow/models/tree/master/research/pcl_rl3Please check readme file in https://github.com/tensorflow/models/tree/master/research/pcl_rl

https://github.com/tensorflow/models/tree/master/research/pcl_rl

https://github.com/tensorflow/models/tree/master/research/pcl_rl


Algorithm 1 K-Loss for PCLInput: rollout step d, batch size B, coefficient λ, τ .Initialize Vθ(s), πφ(a|s), and empty replay buffer RB(β). Set φ = φ.repeat

// Collecting SamplesSample P steps st:t+P ∼ πφ on ENV.Insert st:t+P to RB(β).

// TrainSample batch {s(k)t:t+d, a

(k)t:t+d, r

(k)t:t+d}

Bk=1 from RB(β) to contain a total of Q transitions (B ≈ Q/d).

∆θ = ∇θ 1B2

∑1≤i,j≤B [K([si, ai], [sj , aj ])RiRj ],

∆φ = − 1B

∑1≤i≤B [Ri

∑d−1t=0 ∇φ log πφ(at+i|st+i)], where

Ri = −Vθ(si) + γdVθ(si+d) +

d−1∑t=0

γt(ri+t − (λ+ τ) log πφ(at+i|st+1) + τ log πφ(at+i|st+1)).

Update θ and φ using ADAM with ∆θ,∆φ.

// Update auxiliary variablesUpdate φ = αφ+ (1− α)φ.

until Convergence

0.0 0.4 0.8 1.2 1.6 2.0million steps

500

1000

1500

2000

2500

0.0 0.4 0.8 1.2 1.6 2.0million steps

0500

100015002000250030003500

K-lossTD(0)FVIRG

(a) Walker2d (b) HalfCheetah

Figure 5. More results of various variants of Trust PCL on Mujoco Benchmark (on top of Figure 4).

Documents

A Kernel Loss for Solving the Bellman Equation · 2019. 6. 21. · A Kernel Loss for Solving the Bellman Equation In this paper, we propose a novel loss function for value function