Stochastic Variance-Reduced Hamilton Monte Carlo Methods · Methods Difan Zou z and Pan Xu yz and Quanquan Gu x February 9, 2018 Abstract We propose a fast stochastic Hamilton Monte

Stochastic Variance-Reduced Hamilton Monte Carlo Methods

Difan Zou * 1 Pan Xu * 1 Quanquan Gu 1

AbstractWe propose a fast stochastic Hamilton MonteCarlo (HMC) method, for sampling from asmooth and strongly log-concave distribution.At the core of our proposed method is a vari-ance reduction technique inspired by the recentadvance in stochastic optimization. We showthat, to achieve ε accuracy in 2-Wasserstein dis-tance, our algorithm achieves O

(n+ κ2d1/2/ε+

κ4/3d1/3n2/3/ε2/3)

gradient complexity (i.e.,number of component gradient evaluations),which outperforms the state-of-the-art HMC andstochastic gradient HMC methods in a wideregime. We also extend our algorithm for sam-pling from smooth and general log-concave dis-tributions, and prove the corresponding gradientcomplexity as well. Experiments on both syn-thetic and real data demonstrate the superior per-formance of our algorithm.

1. IntroductionPast decades have witnessed increasing attention of MarkovChain Monte Carlo (MCMC) methods in modern machinelearning problems (Andrieu et al., 2003). An importantfamily of Markov Chain Monte Carlo algorithms, calledLangevin Monte Carlo method (Neal et al., 2011), is pro-posed based on Langevin dynamics (Parisi, 1981). Langevindynamics was used for modeling of the dynamics of molec-ular systems, and can be described by the following Ito’sstochastic differential equation (SDE) (Øksendal, 2003),

dXt = −∇f(Xt)dt+√

2βdBt, (1.1)

where Xt is a d-dimensional stochastic process, t ≥ 0 de-notes the time index, β > 0 is the temperature parameter,and Bt is the standard d-dimensional Brownian motion.Under certain assumptions on the drift coefficient∇f , Chi-ang et al. (1987) showed that the distribution ofXt in (1.1)

*Equal contribution 1Department of Computer Science, Univer-sity of California, Los Angeles, CA 90095, USA. Correspondenceto: Quanquan Gu <[email protected]>.

Proceedings of the 35 th International Conference on MachineLearning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018by the author(s).

converges to its stationary distribution, a.k.a., the Gibbsmeasure πβ ∝ exp(−βf(x)). Note that πβ is smooth andlog-concave (resp. strongly log-concave) if f is smooth andconvex (resp. strongly convex). A typical way to samplefrom density πβ is applying Euler-Maruyama discretizationscheme (Kloeden & Platen, 1992) to (1.1), which yields

Xk+1 = Xk −∇f(Xk)η +√

2ηβ · εk, (1.2)

where εk ∼ N(0, Id×d) is a standard Gaussian random vec-tor, Id×d is a d × d identity matrix, and η > 0 is the stepsize. (1.2) is often referred to as the Langevin Monte Carlo(LMC) method. In total variation (TV) distance, LMC hasbeen proved to be able to produce approximate sampling ofdensity πβ ∝ e−f/β under arbitrary precision requirementin Dalalyan (2014); Durmus & Moulines (2016b), with prop-erly chosen step size. The non-asymptotic convergence ofLMC has also been studied in Dalalyan (2017); Dalalyan &Karagulyan (2017); Durmus et al. (2017), which shows thatthe LMC algorithm can achieve ε-precision in 2-Wassersteindistance after O(κ2d/ε2) iterations if f is L-smooth and µ-strongly convex, where κ = L/µ is the condition number.

In order to accelerate the convergence of Langevin dynamics(1.1) and improve its mixing time to the unique stationarydistribution, Hamiltonian dynamics (Duane et al., 1987;Neal et al., 2011) was proposed, which is also known asunderdampled Langevin dynamics and is defined by thefollowing system of SDEs

dVt = −γVtdt− u∇f(Xt)dt+√

2γudBt,

dXt = Vtdt,(1.3)

where γ > 0 is the friction parameter, u denotes the in-verse mass, Xt,Vt ∈ Rd are the position and velocityof the continuous-time dynamics respectively, and Bt isthe Brownian motion. Let Wt = (X>t ,V

>t )>, under

mild assumptions on the drift coefficient ∇f(x), the dis-tribution of Wt converges to an unique invariant distribu-tion πw ∝ e−f(x)−‖v‖22/(2u) (Neal et al., 2011), whosemarginal distribution onXt, denoted by π, is proportionalto e−f(x). Similar to the numerical approximation of theLangevin dynamics in (1.2), one can also apply the sameEuler-Maruyama discretization scheme to Hamiltonian dy-namics in (1.3), which gives rise to Hamiltonian Monte

arX

iv:1

802.

0479

1v2

[st

at.M

L]

19

Oct

202

0


Carlo (HMC) method

vk+1 = vk − γηvk − ηu∇f(xk) +√

2γuηεk,

xk+1 = xk + ηvk.(1.4)

(1.4) provides an alternative way to sample from the targetdistribution π ∝ e−f(x). While HMC has been observed tooutperform LMC in a number of empirical studies (Chenet al., 2014; 2015), there does not exist a non-asymptoticconvergence analysis of the HMC method until very recentwork by Cheng et al. (2017)1. In particular, Cheng et al.(2017) proposed a variant of HMC based on coupling tech-niques, and showed that it achieves ε sampling accuracyin 2-Wasserstein distance within O(κ2d1/2/ε) iterations forsmooth and strongly convex function f . This improves uponthe convergence rate of LMC by a factor of O(d1/2/ε).

Both LMC and HMC are gradient based Monte Carlo meth-ods and are effective in sampling from smooth and stronglylog-concave distributions. However, they can be slow ifthe evaluation of the gradient is computationally expen-sive, especially on large datasets. This motivates usingstochastic gradient instead of full gradient in LMC andHMC, which gives rise to Stochastic Gradient LangevinDynamics (SGLD) (Welling & Teh, 2011; Ahn et al., 2012;Durmus & Moulines, 2016b; Dalalyan, 2017) and StochasticGradient Hamilton Monte Carlo (SG-HMC) method (Chenet al., 2014; Ma et al., 2015; Chen et al., 2015) respectively.For smooth and strongly log-concave distributions, Dalalyan& Karagulyan (2017); Dalalyan (2017) proved that the con-vergence rate of SGLD is O(κ2dσ2/ε2), where σ2 denotesthe upper bound of the variance of the stochastic gradient.Cheng et al. (2017) proposed a variant of SG-HMC andproved that it converges after O(κ2dσ2/ε2) iterations. It isworth noting that although using stochastic gradient evalua-tions reduces the per-iteration cost,it comes at a cost that theconvergence rates of SGLD and SG-HMC are slower thanLMC and HMC. Thus, a natural questions is:

Does there exist an algorithm that can leverage stochas-tic gradients, but also achieve a faster rate of conver-gence?

In this paper, we answer this question affirmatively, whenthe function f can be written as the finite sum of n smoothcomponent functions fi

f(x) =1

n

n∑i=1

fi(x). (1.5)

It is worth noting that the finite sum structure is prevalentin machine learning, as the log-likelihood function of adataset (e.g., f ) is the sum of the log-likelihood over eachdata point (e.g., fi) in the dataset. We propose a stochastic

1In Cheng et al. (2017), the sampling method in (1.4) is alsocalled the underdampled Langevin MCMC algorithm.

variance-reduced HMC (SVR-HMC), which incorporatesthe variance reduction technique into stochastic HMC. Ouralgorithm is inspired by the recent advance in stochasticoptimization (Roux et al., 2012; Johnson & Zhang, 2013;Xiao & Zhang, 2014; Defazio et al., 2014; Allen-Zhu &Hazan, 2016; Reddi et al., 2016; Lei & Jordan, 2016; Leiet al., 2017), which use semi-stochastic gradients to accel-erate the optimization of the finite-sum function, and toimprove the runtime complexity of full gradient methods.We also notice that the variance reduction technique hasalready been employed in recent work Dubey et al. (2016);Baker et al. (2017) on SGLD. Nevertheless, it does not showan improvement in terms of dependence on the accuracy ε.

In detail, the proposed SVR-HMC uses a multi-epochscheme to reduce the variance of the stochastic gradient.At the beginning of each epoch, it computes the full gradi-ent or an estimation of the full gradient based on the entiredata. Within each epoch, it performs semi-stochastic gra-dient descent and outputs the last iterate as the warm upstarting point for the next epoch. Thorough experiments onboth synthetic and real data demonstrate the advantage ofour proposed algorithm.

Our Contributions The major contributions of our workare highlighted as follows.

• We propose a new algorithm, SVR-HMC, that incorpo-rates variance-reduction technique into HMC. Our algo-rithm does not require the variance of the stochastic gra-dient is bounded. We proved that SVR-HMC has a bettergradient complexity than the state-of-the-art LMC andHMC methods for sampling from smooth and stronglylog-concave distributions, when the error is measured by2-Wasserstein distance. In particular, to achieve ε sam-pling error in 2-Wasserstein distance, our algorithm onlyneeds O

(n+κ2d1/2/ε+κ4/3d1/3n2/3/ε2/3

)number of

component gradient evaluations. This improves upon thestate-of-the-art result by (Cheng et al., 2017), which isO(nκ2d1/2/ε) in a large regime.

• We generalize the analysis of SVR-HMC to samplingfrom smooth and general log-concave distributions byadding a diminishing regularizer. We prove that the gra-dient complexity of SVR-HMC to achieve ε-accuracy in2-Wasserstein distance is O(n+d11/2/ε6+d11/3n2/3/ε4).To the best of our knowledge, this is the first convergenceresult of LMC methods in 2-Wasserstein distance.

Notation We denote the discrete update by lower case sym-bol xk and the continuous-time dynamics by upper casesymbolXt. We denote by ‖x‖2 the Euclidean norm of vec-tor x ∈ Rd. For a random vector Xt ∈ Rd (or xk ∈ Rd),we denote its probability distribution function by P (Xt) (orP (xk)). We denote by Eu(X) the expectation ofX under


probability measure u. The squared 2-Wasserstein distancebetween probability measures u and v is

W22 (u, v) = inf

ζ∈Γ(u,v)

∫Rd×Rd

‖Xu −Xv‖22dζ(Xu,Xv),

where Γ(u, v) is the set of all joint distributions with u andv being the marginal distributions. We use an = O(bn) todenote an ≤ Cbn for some constant C > 0 independent ofn, and use an = O(bn) to hide logarithmic terms of bn. Wedenote an . bn (an & bn) if an is less than (larger than) bnup to a constant. We use a ∧ b to denote min{a, b}

2. Related WorkIn this section, we briefly review the relevant work in theliterature.

Langevin Monte Carlos (LMC) methods (a.k.a, UnadjustedLangevin Algorithms), and its Metropolis adjusted ver-sion, have been studied in a number of papers (Roberts& Tweedie, 1996; Roberts & Rosenthal, 1998; Stramer& Tweedie, 1999a;b; Jarner & Hansen, 2000; Roberts &Stramer, 2002), which have been proved to attain asymptoticexponential convergence. In the past few years, there hasemerged numerous studies on proving the non-asymptoticconvergence of LMC methods. Dalalyan (2014) first pro-posed the theoretical guarantee for approximate sampling us-ing Langevin Monte Carlo method for strongly log-concaveand smooth distributions, where he proved rate O(d/ε2)for LMC algorithm with warm start in total variation (TV)distance. This result has later been extended to Wasser-stein metric by Dalalyan & Karagulyan (2017); Durmus& Moulines (2016b), where the same convergence rate in2-Wasserstein distance holds without the warm start as-sumption. Recently, Cheng & Bartlett (2017) also provedan O(d/ε) convergence rate of the LMC algorithm in KL-divergence. The stochastic gradient based LMC meth-ods, also known as stochastic gradient Langevin dynamics(SGLD), was originally proposed for Bayesian posteriorsampling (Welling & Teh, 2011; Ahn et al., 2012). Dalalyan(2017); Dalalyan & Karagulyan (2017) analyzed the con-vergence rate for SGLD based on both unbiased and biasedstochastic gradients. In particular, they proved that the gra-dient complexity for unbiased SGLD is O(κ2d/ε2), andshowed that it may not converge to the target distribution ifthe stochastic gradient has non-negligible bias. The SGLDalgorithm has also been applied to nonconvex optimization.Raginsky et al. (2017) analyzed the non-asymptotic con-vergence rate of SGLD. Zhang et al. (2017) provided thetheoretical guarantee of SGLD in terms of the hitting timeto a first and second-order stationary point. Xu et al. (2017)provided a analysis framework for the global convergenceof LMC, SGLD and its variance-reduced variant based onthe ergodicity of the discrete-time algorithm.

In order to improve convergence rates of LMC methods,the Hamiltonian Monte Carlo (HMC) method was proposedDuane et al. (1987); Neal et al. (2011), which introduces amomentum term in its dynamics. To deal with large datasets,stochastic gradient HMC has been proposed for Bayesianlearning (Chen et al., 2014; Ma et al., 2015). Chen et al.(2015) investigated the generic stochastic gradient MCMCalgorithms with high-order integrators, and provided a com-prehensive convergence analysis. For strongly log-concaveand smooth distribution, a non-asymptotic convergenceguarantee was proved by Cheng et al. (2017) for under-damped Langevin MCMC, which is a variant of stochasticgradient HMC method.

Our proposed algorithm is motivated by the stochastic vari-ance reduced gradient (SVRG) algorithm, was first proposedin Johnson & Zhang (2013), and later extended to differ-ent problem setups Xiao & Zhang (2014); Defazio et al.(2014); Reddi et al. (2016); Allen-Zhu & Hazan (2016);Lei & Jordan (2016); Lei et al. (2017). Inspired by thisline of research, Dubey et al. (2016) applied the variancereduction technique to stochastic gradient Langevin dynam-ics, and proved a slightly tighter convergence bound thanSGLD. Nevertheless, the dependence of the convergencerate on the sampling accuracy ε is not improved. Thus, itremains open whether variance reduction technique can in-deed improve the convergence rate of MCMC methods. Ourwork answers this question in the affirmative and providesrigorously faster rates of convergence for sampling fromlog-concave and smooth density functions.

For the ease of comparison, we summarize the gradientcomplexity2 in 2-Wasserstein distance for different gradient-based Monte Carlo methods in Table 1. Evidently, for sam-pling from smooth and strongly log-concave distributions,SVR-HMC outperforms all existing algorithms.

Table 1. Gradient complexity of gradient-based Monte Carlo algo-rithms in 2-Wasserstein distance for sampling from smooth andstrong log-concave distributions.

Methods Gradient Complexity

LMC (Dalalyan, 2017) O(nκ2dε2

)HMC (Cheng et al., 2017) O

(nκ2d1/2

ε

)SGLD (Dalalyan, 2017) O

(κ2σ2dε2

)SG-HMC (Cheng et al., 2017) O

(κ2σ2dε2

)SVR-HMC (this paper) O

(n+ κ2d1/2

ε+ κ3/4d1/3n2/3

ε2/3

)2The gradient complexity is defined as number of stochastic

gradient evaluations to achieve ε sampling accuracy.


3. The Proposed AlgorithmIn this section, we propose a novel HMC algorithm thatleverages variance reduced stochastic gradient to samplefrom the target distribution π = e−f(x)/Z, where Z =∫e−f(x)dx is the partition function.

Recall that function f(x) has the finite-sum structure in(1.5). When n is large, the full gradient 1/n

∑ni=1∇fi(x)

in (1.4) can be expensive to compute. Thus, the stochasticgradient is often used to improve the computational com-plexity per iteration. However, due to the non-diminishingvariance of the stochastic gradient, the convergence rateof gradient-based MC methods using stochastic gradient isoften no better than that of gradient MC using full gradient.

In order to overcome the drawback of stochastic gradi-ent, and achieve faster rate of convergence, we proposea Stochastic Variance-Reduced Hamiltonian Monte Carloalgorithm (SVR-HMC), which leverages the advantages ofboth HMC and variance reduction. The outline of the al-gorithm is displayed in Algorithm 1. We can see that thealgorithm performs in a multi-epoch way. At the beginningof each epoch, it computes the full gradient of the f at somesnapshot of the iterate xj . Then it performs the followingupdate for both the velocity and the position variables ineach epoch

vk+1 = vk − γηvk − ηugk + εvk,

xk+1 = xk + ηvk + εxk,(3.1)

where γ, η, u > 0 are tuning parameters, gk is a semi-stochastic gradient that is an unbiased estimator of ∇f(xk)and defined as follows,

gk = ∇fik(xk)−∇fik(xj) +∇f(xj), (3.2)

where ik is uniformly sampled from {1, . . . , n}, and xj is asnapshot of xk that is only updated every m iterations suchthat k = jm+ l for some l = 0, . . . ,m− 1. And εvk and εxkare Gaussian random vectors with zero mean and covariancematrices equal to

E[εvk(εvk)>] = u(1− e−2γη) · Id×d,

E[εxk(εxk)>] =u

γ2(2γη + 4e−γη − e−2γη − 3) · Id×d,

E[εvk(εxk)>] =u

γ(1− 2e−γη + e−2γη) · Id×d, (3.3)

where Id×d is a d× d identity matrix.

The idea of semi-stochastic gradient has been successfullyused in stochastic optimization in machine learning to re-duce the variance of stochastic gradient and obtains fasterconvergence rates (Johnson & Zhang, 2013; Xiao & Zhang,2014; Reddi et al., 2016; Allen-Zhu & Hazan, 2016; Lei& Jordan, 2016; Lei et al., 2017). Apart from the semi-stochastic gradient, the second update formula in (3.1) also

differs from the direct Euler-Maruyama discretization (1.4)of Hamiltonian dynamics due to the additional Gaussiannoise term εxk . This additional Gaussian noise term is piv-otal in our theoretical analysis to obtain faster convergencerates of our algorithm than LMC methods. Similar idea hasbeen used in Cheng et al. (2017) to prove the faster rate ofconvergence of HMC (underdamped MCMC) against LMC.Algorithm 1 Stochastic Variance-Reduced HamiltonianMonte Carlo (SVR-HMC)

1: initialization: x0 = 0, v0 = 02: for j = 0, . . . , dK/me3: g = ∇f(xj)4: for l = 0, . . . ,m− 15: k = jm+ l6: Uniformly sample ik ∈ [n]7: gk = ∇fik(xk)−∇fik(xj) + g8: xk+1 = xk + ηvk + εxk9: vk+1 = vk − γηvk − ηugk + εvk.

10: if l = m− 111: xj = xk+1

12: end13: end for14: end for15: output: xK

4. Main TheoryIn this section, we analyze the convergence of our proposedalgorithm in 2-Wasserstein distance between the distributionof the iterate in Algorithm 1, and the target distributionπ ∝ e−f .

Following the recent work Durmus & Moulines (2016a);Dalalyan & Karagulyan (2017); Dalalyan (2017); Chenget al. (2017), we use the 2-Wasserstein distance to measurethe convergence rate of Algorithm 1, since it directly pro-vides the level of approximation of the first and second ordermoments (Dalalyan, 2017; Dalalyan & Karagulyan, 2017).It is arguably more suitable to characterize the quality ofapproximate sampling algorithms than the other distancemetrics such as total variation distance. In addition, whileAlgorithm 1 performs update on both the position variablexk and the velocity variable vk, only the convergence rateof the position variable xk is of central interest.

4.1. SVR-HMC for Sampling from StronglyLog-concave Distributions

We first present the convergence rate and gradient complex-ity of SVR-HMC when f is smooth and strongly convex,i.e., the target distribution π ∝ e−f is smooth and stronglylog-concave. We start with the following formal assump-tions on the negative log density function.


Assumption 4.1 (Smoothness). There exists a constantL > 0, such that for any x,y ∈ Rd, the following holds forany i,

‖∇fi(x)−∇fi(y)‖2 ≤ L‖x− y‖2.

Under Assumption 4.1, it can be easily verified that functionf(x) is also L-smooth.

Assumption 4.2 (Strong Convexity). There exists a con-stant µ > 0, such that for any x,y ∈ Rd, the followingholds for any i,

f(x)− f(y) ≥ 〈∇f(y),x− y〉+µ

2‖x− y‖22. (4.1)

Note that the strong convexity assumption is only made onthe finite sum function f , instead of the individual compo-nent function fi’s.

Theorem 4.3. Under Assumptions 4.1 and 4.2. Let P (xK)denote the distribution of the last iterate xK , and π ∝ e−f(x)

denote the stationary distribution of (1.3). Set u = 1/L,γ = 2, x0 = 0, v0 = 0 and η = O(1/κ ∧ 1/(κ1/3n2/3)).Assume ‖x∗‖2 ≤ R for some constant R > 0, wherex∗ = arg minx f(x) is the global minimizer of functionf(x). Then the output of Algorithm 1 satisfies,

W2

(P (xK), π

)≤ e−Kη/(2κ)w0 + 4ηκ(2

√D1 +

√D2)

+ 2√κD3mη

3/2, (4.2)

where w0 = W2

(P (x0), π

), κ = L/µ is the condition

number, η is the step size, and m denotes the epoch (i.e.,inner loop) length of Algorithm 1. D1, D2 and D3 aredefined as follows,

D1 =

(8η2

5+

4

3

)Uv +

4

3LUf +

16dη

3L,

D2 = 13Uv +8UfL

+28dη

L, D3 = Uv + 4ud,

in which parameters Uv and Uf are in the order of O(d/µ)and O(dκ), respectively.

Remark 4.4. In existing stochastic Langevin Monte Carlomethods (Dalalyan & Karagulyan, 2017; Zhang et al., 2017)and stochastic Hamiltonian Monte Carlo methods (Chenet al., 2014; 2015; Cheng et al., 2017), their convergenceanalyses require bounded variance of stochastic gradient,i.e., the inequality Ei[‖∇fi(x) − ∇f(x)‖22] ≤ σ2 holdsuniformly for all x ∈ Rd. In contrast, our analysis does notneed this assumption, which implies that our algorithm isapplicable to a larger class of target density functions.

In the following corollary, by providing a specific choiceof step size η, and epoch length m, we present the gradientcomplexity of Algorithm 1 in 2-Wasserstein distance.

Corollary 4.5. Under the same conditions as in Theo-rem 4.3, let m = n and η = O

(ε/(κ−1d−1/2) ∧

ε2/3/(κ1/3d1/3n2/3)). Then the output of Algorithm 1 sat-

isfiesW2

(P (xK), π

)≤ ε after

O

(n+

κ2d1/2

ε+n2/3κ4/3d1/3

ε2/3

)(4.3)

stochastic gradient evaluations.

Remark 4.6. Recall that the gradient complexity of HMCis O(nκ2d1/2/ε) and the gradient complexity of SG-HMCis O(κ2dσ2/ε2), both of which are recently proved in Chenget al. (2017). It can be seen from Corollary 4.5 that thegradient complexity of our SVR-HMC algorithm has a betterdependence on dimension d.

Note that the gradient complexity of SVR-HMC in (4.3)depends on the relationship between sample size n andprecision parameter ε. To make a thorough comparison withexisting algorithms, we discuss our result for SVR-HMC inthe following three regimes:

• When n . κd1/4/ε1/2, the gradient complexity of ouralgorithm is dominated by O(κ2d1/2/ε), which is lowerthan that of the HMC algorithm by a factor of O(n) andlower than that of the SG-HMC algorithm by a factor ofO(d1/2/ε).

• When κd1/4/ε1/2 . n . κdσ3/ε2, the gradi-ent complexity of our algorithm is dominated byO(n2/3κ4/3d1/3/ε2/3). It improves that of HMC by afactor of O(n1/3κ2/3d1/6/ε4/3), and is lower than thatof SG-HMC by a factor of O(κ2/3d2/3σ2n−2/3/ε4/3).Plugging in the upper bound of n into (4.3) yieldsO(κ2dσ2/ε2) gradient complexity, which still matchesthat of SG-HMC.

• When n & κ4d/ε2, i.e., the sample size is super large,the gradient complexity of our algorithm is dominated byO(n). It is still lower than that of HMC by a factor ofO(κ2d1/2/ε). Nonetheless, our algorithm has a highergradient complexity than SG-HMC due to the extremelylarge sample size. This suggests that SG-HMC (Chenget al., 2017) is the most suitable algorithm in this regime.

Moreover, from Corollary 4.5 we know that the op-timal learning rate for SVR-HMC is in the order ofO(ε2/3/(κ1/3d1/3n2/3)), while the optimal learning ratefor SG-HMC is in the order of O(ε2/(σ2dκ))), which issmaller than the learning rate of SVR-HMC when n ≤κdσ3/ε2 (Dalalyan, 2017). This observation aligns with theconsequence of variance reduction in the field of optimiza-tion.


4.2. SVR-HMC for Sampling from GeneralLog-concave Distributions

In this section, we will extend the analysis of the proposedalgorithm SVR-HMC to sampling from distributions whichare only general log-concave but not strongly log-concave.

In detail, we want to sample from the distribution π ∝e−f(x), where f is general convex andL-smooth. We followthe similar idea in Dalalyan (2014) to construct a stronglylog-concave distribution by adding a quadratic regularizerto the convex and L-smooth function f , which yields

f(x) = f(x) + λ‖x‖22/2,

where λ > 0 is a regularization parameter. Apparently, f isλ-strongly convex and (L+ λ)-smooth. Then we can applyAlgorithm 1 to function f , which amounts to sampling fromthe modified target distribution π ∝ e−f . We will obtaina sequence {xk}k=0,...,K , whose distribution converges toa unique stationary distribution of Hamiltonian dynamics(1.3), denoted by π. According to Neal et al. (2011), π ispropositional to e−f(x), i.e.,

π ∝ exp(− f(x)

)= exp

(− f(x)− λ

2‖x‖22

).

Denote the distribution of xk by P (xk).We have

W2(P (xk, π)) ≤ W2(P (xk), π) +W2(π, π). (4.4)

To bound the 2-Wasserstein distance between P (xk) andthe desired distribution π, we only need to upper bound the2-Wasserstein distance between two Gibbs distribution πand π. Before we present our theoretical characterization onthis distance, we first lay down the following assumption.

Assumption 4.7. Regarding distribution π ∝ e−f , itsfourth-order moment is upper bounded, i.e., there existsa constant U such that Eπ[‖x‖42] ≤ Ud2.

The following theorem spells out the convergence rate ofSVR-HMC for sampling from a general log-concave distri-bution.

Theorem 4.8. Under Assumptions 4.1 and 4.7, in orderto sample for a general log-concave density π ∝ e−f(x),the output of Algorithm 1 when applied to f(x) = f(x) +λ‖x‖22/2 satisfiesW2

(P (xk), π

)≤ ε after

O

(n+

d11/2

ε6+d11/3n2/3

ε4

)gradient evaluations.

Regarding sampling from a smooth and general log-concavedistribution, to the best of our knowledge, there is no exist-ing theoretical analysis on the convergence of LMC algo-rithms in 2-Wasserstein distance. Yet the convergence anal-yses of LMC methods in total variation distance (Dalalyan,

2014; Durmus et al., 2017) and KL-divergence (Cheng &Bartlett, 2017) have recently been established. In detail,Dalalyan (2014) proved a convergence rate of O(d3/ε4) intotal variation distance for LMC with general log-concavedistributions, which implies O(nd3/ε4) gradient complex-ity. Durmus et al. (2017) improved the gradient complexityof LMC in total variation distance to O(nd5/ε2). (Cheng& Bartlett, 2017) proved the convergence of LMC in KL-divergence, which attains O(nd/ε3) gradient complexity. Itis worth noting that our convergence rate in 2-Wassersteindistance is not directly comparable to the aforementionedexisting results.

5. ExperimentsIn this section, we compare the proposed algorithm (SVR-HMC) with the state-of-the-art MCMC algorithms forBayesian learning. To compare the convergence rates fordifferent MCMC algorithms, we conduct the experimentson both synthetic data and real data.

We compare our algorithm with SGLD (Welling & Teh,2011), VR-SGLD (Reddi et al., 2016), HMC (Cheng &Bartlett, 2017) and SG-HMC (Cheng & Bartlett, 2017).

5.1. Simulation Based on Synthetic Data

On the synthetic data, we construct each component func-tion to be fi(x) = (x − ai)>Σ(x − ai)/2, where ai is aGaussian random vector drawn from distribution N (2, 4×Id×d), and Σ is a positive definite symmetric matrix withmaximum eigenvalue L = 3/2 and minimum eigenvalueµ = 2/3. Note that each random vector ai leads to aparticular component function fi(x). Then it can be ob-served that the target density π ∝ exp

(1/n

∑ni=1 fi(x)

)=

exp((x−a)>Σ(x−a)/2

)is a multivariate Gaussian distri-

bution with mean a = 1/n∑ni=1 ai and covariance matrix

Σ. Moreover, the negative log density f(x) is L-smoothand µ-strongly convex.

In our simulation, we investigate different dimension dand number of component functions n, and show the 2-Wasserstein distance between the target distribution π andthat of the output from different algorithms with respectto the number of data passes. In order to estimate the 2-Wasserstein distance between the distribution of each iterateand the target one, we repeat all algorithms for 20, 000 timesand obtain 20, 000 random samples for each algorithm ineach iteration. In Figure 1, we present the convergence re-sults for three HMC based algorithms (HMC, SG-HMC andSVR-HMC). It is evident that SVR-HMC performs the bestamong these three algorithms when n is not large enough,and its performance becomes close to that of SG-HMCwhen the number of component function is increased. This


0 200 400 600 800 1000 1200 1400 1600 1800 200010-3

10-2

10-1

100

101

102

(a) d = 10, n = 50

0 500 1000 1500 2000 2500 3000 3500 4000 4500 500010-3

10-2

10-1

100

101

102

(b) d = 10, n = 100

0 100 200 300 400 500 600 700 800 900 100010-3

10-2

10-1

100

101

102

(c) d = 10, n = 1000

0 50 100 150 200 250 300 350 400 450 50010-3

10-2

10-1

100

101

102

(d) d = 10, n = 5000

0 200 400 600 800 1000 1200 1400 1600 1800 200010-2

10-1

100

101

102

103

(e) d = 50, n = 50

0 100 200 300 400 500 600 700 800 900 100010-2

10-1

100

101

102

103

(f) d = 50, n = 100

0 50 100 150 200 250 300 350 400 450 50010-2

10-1

100

101

102

103

(g) d = 50, n = 1000

0 10 20 30 40 50 60 70 80 90 10010-2

10-1

100

101

102

103

(h) d = 50, n = 5000

Figure 1. Numerical results for synthetic data, where we compare 3 different algorithms, and show their convergence performance in2-Wasserstein distance. (a)-(h) represent for different dimensions d and sample sizes n.

phenomenon is well-aligned with our theoretical analysis,since the gradient complexity of our algorithm can be worsethan SG-HMC when the sample size n is extremely large.

5.2. Bayesian Logistic Regression for Classification

Table 2. Summary of datasets for Bayesian classification

Dataset pima a3a gisette mushroom

n (training) 384 3185 6000 4062n (test) 384 29376 1000 4062d 8 122 5000 112

Now, we apply our algorithm to the Bayesian logistic re-gression problems. In logistic regression, given n i.i.d. ex-amples {ai, yi}i=1,...,n, where ai ∈ Rd and yi ∈ {0, 1}denote the features and binary labels respectively, the prob-ability mass function of yi given the feature ai is mod-elled as p(yi|ai,x) = 1/

(1 + e−yix

>ai), where x ∈ Rd

is the regression parameter. Considering the prior p(x) =N (0, λ−1I), the posterior distribution takes the form

p(x|A,Y ) ∝ p(Y |A,x)p(x) =

n∏i=1

p(yi|ai, β)p(x).

where A = [a1,a2, . . . ,an]> and Y = [y1, y2, . . . , yn]>.The posterior distribution can be written as p(x|A,Y ) ∝

e−∑ni=1 fi(x), where each fi(x) is in the following form

fi(x) = n log(1 + exp(−yix>ai)

)+ λ/2‖x‖22.

We use four binary classification datasets from Libsvm(Chang & Lin, 2011) and UCI machine learning reposi-tory (Lichman, 2013), which are summarized in Table 3.Note that pima and mushroom do not have test data in theiroriginal version, and we split them into 50% for trainingand 50% for test. Following Welling & Teh (2011); Chenet al. (2014; 2015), we report the sample path average anddiscard the first 50 iterations as burn-in. It is worth notingthat we observe similar convergence comparison of differ-ent algorithms for larger burn-in period (= 104). We runeach algorithm 20 times and report the averaged results forcomparison. Note that variance reduction based algorithms(i.e., VR-SGLD and SVR-HMC) require the first data passto compute one full gradient. Therefore, in Figure 2, plotsof VR-SGLD and VRHMC start from the second data passwhile plots of SGLD and SGHMC start from the first datapass. It can be clearly seen that our proposed algorithmis able to converge faster than SGLD and SG-HMC on alldatasets, which validates our theoretical analysis of the con-vergence rate. In addition, although there is no existingnon-asymptotic theoretical guarantee for VR-SGLD whenthe target distribution is strongly log-concave, from Figure2, we can observe that SVR-HMC also outperforms VR-SGLD on these four datasets, which again demonstrates thesuperior performance of our algorithm. This clearly showsthe advantage of our algorithm for Bayesian learning.


1 3 5 7 9 11 13 15 17 190.48

0.49

0.5

0.51

0.52

0.53

0.54

0.55

0.56

SGLDSGHMCVR-SGLDSVR-HMC

(a) pima

1 3 5 7 9 11 13 15 17 190.32

0.33

0.34

0.35

0.36

0.37


(b) a3a

0 5 10 15 20 25 300

0.05

0.1

0.15

0.2

0.25


(c) gisette

1 3 5 7 9 11 13 15 17 190

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08


(d) mushroom

Figure 2. Comparison of different algorithms for Bayesian logistic regression, where y axis shows the negative log-likelihood on the testdata, and y axis is the number of data passess. (a)-(d) correspond to 4 datasets.

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

10-1

100

101


(a) geographical

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 3910-3

10-2

10-1


(b) noise

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 3910-2

10-1


(c) parkinson

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 3910-4

10-3

10-2

10-1


(d) toms

Figure 3. Comparison of different algorithms for Bayesian linear regression, where y axis is the mean square errors on the test data, and xaxis is the number of data passess. (a)-(d) correspond to 4 datasets.

Table 3. Test error of different algorithms for Bayesian classification after 10 entire data passes on 4 datasets

Dateset pima a3a gisette mushroom

SGLD 0.2314± 0.0044 0.1594± 0.0018 0.0098± 0.0009 (6.647 + 2.251)× 10−4

SGHMC 0.2306± 0.0079 0.1591± 0.0044 0.0096± 0.0006 (5.916± 2.734)× 10−4

VR-SGLD 0.2299 + 0.0056 0.1572± 0.0012 0.0105± 0.0006 (7.755± 3.231)× 10−4

SVR-HMC 0.2289± 0.0043 0.1570± 0.0019 0.0093± 0.0011 (6.278± 3.149)× 10−4

Table 4. Summary of datasets for Bayesian linear regression

Dataset geographical noise parkinson toms

n 1059 1503 5875 45730d 69 5 21 96

5.3. Bayesian Linear Regression

We also apply our algorithm to Bayesian linear regres-sion, and make comparison with the baseline algorithms.Similar to Bayesian classification, given i.i.d. examples{ai, yi}i=1,...,n with yi ∈ R, the likelihood of Bayessianlinear regression is p(yi|ai,x) = N (x>ai, σ

2a) and the

prior is N (0, λ−1I). We use 4 datasets, which are summa-rized in Table 4. In our experiment, we set σ2

a = 1 andλ = 1, and conduct the normalization of the original data.In addition, we split each dataset into training and test dataevenly. Similarly, we compute the sample path averagewhile treating the first 50 iterates as burn in. We reportthe mean square errors on the test data on these 4 datasetsin Figure 3 for different algorithms. It is evident that our

algorithm is faster than all the other baseline algorithms onall the datasets, which further illustrates the advantage ofour algorithm for Bayesian learning.

6. Conclusions and Future workWe propose a stochastic variance reduced Hamilton MonteCarlo (HMC) method, for sampling from a smooth andstrongly log-concave distribution. We show that, to achieveε accuracy in 2-Wasserstein distance, our algorithm enjoysa faster rate of convergence and better gradient complex-ity than state-of-the-art HMC and stochastic gradient HMCmethods in a wide regime. We also extend our algorithmfor sampling from smooth and general log-concave distribu-tions. Experiments on both synthetic and real data verifiedthe superior performance of our algorithm. In the future,we will extend our algorithm to non-log-concave distribu-tions and study the symplectic integration techniques suchas Leap-frog integration for Bayesian posterior sampling.


AcknowledgementsWe would like to thank the anonymous reviewers for theirhelpful comments. This research was sponsored in part bythe National Science Foundation IIS-1618948, IIS-1652539and SaTC CNS-1717950. The views and conclusions con-tained in this paper are those of the authors and should notbe interpreted as representing any funding agencies.

ReferencesAhn, S., Balan, A. K., and Welling, M. Bayesian posterior

sampling via stochastic gradient fisher scoring. In ICML,2012.

Allen-Zhu, Z. and Hazan, E. Variance reduction for fasternon-convex optimization. In International Conference onMachine Learning, pp. 699–707, 2016.

Andrieu, C., De Freitas, N., Doucet, A., and Jordan, M. I.An introduction to mcmc for machine learning. Machinelearning, 50(1-2):5–43, 2003.

Baker, J., Fearnhead, P., Fox, E. B., and Nemeth, C. Controlvariates for stochastic gradient mcmc. arXiv preprintarXiv:1706.05439, 2017.

Bakry, D., Gentil, I., and Ledoux, M. Analysis and geometryof Markov diffusion operators, volume 348. SpringerScience & Business Media, 2013.

Chang, C.-C. and Lin, C.-J. Libsvm: a library for supportvector machines. ACM transactions on intelligent systemsand technology (TIST), 2(3):27, 2011.

Chen, C., Ding, N., and Carin, L. On the convergence ofstochastic gradient mcmc algorithms with high-order in-tegrators. In Advances in Neural Information ProcessingSystems, pp. 2278–2286, 2015.

Chen, T., Fox, E., and Guestrin, C. Stochastic gradienthamiltonian monte carlo. In International Conference onMachine Learning, pp. 1683–1691, 2014.

Cheng, X. and Bartlett, P. Convergence of langevin mcmcin kl-divergence. arXiv preprint arXiv:1705.09048, 2017.

Cheng, X., Chatterji, N. S., Bartlett, P. L., and Jordan, M. I.Underdamped langevin mcmc: A non-asymptotic analy-sis. arXiv preprint arXiv:1707.03663, 2017.

Chiang, T.-S., Hwang, C.-R., and Sheu, S. J. Diffusion forglobal optimization in rˆn. SIAM Journal on Control andOptimization, 25(3):737–753, 1987.

Dalalyan, A. S. Theoretical guarantees for approximatesampling from smooth and log-concave densities. arXivpreprint arXiv:1412.7392, 2014.

Dalalyan, A. S. Further and stronger analogy between sam-pling and optimization: Langevin monte carlo and gradi-ent descent. arXiv preprint arXiv:1704.04752, 2017.

Dalalyan, A. S. and Karagulyan, A. G. User-friendly guaran-tees for the langevin monte carlo with inaccurate gradient.arXiv preprint arXiv:1710.00095, 2017.

Defazio, A., Bach, F., and Lacoste-Julien, S. Saga: Afast incremental gradient method with support for non-strongly convex composite objectives. In Advances inNeural Information Processing Systems, pp. 1646–1654,2014.

Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D.Hybrid monte carlo. Physics letters B, 195(2):216–222,1987.

Dubey, K. A., Reddi, S. J., Williamson, S. A., Poczos, B.,Smola, A. J., and Xing, E. P. Variance reduction instochastic gradient langevin dynamics. In Advances inNeural Information Processing Systems, pp. 1154–1162,2016.

Durmus, A. and Moulines, E. High-dimensional bayesianinference via the unadjusted langevin algorithm. 2016a.

Durmus, A. and Moulines, E. Sampling from stronglylog-concave distributions with the unadjusted langevinalgorithm. arXiv preprint arXiv:1605.01559, 2016b.

Durmus, A., Moulines, E., et al. Nonasymptotic conver-gence analysis for the unadjusted langevin algorithm. TheAnnals of Applied Probability, 27(3):1551–1587, 2017.

Eberle, A., Guillin, A., and Zimmer, R. Couplings and quan-titative contraction rates for langevin dynamics. arXivpreprint arXiv:1703.01617, 2017.

Jarner, S. F. and Hansen, E. Geometric ergodicity ofmetropolis algorithms. Stochastic processes and theirapplications, 85(2):341–361, 2000.

Johnson, R. and Zhang, T. Accelerating stochastic gradientdescent using predictive variance reduction. In Advancesin Neural Information Processing Systems, pp. 315–323,2013.

Kloeden, P. E. and Platen, E. Higher-order implicit strongnumerical schemes for stochastic differential equations.Journal of statistical physics, 66(1):283–314, 1992.

Lei, L. and Jordan, M. I. Less than a single pass: Stochas-tically controlled stochastic gradient method. arXivpreprint arXiv:1609.03261, 2016.

Lei, L., Ju, C., Chen, J., and Jordan, M. I. Non-convexfinite-sum optimization via scsg methods. In Advances inNeural Information Processing Systems, pp. 2345–2355,2017.


Lichman, M. UCI machine learning repository, 2013. URLhttp://archive.ics.uci.edu/ml.

Ma, Y.-A., Chen, T., and Fox, E. A complete recipe forstochastic gradient mcmc. In Advances in Neural Infor-mation Processing Systems, pp. 2917–2925, 2015.

Neal, R. M. et al. Mcmc using hamiltonian dynamics. Hand-book of Markov Chain Monte Carlo, 2(11), 2011.

Øksendal, B. Stochastic differential equations. In Stochasticdifferential equations, pp. 65–84. Springer, 2003.

Parisi, G. Correlation functions and computer simulations.Nuclear Physics B, 180(3):378–384, 1981.

Raginsky, M., Rakhlin, A., and Telgarsky, M. Non-convex learning via stochastic gradient langevin dy-namics: a nonasymptotic analysis. arXiv preprintarXiv:1702.03849, 2017.

Reddi, S. J., Hefny, A., Sra, S., Poczos, B., and Smola, A.Stochastic variance reduction for nonconvex optimization.In International conference on machine learning, pp. 314–323, 2016.

Roberts, G. O. and Rosenthal, J. S. Optimal scaling of dis-crete approximations to langevin diffusions. Journal ofthe Royal Statistical Society: Series B (Statistical Method-ology), 60(1):255–268, 1998.

Roberts, G. O. and Stramer, O. Langevin diffusions andmetropolis-hastings algorithms. Methodology and com-puting in applied probability, 4(4):337–357, 2002.

Roberts, G. O. and Tweedie, R. L. Exponential convergenceof langevin distributions and their discrete approxima-tions. Bernoulli, pp. 341–363, 1996.

Roux, N. L., Schmidt, M., and Bach, F. R. A stochasticgradient method with an exponential convergence ratefor finite training sets. In Advances in Neural InformationProcessing Systems, pp. 2663–2671, 2012.

Stramer, O. and Tweedie, R. Langevin-type models i: Dif-fusions with given stationary distributions and their dis-cretizations. Methodology and Computing in AppliedProbability, 1(3):283–306, 1999a.

Stramer, O. and Tweedie, R. Langevin-type models ii: Self-targeting candidates for mcmc algorithms. Methodologyand Computing in Applied Probability, 1(3):307–328,1999b.

Welling, M. and Teh, Y. W. Bayesian learning via stochasticgradient langevin dynamics. In Proceedings of the 28thInternational Conference on Machine Learning (ICML-11), pp. 681–688, 2011.

Xiao, L. and Zhang, T. A proximal stochastic gradientmethod with progressive variance reduction. SIAM Jour-nal on Optimization, 24(4):2057–2075, 2014.

Xu, P., Chen, J., Zou, D., and Gu, Q. Global convergenceof langevin dynamics based algorithms for nonconvexoptimization. arXiv preprint arXiv:1707.06618, 2017.

Zhang, Y., Liang, P., and Charikar, M. A hitting time anal-ysis of stochastic gradient langevin dynamics. arXivpreprint arXiv:1702.05575, 2017.

http://archive.ics.uci.edu/ml


A. Proof of Main TheoryIn this section, we present our theoretical analysis of the proposed SVR-HMC algorithm. Before we present the proof of ourmain theorem, we introduce some notations for the ease of our presentation. We use notation Sη to denote the one-stepSVR-HMC update in (3.1) with step size η, i.e., xk+1 = Sηxk and vk+1 = Sηvk. Similarly, We define an operator Gηwhich also performs one step update with step size η, but replaces the semi-stochastic gradient in (3.1) with the full gradient.Specifically, we have

Gηvk = vk − γηvk − ηu∇f(xk) + εvk,

Gηxk = xk + ηvk + εxk,(A.1)

for any xk,vk ∈ Rd, where εvk and εxk are the same as defined in Algorithm 1. Next, we define an operator Lη whichrepresents the integration over a time interval of length η on the continuous dynamics (1.3). Specifically, for any startingpoint V0 andX0, integrating (1.3) over time interval (0, η) yields the following equations:

Vt = LtV0 = V0e−γt − u

(∫ t

0

e−γ(t−s)∇f(Xt)ds)

+√

2γu

∫ t

0

e−γ(t−s)dBs, (A.2)

Xt = LtX0 = X0 +

∫ t

0

Vsds. (A.3)

(A.2) and (A.3) give out an implicit solution of dynamics (1.3), which can be easily verified by taking derivatives of thesetwo equations (Cheng et al., 2017). The following lemma characterizes the mean value and covariance of the Brownianmotion terms.

Lemma A.1. (Cheng et al., 2017) The additive Brownian motion in (A.2), denoted by εv =√

2γu∫ t

0e−γ(t−s)dBs, has

mean 0 and covariance matrix

E[εv(εv)>] = 2γuE[ ∫ t

0

e−γ(t−s)dBs

∫ t

0

e−γ(t−s)dB>s

]= u(1− e−2γt) · Id×d.

Note that there also exists a hidden Brownian motion term in (A.3), which comes from the velocity Vs, denoted byεx =

√2γu

∫ t0

∫ s0e−γ(s−r)dBrdt, having mean 0 and covariance matrix

E[εx(εx)>] = 2γuE[ ∫ t

0

∫ s

0

e−γ(s−r)dBrds∫ t

0

∫ s

0

e−γ(s−r)dB>r ds]

=u

γ2(2γt+ 4e−γt − e−2γt − 3) · Id×d.

In addition, εv and εx have the following cross-covariance

E[εv(εx)>] = 2γuE[ ∫ t

0

e−γ(t−s)dBs

∫ t

0

∫ s

0

e−γ(s−r)dB>r ds]

=u

γ(1− 2e−γt + e−2γt) · Id×d.

Recall the independent Gaussian random vectors εvk and εxk used in each iteration of Algorithm 1. They all have zero meanand the covariance matrices defined in (3.3) have exactly the same form with the covariance matrices in Lemma A.1 whent = η. Due to this property, we will use a synchronous coupling technique that ensures the Gaussian random vectors in eachone-step update of the discrete algorithm, i.e., Sηx and Sηv, are exactly the same additive Brownian motion terms in theone-step integration of the continuous dynamics Lηx and Lηv. The shared Brownian motions between Sηv and Lηv (Sηxand Lηx) are pivotal to our analysis. Similar coupling techniques are also used in Eberle et al. (2017); Cheng et al. (2017).

A.1. Proof of Theorem 4.3

We first lay down some technical lemmas that are useful in our proof. The first lemma characterizes the discretization errorbetween the full gradient-based HMC update in (A.1) and the solutions of continuous Hamiltonian dynamics (1.3).

Lemma A.2. Under Assumptions 4.1 and 4.2, consider one-step discrete update (A.1) and Langevin diffusion (A.2)-(A.3)starting from point (xk,vk). Then the discretization error for velocity and position are bounded by

E[‖Gηxk − Lηxk‖22] ≤ η4

[(2γ2 + 2uL

3

)Uv +

4u2L

3Uf +

8u2Lγdη

3

], D1η

4,

E[‖Gηvk − Lηvk‖22] ≤ η4

[(3γ4

4+ u2L2

)Uv +

(3u2γ2L

2+ 4u3L2

)Uf + 4u3L2ηγd

], D2η

4,


where parameters Uv and Uf are in the order of O(d/µ) and O(dκ) respectively.

The difference between our SVR-HMC update and the full gradient-based HMC update in (A.1) can be characterized by thefollowing lemma.

Lemma A.3. Under Assumptions 4.1 and 4.2, for any xk,vk ∈ Rd, we have

E[‖Sηxk − Gηxk‖22] = 0, (A.4)

E[‖Sηvk − Gηvk‖22] ≤ 2η4m2u2L2(Uv + γud) , D3u2L2m2η4. (A.5)

The following lemma shows the contraction property for the diffusion operator in terms of the coupled `2 norm.

Lemma A.4. (Cheng et al., 2017) Under Assumptions 4.1 and 4.2, let z = (x>, (x + v)>)> ∈ R2d and Ltz =((Ltx)>, (Ltx + Ltv)>)>. Set γ = 2 and u = 1/L in (A.2)-(A.3). Starting from two different points z1 and z2,the continuous-time dynamics after time t satisfy

E[‖Ltz1 − Ltz2‖22] ≤ e−t/κE[‖z1 − z2‖22],

where the diffusion operators on z1 and z2 share the same Brownian motion, and κ = L/µ denotes the condition number.

For the operators Lη, we denote Lkηx = Lη ◦ Lη ◦ · · · ◦ Lηx as the result after Lη operates k times starting at x. We havethe following lemma which is useful to characterize the distance E[‖zk − Lkηzπ‖22] based on some recursive arguments,

where zπ =((xπ)>, (xπ + vπ)>

)>.

Lemma A.5. (Dalalyan & Karagulyan, 2017) LetA, B and C be given non-negative numbers such thatA ∈ (0, 1). Assumethat the sequence of non-negative numbers {xk}k=0,1,2,... satisfies the recursive inequality

x2k+1 ≤ [(1−A)xk + C]2 +B2,

for every integer k ≥ 0. Then, for all integers k ≥ 0,

xk ≤ (1−A)kx0 +C

A+

B√A.

Lemma A.6. For any two random vectorsX,Y ∈ Rd, the following holds

E[‖X + Y ‖22] ≤(√

E[‖X‖22] +√

E[‖Y ‖22]

)2

.

Based on all the above lemmas, we are now ready to prove Theorem 4.3.

Proof of Theorem 4.3. Let zπ denote the random variable satisfying distribution πz, then we have

E[‖zk+1 − Lk+1η zπ‖22] = E[‖zk+1 − Gηzk + Gηzk − Lk+1

η zπ‖22]

= E[‖zk+1 − Gηzk‖22 + 2〈zk+1 − Gηzk,Gηzk − Lk+1η zπ〉+ ‖Gηzk − Lk+1

η zπ‖22]

= E[‖zk+1 − Gηzk‖22 + ‖Gηzk − Lk+1η zπ‖22], (A.6)

where the last equality follows from the fact that E[〈zk+1 − Gηzk,Gηzk − Lk+1

η zπ〉]

= E[Eik [〈zk+1 − Gηzk,Gηzk −

Lk+1η zπ〉]

]= 0 and Eik [zk+1] = Gηzk. Note that zk+1 = (x>k+1, (xk+1 + vk+1)>)>, thus

E[‖zk+1 − Gηzk‖22 = E[‖xk+1 − Gηxk‖22 + E[‖xk+1 + vk+1 − Gη(xk + vk)‖22= E[‖vk+1 − Gηvk‖22 (A.7)

≤ D3m2η4, (A.8)


where the second equality follows from xk+1 = Gηxk, the inequality follows from Lemma A.3 and the fact that uL = 1.The second term on the R.H.S of (A.6) can be further bounded as follows,

E[‖Gηzk − Lk+1η zπ‖22] = E[‖Gηzk − Lηzk + Lηzk − Lk+1

η zπ‖22]

≤(√

E[‖Gηzk − Lηzk‖22] +

√E[‖Lηzk − Lk+1

η zπ‖22]

)2

≤(√

E[‖Gηzk − Lηzk‖22] + e−η/(2κ)√E[‖zk − Lkηzπ‖22]

)2

, (A.9)

where the first inequality holds due to Lemma A.6 and the second inequality follows from Lemma A.4. We further have√E[‖Gηzk − Lηzk‖22] =

√E[‖Gη(vk + xk)− Lη(vk + xk)‖22] + E[‖Gηxk − Lηxk‖22]

≤√

E[‖Gη(vk + xk)− Lη(vk + xk)‖22 +√

E[‖Gηxk − Lηxk‖22]

≤ 2√E[‖Gηxk − Lηxk‖22] +

√E[‖Gηvk − Lηvk‖22] (A.10)

≤ 2√D1η

2 +√D2η

2, (A.11)

where the first inequality is due to√a+ b ≤

√a+√b, the second inequality is due to (B.8) and the last inequality comes

from Lemma A.2. HereD1,D2 are constants which are both in the order ofO(d/µ). Denotew2k+1 = E[‖zk+1−Lk+1

η zπ‖22].Submitting (A.8), (A.9) and (A.11) into (A.6) yields

w2k+1 ≤

(e−η/(2κ)wk + 2

√D1η

2 +√D2η

2)2

+D3m2η4. (A.12)

Then, by Lemma A.5, wk can be bounded by

wk ≤ e−kη/(2κ)w0 +2√D1η

2 +√D2η

2

1− e−η/(2κ)+

√D3mη

2

√1− e−η/(2κ)

.

Note that the above results rely on the shared Brownian motion in the discrete update and continuous Langevin diffusion,i.e., we assume identical Brownian motion sequences are used in the updates zk = Skη z0 and Lkηzπ. Since zπ satisfies thestationary distribution πz, Lkηzπ satisfies πz as well. According to the definition of 2-Wasserstein distance, we have

W2

(P (zk), πz

)=

(inf

ζ∈Γ(zk,Lkηzπ)

∫Rd×Rd

‖zk − Lkηzπ‖22dζ(zk,Lkηzπ)

)1/2

≤√E[‖zk − Lkηzπ‖22]

= wk,

which further implies that

W2

(P (zK), πz

)≤ wK ≤ e−Kη/(2κ)w0 +

2√D1η

2 +√D2η

2

1− e−η/(2κ)+

√D3mη

2

√1− e−η/(2κ)

. (A.13)

Let Kη = T , and note that 1− e−η/(2κ) ≥ η/(4κ) when assuming 0 < η/κ ≤ 1. Therefore, we have

W2

(P (zK), πz

)≤ wK ≤ e−T/(2κ)w0 + 4ηκ(2

√D1 +

√D2) + 2

√κD3mη

3/2. (A.14)

Moreover, note that

W2

(P (xK), π

)= EΓ(xK ,xπ)[‖xK − xπ‖22]

≤ EΓ(xK ,vK ,xπ,vπ)[‖xK − xπ‖22 + ‖xK + vK − xπ − vπ‖22]

=W2

(P (zK), πz

).

Substituting the above into (A.14) directly yields the argument in Theorem 4.3.


A.2. Proof of Corollary 4.5

Now we present the calculation of gradient complexity of our algorithm.

We first present the following Lemma that characterizes the expectation E[‖xπ − x∗‖], where x∗ = arg minx f(x) is theglobal minimizer of function f .

Lemma A.7 (Proposition 1 in Durmus & Moulines (2016b)). Let x∗ = arg minx f(x) denote the global minimizer offunction f , and xπ be the random vector satisfying distribution π ∝ e−f(x), the following holds,

E[‖xπ − x∗‖22] ≤ d

µ.

Then we are going to prove Corollary 4.5.

Proof of Corollary 4.5. We first let w0e−T/2κ = ε/3, which implies that T = 2κ log(3w0/ε). Note that x∗ is the minimizer

of f and by assumption we have ‖x0 − x∗‖2 ≤ R. Recall the definition of wk, we have

w0 = E[‖x0 − xπ‖22] = E[‖x0 − x∗ + x∗ − xπ‖22] ≤ 2E[‖xπ − x∗‖22] + 2‖x0 − x∗‖22 ≤2d

µ+ 2R,

where the last inequality comes from Lemma A.7. Then we obtain T = O(κ), where O(·) notation hides the logarithmicterm of ε, d, µ and R. We then rewrite (A.14) as follow,

W2

(P (zK), πz

)≤ e−T/(2κ)w0 + C2η + C3mη

3/2, (A.15)

where C2 = O(κ(d/µ)1/2) and C3 = O((κd/µ)1/2

). We then let

C2η =ε

3, and C3mη

3/2 =ε

3,

and solve for η, which leads to

η = min

{ε

3C2

,ε2/3

(3C3m)2/3

}.

Thus, the total iteration number satisfies

K =T

η≤ 3TC2

ε+T (3C3m)2/3

ε2/3.

In terms of gradient complexity, we have

Tg = K + n

(1 ∨ K

m

)≤ K +

Kn

m+ n.

Substituting C2, C3, T into the above equation, and let m = n, we obtain

Tg ≤ 2K + n = O

(κ2(d/µ)1/2

ε+κ4/3(d/µ)1/3n2/3

ε2/3+ n

). (A.16)

When µ and L appear individually, they can be treated as constants. Thus we arrive at the result in Corollary 4.5.

A.3. Proof of Theorem 4.8

In this section, we prove the convergence result of SVR-HMC for sampling from a general log-concave distribution. Notethat for a µ-strongly log-concave distribution π ∝ e−f , it must satisfy a logarithmic Sobolev inequality with constantCLS = 1/µ (Raginsky et al., 2017). We first present the following two useful lemmas.


Lemma A.8. (Dalalyan, 2014) Let f and f be two functions such that f(x) ≤ f(x) for all x ∈ Rd, suppose e−f and e−f

are both integratable. Then the Kullback-Leibler (KL) divergence between distribution π ∝ e−f and π ∝ e−f satisfies

KL(π||π) ≤ 1

2

∫Rd

(f(x)− f(x)

)2dπ(x).

Lemma A.9. (Bakry et al., 2013) If a stationary distribution π1 satisfies a logarithmic Sobolev inequality with constantCLS , for any probability measure π2, it follows that

W2(π1, π2) ≤√

2CLSKL(π2||π1).

In what follows, we are going to leverage the above two lemmas to analyze the convergence rate of SVR-HMC for samplingfrom general log-concave distributions. Based on Assumption 4.7, we have∫

Rd

(f(x)− f(x)

)2dπ(x) =

λ2

4

∫Rd‖x‖42dπ(x) ≤ λ2Ud2

4,

where U > 0 is an absolute constant. Then, by Lemma A.8, we immediately have

KL(π||π) ≤ λ2Ud2

8.

From Lemma A.9, the 2-Wasserstein distance W2(π, π) is upper bounded by

W2(π, π) ≤√

2CLSKL(π||π) ≤√λUd2

2, (A.17)

where we use the fact that the probability measure π satisfies a logarithmic Sobolev with constant CLS = 1/λ due tothe strong convexity of f . By triangle inequality in 2-Wasserstein distance, for any distribution p, we haveW2(p, π) ≤W2(p, π) + W2(π, π). Thus, we can perform our algorithms over distribution π ∝ e−f , and obtain an approximatesamplingX which achieves the ε-precision requirement inW2(P (X), π), as long as ensuringW2(P (X), π) ≤ ε/2 andW2(π, π) ≤ ε/2. According to (A.17), the requirementW2(π, π) ≤ ε/2 suggests that the parameter λ should be selectedsuch that λ ≤ ε2/(Ud2) = O(ε2/d2). Based on the above discussion, we are ready to prove Theorem 4.8 as follows.

Proof of Theorem 4.8. Note that we perform Algorithm 1 on the approximate density π ∝ e−f , where f(x) = f(x) +λ‖x− x∗‖22/2, and λ = O(ε2/d2). It can be readily seen that function f(x) is an (L+ λ)-smooth and λ-strongly convexfunction. Thus, we can directly replace the parameter µ in (A.16) with λ, and obtain

Tg = O

(d1/2

ελ5/2+d1/3n2/3

ε2/3λ5/3+ n

),

where we treat the smoothness parameter L+ λ as constant of order O(1) when it appears individually. Plugging the factλ = O(ε2/d2) into the above equation, we have

Tg = O

(n+

d11/2

ε6+d11/3n2/3

ε4

),

which completes the proof.

B. Proof of Technical LemmasIn this section, we prove the technical lemmas used in the proof of our main theorems. We first present some useful lemmasthat will be used in our analysis.

Lemma B.1. Under Assumptions 4.1 and 4.2, the solution of Hamiltonian Langevin dynamics in (A.2)-(A.3) satisfies

E[‖Vt‖22] ≤ 2u[f(X0)− f(x∗) + γdt

]+ ‖V0‖22,

E[f(Xt)] ≤ f(X0) +‖V0‖22

2u+ γdt,

E[‖∇f(Xt)‖22] ≤ 2L

(f(X0)− f(x∗) +

‖V0‖222u

+ γdt

),

where x∗ = arg minx f(x) denotes the global minimizer of function f(x).


Lemma B.2. Consider iterates xk and zk in Algorithm 1. Let u = 1/L and γ = 2, choose η = O(1/κ ∧ 1/(κ1/3m2/3)),and assume that η2 ≤ log(2)/(36κK), we have the following union bounds on E[‖x − x∗‖22], E[f(xk)] − f(x∗) andE[‖vk‖22],

E[‖xk − x∗‖22] ≤ 42d

µ+ 24‖x∗‖22 +

8d

L, Ux,

E[f(xk)]− f(x∗) ≤ 21dκ+ 12L‖x∗‖22 + 4d , Uf ,

E[‖vk‖22] ≤ 80d

µ+ 48‖x∗‖22 +

18d

L, Uv.

Moreover, it can be seen that Ux and Uv are both in the order of O(d/µ), and Uf is in the order of O(dκ).

B.1. Proof of Lemma A.2

Proof of Lemma A.2. In discrete update (3.1), the added Gaussian noises εxk and εvk have mean 0 and satisfy

E[εvk(εvk)>] = u(1− e−2γη) · Id×d,

E[εxk(εxk)>] =u

γ2(2γη + 4e−γη − e−2γη − 3) · Id×d,

E[εvk(εxk)>] =u

γ(1− 2e−γη + e−2γη) · Id×d,

(B.1)

which are identical to those of the Brownian motions in Langevin diffusion (A.2) and (A.3) with time t = η by Lemma A.1.Note that when 0 < x < 1, we have 1− x ≤ exp(−x) ≤ 1− x+ x2/2. Thus assuming 2γη ≤ 1, we obtain

E[‖εvk‖22] ≤ 2γuηd, E[‖εxk‖22] ≤ 2uη2d, and E[〈εvk, εxk〉] ≤ 2γuη2d. (B.2)

Therefore, we are able to apply synchronous coupling argument, i.e., considering shared Brownian terms in both discreteupdate (3.1) and Langevin diffusion.

Firstly, we are going to bound the discretization error in the velocity variable v. LetX0 = xk, V0 = vk,Xs = Lsxk andVs = Lsvk. Based on (3.1) and (A.2), we have

E[‖Gηvk − Lηvk‖22] = E[∥∥∥∥vk(1− γη − e−γη) +

u

γ(1− γη − e−γη)∇f(xk)

+ u

∫ η

0

e−γ(η−s)[∇f(Xs)−∇f(X0)]ds∥∥∥∥2

2

]≤ 3E

[γ4η4

4‖vk‖22 +

u2γ2η4

4‖∇f(xk)‖22 + u2

∥∥∥∥∫ η

0

e−γ(η−s)[∇f(Xs)−∇f(X0)]ds∥∥∥∥2

2

],

(B.3)

where the inequality follows from facts that |1−x−e−x| ≤ x2/2 when 0 ≤ x ≤ 1 and ‖x+y+z‖22 ≤ 3(‖x‖22+‖y‖22+‖z‖22). In terms of the third term on the R.H.S of (B.3), we have

E[∥∥∥∥ ∫ η

0

e−γ(η−s)[∇f(Xs)−∇f(X0)]ds∥∥∥∥2

2

]≤ ηE

[ ∫ η

0

∥∥e−γ(η−s)[∇f(Xs)−∇f(X0)]∥∥2

2ds]

≤ ηE[ ∫ η

0

∥∥∇f(Xs)−∇f(X0)∥∥2

2ds]

≤ ηL2

[ ∫ η

0

E∥∥Xs −X0

∥∥2

2ds],

where the first inequality follows from inequality ‖∫ t

0x(s)ds‖22 ≤ t

∫ t0‖x(s)‖22ds, the second inequality is due to

exp(−x) ≤ 1, and the last inequality follows from Assumption 4.1. Note that dXs = Vsds, we further have

ηL2

[ ∫ η

0

E∥∥Xs −X0

∥∥2

2ds]

= ηL2

[ ∫ η

0

E∥∥∥∥∫ s

0

Vrdr

∥∥∥∥2

2

ds]

≤ ηL2

[ ∫ η

0

s

∫ s

0

E‖Vr‖22drds],


where the last inequality is due to the fact that ‖∫ t

0x(s)ds‖22 ≤ t

∫ t0‖x(s)‖22ds. By Lemma B.1, we know that

E[‖Vr‖22] ≤ 2uE[f(X0)− f(x∗) + 2ηγd] + E[‖V0‖22]

for r ≤ η. Thus, it follows that

ηL2

[ ∫ η

0

s

∫ s

0

E‖Vr‖22drds]≤η4L2

[2uE[f(X0)− f(x∗) + 2ηγd]

]+ E[‖V0‖22]

3.

Substituting the above into (B.3), we obtain

E[‖Gηvk − Lηvk‖22]

≤ η4E[

3γ4

4‖V0‖22 +

3u2γ2

4‖∇f(X0)‖22 + u2L2

[2u[f(X0)− f(x∗) + 2ηγd] + ‖V0‖22

]]≤ η4

[(3γ4

4+ u2L2

)E[‖vk‖22] +

(3u2γ2L

2+ 2u3L2

)E[f(xk)− f(x∗)

]+ 4u3L2ηγd

], (B.4)

where the second inequality is by facts that V0 = vk,X0 = xk and ‖∇f(x)‖22 ≤ 2L(f(x)− f(x∗)

). Next, we are going

to bound the discretization error in the position variable x. Note that the randomness ofXη comes from the Brownian termin the velocity variation, and can be also regarded as an additive Gaussian noise, i.e.,

√2γu

∫ η0

dt∫ t

0e−γ(t−s)dBs. Note

that we utilize the identical random variable in the discrete update (3.1), which implies that the coupling technique can stillbe used in the discretization error computation in xk. Let Vt = V0e

−γt − u∫ t

0e−γ(t−s)∇f(Xt)ds, we have

E[‖Gηxk − Lηxk‖22] = E[∥∥∥∥ ∫ η

0

(V0 − Vt

)dt∥∥∥∥2

2

]≤ η

∫ η

0

E[‖V0 − Vt‖22]dt

= η

∫ η

0

E[∥∥∥∥V0(1− e−γt) + u

∫ t

0

e−γ(t−s)∇f(Xs)ds∥∥∥∥2

2

]dt

≤ η∫ η

0

{2γ2t2E[‖V0‖22] + 2u2E

[∥∥∥∥ ∫ t

0

e−γ(t−s)∇f(Xs)ds∥∥∥∥2

2

]}dt

≤ 2γ2η4

3E[‖V0‖22] + 2u2η

∫ η

0

t

∫ t

0

E[‖∇f(Xs)‖22]dsdt.

From Lemma B.1, it can be seen that

E‖∇f(Xs)‖22 ≤ 2L

(E[f(X0)− f(x∗)] +

E[‖vk‖22]

2u+ 2γdη

)for any s ≤ η, thus we have

2u2η

∫ η

0

t

∫ t

0

E[‖∇f(Xs)‖22]dsdt ≤ 4u2Lη4

3

(E[f(X0)− f(x∗)] +

E[‖vk‖22]

2u+ 2γdη

).

Then, replacing V0 andX0 by vk and xk respectively, the discretization error inxbk is bounded by


[(2γ2 + 2uL

3

)E[‖vk‖22] +

4u2L

3E[f(xk)− f(x∗)] +

8u2Lγdη

3

]. (B.5)

Finally, by Lemma B.2, we have uniform bounds Uv and Uf on E[‖vk‖22] and E[f(xk)]− f(x∗), substituting these boundsinto (B.4) and (B.5), we are able to complete the proof.



Proof of Lemma A.3. Note that the update for xk does not contain the gradient term, which implies Sηxk = Gηxk andE[‖Sηxk − Gηxk‖22] = 0. In the sequel, we mainly consider the velocity variable. Applying coupling argument, it can bedirectly observed that

E[‖Sηvk − Gηvk‖22] = η2u2E[∥∥∇fik(xk)−∇fik(xj)−

(∇f(xk)−∇f(xj)

)∥∥2

2

]≤ η2u2E

[∥∥∇fik(xk)−∇fik(xj)∥∥2

2

]≤ η2u2L2E

[∥∥xk − xj∥∥2

2

], (B.6)

where the first inequality is by the fact that E[‖x− E[x]‖22] ≤ E[‖x‖22], and the second inequality follows from Assumption4.1. Note that by (3.1), we have

E[‖xk − xj‖22] = E[∥∥∥∥ jm+l−1∑

r=jm

ηvr + εxr

∥∥∥∥2

2

]

≤ 2E[∥∥∥∥ jm+l−1∑

r=jm

ηvr

∥∥∥∥2

2

]+ 2E

[∥∥∥∥ jm+l−1∑r=jm

εxr

∥∥∥∥2

2

]

= 2η2E[∥∥∥∥ jm+l−1∑

r=jm

vr

∥∥∥∥2

2

]+ 2

jm+l−1∑r=jm

E[‖εxr‖22]

≤ 2lη2

jm+l−1∑r=jm

E[‖vr‖22] + 4luη2d, (B.7)

where the first inequality is due to (a+ b)2 ≤ 2(a2 + b2), the second equation is due to the independence among Gaussianrandom variables εxr and the last inequality is due to

(∑ni=1 ai

)2 ≤ n∑ni=1 a

2i and (B.2). Let Uv denote the union upper

bound of E[‖vk‖22] for all 0 ≤ k ≤ K, (B.7) can be further relaxed as follows

E‖xk − xj‖22 ≤ 2η2(l2Uv + 2lud) ≤ 2η2(m2Uv + 2mud).

Since m ≤ m2, we are able to complete the proof by submitting the above inequality into (B.6) and setting D3 =2(Uv + 2ud), i.e.,

E[‖Sηvk − Gηvk‖22] ≤ 2η4u2L2m2(Uv + 2ud) , D3u2L2m2η4.


Proof of Lemma A.6. Note that for random vectorsX and Y , we have

(E[〈X,Y 〉]

)2=

( d∑i=1

EXiYi

)2

≤( d∑i=1

(EX2i )1/2E(Y 2

i )1/2

)2

≤( d∑i=1

EX2i

)( d∑i=1

EY 2i

)= E[‖X‖22]E[‖Y ‖22],

where the first and second inequalities are by Holder’s inequality and Cauchy-Schwarz inequality respectively. Thus, itfollows that

E[‖X + Y ‖22] = E[‖X‖22 + ‖Y ‖22 + 2〈X,Y 〉]

≤ E[‖X‖22] + E[‖Y ‖22] + 2√E[‖X‖22]E[‖Y ‖22] =

(√E[‖X‖22] +

√E[‖Y ‖22]

)2

, (B.8)

which completes the proof.


C. Proof of Auxiliary LemmasIn this section, we prove extra lemmas used in our proof.

C.1. Proof of Lemma B.1

Proof. We consider the Lyapunov function Et = E[f(Xt) + ‖Vt‖22/(2u)], which corresponds to the expected total energyof such dynamic system. By Ito’s lemma, we have

dEtdt

=1

dt[E〈∇VtEt, dVt〉+ E〈∇Xt

Et, dXt〉]

+ γuE〈∇2VtEt, I〉

=1

uE[−γ‖Vt‖22 − u〈Vt,∇f(Xt)〉] + E[〈∇f(Xt),Vt〉] + γd

= γd− γ

uE[‖Vt‖22]

≤ γd,

where in the third equation we use the martingale property of dBt. Thus, we have Et ≤ tγd+ E0. Adding the term −f(x∗)on the both sides, we have

E[f(Xt)− f(x∗)] + E[‖Vt‖22]/(2u) ≤ tγd+ f(X0)− f(x∗) + ‖V0‖22/(2u). (C.1)

Note that both terms E[‖Vt‖22] and E[f(Xt)− f(x∗)] are positive, which immediately implies that

E[‖Vt‖22] ≤ 2u[f(X0)− f(x∗) + γdt

]+ ‖V0‖22,

E[f(Xt)] ≤ f(X0) +‖V0‖22

2u+ γdt.

Moreover, note that x∗ = argmin f(x) and thus

f(x∗)− f(x) ≤ f(x−∇f(x)/L)− f(x)

≤ 〈∇f(x),−∇f(x)/L〉+ ‖∇f(x)‖22/(2L)

= −‖∇f(x)‖22/(2L),

where the second inequality is due to Assumption 4.1, which further implies that

E[‖∇f(Xt)‖22] ≤ 2LE[f(Xt)− f(x∗)].

By (C.1) we have

E[f(Xt)− f(x∗)] ≤ tγd+ f(X0)− f(x∗) + ‖V0‖22/(2u), (C.2)

which further indicates

E[‖∇f(Xt)‖22] ≤ 2L

(f(X0)− f(x∗) +

‖V0‖222u

+ γdt

).

Thus, we complete the proof.

C.2. Proof of Lemma B.2

To prove Lemma B.2, we need the following lemma.

Lemma C.1. Under Assumptions 4.1 and 4.2, when η ≤ 1/(2γ), expectations E[f(xk)] and E[‖vk‖22] are upper boundedas follows,

E[f(xk)] ≤ 1

1− γη

[eG1TηE0 + T (γd+G0η) +

1

2T 2ηG1e

G1Tη(γd+ ηG0)

],

E[‖vk‖22] ≤ 2u

[eG1TηE0 + T (γd+G0η) +

1

2T 2ηG1e

G1Tη(γd+ ηG0)− f(x∗)

],

where T = Kη denotes the length of time, G0 = 6uE[‖∇fi(x∗)‖] + 2Lγud− 18uLκf(x∗), and G1 = 36uLκ.


Proof of Lemma B.2. We first prove the upper bound for E[‖xk − xπ‖22]. Applying triangle inequality yields

E[‖xk − x∗‖22] ≤ 2E[‖xk − xπ‖22] + 2E[‖xπ − x∗‖22]

≤ 2w2k +

2d

µ, (C.3)

where the second inequality comes from Lemma A.7 and wk =(E[‖xk−xπ‖22]+E[‖xk+vk−xπ−vπ‖22]

)1/2. According

to (A.6), (A.7),(A.9) and (A.10), we have

w2k+1 ≤

(e−η/(2κ)wk + 2

√E[‖Gηxk − Lηxk‖22] +

√E[‖Gηvk − Lηvk‖22]

)2

+ E[‖Sηvk − Gηvk‖22].

By (B.4), (B.5), (B.6) and (B.7), we have


[(2γ2 + 2uL

3

)Uv +

4u2L

3Uf +

8u2Lγdη

3

]= D1η

4,

E[‖Gηvk − Lηvk‖22] ≤ η4

[(3γ4

4+ u2L2

)Uv +

(3u2γ2L

2+ 4u3L2

)Uf + 4u3L2ηγd

]= D2η

4,

E[‖Sηvk − Gηvk‖22] ≤ 2(Uv + 2ud)m2u2L2η2 = D3m2u2L2η2,

(C.4)

where Uv and Uf denote any uniform upper bounds for E[‖vk‖22] and E[f(xk)− f(x∗)] respectively. Applying LemmaA.5 yields

wk ≤ e−kη/(2κ)w0 +2

√D1η

2 +

√D2η

2

1− e−η/(2κ)+

√D3mη

2

√1− e−η/(2κ)

≤ w0 + 4ηκ(

2

√D1 +

√D2

)+ 2

√κD3mη

3/2, (C.5)

where we use the fact that e−kη/(2κ) < 1 and 1 − e−η/(2κ) ≥ η/(4κ) when 0 < η/κ ≤ 1. It is then left to show theorder of D1, D2 and D3. To this end, we need to find uniform upper bounds for E[‖vk‖22] and E[f(xk)− f(x∗)] by (C.4),namely, we need to find the order of Uv and Uf . In the following, we will show this by applying Lemma C.1. Denote T asT = kη and consider sufficiently small η such that G1Tη ≤ log(2), G0η ≤ γd and γη ≤ 1/2, by Lemma C.1 we obtainthe following upper bounds for E[‖vk‖22] and E[f(xk)]− f(x∗)

E[f(xk)]− f(x∗) ≤ 2(2E0 + 2Tγd+ 2 log(2)Tγd

)+ |f(x∗)| ≤ 4(E0 + 2Tγd) + |f(x∗)| = Uf ,

E[‖vk‖22] ≤ 2u(2E0 + 4Tγd+ |f(x∗)|

)≤ uUf = Uv.

In addition, since u = 1/L and γ = 2, we can write Uf = O(Td) = O(κd log(1/ε)) and Uv = O(d log(1/ε)/µ). Recallthe definition of D1, D2 and D3 in (C.4), for sufficiently small η < 1/2γ = 1/4, we have

D1 ≤ 10Uv/3 + 4uUf/3 + 4ud/3 ≤ 5Uv + 2ud,

D2 ≤ 13Uv + 10uUf + 2ud = 23Uv + 2ud,

D3 ≤ 2Uv + 4ud.

(C.6)

We choose step size η in (C.5) such that 4ηκ(

2

√D1 +

√D2

)≤√d/µ and 2

√κD3mη

3/2 ≤√d/µ. To this end, we let

η ≤ min

{1

4κ(2

√D1µ/d+

√D2µ/d)

,

(1

2n

√κD3µ/d

)3/2}= O(1/κ ∧ 1/(κ1/3m2/3)),

where the equation is calculated based on (C.6). Then by (C.5) we have

w2k ≤

(w0 + 2

√d/µ

)2

≤ 2w20 +

8d

µ.


Now we deal with w0. Note that x0 = 0 and v0 = 0. By the definition of wk, we have

w20 = E[‖xπ − x0‖22 + ‖xπ + vπ − x0 − v0‖22]

≤ 3E[‖xπ‖22] + 2E[‖vπ‖22]

≤ 6d

µ+ 6‖x∗‖22 +

2d

L,

where the first inequality comes form triangle inequality and in the second inequality we use facts that E[‖xπ‖22] =2E[‖xπ −x∗‖22] + 2‖x∗‖22, E[‖vπ‖22] = 1/

√(2π)d

∫Rd ‖v‖

22 exp(−‖v‖22/2u)dv = ud = d/L and E[‖xπ −x∗‖22] ≤ d/µ

by Lemma A.7. Applying (C.3) we further have

E[‖xk − x∗‖22] ≤ 2w2k +

2d

µ≤ 2(

2w20 +

8d

µ

)+

2d

µ=

42d

µ+ 24‖x∗‖22 +

8d

L,

which completes the proof for the upper bound of E[‖xk − x∗‖22]. Moreover, according to Assumption 4.1, we have

E[f(xk)]− f(x∗) ≤ LE[‖xk − x∗‖22]

2≤ 21dκ+ 12L‖x∗‖22 + 4d.

In the following, we are going to prove the union upper bound on E[‖vk‖22]. Similar to the proof of Ux, we have

E[‖vk‖22] = E[‖vk − vπ + vπ‖]≤ 2E[‖vπ‖22] + 2E[‖vπ − vk‖22]

≤ 2E[‖vπ‖22] + 4E[‖vπ − vk + x∗ − xk‖22] + 4E[‖x∗ − xk‖22]

= 2E[‖vπ‖22] + 4w2k.

Note that w2k ≤ 2w2

0 + 8d/µ ≤ 20d/µ+ 12‖x∗‖22 + 4d/L and E[‖vπ‖22] = d/L, we have

E[‖vk‖22] ≤ 80d

µ+

18d

L+ 48‖x∗‖22 , Uv,

which completes our proof.

C.3. Proof of Lemma C.1

Proof. Recall the discrete update form (3.1) and the proposed SVR-HMC algorithm. Let k = jm+ l, we first rewrite thel-th update in the j-th epoch as follows,

xk+1 = xk + ηvk + εxk,

vk+1 = vk − γηvk − ηugk + εvk,(C.7)

where gk = ∇fik(xk)−∇fik(xj) +∇f(xj).

In order to show the upper bounds of E[f(xk)] and E[‖vk‖22], we consider the Lyapunov function Ek = E[(1− γη)f(xk) +‖vk‖22/(2u)]. In what follows, we aim to establish the relationship between Ek+1 and Ek. To begin with, we deal withE[f(xk+1)], which can be upper bounded by

E[f(xk+1)] ≤ E[f(xk) + η〈vk,∇f(xk)〉+

L‖ηvk + εxk‖222

]= E

[f(xk) + η〈vk,∇f(xk)〉+

Lη2‖vk‖222

]+L

2E[‖εxk‖22]. (C.8)

In terms of E‖vk+1‖22, we have

E[‖vk+1‖22] = E‖vk − γηvk − ηugk + εvk‖22= E[‖vk − γηvk − ηugk‖22] + E[‖εvk‖22]. (C.9)


As for the first term on the R.H.S of the above equation, we have

E[‖vk − γηvk − ηugk‖22] = E[‖(1− γη)vk‖22]− 2(1− γη)ηuE[〈vk,gk〉] + η2u2E[‖gk‖22].

Note that

E[〈vk,gk〉] = E[〈vk,Eikgk〉] = E[〈vk,∇f(xk)〉],

which immediately implies

E[‖vk − γηvk − ηugk‖22] = (1− γη)2E[‖vk‖22]− 2(1− γη)ηuE[〈vk,∇f(xk)〉

]+ η2u2E[‖∇fik(xk)−∇fik(xj) +∇f(xj)‖22]

≤ (1− γη)E[‖vk‖22]− 2(1− γη)ηuE[〈vk,∇f(xk)〉]+ 3η2u2E[‖∇fik(xk)‖22 + ‖∇fik(xj)‖22 + ‖∇f(xj)‖22], (C.10)

where the first inequality follows from the fact that (a+ b+ c)3 ≤ 3(a2 + b2 + c2) and that 1− ηγ < 1. Combining (C.8),(C.9) and (C.10), we obtain

Ek+1 = E[(1− γη)f(xk+1) +

‖vk+1‖222u

]≤ (1− γη)Ef(xk) +

1− γη + Luη2(1− γη)

2uE‖vk‖22 +

3η2u

2E[‖∇fik(xk)‖22 + ‖∇fik(xj)‖22 + ‖∇f(xj)‖22]

+(1− γη)L

2E[‖εxk‖22] +

E[‖εvk‖22]

2u. (C.11)

From (B.2), we know that

E[‖εvk‖22] ≤ 2γudη, and E[‖εxk‖22] ≤ 2udη2.

We bound the gradient norm term as follows.

E[‖∇fi(xk)‖22] ≤ 2E[‖∇fi(xk)− fi(x∗)‖22] + 2E[‖∇fi(x∗)‖22]

≤ 2L2E[‖x− x∗‖22] + 2E[‖∇fi(x∗)‖22]

≤ 4L2

µE[f(xk)− f(x∗)] + 2E[‖∇fi(x∗)‖22].

Upper bounds of ‖∇fik(xj)‖22 and ‖∇f(xj)‖22 can be established in the same way. Then (C.11) can be further bounded by

Ek+1 ≤ (1− γη)E[f(xk)] +1− γη + Luη2

2uE[‖vk‖22]

+ 6η2uLκE[f(xk) + 2f(xj)− 3f(x∗)] + 6η2uE[‖∇fi(x∗)‖] + dη(γ + Luη)

≤(1− γη + 6η2uLκ

)E[f(xk)] +

1− γη + Luη2

2uE[‖vk‖22] + 12η2uLκE[f(xj)]

+ ηγd+ η2[6uE[‖∇fi(x∗)‖22] + Lud− 18uLκf(x∗)

]. (C.12)

Note that we have assumed γη ≤ 1/2, which further implies that

(1− γη + 6η2uLκ)E[f(xk)] +1− γη + Luη2

2uE[‖vk‖22] ≤ max

{1− γη + 6η2Luκ

1− γη, 1− γη + Luη2

}Ek

≤ (1 + 12η2uLκ)Ek,

where in the second inequality we use the fact that (1 − γη + a)/(1 − γη) ≤ 1 + 2a for any a > 0 and 0 < γη ≤ 1/2.Moreover, since 0 < γη ≤ 1/2, we have E[f(xj)] ≤ 2(1− γη)E[f(xj)] + E[‖vj‖22]/(u) = 2Ejm, where we used the factthat xj = xjm. Therefore (C.12) turns to

Ek+1 ≤ (1 + 12η2uLκ)Ek + 24η2uLκEjm + ηγd+ η2G0, (C.13)


where G0 = 6uE[‖∇fi(x∗)‖] + 2Lud− 18uLκf(x∗). Note that the inequality (C.13) can be relaxed by

Ek+1 ≤ (1 + 36η2uLκ) max{Ek, Ejm}+ ηγd+ η2G0. (C.14)

We then consider two cases: Ek ≥ Ejm and Ejm > Ek and analyze the upper bound of Ek+1 respectively.

Case I: Ek ≥ Ejm. The inequality (C.14) reduces to

Ek+1 ≤ (1 + 36η2uLκ)Ek + ηγd+ η2G0,

which immediately implies that

Ek ≤ (1 + 36η2uLκ)kE0 + (ηγd+ η2G0)

k−1∑i=0

(1 + 36η2uLκ)i

= (1 + 36η2uLκ)kE0 + (ηγd+ η2G0)(1 + 36η2uLκ)k − 1

36η2uLκ.

Let G1 = 36uLκ, and it is easy to verify the following fact for any 0 < G1η2.

(1 +G1η2)k = exp

(k log(1 +G1η

2))≤ exp

(kG1η

2).

Then, Ek+1 can be further bounded as

Ek ≤ (1 +G1η2)kE0 + (ηγd+ η2G0)

(1 +G1η2)k − 1

G1η2

≤ eG1kη2

E0 + (ηγd+ η2G0)eG1kη

2 − 1

G1η2

≤ eG1kη2

E0 + (ηγd+ η2G0)G1kη

2 + eG1kη2

G21k

2η4/2

G1η2

= eG1kη2

E0 + kηγd+ kη2G0 +1

2k2η3G1e

G1kη2

(γd+ ηG0), (C.15)

where the third inequality holds because h(y) ≤ h(0) + h′(0)y + maxs∈[0,y] h′′(s)y2/2 holds for any C2 function h.

Case II: Ejm > Ek. In order to obtain the upper bound of Ek, we still need to recursively call (C.14) many times. However,note that jm ≤ k, which implies that we only need to perform recursions less than k times. Thus, (C.15) remains true.

Finally, using facts that E[f(xk)] − f(x∗) ≥ 0, E[‖vk‖22] ≥ 0 and the definition of Ek, replacing k in (C.15) by K, wearrive at the arguments proposed in this lemma.

Documents

Stochastic Variance-Reduced Hamilton Monte Carlo Methods · Methods Difan Zou z and Pan Xu yz and Quanquan Gu x February 9, 2018 Abstract We propose a fast stochastic Hamilton Monte