Deep Generative Learning via Schrödinger Bridge

Deep Generative Learning via Schrodinger Bridge

Gefei Wang 1 Yuling Jiao 2 Qian Xu 3 Yang Wang 1 4 Can Yang 1 4

AbstractWe propose to learn a generative model via en-tropy interpolation with a Schrodinger Bridge.The generative learning task can be formulatedas interpolating between a reference distributionand a target distribution based on the Kullback-Leibler divergence. At the population level, thisentropy interpolation is characterized via an SDEon [0, 1] with a time-varying drift term. At thesample level, we derive our Schrodinger Bridgealgorithm by plugging the drift term estimated bya deep score estimator and a deep density ratio es-timator into the Euler-Maruyama method. Undersome mild smoothness assumptions of the targetdistribution, we prove the consistency of both thescore estimator and the density ratio estimator,and then establish the consistency of the proposedSchrodinger Bridge approach. Our theoretical re-sults guarantee that the distribution learned by ourapproach converges to the target distribution. Ex-perimental results on multimodal synthetic dataand benchmark data support our theoretical find-ings and indicate that the generative model viaSchrodinger Bridge is comparable with state-of-the-art GANs, suggesting a new formulation ofgenerative learning. We demonstrate its useful-ness in image interpolation and image inpainting.

1. IntroductionDeep generative models have achieved enormous success inlearning the underlying high-dimensional data distributionfrom samples. They have various applications in machinelearning, like image-to-image translation (Zhu et al., 2017;

1Department of Mathematics, The Hong Kong University ofScience and Technology, Hong Kong, China 2School of Mathemat-ics and Statistics, Wuhan University, Wuhan, China 3AI Group,WeBank Co., Ltd., Shenzhen, China 4Guangdong-Hong Kong-Macao Joint Laboratory for Data-Driven Fluid Mechanics and En-gineering Applications, The Hong Kong University of Science andTechnology, Hong Kong, China. Correspondence to: Yuling Jiao<[email protected]>, Can Yang <[email protected]>.

Proceedings of the 38 th International Conference on MachineLearning, PMLR 139, 2021. Copyright 2021 by the author(s).

Choi et al., 2020), semantic image editing (Zhu et al., 2016;Shen et al., 2020) and audio synthesis (Van Den Oord et al.,2016; Prenger et al., 2019). Most of existing generativemodels seek to learn a nonlinear function to transform asimple reference distribution to the target distribution as datagenerating mechanisms. They can be categorized as eitherlikelihood-based models or implicit generative models.

Likelihood-based models, such as variational auto-encoders(VAEs) (Kingma & Welling, 2014) and flow-based methods(Dinh et al., 2015), optimize the negative log-likelihoodor its surrogate loss, which is equivalent to minimize theKullback–Leibler (KL) divergence between the target dis-tribution and the generated distribution. Although theirability to learn flexible distributions is restricted by the wayto model the probability density, many works have beenestablished to alleviate this problem and achieved appeal-ing results (Makhzani et al., 2016; Tolstikhin et al., 2018;Razavi et al., 2019; Dinh et al., 2017; Papamakarios et al.,2017; Kingma & Dhariwal, 2018; Behrmann et al., 2019).As a representative of implicit generative models, gener-ative adversarial networks (GANs) use a min-max gameobjective to learn the target distribution. It has been shownthat vanilla GAN (Goodfellow et al., 2014) minimizes theJensen-Shannon (JS) divergence between the target distri-bution and the generated distribution. To generalize vanillaGAN, researchers consider some other criterions includ-ing more general f -divergences (Nowozin et al., 2016), 1-Wasserstein distance (Arjovsky et al., 2017) and maximummean discrepancy (MMD) (Binkowski et al., 2018). Mean-while, recent progress on designing network architectures(Radford et al., 2016; Zhang et al., 2019) and training tech-niques (Karras et al., 2018; Brock et al., 2019) has enabledGANs to produce impressive high-quality images.

Despite the extraordinary performance of generative models(Razavi et al., 2019; Kingma & Dhariwal, 2018; Brock et al.,2019; Karras et al., 2019), there still exists a gap betweenthe empirical success and the theoretical justification ofthese methods. For likelihood-based models, consistencyresults require that the data distribution is within the modelfamily, which is often hard to hold in practice (Kingma& Welling, 2014). Recently, new generative models havebeen developed from different perspectives, such as gradientflow in a measure space in which GAN can be covered asa special case (Gao et al., 2019; Arbel et al., 2019) and


stochastic differential equations (SDE) (Song & Ermon,2019; 2020; Song et al., 2021). To push a simple initialdistribution to the target one, however, these methods (Gaoet al., 2019; Arbel et al., 2019; Liutkus et al., 2019; Song &Ermon, 2019; 2020; Block et al., 2020) require the evolvingtime to go to infinity at the population level. Therefore,these methods require a strong assumption to achieve modelconsistency: the target must be log-concave or satisfy thelog-Sobolev inequality.

To fill the gap, we propose a Schrodinger Bridge approachto learn generative models. Schrodinger Bridge tackles theproblem by interpolating a reference distribution to a targetdistribution based on the Kullback-Leibler divergence. TheSchrodinger Bridge can be formulated via an SDE on a finitetime interval [0, 1] with a time-varying drift term. At thepopulation level, we can solve the SDE using the standardEuler-Maruyama method. At the sample level, we deriveour Schrodinger Bridge algorithm by plugging the drift terminto the Euler-Maruyamma method, where the drift termcan be accurately estimated by a deep score network. Themajor contributions of this work are as follows:

• From the theoretical perspective, we prove the con-sistency of the Schrodinger Bridge approach underthe some mild smoothness assumptions of the targetdistribution. Our theory guarantees that the learneddistribution converges to the target. To achieve modelconsistency, existing theories rely on strong assump-tions, e.g., the target must be log-concave or satisfysome error bound conditions, such as the log-Sobolevinequality. These assumptions may not hold in prac-tice.

• From the algorithmic perspective, we develop a noveltwo-stage approach to make the theory of SchrodingerBridge work in practice, where the first stage effec-tively learns a smoothed version of the target distribu-tion and the second stage drives the smoothed one tothe target distribution. Figure 1 gives an overview ofour two-stage algorithm.

• Through synthetic data, we demonstrate that ourSchrodinger Bridge approach can stably learn multi-modal distribution, while GANs are often highly unsta-ble and prone to miss modes (Che et al., 2017). We alsoshow that the proposed approach achieves comparableperformance with state-of-the-art GANs on benchmarkdata.

In summary, we believe that our work suggests a new for-mulation of generative models.

stage 1 stage 2

Figure 1. Overview of our two-stage algorithm. Stage 1 drivessamples at 0 (left) to a smoothed data distribution (middle), andstage 2 learns the underlying target data distribution (right) withsamples produced by stage 1. Stage 1 and stage 2 are achievedthrough the two different Schrodinger Bridges with theoreticallyguaranteed performance.

2. BackgroundLet’s first recall some background on Schrodinger Bridgeproblem adopted from (Leonard, 2014; Chen et al., 2020).

Let Ω = C([0, 1],Rd) be the space of Rd-valued con-tinuous functions on time interval [0, 1]. Denote X =(Xt)t∈[0,1] as the canonical process on Ω, where Xt(ω) =ωt, ω = (ωs)s∈[0,1] ∈ Ω. The canonical σ-field onΩ is then generated as F = σ(Xt, t ∈ [0, 1]) =ω : (Xt(ω))t∈[0,1] ∈ H : H ∈ B(Rd)

. Denote P(Ω)

as the space of probability measures on the path space Ω,and Wx

τ ∈ P(Ω) as the Wiener measure with variance τwhose initial marginal is δx. The law of the reversible Brow-nian motion, is then defined as Pτ =

∫Wx

τ dx, which is anunbounded measure on Ω. One can observe that, Pτ has amarginal coincides with the Lebesgue measure L at each t.

Schrodinger (1932) studied the problem of finding the mostlikely random evolution between two continuous proba-bility distributions µ, ν ∈ P(Rd). Nowadays, peoplecall the study of Schrodinger as the Schrodinger Bridgeproblem (SBP). In fact, SBP can be further formulatedas seeking a probability law on a path space that inter-polates between µ and ν, such that the probability lawis close to the prior law of the Brownian diffusion in thesense of relative entropy (Jamison, 1975; Leonard, 2014),i.e., finding a path measure Q∗ ∈ P(Ω) with marginalQ∗t = (Xt)#Q∗ = Q∗ X−1

t , t ∈ [0, 1] such that

Q∗ ∈ arg minQ∈P(Ω)DKL(Q||Pτ ),

andQ0 = µ,Q1 = ν,

where µ, ν ∈ P(Rd), relative entropy DKL(Q||Pτ ) =∫log( dQdPτ )dQ if Q Pτ (i.e. Q is absolutely contin-

uous w.r.t. Pτ ), and DKL(Q||Pτ ) = ∞ otherwise. Thefollowing results characterize the solution to SBP.

Theorem 1 (Leonard, 2014) If µ, ν L , then SBPadmits a unique solution Q∗ = f∗(X0)g∗(X1)Pτ ,


where f∗, g∗ are L -measurable nonnegative func-tions on Rd satisfying the Schrodinger systemf∗(x)EPτ [g∗ (X1) | X0 = x] = dµ

dL (x), L − a.e.g∗(y)EPτ [f∗ (X0) | X1 = y] = dν

dL (y), L − a.e.

Besides Q∗, we can also characterize the density of thetime-marginals of Q∗, i.e. dQ

∗t

dL (x).

Let q(x) and p(y) be the density of µ and ν respectively,and hτ (s,x, t,y) = [2πτ(t − s)]−d/2 exp

(−‖x−y‖

2

2τ(t−s)

)be

the transition density of the Wiener process. Then we haveEPτ [f∗ (X0) | X1 = y] =

∫hτ (0,x, 1,y)f0(x)dx,

EPτ [g∗ (X1) | X0 = x] =∫hτ (0,x, 1,y)g1(y)dy. The

above Schrodinger system is equivalent tof∗(x)

∫hτ (0,x, 1,y)g1(y)dy = q(x),

g∗(y)∫hτ (0,x, 1,y)f0(x)dx = p(y).

Denote f0(x) = f∗(x), g1(y) = g∗(y),

f1(y) =

∫hτ (0,x, 1,y)f0(x)dx,

g0(x) =

∫hτ (0,x, 1,y)g1(y)dy.

The Schrodinger system in Theorem 1 can also be charac-terized by

q(x) = f0(x)g0(x), p(y) = f1(y)g1(y)

with the following forward and backward time harmonicequations (Chen et al., 2020)

∂tft(x) = τ∆2 ft(x),

∂tgt(x) = − τ∆2 gt(x),

on (0, 1)× Rd.

Let qt denote marginal density of Q∗t , then it can be rep-resented (Chen et al., 2020) by the product of gt and ftdefined as qt(x) =

dQ∗tdL (x), and qt(x) = ft(x)gt(x).

There are also dynamic formulations of SBP. Let U con-sist of admissible Markov controls with finite energy. Thefollowing theorem shows that, the vector field

u∗t = τv∗t = τ∇x log gt(x)

=τ∇x log

∫hτ (t,x, 1,y)g1(y)dy

(1)

solves such a stochastic control problem:

Theorem 2 (Dai Pra, 1991)

u∗t (x) ∈ arg minu∈U

E[∫ 1

0

1

2‖ut‖2dt

]s.t.

dxt = utdt+√τdwt,

x0 ∼ q(x), x1 ∼ p(x).(2)

According to Theorem 2, the dynamics determined by theSDE in (2) with a time-varying drift term u∗t in (1) willmake the particles sampled from the initial distribution µevolve to the particles drawn from the target distribution ν inthe unit time interval. This nice property is what we need ingenerative learning because we want to learn the underlyingtarget distribution ν via pushing forward a simple referencedistribution µ. Theorem 2 also indicates that such a solutionhas minimum energy in terms of quadratic cost.

3. Generative Learning via SchrodingerBridge

In generative learning, we observe i.i.d. data x1, ...,xn froman unknown distribution pdata ∈ P(Rd). The underlyingdistribution pdata often has multi-modes or lies on a low-dimensional manifold, which may cause difficulty to learnfrom simple distribution such as Gaussian or Dirac measuresupported on a single point. To make the generative learningtask easy to handle, we can first learn a smoothed versionof pdata from the simple reference distribution, say

qσ(x) =

∫pdata(y)Φσ(x− y)dy,

where Φσ(·) is the density of N (0, σ2I), the variance ofGaussian noise σ2 controls the smoothness of qσ . Then welearn pdata starting from qσ. At the population level, thisidea can be done via Schrodinger Bridge from the point ofview of the stochastic control problem (Theorem 2). To beprecise, we have the following theorem.

Theorem 3 Define the density ratio f(x) = qσ(x)Φ√τ (x) . Then

for the SDE

dxt = τ∇ logEz∼Φ√τ [f(xt+√

1− tz)]dt+√τdwt (3)

with initial condition x0 = 0, we have x1 ∼ qσ(x).

And, for the SDE

dxt = σ2∇ log q√1−tσ(xt)dt+ σdwt (4)

with initial condition x0 ∼ qσ(x), we have x1 ∼ pdata(x).

According to Theorem 3, at the population level, the targetpdata can be learned from the Dirac mass supported at 0through two SDEs (3) and (4) in the unit time interval [0,1].The main feature of the SDEs (3) and (4) is that both driftterms are time-varying, which is different from classicalLangevin SDEs with time-invariant drift terms (Song & Er-mon, 2019; 2020). The benefit of time-varying drift terms isthat the dynamics in (3) and (4) will push the initial distribu-tions to the target distributions in a unit time interval, whilethe classical Langevin SDE needs time to go to infinity.


3.1. Estimation of the drift terms

Based on Theorem 3, we can run the Euler-Maruyamamethod to solve the SDEs (3) and (4) and get particles ap-proximately drawn from the targets (Higham, 2001). How-ever, the drift terms in Theorem 3 depend on the underlyingtarget. To make the Euler-Maruyama method practical, weneed to estimate the two drift terms in (3) and (4). In Eq.(3), some calculation shows that

∇ logEz∼Φ√τ [f(x +√

1− tz)]

=Ez∼Φ√τ

[f(x +

√1− tz)∇ log f(x +

√1− tz)

]Ez∼Φ√τ [f(x +

√1− tz)]

,

(5)

and∇ log f(x) = ∇ log qσ(x) + x/τ.

Let f and ∇ log qσ be the estimators of the density ratio fand the score of qσ(x), respectively. After plugging theminto (5), we can obtain an estimator of the drift term in (3) bycomputing the expectation with Monte Carlo approximation.

Now we consider obtaining the estimator of density ratiof , via minimizing the logistic regression loss Llogistic(r) =Eqσ(x) log(1+exp(−r(x)))+EΦ√τ (x) log(1+exp(r(x))).By setting the first variation to zero, the optimal solution isgiven by

r∗(x) = logqσ(x)

Φ√τ (x).

Therefore, given samples x1, ..., xn from qσ(x), which canbe obtained by adding Gaussian noise drawn from Φσ onx1, ...,xn ∼ pdata, and samples z1, ..., zn from Φ√τ , wecan estimate the density ratio f(x) by

f(x) = exp(rφ(x)), (6)

where rφ ∈ NN φ is the neural network that minimizes theempirical loss:

rφ ∈ arg minrφ∈NNφ1

n

n∑i=1

[ log(1 + exp(−rφ(xi)))

+ log(1 + exp(rφ(zi)))].(7)

Next, we consider estimating the time-varying drift term in(4), i.e.,∇ log q√1−tσ(x) for t ∈ [0, 1]. To do so, we builda deep network as the score estimator for∇ log qσ(x) withσ varying in [0, σ]. Vincent (2011) showed that, explicitlymatching the score by minimizing the objective

1

2Eqσ(x)‖sθ(x, σ)−∇x log qσ(x)‖2

is equivalent to minimizing the denoising score matchingobjective

1

2Epdata(x)EN (x;x,σ2I)‖sθ(x, σ)−∇x log qσ(x|x)‖2

=1

2Epdata(x)EN (x;x,σ2I)

∥∥∥∥sθ(x, σ) +x− x

σ2

∥∥∥∥2

.

Thus we build the score estimator following Song & Ermon(2019; 2020) as

sθ(·, ·) ∈ arg minsθ∈NN θ

L(θ), (8)

L(θ) =1

m

m∑j=1

λ(σj)Lσj (θ), (9)

Lσj (θ) =

n∑i=1

∥∥∥∥∥sθ(xi + zi, σ) +ziσ2j

∥∥∥∥∥2

/n,

variance terms σ2j , j = 1, . . . ,m are i.i.d. samples from

Uniform[0, σ2] with sample size m, λ(σ) = σ2 is a non-negative scaling factor to ensure all the summands in (9)have the same scale, and zi, i = 1, ..., n are i.i.d. from Φσ .

At last, we establish the consistencies of the deep densityratio estimator f(x) = exp(rφ(x)) and the deep score esti-mator ∇ log qσ(x) = sθ(x; σ) in Theorem 4 and Theorem5, respectively.

Theorem 4 Assume that the support of pdata(x) is con-tained in a compact set, and f(x) is Lipschitz continuousand bounded. Set the depth D, width W , and size S ofNN φ as

D = O(log(n)),W = O(nd

2(2+d) / log(n)),

S = O(nd−2d+2 log(n)−3).

Then E[‖f(x)− f(x)‖L2(pdata)]→ 0 as n→∞.

Theorem 5 Assume that pdata(x) is differentiable withbounded support, and ∇ log qσ(x) is Lipschitz continuousand bounded for (σ,x) ∈ [0, σ] × Rd. Set the depth D,widthW , and size S of NN θ as

D = O(log(n)),W = O(maxnd

2(2+d) / log(n), d),

S = O(dnd−2d+2 log(n)−3).

Then E[‖‖∇ log qσ(x) − ∇ log qσ(x)‖2‖L2(qσ)] → 0 asm,n→∞.


Algorithm 1 Sampling

Input: f(·), sθ(·, ·), τ , σ, N1, N2, N3

Initialize particles as x0 = 0 stage 1for k = 0 to N1 − 1 do

Sample zi2N3i=1 , εk ∼ N (0, I)

xi = xk +

√τ(

1− kN1

)zi, i = 1, ..., N3

b(xk) =

∑N3i=1 f(xi)[sθ(xi,σ)+

√(1− k

N1

)/τzi]∑2N3

i=N3+1 f(xi)+ xk

τ .

xk+1 = xk + τN1

b(xk) +√

τN1

εk.

end forSet x0 = xN1

stage 2for k = 0 to N2 − 1 do

Sample εk ∼ N (0, I)

b(xk) = sθ(xk,√

1− kN2σ)

xk+1 = xk + σ2

N2b(xn) + σ√

N2εk

end forreturn xN2

3.2. Schrodinger Bridge Algorithm

With the two estimators f and ∇ log qσ , we can use theEuler-Maruyama method to approximate numerical solu-tions of SDEs (3) and (4). Let N1 and N2 be the number ofuniform grids on the time interval [0, 1]. In stage 1, we startfrom 0 and run Euler-Maruyama for (3) with the estimatedf and ∇ log qσ in the drift term to obtain samples that fol-low qσ approximately. In stage 2, we start with the samplesfrom qσ and run another Euler-Maruyama for (4) with theestimated time-varying drift term ∇ log qσ . We summarizeour two-stage Schrodinger Bridge algorithm in 1.

Interestingly, the second stage of our proposed SchrodingerBridge algorithm 1 recovers the reverse-time VarianceExploding (VE) SDE algorithm proposed in Song et al.(2021), if their annealing scheme is chosen to be linear asσ2(t) = σ2 · t. From this point of view, our SchrodingerBridge algorithm also provides deeper understanding of an-nealing score based sampling, i.e., the reverse-time VE SDEalgorithm (with a proper annealing scheme) proposed bySong et al. (2021) is equivalent to the Schrodinger BridgeSDE (4).

3.3. Consistency of Schrodinger Bridge Algorithm

Let

D1(t,x) = ∇ logEz∼Φ√τ [f(x +√

1− tz)],

D2(t,x) = ∇ log q√1−tσ(x)

be the drift terms. Denote

hσ,τ (x1,x2) = exp

(‖x1‖2

2τ

)pdata(x1 + σx2).

Now we establish the consistency of our Schrodinger Bridgealgorithm which can drive a simple distribution to the targetone. To this end, we need the following assumptions:

Assumption 1 supp(pdata) is contained in a ball with ra-dius R, and pdata > c > 0 on its support.

Assumption 2 ‖Di(t,x)‖2 ≤ C1(1 + ‖x‖2), ∀x ∈supp(pdata), t ∈ [0, 1], where C1 ∈ R is a constant.

Assumption 3 ‖Di(t1,x1) − Di(t2,x2)‖ ≤ C2(‖x1 −x2‖+ |t1 − t2|1/2), ∀x1,x2 ∈ supp(pdata), t1, t2 ∈ [0, 1].C2 ∈ R is another constant.

Assumption 4 hσ,τ (x1,x2), ∇x1hσ,τ (x1,x2), pdata and

∇pdata are L-Lipschitz functions.

Theorem 6 Under Assumptions 1-4,

E[W2(Law(xN2), pdata)]→ 0, as n,N1, N2, N3 →∞,

whereW2 is the 2-Wasserstein distance between two distri-butions.

The consistency of the proposed Schrodinger Bridge algo-rithm is mainly based on mild assumptions (such as smooth-ness and boundedness) without some restricted technical re-quirements that the target distribution has to be log-concaveor fulfill the log-Sobolev inequality (Gao et al., 2021; Arbelet al., 2019; Liutkus et al., 2019; Block et al., 2020).

4. Related WorkWe discuss connections and differences between ourSchrodinger Bridge approach and existing related works.

Most of existing generative models, such as VAEs, GANsand flow-based methods, parameterize a transform map witha neural network G that minimizes an integral probabilitymetric. Clearly, they are quite different from our proposal.

Recently, particle methods derived in the perspective of gra-dient flows in measure spaces or SDEs have been studied(Johnson & Zhang, 2018; Gao et al., 2019; Arbel et al.,2019; Song & Ermon, 2019; 2020; Song et al., 2021). Herewe clarify the main differences of our Schrodinger Bridgeapproach and the above mentioned particle methods. Theproposals in (Johnson & Zhang, 2018; Gao et al., 2019;Arbel et al., 2019) are derived based on the surrogate of thegeodesic interpolation (Gao et al., 2021; Liutkus et al., 2019;Song & Ermon, 2019). They utilize the invariant measureof SDEs to model the generative task, resulting in an itera-tion scheme that looks similar to our Schrodinger Bridge.


However, the main difference lies that the drift terms of theLangevin SDEs in (Song & Ermon, 2019; 2020; Block et al.,2020) are time-invariant in contrast to the time-varying driftterm in our formulation. As shown in Theorem 3, the ben-efit of the time-varying drift term is essential: the SDE ofSchrodinger Bridge runs on a unit time interval [0, 1] willrecover the target distribution at the terminal time. However,the evolution measures of the above mentioned methods(Gao et al., 2019; Arbel et al., 2019; Song & Ermon, 2019;2020; Block et al., 2020; Gao et al., 2021) only converge tothe target when the time goes to infinity. Hence, some techni-cal requirements are imposed to the target distribution, suchas log-concave or the log-Sobolev inequality, to guaranteethe consistency of Euler-Maruyama discretization. However,these assumptions may often be too strong to hold in realdata analysis. We proposed a two-stage approach to makethe Schrodinger Bridge formulation work in practice. Wedrive the Dirac distribution to a smoothed version of under-lying distribution pdata in stage 1 and then learn pdata fromthe smoothed version in stage 2. Interestingly, the secondstage of the proposed Schrodinger Bridge algorithm recov-ers the reverse-time Variance Exploding SDE algorithm (VESDE) (Song et al., 2021) when their annealing scheme islinear, i.e., σ2(t) = σ2 · t. Therefore, the analysis devel-oped here also provides a theoretical justification of whythe reverse-time VE SDE algorithm works well. However,their setting is σ2(t) = (σ2

max)t · (σ2min)1−t. This implies

that the end-time distribution of the reverse-time VE SDE isstill a smoothed one (with noise level σmin), resulting in abarrier of establishing the consistency. Another fundamen-tal difference between our approach and reverse-time VESDE is that, the reverse-time VE SDE also need a smootheddistribution as the input of theoretically, but they only ap-proximately use large Gaussian noises as the initializationof the denoising process. Stage 1 ensures our algorithm tolearn samples from the smoothed data distribution in unittime, which is necessary for model consistency.

5. ExperimentsIn this section, we first employ two-dimensional toy exam-ples to show the ability of our algorithm to learn multimodaldistributions which may not satisfy log-Sobolev inequality.Next, we show that our algorithm is able to generate realisticimage samples. We also demonstrate the effective of our ap-proach by image interpolation and image inpainting. We usetwo benchmark datasets including CIFAR-10 (Krizhevskyet al., 2009) and CelebA (Liu et al., 2015). For CelebA, theimages are center-cropped and resized to 64× 64. Both ofthe datasets are normalized by first rescaling the pixel valuesto [0, 1], and then substracting a mean vector x estimated us-ing 50,000 samples to center the data distributions at the ori-gin. In our algorithm, the particles start from δ0. To improvethe performance, it is helpful to align the sample mean to

the origin. After generation, we add the image mean x backto the generated samples. More details on the hyperparame-ter settings and network architectures, and some additionalexperiments are provided in the supplementary material.The code for reproducing all our experiments is available athttps://github.com/YangLabHKUST/DGLSB.

5.1. Setup

For the noise level σ, we set σ = 1.0 in this paper forgenerative tasks including both 2D example and CIFAR-10.In fact, the performance of our algorithm is insensitive tothe choice of σ when σ is given in a reasonable range (theresults with other σ values are shown in the supplementarymaterial). We find that the performance of our algorithmis often among the best by setting σ = 1.0 for 32 × 32images. The reason is that a very small σ can not makeqσ smooth enough and harms the performance of stage 1while a very large σ brings more difficulty for our stage 2 toanneal the noise level down. For larger images like CelebA,as the dimensionality of samples is higher, we increasethe noise level σ to 2.0. We also compare the results byvarying the value of the variance of the Wiener measure τfor image generation. The numbers of grids are chosen asN1 = N2 = 1, 000 for stage 1 and stage 2. We use samplesize N3 = 1 to estimate the drift term in stage 1 for both2D toy examples and real images. In general, we find that alarger sample sizeN3 does not significantly improve samplequality.

5.2. Learning 2D Multimodal Distributions

We demonstrate that our algorithm can effectively learnmultimodal distributions. The distribution we adopt is amixture of Gaussians with 6 components. Each of the com-ponents has a mean with a distance equaling to 5.0 fromthe origin, and a variance 0.01, as shown in Fig. 2(a). Thecomponents are relatively far away from each other. It isa very challenging task for GANs to learn this multimodaldistribution because this distribution may not satisfy the log-Sobolev inequality. Fig. 2(b) shows the failure of vanillaGAN, where several modes are missed. However, Fig. 2(c)and 2(d) show that our algorithm is able to stably generatesamples from the multimodal distribution without ignoringany of the modes. In Fig. 3, we compare the ground truth ve-locity fields induced by drift terms D1(t,x), D2(t,x) withthe estimated velocity fields at the end of each stage. Ourestimated drift terms are close to the ground truth except forthe region with nearly zero probability density.

5.3. Effectiveness of Two Stages for Image Generation

Fig. 4 shows the particle evolution on CIFAR-10 in ouralgorithm, where the two stages are annotated with cor-responding colors. It shows that our two-stage approach

https://github.com/YangLabHKUST/DGLSB


(a) (b) (c) (d)

Figure 2. KDE plots for mixture of Gaussians with 5,000 samples.(a). Ground truth. (b). Distribution learned by vanilla GAN.(c). Distribution learned by the proposed method after stage 1(τ = 5.0). (d). Distribution learned by the proposed method afterstage 2.

(a) (b) (c) (d)

Figure 3. Velocity fields. (a) and (b). Ground truth velocity fieldsat the end of stages 1 and 2. (c) and (d). Estimated velocity fieldsat the end of stages 1 and 2.

provides a valid path for the particles to move from the ori-gin to the target distribution. A natural question is: what arethe roles of stage 1 and stage 2 in the generative modeling,respectively? In this subsection, we design experiments toanswer this question.

Figure 4. Particle evolution on CIFAR-10. The column in thecenter indicates particles obtained after stage 1.

We first evaluate the role of stage 1. For this purpose, weskip stage 1 but simply run stage 2 using non-informativeGaussian noises as the initial condition. Fig. 5 shows thatthe approach only using stage 2 generates worse image sam-ples than the proposed two-stage approach. These resultsindicate that the role of stage 1 is to provide a better ini-tial reference for stage 2. The role of stage 2 is easier tocheck: it is a Schrodinger Bridge from qσ(x) to the targetdistribution pdata(x). In Fig. 6, we perturb real images with

Gaussian noises of variance σ2 = 1.0. Our stage 2 annealsthe noise level to zero and drives the particles to the datadistribution. Moreover, Fig. 6 also indicates that stage 2 notonly recover the original images, but also generate imageswith some extent of diversity.

(a) (b) (c) (d)

Figure 5. Comparison with random image samples. (a). Samplesproduced by our algorithm with τ = 2.0 (FID = 12.32). (b), (c),(d). Samples produced by stage 2 taking Gaussian noises withvariance 1.0 (FID = 32.60), 1.5 (FID = 24.76), 2.0 (FID = 51.21)as input respectively.

Figure 6. Denoising with stage 2 for perturbed real images.

5.4. Results

In this subsection, we evaluate our proposed approach onbenchmark datasets. Fig. 7 presents the generated samplesof our algorithm on CIFAR-10 and CelebA. Visually, ouralgorithm produces high-fidelity image samples which arecompetitive with real images. For quantitive evaluation,we employ Frechet Inception Distance (FID) (Heusel et al.,2017) and Inception Score (IS) (Salimans et al., 2016) tocompare our method with other benchmark methods.

We first compare the FID and IS on CIFAR-10 dataset, withτ increasing from 1.0 to 4.0 using 50,000 generated samples.Note that τ is the variance of the prior Wiener measure instage 1, so it controls the behavior of the particle evolutionfrom δ0 to qσ, and has an impact on the numerical results.To make the prior reasonable, we let τmin = σ2 = 1.0. Thereason is that, if the particles strictly follow the prior lawof the Brownian diffusion with variance τ in stage 1, theend time marginal will be N (0, τI). A good choice of theprior should make N (0, τI) close to the end time marginalqσ which we are interested about. As shown in Table 1, our


Table 1. FID and Inception Score on CIFAR-10 for τ ∈ [1, 4].

τ 1.0 1.5 2.0 2.5

FID 37.20 20.49 12.32 12.90IS 6.52 7.65 8.14 7.99

τ 3.0 3.5 4.0

FID 13.97 14.49 14.67IS 7.98 8.03 8.10

Table 2. FID and Inception Scores on CIFAR-10.

MODELS FID IS

WGAN-GP 36.4 7.86±0.07SN-SMMDGAN 25.0 7.3±0.1SNGAN 21.7 8.22±0.05NCSN 25.32 8.87±0.12NCSNV2 10.87 8.40±0.07

OURS 12.32 8.14±0.07

algorithm achieves the best performance at τ = 2.0. Theresults also indicate that our algorithm is stable with respectto the value of variance of the prior Wiener measure τ whenτ ≥ 2.0. In general, reasonable choices of τ would result inrelatively good generating performance.

Table 2 presents the FID and IS of our algorithm evaluatingwith 50,000 samples, as well as other state-of-the-art gener-ative models including WGAN-GP (Gulrajani et al., 2017),SN-SMMDGAN(Arbel et al., 2018), SNGAN (Miyato et al.,2018), NCSN (Song & Ermon, 2019) and NCSNv2 (Song &Ermon, 2020) on CIFAR-10. Our algorithm attains an FIDscore of 12.32 and an Inception Score of 8.14, which arecompetitive with the referred baseline methods. The quanti-tive results demonstrate the effectiveness of our algorithm.

(a) (b)

Figure 7. Random samples on CIFAR-10 (σ = 1.0, τ = 2.0) andCelebA (σ = 2.0, τ = 8.0).

5.5. Image Interpolation and Inpainting with Stage 2

To demonstrate usefulness of the proposed algorithm, weconsider image interpolation and inpainting tasks.

Interpolating images linearly in the data distribution pdata

would induce artifacts. However, if we perturb the linearinterpolation using a Gaussian noise with variance σ2, andthen use our stage 2 to denoise, we are able to obtain aninterpolation without such artifacts. We find σ2 = 0.4 issuitable for the image interpolation task for CelebA. Fig. 8lists the image interpolation results. Our algorithm producessmooth image interpolation by gradually changing facialattributes.

Figure 8. Image interpolation on CelebA. The first and lastcolumns correspond to real images.

The second stage can also be utilized for image inpaintingwith a little modification, inspired by the image inpaintingalgorithm with annealed Langevin dynamics in (Song & Er-mon, 2019). Let m be a mask with entries in 0, 1 where0 corresponds to missing pixels. The idea for inpainting isvery similar to interpolation. We treat xm+σε as a sam-ple from qσ, where ε ∼ N (0, I). Thus, we can use stage 2to obtain samples from pdata. The image inpainting proce-dure is given in algorithm 2, and the results are presented

in Fig. 9. Notice that we perturb y with√

1− k+1N2

σz atthe end of each iteration. This is because the k-th iterationin stage 2 can be regarded as one-step Schrodinger Bridgefrom q√

1−k/N2σto q√

1−(k+1)/N2σ. Thus, the particles

are supposed to follow q√1−(k+1)/N2σ

(x) after the k-thiteration.

Figure 9. Image inpainting on CelebA. The leftmost column con-tains real images. Each occluded image is followed by threeinpainting samples.


Algorithm 2 Inpainting with stage 2Input: y = xm, mSample z ∼ N (0, I)x0 = y + σzfor k = 0 to N2 − 1 do

Sample εk ∼ N (0, I)

b(xk) = sθ(xk,√

1− kN2σ)

xk+1 = xk + σ2

N2b(xk) + σ√

N2εn

xk+1 = xk+1 (1−m) + (y +√

1− k+1N2

σz)m

end forreturn xN2

6. ConclusionWe propose to learn a generative model via entropy interpo-lation with a Schrodinger Bridge. At the population level,this entropy interpolation can be characterized via an SDEon [0, 1] with a time varying drift term. We derive a two-stage Schrodinger Bridge algorithm by plugging the driftterm estimated by a deep score estimator and a deep den-sity estimator in the Euler-Maruyama method. Under somesmoothness assumptions of the target distribution, we provethe consistency of the proposed Schrodinger Bridge ap-proach, guaranteeing that the learned distribution convergesto the target distribution. Experimental results on multi-modal synthetic data and benchmark data support our theo-retical findings and demonstrate that the generative modelvia Schrodinger Bridge is comparable with state-of-the-artGANs, suggesting a new formulation of generative learning.

7. AcknowledgementWe thank the reviewers for their valuable comments. Thiswork is supported in part by the National Key Research andDevelopment Program of China [grant 208AAA0101100],the National Science Foundation of China [grant 11871474],the research fund of KLATASDSMOE, Hong KongResearch Grant Council [grants 16307818, 16301419,16308120], the Guangdong-Hong Kong-Macao Joint Labo-ratory [grant 2020B1212030001], Hong Kong Innovationand Technology Fund [PRP/029/19FX], Hong Kong Uni-versity of Science and Technology (HKUST) [startup grantR9405, Z0428 from the Big Data Institute] and the HKUST-WeBank Joint Lab project. The computational task forthis work was partially performed using the X-GPU clus-ter supported by the RGC Collaborative Research Fund:C6021-19EF.

ReferencesArbel, M., Sutherland, D., Binkowski, M., and Gretton, A.

On gradient regularizers for MMD GANs. In Advances in

Neural Information Processing Systems, pp. 6701–6711,2018.

Arbel, M., Korba, A., Salim, A., and Gretton, A. Maximummean discrepancy gradient flow. In Advances in NeuralInformation Processing Systems, pp. 6481–6491, 2019.

Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein gen-erative adversarial networks. In International Conferenceon Machine Learning, pp. 214–223, 2017.

Behrmann, J., Grathwohl, W., Chen, R. T. Q., Duvenaud,D., and Jacobsen, J.-H. Invertible residual networks. InInternational Conference on Machine Learning, pp. 573–582, 2019.

Binkowski, M., Sutherland, D. J., Arbel, M., and Gretton, A.Demystifying MMD GANs. In International Conferenceon Learning Representations, 2018.

Block, A., Mroueh, Y., and Rakhlin, A. Generative model-ing with denoising auto-encoders and langevin sampling.arXiv preprint arXiv:2002.00107, 2020.

Brock, A., Donahue, J., and Simonyan, K. Large scaleGAN training for high fidelity natural image synthesis. InInternational Conference on Learning Representations,2019.

Che, T., Li, Y., Jacob, A. P., Bengio, Y., and Li, W. Moderegularized generative adversarial networks. In Interna-tional Conference on Learning Representations, 2017.

Chen, Y., Georgiou, T. T., and Pavon, M. Stochastic controlliasons: Richard sinkhorn meets gaspard monge on aschroedinger bridge. arXiv preprint arXiv:2005.10963,2020.

Choi, Y., Uh, Y., Yoo, J., and Ha, J.-W. Stargan v2: Diverseimage synthesis for multiple domains. In Proceedingsof the IEEE/CVF Conference on Computer Vision andPattern Recognition, pp. 8188–8197, 2020.

Dai Pra, P. A stochastic control approach to reciprocal dif-fusion processes. Applied mathematics and Optimization,23(1):313–329, 1991.

Dinh, L., Krueger, D., and Bengio, Y. NICE: Non-linearindependent components estimation. In InternationalConference on Learning Representations, 2015.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density esti-mation using Real NVP. In International Conference onLearning Representations, 2017.

Gao, Y., Jiao, Y., Wang, Y., Wang, Y., Yang, C., and Zhang,S. Deep generative learning via variational gradient flow.In International Conference on Machine Learning, pp.2093–2101, 2019.


Gao, Y., Huang, J., Jiao, Y., Liu, J., Lu, X., and Yang, Z.Generative learning with euler particle transport. In An-nual Conference on Mathematical and Scientific MachineLearning, volume 145, pp. 1–33, 2021.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,Warde-Farley, D., Ozair, S., Courville, A., and Bengio,Y. Generative adversarial nets. In Advances in NeuralInformation Processing Systems, pp. 2672–2680, 2014.

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., andCourville, A. C. Improved training of Wasserstein gans.In Advances in Neural Information Processing Systems,pp. 5769–5779, 2017.

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., andHochreiter, S. GANs trained by a two time-scale updaterule converge to a local nash equilibrium. In Advances inNeural Information Processing Systems, pp. 6629–6640,2017.

Higham, D. J. An algorithmic introduction to numericalsimulation of stochastic differential equations. SIAMreview, 43(3):525–546, 2001.

Jamison, B. The markov processes of schrodinger.Zeitschrift fur Wahrscheinlichkeitstheorie und VerwandteGebiete, 32(4):323–331, 1975.

Johnson, R. and Zhang, T. Composite functional gradientlearning of generative adversarial models. In Interna-tional Conference on Machine Learning, pp. 2371–2379,2018.

Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progres-sive growing of GANs for improved quality, stability,and variation. In International Conference on LearningRepresentations, 2018.

Karras, T., Laine, S., and Aila, T. A style-based genera-tor architecture for generative adversarial networks. InProceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition, pp. 4401–4410, 2019.

Kingma, D. P. and Dhariwal, P. Glow: Generative flowwith invertible 1x1 convolutions. In Advances in NeuralInformation Processing Systems, pp. 10236–10245, 2018.

Kingma, D. P. and Welling, M. Auto-encoding variationalbayes. In International Conference on Learning Repre-sentations, 2014.

Krizhevsky, A., Hinton, G., et al. Learning multiple layersof features from tiny images. 2009.

Leonard, C. A survey of the schrodinger problem and someof its connections with optimal transport. DYNAMICALSYSTEMS, 34(4):1533–1574, 2014.

Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning faceattributes in the wild. In Proceedings of InternationalConference on Computer Vision, pp. 3730–3738, 2015.

Liutkus, A., Simsekli, U., Majewski, S., Durmus, A., Stoter,F.-R., Chaudhuri, K., and Salakhutdinov, R. Sliced-Wasserstein flows: Nonparametric generative modelingvia optimal transport and diffusions. In InternationalConference on Machine Learning, pp. 4104–4113, 2019.

Makhzani, A., Shlens, J., Jaitly, N., and Goodfellow, I.Adversarial autoencoders. In ICLR Workshop, 2016.

Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spec-tral normalization for generative adversarial networks. InInternational Conference on Learning Representations,2018.

Nowozin, S., Cseke, B., and Tomioka, R. f -GAN: Traininggenerative neural samplers using variational divergenceminimization. In Advances in Neural Information Pro-cessing Systems, pp. 271–279, 2016.

Papamakarios, G., Pavlakou, T., and Murray, I. Maskedautoregressive flow for density estimation. In Advances inNeural Information Processing Systems, pp. 2335–2344,2017.

Prenger, R., Valle, R., and Catanzaro, B. Waveglow: Aflow-based generative network for speech synthesis. InIEEE International Conference on Acoustics, Speech andSignal Processing, pp. 3617–3621, 2019.

Radford, A., Metz, L., and Chintala, S. Unsupervised rep-resentation learning with deep convolutional generativeadversarial networks. In International Conference onLearning Representations, 2016.

Razavi, A., van den Oord, A., and Vinyals, O. Generatingdiverse high-fidelity images with vq-vae-2. In Advancesin Neural Information Processing Systems, pp. 14837–14847, 2019.

Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Rad-ford, A., and Chen, X. Improved techniques for traininggans. In Advances in Neural Information ProcessingSystems, pp. 2226–2234, 2016.

Schrodinger, E. Sur la theorie relativiste de l’electron etl’interpretation de la mecanique quantique. In Annales del’institut Henri Poincare, volume 2, pp. 269–310, 1932.

Shen, Y., Gu, J., Tang, X., and Zhou, B. Interpreting thelatent space of gans for semantic face editing. In Proceed-ings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pp. 9243–9252, 2020.


Song, Y. and Ermon, S. Generative modeling by estimatinggradients of the data distribution. In Advances in NeuralInformation Processing Systems, pp. 11895–11907, 2019.

Song, Y. and Ermon, S. Improved techniques for trainingscore-based generative models. In Advances in NeuralInformation Processing Systems, 2020.

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Er-mon, S., and Poole, B. Score-based generative modelingthrough stochastic differential equations. In InternationalConference on Learning Representations, 2021.

Tolstikhin, I., Bousquet, O., Gelly, S., and Scholkopf, B.Wasserstein auto-encoders. In International Conferenceon Learning Representations, 2018.

Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K.,Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. W.,and Kavukcuoglu, K. WaveNet: A generative model forraw audio. In 9th ISCA Speech Synthesis Workshop, pp.125–125, 2016.

Vincent, P. A connection between score matching and de-noising autoencoders. Neural computation, 23(7):1661–1674, 2011.

Zhang, H., Goodfellow, I., Metaxas, D., and Odena, A. Self-attention generative adversarial networks. In Interna-tional Conference on Machine Learning, pp. 7354–7363,2019.

Zhu, J.-Y., Krahenbuhl, P., Shechtman, E., and Efros, A. A.Generative visual manipulation on the natural image man-ifold. In European Conference on Computer Vision, pp.597–613, 2016.

Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. Unpairedimage-to-image translation using cycle-consistent adver-sarial networks. In Proceedings of the IEEE InternationalConference on Computer Vision, pp. 2223–2232, 2017.

Documents

Deep Generative Learning via Schrödinger Bridge