Page 1: Sinkhorn AutoEncoders - the data. Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) belong to the former class, by learning

Sinkhorn AutoEncoders

Giorgio Patrini⇤•

UvA Bosch Delta Lab

Samarth BhargavUniversity of Amsterdam

Rianne van den Berg⇤†University of Amsterdam

Max WellingUniversity of Amsterdam


Patrick ForreUniversity of Amsterdam

Tim Genewein‡Bosch Center for

Artificial Intelligence

Marcello CarioniKFU Graz

Frank NielsenEcole Polytechnique

Sony CSL


Optimal transport o↵ers an alternative tomaximum likelihood for learning generativeautoencoding models. We show that mini-mizing the p-Wasserstein distance betweenthe generator and the true data distributionis equivalent to the unconstrained min-minoptimization of the p-Wasserstein distancebetween the encoder aggregated posteriorand the prior in latent space, plus a recon-struction error. We also identify the role ofits trade-o↵ hyperparameter as the capac-ity of the generator: its Lipschitz constant.Moreover, we prove that optimizing the en-coder over any class of universal approxima-tors, such as deterministic neural networks,is enough to come arbitrarily close to theoptimum. We therefore advertise this frame-work, which holds for any metric space andprior, as a sweet-spot of current generativeautoencoding objectives.We then introduce the Sinkhorn auto-encoder (SAE), which approximates andminimizes the p-Wasserstein distance in la-tent space via backprogation through theSinkhorn algorithm. SAE directly workson samples, i.e. it models the aggregatedposterior as an implicit distribution, withno need for a reparameterization trick forgradients estimations. SAE is thus able towork with di↵erent metric spaces and priorswith minimal adaptations.We demonstrate the flexibility of SAE onlatent spaces with di↵erent geometries andpriors and compare with other methods onbenchmark data sets.

⇤Equal contributions.

•Now at Deeptrace. †Now at

Google Brain. ‡Now at DeepMind.


Unsupervised learning aims at finding the underlyingrules that govern a given data distribution PX . It canbe approached by learning to mimic the data genera-tion process, or by finding an adequate representationof the data. Generative Adversarial Networks (GAN)(Goodfellow et al., 2014) belong to the former class,by learning to transform noise into an implicit dis-tribution that matches the given one. AutoEncoders(AE) (Hinton and Salakhutdinov, 2006) are of thelatter type, by learning a representation that maxi-mizes the mutual information between the data andits reconstruction, subject to an information bottle-neck. Variational AutoEncoders (VAE) (Kingma andWelling, 2013; Rezende et al., 2014), provide both agenerative model — i.e. a prior distribution PZ onthe latent space with a decoder G(X|Z) that modelsthe conditional likelihood — and an encoder Q(Z|X)— approximating the posterior distribution of thegenerative model. Optimizing the exact marginallikelihood is intractable in latent variable modelssuch as VAE’s. Instead one maximizes the EvidenceLower BOund (ELBO) as a surrogate. This objec-tive trades o↵ a reconstruction error of the inputdistribution PX and a regularization term that aimsat minimizing the Kullback-Leibler (KL) divergencefrom the approximate posterior Q(Z|X) to the priorPZ .

An alternative principle for learning generative au-toencoders comes from the theory of Optimal Trans-port (OT) (Villani, 2008), where the usual KL-divergence KL(PX , PG) is replaced by OT-cost diver-gences Wc(PX , PG), among which the p-Wassersteindistances Wp are proper metrics. In the papersTolstikhin et al. (2018); Bousquet et al. (2017) itwas shown that the objective Wc(PX , PG) can bere-written as the minimization of the reconstructionerror of the input PX over all probabilistic encoders

Page 2: Sinkhorn AutoEncoders - the data. Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) belong to the former class, by learning

Q(Z|X) constrained to the condition of matchingthe aggregated posterior QZ — the average (approx-imate) posterior EPX [Q(Z|X)] — to the prior PZ

in the latent space. In Wasserstein AutoEncoders(WAE) (Tolstikhin et al., 2018), it was suggested,following the standard optimization principles, tosoftly enforce that constraint via a penalization termdepending on a choice of a divergence D(QZ , PZ)in latent space. For any such choice of divergencethis leads to the minimization of a lower bound ofthe original objective, leaving the question about thestatus of the original objective open. Nonetheless,WAE empirically improves upon VAE for the twochoices made there, namely either a Maximum MeanDiscrepancy (MMD) (Gretton et al., 2007; Sripe-rumbudur et al., 2010, 2011), or an adversarial loss(GAN), again both in latent space.

We contribute to the formal analysis of autoencoderswith OT. First, using the Monge-Kantorovich equiva-lence (Villani, 2008), we show that (in non-degeneratecases) the objective Wc(PX , PG) can be reduced tothe minimization of the reconstruction error of PX

over any class containing the class of all deterministic

encoders Q(Z|X), again constrained to QZ = PZ .Second, when restricted to the p-Wasserstein dis-tance Wp(PX , PG), and by using a combination oftriangle inequality and a form of data processinginequality for the generator G, we show that thesoft and unconstrained minimization of the recon-struction error of PX together with the penalizationterm � · Wp(QZ , PZ) is actually an upper boundto the original objective Wp(PX , PG), where theregularization/trade-o↵ hyperparameter � needs tomatch at least the capacity of the generator G, i.e.its Lipschitz constant. This suggests that using a p-Wasserstein metric Wp(QZ , PZ) in latent space in theWAE setting (Tolstikhin et al., 2018) is a preferablechoice.

Third, we show that the minimum of that objectivecan be approximated from above by any class ofuniversal approximators for Q(Z|X) to arbitrarilysmall error. In case we choose the Lp-norms k ·kp andcorresponding p-Wasserstein distances Wp one canuse the results of (Hornik, 1991) to show that anyclass of probabilistic encoders Q(Z|X) that containsthe class of deterministic neural networks has allthose desired properties. This justifies the use of suchclasses in practice. Note that analogous results forthe latter for other divergences and function classesare unknown.

Fourth, as a corollary we get the folklore claim thatmatching the aggregated posterior QZ and prior PZ

is a necessary condition for learning the true datadistribution PX in rigorous mathematical terms. Anydeviation will thus be punished with a poorer perfor-mance. Altogether, we have addressed and answeredthe open questions in (Tolstikhin et al., 2018; Bous-quet et al., 2017) in detail and highlighted the sweet-spot framework for generative autoencoder modelsbased on Optimal Transport (OT) for any metricspace and any prior distribution PZ , and with specialemphasis on Euclidean spaces, Lp-norms and neuralnetworks.

The theory supports practical innovations. We arenow in a position to learn deterministic autoencoders,Q(Z|X), G(X|Z), by minimizing a reconstructionerror for PX and the p-Wasserstein distance on thelatent space between samples of the aggregated poste-rior and the prior Wp(QZ , PZ). The computation ofthe latter is known to be di�cult and costly (cp. Hun-garian algorithm (Kuhn, 1955)). A fast approximatesolution is provided by the Sinkhorn algorithm (Cu-turi, 2013), which uses an entropic relaxation. We fol-low (Frogner et al., 2015) and (Genevay et al., 2018),by exploiting the di↵erentiability of the Sinkhorniterations, and unroll it for backpropagation. In ad-dition we correct for the entropic bias of the Sinkhornalgorithm (Genevay et al., 2018; Feydy et al., 2018).Altogether, we call our method the Sinkhorn Au-

toEncoder (SAE).

The Sinkhorn AutoEncoder is agnostic to the ana-lytical form of the prior, as it optimizes a sample-based cost function which is aware of the geometryof the latent space. Furthermore, as a byproductof using deterministic networks, it models the ag-gregated posterior as an implicit distribution (Mo-hamed and Lakshminarayanan, 2016) with no need ofthe reparametrization trick for learning the encoder(Kingma and Welling, 2013). Therefore, with essen-tially no change in the algorithm, we can learn modelswith normally distributed priors and aggregated pos-teriors, as well as distributions living on manifoldssuch as hyperspheres (Davidson et al., 2018) andprobability simplices.

In our experiments we explore how well the SinkhornAutoEncoder performs on the benchmark datasetsMNIST and CelebA with di↵erent prior distributionsPZ and geometries in latent space, e.g. the Gaussianin Euclidean space or the uniform distribution on ahypersphere. Furthermore, we compare the SAE tothe VAE (Kingma and Welling, 2013), to the WAE-MMD (Tolstikhin et al., 2018) and other methods ofapproximating the p-Wasserstein distance in latentspace like the Hungarian algorithm (Kuhn, 1955)

Page 3: Sinkhorn AutoEncoders - the data. Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) belong to the former class, by learning

and the Sliced Wasserstein AutoEncoder (Kolouriet al., 2018). We also explore the idea of matchingthe aggregated posterior QZ to a standard Gaussianprior PZ via the fact that the 2-Wasserstein distancehas a closed form for Gaussian distributions: we esti-mate the mean and covariance of QZ on minibatchesand use the loss W2(QZ , PZ) for backpropagation.Finally, we train SAE on MNIST with a probabilitysimplex as a latent space and visualize the matchingof the aggregate posterior and the prior.



We follow Tolstikhin et al. (2018) and denote withX ,Y,Z the sample spaces and with X,Y, Z andPX , PY , PZ the corresponding random variables anddistributions. Given a map F : X ! Y we denote byF# the push-forward map acting on a distributionP as P �F�1. If F (Y |X) is non-deterministic we de-fine the push-forward F (Y |X)#PX of a distributionP as the induced marginal of the joint distributionF (Y |X)PX . For any measurable non-negative cost

c : X ⇥ Y ! R+ [ {1}, one can define the followingOT-cost between distributions PX and PY via:

Wc(PX , PY ) = inf�2⇧(PX ,PY )

E(X,Y )⇠�[c(X,Y )], (1)

where ⇧(PX , PY ) is the set of all joint distributionsthat have PX and PY as the marginals. The elementsfrom ⇧(PX , PY ) are called couplings from PX to PY .If c(x, y) = d(x, y)p for a metric d and p � 1 thenWp := p

pWc is called the p-Wasserstein distance.

Let PX denote the true data distribution on X . Wedefine a latent variable model as follows: we fix alatent space Z and a prior distribution PZ on Zand consider the conditional distribution G(X|Z)(the decoder) parameterized by a neural networkG. Together they specify a generative model asG(X|Z)PZ . The induced marginal will be denotedby PG. Learning PG such that it approximates thetrue PX is then defined as:


Wc(PX , PG). (2)

Because of the infimum over ⇧(PX , PG) inside Wc,this is intractable. To rewrite this objective we con-sider the posterior distribution Q(Z|X) (the encoder)and its aggregated posterior QZ :

QZ = Q(Z|X)#PX = EX⇠PXQ(Z|X), (3)

the induced marginal of the joint Q(Z|X)PX .


Tolstikhin et al. (2018) show that if the decoderG(X|Z) is deterministic, i.e. PG = G#PZ , or inother words, if all stochasticity of the generativemodel is captured by PZ , then:

Wc(PX , PG) = infQ(Z|X):QZ=PZ


(4)Learning the generative model G with the WAEamounts to the objective:




+� ·D(QZ , PZ), (5)

where � > 0 is a Lagrange multiplier and D is anydivergence measure on probability distributions onZ. The specific choice for D is left open. WAE useseither MMD (Gretton et al., 2012) or a discriminatortrained adversarially for D. As discussed in Bousquetet al. (2017), Eq. 5 is a lower bound of Eq. 4 for anychoice of D and any value of � > 0. Minimizing thislower bound does not ensure a minimization of theoriginal objective of Eq. 4.


We improve upon the analysis of Tolstikhin et al.(2018) of generative autoencoders in the frameworkof Optimal Transport in several ways. Our contribu-tions can be summarized by the following theorem,upon which we will comment directly after.

Theorem 2.1. Let X , Z be endowed with any met-

rics and p � 1. Let PX be a non-atomic distribution1

and G(X|Z) be a deterministic generator/decoder

that is �-Lipschitz. Then we have the equality:

Wp(PX , PG) = infQ2F


qEX⇠PXEZ⇠Q(Z|X) [d(X,G(Z))p]

+ � ·Wp(QZ , PZ) , (6)

where F is any class of probabilistic encoders that

at least contains a class of universal approximators.

If X , Z are Euclidean spaces endowed with the Lp-

norms k · kp then a valid minimal choice for F is the

class of all deterministic neural network encoders Q

1A probability measure is non-atomic if every point

in its support has zero measure. It is important to distin-

guish between the empirical data distribution PX , which

is always atomic, and the underlying true distribution

PX , only which we need to assume to be non-atomic.

Page 4: Sinkhorn AutoEncoders - the data. Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) belong to the former class, by learning

(here written as a function), for which the objective

reduces to:

Wp(PX , PG) = infQ2F


qEX⇠PX [kX �G(Q(X))kpp]

+ � ·Wp(QZ , PZ).

The proof of Theorem 2.1 can be found in AppendixA, B and C. It uses the following three arguments:

i.) It is the Monge-Kantorovich equivalence (Villani,2008) for non-atomic PX that allows us to restrict todeterministic encoders Q(Z|X). This is a first theo-retical improvement over the Eq. 4 from Tolstikhinet al. (2018).

ii.) The upper bound can be achieved by a simpletriangle inequality :

Wp(PX , PG) Wp(PX , PX) +Wp(PX , PG),

where PX := G#Q(Z|X)#PX = G#QZ is the recon-struction of PX . Note that the triangle inequality isnot available for other general cost functions or diver-gences. This might be a reason for the di�culty ofgetting upper bounds in such settings. On the otherhand, if a divergence satisfies the triangle inequalitythen one can use the same argument to arrive at newvariational optimization objectives and principles.

iii.) We then prove the data processing inequality forthe Wp-distance:

Wp(G#QZ , G#PZ) � ·Wp(QZ , PZ),

with any � � kGkLip, the Lipschitz constant of G.Such an inequality is available and known for severalother divergences usually with � = 1.

Putting all three pieces together we immediatelyarrive at the equality (upper and lower bound) ofthe first part of Theorem 2.1. This insight directlysuggests that using the divergence Wp(QZ , PZ) inlatent space with a hyperparameter � � kGkLip inthe WAE setting is a preferable choice. These are twofurther improvements over Tolstikhin et al. (2018).Note that if G is a neural network with activationfunction g with kg0k1 1 (e.g. ReLU, sigmoid, tanh,etc.) and weight matrices (B`)`=1,...,L, then G is �-Lipschitz for any � � kB1kp · · · kBLkp, where thelatter is the product of the Lp-matrix norms (cp.Balan et al. (2017)).

iv.) For the second part of Theorem 2.1 we use theuniversal approximator property of neural networks(Hornik, 1991) and the compatibility of the Lp-normk·kp-norm with the p-Wasserstein distance Wp. Prov-ing such statements for other divergences seems torequire much more e↵ort (if possible at all).

When the encoders are restricted to be neural net-works of limited capacity, e.g. if their architecture isfixed, then enforcing QZ ⇡ PZ might not be feasiblein the general case of dimensionality mismatch be-tween X and Z (Rubenstein et al., 2018). In fact,since the class of deterministic neural networks (oflimited capacity) is much smaller than the class ofdeterministic measurable maps, one might consideradding noise to the output, i.e. use stochastic net-works instead. Nonetheless, neural networks canapproximate any measurable map up to arbitrarilysmall error (Hornik, 1991). Furthermore, in practicethe encoder Q(Z|X) maps from the high dimensionaldata space X to the much lower dimensional latentspace Z, suggesting that the task of matching dis-tributions in the lower dimensional latent space Zshould be feasible. Also, in view of Theorem 2.1 itfollows that learning deterministic autoencoders issu�cient to approach the theoretical bound and thuswill be our empirical choice.

Theorem 2.1 certifies that, failing to match aggre-gated posterior and prior makes learning the datadistribution impossible. Matching in latent spaceshould be seen as fundamental as minimizing thereconstruction error, a fact known about the perfor-mance of VAE (Ho↵man and Johnson, 2016; Higginset al., 2017; Alemi et al., 2018; Rosca et al., 2018).This necessary condition for learning the data distri-bution turns out to be also su�cient assuming thatthe set of encoders is expressive enough to nullify thereconstruction error.

With the help of Theorem 2.1 we arrive at the follow-ing unconstrained min-min-optimization objectiveover deterministic decoder and encoder neural net-works (Q written as a function here):




qEX⇠PX [kX �G(Q(X))kpp]

+ � ·Wp(QZ , PZ),

with � � kGkLip for all occuring G.



Even though the theory supports the use of the p-Wasserstein distanceWp(QZ , PZ) in latent space, it isnotoriously hard to compute or estimate. In practice,we will need to approximate Wp(QZ , PZ) via samples

from QZ (and PZ). The sample version Wp(QZ , PZ)

Page 5: Sinkhorn AutoEncoders - the data. Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) belong to the former class, by learning

with PZ = 1M

PMm=1 �zm and QZ = 1


PMm=1 �zm

has an exact solution, which can be computed usingthe Hungarian algorithm (Kuhn, 1955) in nearO(M3)time (time complexity). Furthermore, Wp(QZ , PZ)

will di↵er from Wp(QZ , PZ) in size of about O(M� 1k )

(sample complexity), where k is the dimension of Z(Weed and Bach, 2017). Both complexity measuresare unsatisfying in practice, but they can be improvedvia entropy regularization (Cuturi, 2013), which wewill explain next.

Following Genevay et al. (2018, 2019); Feydy et al.(2018) we define the entropy regularized OT cost with" � 0:

Sc,"(PX , PY ) := inf�2⇧(PX ,PY )

E(X,Y )⇠�[c(X,Y )]

+ " ·KL(�, PX ⌦ PY ). (7)

This is in general not a divergence due to its entropicbias. When we remove this bias we arrive at theSinkhorn divergence:

Sc,"(PX ,PY ) := Sc,"(PX , PY )

� 1


⇣Sc,"(PX , PX) + Sc,"(PY , PY )

⌘. (8)

The Sinkhorn divergence has the following limitingbehaviour:

Sc,"(PX , PY )"!0�! Wc(PX , PY ),

Sc,"(PX , PY )"!1�! MMD�c(PX , PY ).

This means that the Sinkhorn divergence Sc," interpo-lates between OT-divergences and MMDs (Grettonet al., 2012). On the one hand, for small " it is knownthat Sc," deviates from the initial objective Wc byabout O(" log(1/")) (Genevay et al., 2019). On theother hand, if " is big enough then Sc," will have

the more favourable sample complexity of O(M� 12 )

of MMDs, which is independent of the dimension,and was proven in Genevay et al. (2019). Further-more, the Sinkhorn algorithm (Cuturi, 2013), whichwill be explained in the section 3.3, allows for fastercomputation of the Sinkhorn divergence Sc," withtime complexity close to O(M2) (Altschuler et al.,2017). Therefore, if we balance " well, we are closeto our original objective and at the same time havefavourable computational and statistical properties.


Guided by the theoretical insights, we can restrict theWAE framework (Tolstikhin et al., 2018) to Sinkhorn

divergences with cost c in latent space and c in dataspace to arrive at the objective:




+ � · Sc,"(QZ , PZ), (9)

with hyperparameters � � 0 and " � 0.Restricting further to p-Wasserstein distances, corre-sponding Sinkhorn divergences and deterministic en-/decoder neural networks, we arrive at the Sinkhorn

AutoEncoder (SAE) objective:




qEX⇠PX [kX �G(Q(X))kpp]

+ � · Sk·kpp,"(QZ , PZ)

1p , (10)

which is then up to the "-terms close to the originalobjective. Note that for computational reasons it issometimes convenient to remove the p-th roots again.The inequality p

pa+ p

pb 2 p

pa+ b shows that the

additional loss is small, while still minimizing anupper bound (using � := �p).


Now that we have the general Sinkhorn AutoEncoderoptimization objective, we need to review how theSinkhorn divergence Sc,"(QZ , PZ) can be estimatedin practice by the Sinkhorn algorithm (Cuturi, 2013)using samples.

If we take M samples each from QZ and PZ , weget the corresponding empirical (discrete) distribu-

tions concentrated on M points: PZ = 1M

PMm=1 �zm

and QZ = 1M

PMm=1 �zm . Then, the optimal cou-

pling of the (empirical) entropy regularized OT-costSc,"(QZ , PZ) with " � 0 is given by the matrix:

R⇤ := argminR2SM

1M hR, CiF � " ·H(R), (11)

where Cij = c(zi, zj) is the matrix associated to thecost c, R is a doubly stochastic matrix as definedin SM = {R 2 RM⇥M

�0 | R1 = 1, RT1 = 1}, andh·, ·iF denotes the Frobenius inner product; 1 is the

vector of ones and H(R) = �PM

i,j=1 Ri,j logRi,j isthe entropy of R.

Cuturi (2013) shows that the Sinkhorn Algorithm1 (Sinkhorn, 1964) returns its "-regularized optimumR⇤ (see Eq. 11) in the limit L ! 1, which is alsounique due to strong convexity of the entropy. TheSinkhorn algorithm is a fixed point algorithm that ismuch faster than the Hungarian algorithm: it runsin nearly O(M2) time (Altschuler et al., 2017) and

Page 6: Sinkhorn AutoEncoders - the data. Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) belong to the former class, by learning

Algorithm 1 Sinkhorn

Input: {zi}Mi=1 ⇠ QZ , {zj}

Mj=1 ⇠ PZ , ", L

8i, j : Cij = c(zi, zj)K = exp(�C/"), u 1 # elem-wise exp

repeat until convergence, but at most L times:

v 1/(K>u) # elem-wise division

u 1/(Kv)R⇤ Diag(u)K Diag(v) # plus rounding step

Output: R⇤, C.

can be e�ciently implemented with matrix multipli-cations; see Algorithm 1. For better di↵erentiabilityproperties we deviate from Eq. 8 and use the unbi-

ased sharp Sinkhorn loss (Luise et al., 2018; Genevayet al., 2018) by dropping the entropy terms (only) inthe evaluations:

S c,"(QZ ,PZ) :=1

MhR⇤, CiF

� 1




iF + hR⇤PZ




where the indices QZ , PZ refer to Eq. 11 applied tothe samples from QZ in both arguments and thenPZ in both arguments, respectively.

Since this only deviates from Eq. 8 in "-terms westill have all the mentioned properties, e.g. that theoptimum of this Sinkhorn distance approaches the op-timum of the OT-cost with the stated rate (Genevayet al., 2018; Cominetti and San Martın, 1994; Weed,2018). Furthermore, for numerical stability we usethe Sinkhorn algorithm in log-space (Chizat et al.,2016; Schmitzer, 2016). In order to round the R thatresults from a finite number L of Sinkhorn iterationsto a doubly stochastic matrix, we use the proceduredescribed Algorithm 2 of (Altschuler et al., 2017).

The smaller the ", the smaller the entropy and thebetter the approximation of the OT-cost. At thesame time, a larger number of steps O(L) is neededto converge, while the rate of convergence remainslinear in L (Genevay et al., 2018). Note that allSinkhorn operations are di↵erentiable. Therefore,when the distance is used as a cost function, we canunroll O(L) iterations and backpropagate (Genevayet al., 2018). In conclusion, we obtain a di↵eren-tiable surrogate for OT-cost between empirical dis-tributions; the approximation arises from sampling,entropy regularization and the finite amount of stepsin place of convergence.

Algorithm 2 SAE Training round

Input: encoder weights A, decoder weights B, ", L, �Minibatch: x = {xi}

Mi=1 ⇠ PX , z = {zj}

Mj=1 ⇠ PZ

z QA(x), x GB(z)D =

1M kx� xkpp

S = SinkhornLoss(z, z, ", L), # 3 xAlg. 1+Eq. 12

Update: A,B with gradient r(A,B)(D + � · S).


To train the Sinkhorn AutoEncoder with encoder QA,decoder GB and with weights A, B, resp., we sampleminibatches x = {xi}Mi=1 from the data distributionPX and z = {zi}Mi=1 from the prior PZ . After encod-ing x we then run the Sinkhorn Algorithm 1 threetimes (for (x, z), (x, x) and (z, z)) to find the optimalcouplings and then compute the unbiased Sinkhorn-Loss via Eq. 12. Note that the L Sinkhorn stepsin Algorithm 1 are di↵erentiable. The weights canthen be updated via (auto-)di↵erentiation throughthe Sinkhorn steps (together with the gradient ofthe reconstruction loss). One training round is sum-marized in Algorithm 2. Small " and large L worsenthe numerical stability of the Sinkhorn. In most ex-periments, both c and c will be k ·k22. Experimentallywe found that the re-calculation of the three opti-mal couplings at each iteration is not a significantoverhead.

SAE can in principle work with arbitrary priors. Theonly requirement coming from the Sinkhorn is theability to generate samples. The choice should bemotivated by the desired geometric properties of thelatent space.


The 2-Wasserstein distance W2(QZ , PZ) has a closedform in Euclidean space if both QZ and PZ are Gaus-sian (Peyre and Cuturi (2018) Rem. 2.31):

W 22 (N (µ1,⌃1),N (µ2,⌃2)) = kµ1 � µ2k22

+ tr

✓⌃1 + ⌃2 � 2


122 ⌃1 ⌃


⌘ 12

◆, (13)

which will further simplify if PZ is standard Gaussian.Even though the aggregated posterior QZ might notbe Gaussian we use the above formula for matchingand backpropagation, by estimating µ1 and ⌃1 onminibatches of QZ via the standard formulas: µ1 :=1M

PMi=1 zi and ⌃1 := 1


PMi=1(zi � µ1)(zi � µ1)T .

We refer to this method as W2GAE (WassersteinGaussian AutoEncoder). We will compare this

Page 7: Sinkhorn AutoEncoders - the data. Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) belong to the former class, by learning

(a) (b) (c) (d) (e)

Figure 1: a) Swiss Roll and its b) squared and c) spherical embeddings learned by Sinkhorn encoders. MNISTembedded onto a 10D sphere viewed through t-SNE, with classes by colours: d) encoder only or e) encoder +decoder.

method against SAE and other baselines as discussednext in the related work section.


The Gaussian prior is common in VAE’s for thereason of tractability. In fact, changing the priorand/or the approximate posterior distributions re-quires the use of tractable densities and the appropri-ate reparametrization trick. A hyperspherical prioris used by Davidson et al. (2018) with improvedexperimental performance; the algorithm models aVon Mises-Fisher posterior, with a non-trivial pos-terior sampling procedure and a reparametrizationtrick based on rejection sampling. Our implicit en-coder distribution sidesteps these di�culties. Recentadvances on variable reparametrization can also sim-plify these requirements (Figurnov et al., 2018). Weare not aware of methods embedding on probabilitysimplices, except the use of Dirichlet priors by thesame Figurnov et al. (2018).

Ho↵man and Johnson (2016) showed that the objec-tive of a VAE does not force the aggregated posteriorand prior to match, and that the mutual informa-tion of input and codes may be minimized instead.Just like the WAE, SAE avoids this e↵ect by con-struction. Makhzani et al. (2015) and WAE improvelatent matching by GAN/MMD. With the same goal,Alemi et al. (2017) and Tomczak and Welling (2017)introduce learnable priors in the form of a mixtureof posteriors, which can be used in SAE as well.

The Sinkhorn (1964) algorithm gained interest afterCuturi (2013) showed its application for fast compu-tation of Wasserstein distances. The algorithm hasbeen applied to ranking (Adams and Zemel, 2011),domain adaptation (Courty et al., 2014), multi-labelclassification (Frogner et al., 2015), metric learning(Huang et al., 2016) and ecological inference (Muzel-lec et al., 2017). Santa Cruz et al. (2017); Lindermanet al. (2018) used it for supervised combinatoriallosses. Our use of the Sinkhorn for generative mod-eling is akin to that of Genevay et al. (2018), which

matches data and model samples with adversarialtraining, and to Ambrogioni et al. (2018), whichmatches samples from the model joint distributionand a variational joint approximation. WAE andWGAN objectives are linked respectively to primaland dual formulations of OT (Tolstikhin et al., 2018).

Our approach for training the encoder alone qualifiesas self-supervised representation learning (Donahueet al., 2017; Noroozi and Favaro, 2016; Noroozi et al.,2017). As in noise-as-target (NAT) (Bojanowski andJoulin, 2017) and in contrast to most other methods,we can sample pseudo labels (from the prior) inde-pendently from the input. In Appendix D we showa formal connection with NAT.

Another way of estimating the 2-Wasserstein distancein Euclidean space is the Sliced Wasserstein AutoEn-coder (SWAE) (Kolouri et al., 2018). The main ideais to sample one-dimensional lines in Euclidean spaceand exploit the explicit form of the 2-Wassersteindistance in terms of cumulative distribution functionsin the one-dimensional setting. We will compare ourmethods to SWAE as well.



We demonstrate qualitatively that the Sinkhorn dis-tance is a valid objective for unsupervised featurelearning by training the encoder in isolation. Thetask consists of embedding the input distributionin a lower dimensional space, while preserving thelocal data geometry and minimizing the loss functionL = 1

M hR⇤, CiF , with c(z, z0) = kz � z0k22. Here Mis the minibatch size.

We display the representation of a 3D Swiss Rolland MNIST. For the Swiss Roll we set " = 10�3,while for MNIST it is set to 0.5, and L is picked toensure convergence. For the Swiss roll (Figure 1a),we use a 50-50 fully connected network with ReLUs.

Page 8: Sinkhorn AutoEncoders - the data. Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) belong to the former class, by learning


method prior cost � MMD RE FID � MMD RE FID

VAE N KL 1 0.28 12.22 11.4 1 0.20 94.19 55

�-VAE N KL 0.1 2.20 11.76 50.0 0.1 0.21 67.80 65

WAE N MMD 100 0.50 7.07 24.4 2000⇤

0.21 65.45 58

SWAE N SW 100 0.32 7.46 18.8 100 0.21 65.28 64

W2GAE (ours) N W 22 1 0.67 7.04 30.5 1 0.20 65.55 58

HAE (ours) N Hungarian 100 5.79 11.84 16.8 100 32.09 84.51 293

SAE (ours) N Sinkhorn 100 5.34 12.81 17.2 100 4.82 90.54 187


H KL 1 0.25 12.73 21.5 - - - -

WAE H MMD 100 0.24 7.88 22.3 2000⇤

0.25 66.54 59

SWAE H SW 100 0.24 7.80 27.6 100 0.41 63.64 80

HAE (ours) H Hungarian 100 0.23 8.69 12.0 100 0.26 63.49 58

SAE (ours) H Sinkhorn 100 0.25 8.59 12.5 100 0.24 63.97 56

Table 1: Results of the autoencoding task. Top 3 results for the FID scores are indicated with boldfacenumbers. We compute MMD in latent space to evaluate the matching between the aggregated posteriorand prior. MMD results are reported times 102. Note that MMD scores are not comparable for di↵erentpriors. For SAE and the Gaussian prior, we used ✏ = 10 as lower values led to numerical instabilities. For thehypersphere we set ✏ = 0.1. *The value of � = 2000 is similar to the value � = 100 as used in (Tolstikhinet al., 2018), as a prefactor of 0.05 was used there for the reconstruction cost. †Comparing with Davidsonet al. (2018) in high-dimensional latent spaces turned out to be unfeasible, due to CPU-based computations.

Figures 1b, 1c show that the local geometry of theSwiss Roll is conserved in the new representationalspaces — a square and a sphere. Figure 1d showsthe t-SNE visualization (Maaten and Hinton, 2008)of the learned representation of the MNIST test set.With neither labels nor reconstruction error, we learnan embedding that is aware of class-wise clusters.Minimization of the Sinkhorn distance achieves thisby encoding onto a d-dimensional hypersphere witha uniform prior, such that points are encouraged tomap far apart. A contractive force is present dueto the inductive prior of neural networks, which areknown to be Lipschitz functions. On the one hand,points in the latent space disperse in order to fill upthe sphere; on the other hand, points close on imagespace cannot be mapped too far from each other.As a result, local distances are conserved while theoverall distribution is spread. When the encoder iscombined with a decoder G the contractive force isenlarged: they collaborate in learning a latent spacewhich makes reconstruction possible despite finitecapacity; see Figure 1e.


For the autoencoding task we compare SAE against(�)-VAE, HVAE, SWAE and WAE-MMD. We fur-thermore denote the model that matches the samplesin latent space with the Hungarian algorithm withHAE. Where compatible, all methods are evaluatedboth on the hypersphere and with a standard normalprior. Results from our proposed W2GAE method asdiscussed in section 4 for Gaussian priors are shownas well. We compute FID scores (Heusel et al., 2017)on CelebA and MNIST. For MNIST we use LeNet as

proposed in (Binkowski et al., 2018). For details onthe experimental setup, see Appendix E. The resultsfor MNIST and CelebA are shown in Table 1. Ex-trapolations, interpolations and samples of WAE andSAE for CelebA are shown in Fig. 2. Visualizationsfor MNIST are shown in Appendix D. Interpolationson the hypersphere are defined on geodesics connect-ing points on the hypersphere. FID scores of SAEwith a hyperspherical prior are on par or better thanthe competing methods. Note that although the FIDscores for the VAE are slightly better than that ofSAE/HAE, the reconstruction error of the VAE issignificantly higher. Surprisingly, the simple W2GAEmethod is on par with WAE on CelebA.

For the Gaussian prior on CelebA, both HAE andSAE perform very poorly. In appendix F we analyzedthe behaviour of the Hungarian algorithm in isola-tion for two sets of samples from high-dimensionalGaussian distributions. The Hungarian algorithmfinds a better matching between samples from asmaller variance Gaussian with samples from thestandard normal distribution. This behaviour getsworse for higher dimensions, and also occurs for theSinkhorn algorithm. This might be due to the factthat most probability mass of a high-dimensionalisotropic Gaussian with standard deviation � lies ona thin annulus at radius �

pd from its origin. For a fi-

nite number of samples the L22 cost function can lead

to a lower matching cost for samples between two an-nuli of di↵erent radii. This e↵ect leads to an encoderwith a variance lower than one. When sampling fromthe prior after training, this yields saturated sampledimages. See Appendix D for reconstructions and sam-ples for HAE with a Gaussian prior on CelebA. Notethat neither SWAE and W2GAE su↵er from this

Page 9: Sinkhorn AutoEncoders - the data. Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) belong to the former class, by learning

Figure 2: From left to right: CelebA extrapolations, interpolations, and samples. Models from Table 1: WAEwith a Gaussian prior (top) and SAE with a uniform prior on the hypersphere (bottom).

problem in our experiments, even though these meth-ods also provide an estimate of the 2-Wassersteindistance. For W2GAE this problem does start ateven higher dimensions (Appendix F).


We further demonstrate the flexibility of SAE byusing Dirichlet priors on MNIST. The prior drawssamples on the probability simplex; hence we con-strain the encoder by a final softmax layer. Weuse priors that concentrate on the vertices with thepurpose of clustering the digits. A 10-dimensionalDir(1/2) prior (Figure 3a) results in an embeddingqualitatively similar to the uniform sphere (1e). Witha more skewed prior Dir(1/5), the latent space couldbe organized such that each digit is mapped to avertex, with little mass in the center. We found thatin dimension 10 this is seldom the case, as multiplevertices can be taken by the same digit to modeldi↵erent styles, while other digits share the same ver-tex. We therefore experiment with a 16-dimensionalDir(1/5), which yields more disconnected clusters

(3b); the e↵ect is evident when showing the priorand the aggregated posterior that tries to cover it(3c). Figure 3d (leftmost and rightmost columns)shows that every digit 0� 9 is indeed represented onone of the 16 vertices, while some digits are presentwith multiple styles, e.g. the 7. The central sam-ples in the Figure are the interpolations obtained bysampling on edges connecting vertices – no real datais autoencoded. Samples from the vertices appearmuch crisper than other prior samples (3e), a sign ofmismatch between prior and aggregated posterior onareas with lower probability mass. Finally, we couldeven learn the Dirichlet hyperparameter(s) with areparametrization trick (Figurnov et al., 2018) andlet the data inform the model on the best prior.


We introduced a generative model built on the prin-ciples of Optimal Transport. Working with empiricalWasserstein distances and deterministic networks pro-vides us with a flexible likelihood-free framework forlatent variable modeling.

(a) (b) (c) (d) (e)

Figure 3: t-SNEs of SAE latent spaces on MNIST: a) 10-dim Dir(1/2) and b) 16-dim Dir(1/5) priors. For thelatter: c) aggregated posterior (red) vs. prior (blue), d) vertices interpolation and e) samples from the prior.

Page 10: Sinkhorn AutoEncoders - the data. Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) belong to the former class, by learning


Adams, R. P. and Zemel, R. S. (2011). Rank-ing via Sinkhorn Propagation. arXiv preprint


Alemi, A., Poole, B., Fischer, I., Dillon, J., Saurous,R. A., and Murphy, K. (2018). Fixing a BrokenELBO. In ICML.

Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy,K. (2017). Deep variational information bottleneck.In ICLR.

Altschuler, J., Weed, J., and Rigollet, P. (2017). Near-linear time approximation algorithms for optimaltransport via Sinkhorn iteration. In NIPS.

Ambrogioni, L., Guclu, U., Gucluturk, Y., Hinne,M., van Gerven, M. A., and Maris, E. (2018).Wasserstein Variational Inference. In NIPS.

Balan, R., Singh, M., and Zou, D. (2017). Lipschitzproperties for deep convolutional networks. arXivpreprint arXiv:1701.05217.

Binkowski, M., Sutherland, D. J., Arbel, M., andGretton, A. (2018). Demystifying MMD GANs.arXiv preprint arXiv:1801.01401.

Bojanowski, P. and Joulin, A. (2017). UnsupervisedLearning by Predicting Noise. In ICML.

Bousquet, O., Gelly, S., Tolstikhin, I., Simon-Gabriel,C.-J., and Schoelkopf, B. (2017). From optimaltransport to generative modeling: the VEGANcookbook. arXiv preprint arXiv:1705.07642.

Chizat, L., Peyre, G., Schmitzer, B., and Vialard,F.-X. (2016). Scaling Algorithms for Unbal-anced Transport Problems. arXiv preprint


Cominetti, R. and San Martın, J. (1994). Asymptoticanalysis of the exponential penalty trajectory inlinear programming. Math. Programming, 67(2,Ser. A):169–187.

Courty, N., Flamary, R., and Tuia, D. (2014). Do-main adaptation with regularized optimal trans-port. In KDD.

Cuturi, M. (2013). Sinkhorn Distances: LightspeedComputation of Optimal Transport. In NIPS.

Davidson, T. R., Falorsi, L., De Cao, N., Kipf, T., andTomczak, J. M. (2018). Hyperspherical VariationalAuto-Encoders. In UAI.

Donahue, J., Krahenbuhl, P., and Darrell, T. (2017).Adversarial feature learning. In ICLR.

Feydy, J., Sejourne, T., Vialard, F.-X., Amari, S.-i., Trouve, A., and Peyre, G. (2018). Inter-polating between Optimal Transport and MMD

using Sinkhorn Divergences. arXiv preprint


Figurnov, M., Mohamed, S., and Mnih, A. (2018).Implicit Reparameterization Gradients. In NIPS.

Frogner, C., Zhang, C., Mobahi, H., Araya, M., andPoggio, T. A. (2015). Learning with a WassersteinLoss. In NIPS.

Genevay, A., Chizat, L., Bach, F., Cuturi, M., Peyre,G., et al. (2019). Sample Complexity of SinkhornDivergences. In AISTATS.

Genevay, A., Peyre, G., Cuturi, M., et al. (2018).Learning Generative Models with Sinkhorn Diver-gences. In AISTATS.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu,B., Warde-Farley, D., Ozair, S., Courville, A., andBengio, Y. (2014). Generative Adversarial Nets.In NIPS.

Gretton, A., Borgwardt, K. M., Rasch, M., Scholkopf,B., and Smola, A. J. (2007). A kernel method forthe two-sample-problem. In NIPS.

Gretton, A., Borgwardt, K. M., Rasch, M. J.,Scholkopf, B., and Smola, A. (2012). A kerneltwo-sample test. Journal of Machine Learning

Research, 13(Mar):723–773.

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler,B., and Hochreiter, S. (2017). GANs trained bya two time-scale update rule converge to a localnash equilibrium. In NIPS.

Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot,X., Botvinick, M., Mohamed, S., and Lerchner, A.(2017). �-VAE: Learning basic visual concepts witha constrained variational framework. In ICLR.

Hinton, G. E. and Salakhutdinov, R. R. (2006). Re-ducing the dimensionality of data with neural net-works. science, 313(5786):504–507.

Ho↵man, M. D. and Johnson, M. J. (2016). ELBOsurgery: yet another way to carve up the varia-tional evidence lower bound. In Workshop in Ad-

vances in Approximate Bayesian Inference, NIPS.

Hornik, K. (1991). Approximation capabilities ofmultilayer feedforward networks. Neural networks,4(2):251–257.

Huang, G., Guo, C., Kusner, M. J., Sun, Y., Sha, F.,and Weinberger, K. Q. (2016). Supervised wordmover’s distance. In NIPS.

Kingma, D. P. and Welling, M. (2013). Auto-encoding variational Bayes. arXiv preprint


Page 11: Sinkhorn AutoEncoders - the data. Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) belong to the former class, by learning

Kolouri, S., Martin, C. E., and Rohde, G. K. (2018).Sliced-Wasserstein Autoencoder: An Embarrass-ingly Simple Generative Model. arXiv preprint


Kuhn, H. W. (1955). The Hungarian method forthe assignment problem. Naval research logistics

quarterly, 2(1-2):83–97.

Linderman, S. W., Mena, G. E., Cooper, H., Panin-ski, L., and Cunningham, J. P. (2018). Repa-rameterizing the Birkho↵ Polytope for VariationalPermutation Inference. AISTATS.

Luise, G., Rudi, A., Pontil, M., and Ciliberto, C.(2018). Di↵erential Properties of Sinkhorn Approx-imation for Learning with Wasserstein Distance.In NIPS.

Maaten, L. v. d. and Hinton, G. (2008). Visualizingdata using t-SNE. Journal of machine learning

research, 9(Nov):2579–2605.

Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I.,and Frey, B. (2015). Adversarial autoencoders.ICLR.

Mohamed, S. and Lakshminarayanan, B. (2016).Learning in implicit generative models. In ICML.

Muzellec, B., Nock, R., Patrini, G., and Nielsen, F.(2017). Tsallis Regularized Optimal Transport andEcological Inference. In AAAI.

Noroozi, M. and Favaro, P. (2016). Unsupervisedlearning of visual representations by solving jigsawpuzzles. In ECCV.

Noroozi, M., Pirsiavash, H., and Favaro, P. (2017).Representation learning by learning to count.CVPR.

Peyre, G. and Cuturi, M. (2018). Computational Op-timal Transport. arXiv preprint arXiv:1803.00567.

Rezende, D. J., Mohamed, S., and Wierstra, D.(2014). Stochastic backpropagation and approxi-mate inference in deep generative models. ICML.

Rosca, M., Lakshminarayanan, B., and Mohamed,S. (2018). Distribution Matching in VariationalInference. arXiv preprint arXiv:1802.06847.

Rubenstein, P. K., Schoelkopf, B., and Tolstikhin,I. (2018). Wasserstein Auto-Encoders: Latent Di-mensionality and Random Encoders. In ICLR


Santa Cruz, R., Fernando, B., Cherian, A., andGould, S. (2017). Deeppermnet: Visual permuta-tion learning. In CVPR.

Schmitzer, B. (2016). Stabilized sparse scaling algo-rithms for entropy regularized transport problems.arXiv preprint arXiv:1610.06519.

Sinkhorn, R. (1964). A relationship between arbitrarypositive matrices and doubly stochastic matrices.Ann. Math. Statist., 35.

Sriperumbudur, B. K., Fukumizu, K., and Lanckriet,G. R. G. (2011). Universality, characteristic kernelsand RKHS embedding of measures. Journal of

Machine Learning Research, 12(Jul):2389–2410.

Sriperumbudur, B. K., Gretton, A., Fukumizu, K.,Scholkopf, B., and Lanckriet, G. R. G. (2010).Hilbert space embeddings and metrics on prob-ability measures. Journal of Machine Learning

Research, 11(Apr):1517–1561.

Tolstikhin, I., Bousquet, O., Gelly, S., and Schoelkopf,B. (2018). Wasserstein Auto-Encoders. In ICLR.

Tomczak, J. M. and Welling, M. (2017). VAE witha VampPrior. In AISTATS.

Villani, C. (2008). Optimal Transport: Old and New.Grundlehren der mathematischen Wissenschaften.Springer Berlin Heidelberg.

Weed, J. (2018). An explicit analysis of the entropicpenalty in linear programming. arXiv preprint


Weed, J. and Bach, F. (2017). Sharp asymptoticand finite-sample rates of convergence of empiricalmeasures in Wasserstein distance. In NIPS.
