Learning Generative Models across Incomparable Spaces

Learning Generative Models across Incomparable Spaces

Charlotte Bunne 1 David Alvarez-Melis 2 Andreas Krause 1 Stefanie Jegelka 2

AbstractGenerative Adversarial Networks have shown re-markable success in learning a distribution thatfaithfully recovers a reference distribution in itsentirety. However, in some cases, we may want toonly learn some aspects (e.g., cluster or manifoldstructure), while modifying others (e.g., style, ori-entation or dimension). In this work, we proposean approach to learn generative models acrosssuch incomparable spaces, and demonstrate howto steer the learned distribution towards targetproperties. A key component of our model isthe Gromov-Wasserstein distance, a notion of dis-crepancy that compares distributions relationallyrather than absolutely. While this framework sub-sumes current generative models in identicallyreproducing distributions, its inherent flexibilityallows application to tasks in manifold learning,relational learning and cross-domain learning.

1. IntroductionGenerative Adversarial Networks (GANs, Goodfellow et al.(2014)) and its variations (Radford et al., 2016; Arjovskyet al., 2017; Li et al., 2017) are powerful models for learningcomplex distributions. Broadly, these methods rely on anadversary that compares samples from the true and learneddistributions, giving rise to a notion of divergence betweenthem. The divergences implied by current methods requirethe two distributions to be supported in sets that are identicalor at the very least comparable; examples include OptimalTransport (OT) distances (Salimans et al., 2018; Genevayet al., 2018) or Integral Probability Metrics (IPM) (Muller,1997; Sriperumbudur et al., 2012; Mroueh et al., 2017). Inall of these cases, the spaces over which the distributions aredefined must have the same dimensionality (e.g., the space

1Department of Computer Science, Eidgenossische TechnischeHochschule (ETH), Zurich, Switzerland 2Computer Science andArtificial Intelligence Laboratory (CSAIL), Massachusetts Insti-tute of Technology (MIT), Cambridge, USA. Correspondence to:Charlotte Bunne <[email protected]>.

Proceedings of the 36 th International Conference on MachineLearning, Long Beach, California, PMLR 97, 2019. Copyright2019 by the author(s).

of 28× 28-pixel vectors for MNIST), and the generated dis-tribution that minimizes the objective has the same supportas the reference one. This is of course desirable when thegoal is to generate samples that are indistinguishable fromthose of the reference distribution.

Many other applications, however, require modeling onlytopological or relational aspects of the reference distribution.In such cases, the absolute location of the data manifold isirrelevant (e.g., distributions over learned representations,such as word embeddings, are defined only up to rotations),or it is not available (e.g., if the data is accessible only as aweighted graph indicating similarities among sample points).Another reason for modeling only topological aspects is thedesire to, e.g., change the appearance or style of the samples,or down-scale images. Divergences that directly comparesamples from the two distributions, and hence most currentgenerative models, do not apply to those settings.

In this work, we develop a novel class of generative mod-els that can learn across incomparable spaces, e.g., spacesof different dimensionality or data type. Here, the rela-tional information between samples, i.e., the topology ofthe reference data manifold, is preserved, but other char-acteristics, such as the ambient dimension, can vary. Akey component of our approach is the Gromov-Wasserstein(GW) distance (Memoli, 2011), a generalization of classicOptimal Transport distances to incomparable ground spaces.Instead of directly comparing points in the two spaces, theGW distance computes pairwise intra-space distances, andcompares those distances across spaces, greatly increasingthe modeling scope. Figure 1 illustrates the new model.

To realize this model, we address several challenges. First,we enable the use of the Gromov-Wasserstein distance invarious learning settings by improving its robustness andensuring unbiased learning. Similar to existing OT-basedgenerative models (Salimans et al., 2018; Genevay et al.,2018), we leverage the differentiability of this distance toprovide gradients for the generator. Second, for efficiency,we further parametrize it via a learnable adversary. Theadded flexibility of the GW distance necessitates to con-strain the adversary. To this end, we propose a novel or-thogonality regularization, which might be of independentinterest.

A final challenge —which doubles as one of the main ad-


dataPx

z gθ

GW GANGW[ ]d12

dn1

0

0

0

d1nd2n

… … …………

…

dn2

d21

intra-space distances

[ ]sp12

spn1

0

0

0

sp1nsp2n

… … …………

…

spn2

sp21w1 w2

w4w3

w6

w9

w5 w8w7

w10loss

intra-space distances

Figure 1. The Gromov-Wasserstein generative adversarial network (GW GAN) learns across incomparable spaces, such as differentdimensions or data type (from graphs to Euclidean space). The key idea is that its learning objective is purely based on intra-spacedistances (e.g., pairwise distances d or shortest paths sp) in the generator and data space, respectively.

vantages of this approach— arises from the added flexibilityof the generator: it allows to freely alter superficial char-acteristics of the generated distribution while still learningthe basic structure of the reference distribution. We showexamples how to steer these additional degrees of freedomvia regularization or adversaries in the model. The resultingmodel subsumes the traditional (i.e., same-space) adversar-ial models as a special case, but can do much more. Forexample, it learns cluster structure across spaces of differentdimensionality and across different data types, e.g., fromgraphs to Euclidean space. Thus, our GW GAN can alsobe viewed as performing dimensionality reduction or man-ifold learning, but, departing from classical approaches tothese problems, it recovers, in addition to the manifold struc-ture of the data, the probability distribution defined over it.Moreover, we propose a general framework for stylisticmodifications by integrating a style adversary; we demon-strate its use by changing the thickness of learned MNISTdigits. In summary, this work provides a framework that sub-stantially expands the potential applications of generativeadversarial learning.

Contributions. We make the following contributions:i. We introduce a new class of generative models that can

learn distributions across different dimensionalities ordata types.

ii. We demonstrate the model’s range of applications bydeploying it to manifold learning, relational learningand cross-domain learning tasks.

iii. More generally, our modifications of the Gromov-Wasserstein discrepancy enable its use as a loss func-tion in various machine learning applications.

iv. Our new approach to approximately enforce orthog-onality in neural networks based on the orthogonalProcrustes problem also applies beyond our model.

2. ModelGiven a dataset of n observations {x1, ..., xn}, xi ∈ Xdrawn from a reference distribution p ∈ P(X ), we aimto learn a generative model gθ parametrized by θ purelybased on relational and intra-structural characteristics of thedataset. The generative model gθ : Z → Y , typically a

neural network, maps random noise z ∈ Z to a generatorspace Y that is independent of data space X .

2.1. Gromov-Wasserstein Discrepancy

Learning generative models typically relies on a statisticaldivergence between the target distribution and the model’scurrent estimate. Classical statistical divergences only applywhen comparing distributions whose supports lie in thesame metric space, or when at least a meaningful distancebetween points in the two supports can be computed. Whenthe data space X and generator space Y are different, thesedivergences no longer apply.

Hence, instead, we will use a more suitable divergence mea-sure. Rather than relying on a metric across the spaces, theGromov-Wasserstein (GW) distance (Memoli, 2011) com-pares distributions by computing a discrepancy betweenthe metrics defined within each of the spaces. As a con-sequence, it is oblivious to specific characteristics or thedimensionality of the spaces.

Given n samples of the compared distributions p ∈ P(X )and q ∈ P(Y), the discrete formulation of the GW distanceneeds a similarity (or distance) matrix between the samplesand a probability vector for each space, say (D,p) and(D,q), with (D,p) ∈ Rn×n × Σn, where Σn := {p ∈R+n ;∑i pi = 1} is the n-dimensional probability simplex.

Then, the GW discrepancy is

GW(D, D,p,q) := minT∈Up,q

ED,D(T )

:= minT∈Up,q

∑ijkl

L(Dik, Djl)TijTkl,(1)

where Up,q := {T ∈ (R+)n×n;T1n = p, T>1n = q}is the set of all couplings T between p and q. The lossfunction L in our case is L(a, b) = L2(a, b) := 1

2 |a − b|2.

If L = L2, then GW1/2 defines a (true) distance (Memoli,2011).

2.2. Gromov-Wasserstein Generative Model

To learn across incomparable spaces, one key idea of ourmodel is to use the Gromov-Wasserstein distance as a loss


Algorithm 1 Training Algorithm of the Gromov-Wasserstein Generative Model.Require: α: learning rate, ng: the number of iterations of the generator per adversary iteration, m: mini-batch size,N : number of training iterations, θ0: initial parameters of generator gθ, ω0 = (ω0, ω0): initial parameters of adversary fωfor t = 0 to N do

sample X = (xi)mi=1 from dataset

sample Z = (zj)mj=1 ∼ N (0, 1), Y = (yj)

mj=1 = gθt((zj)

mj=1)

∀(i, j), Dωij := ‖fω(xi)− fω(xj)‖2 and Dω

ij := ‖fω(yi)− fω(yj)‖2L = GWε(D

ω, Dω,p,q), where p,q are uniform distributions . GWε is defined in Eq. (7)if t mod ng + 1 = 0 thenLreg ← L−Rβ(fω(X), X)−Rβ(fω(Y ), Y ) . Rβ is defined in Eq. (5)ωt+1 ← ωt + α×∇ωt

Lreg

elseθt+1 ← θt − α×∇θtL

end ifend for

function to compare the generated and true distribution. Asin traditional adversarial approaches, we parametrize thegenerator gθ : Z → Y as a neural network that maps noisesamples z to features y. We train gθ by using GW as a loss,i.e., for mini-batches X and Y of reference and generatedsamples, respectively, we compute pairwise distance matri-ces D and D and solve the GW problem, taking p and q asuniform distributions.

While this procedure alone is often sufficient for simpleproblems, in high dimensions, the statistical efficiency ofclassical divergence measures can be poor and a large num-ber of input samples is needed to achieve good discrimi-nation between generated and data distribution (Salimanset al., 2018). To improve discriminability, we learn the intra-space metrics adversarially. An adversary fω parametrizedby ω maps data and generator samples into feature spacesin which we compute Euclidean intra-space distances:

Dωij := ‖fω(xi)− fω(xj)‖2 , where fω : X → Rs (2)

with fω modeled by a neural network. The feature map-ping may, for instance, reduce the dimensionality of X andextract important features. The original loss minimizationproblem of the generator thus becomes a minimax problem

minθ

maxω=(ω,ω)

GW(Dω, Dω,p,q), (3)

where Dω and Dω denote pairwise distance matrices ofsamples originating from the generator and reference do-main, respectively, mapped into the feature space via fω(Eq. (2)). We refer to our model as GW GAN.

3. TrainingWe optimize the adversary fω and generator gθ in an alternat-ing scheme, where we train the generator more frequentlythan the adversary to avoid the adversarially-learned dis-

tance function to become degenerate (Salimans et al., 2018).Algorithm 1 shows the GW GAN training algorithm.

While training of standard GANs suffers from undampedoscillations and mode collapse (Metz et al., 2017; Sali-mans et al., 2016), following an argument of Salimans et al.(2018), the GW objective is well defined and statisticallyconsistent if the adversarially learned intra-space distancesDω are non-degenerate. Trained with the GW loss, the gen-erator thus does not diverge even when adversary fω is keptfixed. Empirical validation (see Appendix A) confirms this:we stopped updating the adversary fω while continuing toupdate the generator. Even with fixed adversary, the gen-erator further improved its learned distribution and did notdiverge.

Note that Problem (3) makes very few assumptions on thespaces X and Y , requiring only that a metric be definedon them. This remarkable flexibility can be exploited toenforce various characteristics on the generated distribution.We discuss examples in Section 3.1. However, this sameflexibility combined with the added degrees of freedom dueto the learned metric, demands to regularize the adversaryto ensure stable training and prevent it from overpoweringthe generator. We propose an effective method to do so inSection 3.2.

Moreover, using the Gromov-Wasserstein distance as a dif-ferentiable loss function for training a generative modelrequires modifying its original formulation to ensure robustand fast computation, unbiased gradients, and numericalstability, as described in detail in Section 3.3.

3.1. Constraining the Generator

The GW loss encourages the generator to recover the re-lational and geometric properties of the reference dataset,but leaves other global aspects undetermined. We can thus


shape the generated distribution by enforcing desired proper-ties through constraints. For example, while any translationof a distribution would achieve the same GW loss, we canenforce centering around the origin by penalizing the normof the generated samples. Figure 2a illustrates an example.

For computer vision tasks, we need to ensure that the gener-ated samples still look like natural images. We found thata total variation regularization (Rudin et al., 1992) inducesthe right bias here and hence greatly improves the results(see Figures 2b, c, and d).

Moreover, the invariances of the GW loss allow for shapingstylistic characteristics of the generated samples by integrat-ing design constraints into the learning process. In contrast,current generative models (Arjovsky et al., 2017; Salimanset al., 2018; Genevay et al., 2018; Li et al., 2017) cannot per-form style transfer as modifications in surface-level featuresof the generated samples conflict with their adversariallycomputed loss used for training the generator. We proposea modular framework, which enables style transfer to thegenerated samples when given an additional style referencebesides the provided data samples. We incorporate designconstraints into the generator’s objective via a style adver-sary c, i.e., any function that quantifies a certain style andthereby, as a penalty, enforces this style on the generatedsamples. The resulting objective is

minθ

maxω=(ω,ω)

GW(Dω, Dω,p,q)− λ× c(gθ(z)). (4)

As a result, the generator learns structural content of thetarget distribution via the adversarially learned GW loss,and stylistic characteristics via the style adversary. Wedemonstrate this framework via the example of stylisticchanges to learned MNIST digits in Section 4.4 and inAppendix G.

3.2. Regularizing the Adversary

During training, the adversary maximizes the objective func-tion (3). However, the GW distance is easily maximizedby stretching the space and thus distorting the intra-spacedistances used for its computation. To avoid such arbitrarydistortion of the space, we propose to regularize the adver-sary fω by (approximately) enforcing it to define a unitarytransformation, thus restricting the magnitude of stretchingit can do. Note that directly parametrizing fω as an or-thogonal matrix would defeat its purpose, as the Frobeniusnorm is unitarily invariant. Instead, we allow fω to take amore general form, but limit its expansivity and contractivitythrough approximate orthogonality.

Previous work has explored various orthogonality-basedregularization methods to stabilize neural networks train-ing(Vorontsov et al., 2017). Saxe et al. (2014) introduceda new class of random orthogonal initial conditions on the

weights of neural networks stabilizing the initial trainingphase. By enforcing the weight matrices to be Parsevaltight frames, layerwise orthogonality constraints are intro-duced in Cisse et al. (2017); Brock et al. (2017; 2019); theypenalize deviations of the weights from orthogonality viaRβ(Wk) := β‖W>k Wk − I‖2F , where Wk are weights oflayer k and ‖ · ‖F is the Frobenius norm.

However, these approaches enforce orthogonality on theweights of each layer rather than constraining the networkfω in its entirety to function as an approximately orthogonaloperator. An empirical comparison to these layerwise ap-proaches (shown in Appendix D) reveals that, for GW GAN,regularizing the full network is desirable. To enforce the ap-proximation of fω as an orthogonal operator, we introducea new orthogonal regularization approach, which ensuresorthogonality of a network by minimizing the distance tothe closest orthogonal matrix P ∗. The regularization termis defined as

Rβ(fω(X), X) := β‖fω(X)−XP ∗>‖2F , (5)

where P ∗ is an orthogonal matrix that most closely mapsX to fω(X), and β is a hyperparameter. The matrixP ∗ = arg minP∈O(s) ‖fω(X) −XP>‖F , where O(s) ={P ∈ Rs×s | P>P = I

}and s is the dimensionality of the

feature space, can be obtained by solving an orthogonal Pro-crustes problem. If the dimensionality s of the feature spaceequals the input dimension, then P ∗ has a closed-form solu-tion P ∗ = UV >, whereU and V are the left and right singu-lar vectors of fω(X)>X , i.e. UΣV > = SVD(fω(X)>X)(Schonemann, 1966). Otherwise, we need to solve for P ∗

with an iterative optimization method.

This novel Procrustes-based regularization principle for neu-ral networks is remarkably flexible since it constrains globalinput-output behavior without making assumptions aboutspecific layers or activations. It preserves the expressibil-ity of the network while efficiently enforcing orthogonality.We use this orthogonal regularization principle for trainingthe adversary of the GW generative model across differentapplications and network architectures.

3.3. Gromov-Wasserstein as a Loss Function

To serve as a robust training objective for general machinelearning settings we modify the naıve formulation of theGromov-Wasserstein discrepancy in various ways.

Regularization of Gromov-Wasserstein Optimal trans-port metrics and extensions such as the Gromov-Wassersteindistance are particularly appealing because they take intoaccount the underlying geometry of the data when com-paring distributions. However, their computational cost isprohibitive for large-scale machine learning problems. Moreprecisely, Problem (1) is a quadratic programming problem,


and solving it directly is intractable for large n. Regularizingthis objective with an entropy term results in significantlymore efficient optimization (Peyre et al., 2016). The re-sulting smoothed problem can be solved through projectedgradient descent methods, where the projection steps relyon the Sinkhorn-Knopp scaling algorithm (Cuturi, 2013).Concretely, the entropy-regularized version of the Gromov-Wasserstein discrepancy proposed by Peyre et al. (2016) hasthe form

GWε(D, D,p,q) = minT∈Up,q

ED,D(T )− εH(T ), (6)

where ED,D(T ) is defined in Equation (1), H(T ) :=−∑ij Tij(log(Tij)− 1) is the entropy of coupling T , and

ε a parameter controlling the strength of regularization. Be-sides leading to significant speedups, entropy smoothingof optimal transport discrepancies results in distances thatare differentiable with respect to their inputs, making thema more convenient choice as loss functions for machinelearning algorithms. Since the Gromov-Wasserstein dis-tance as loss function in generative models compares noisyfeatures of the generator and the data by computing corre-spondences between intra-space distances, a soft rather thana hard alignment T might be a desirable property. The en-tropy smoothing yields couplings that are sparser than theirnon-regularized counterparts, ideal for applications wheresoft alignments are desired (Cuturi & Peyre, 2016).

The effectiveness of entropy smoothing of the Gromov-Wasserstein discrepancy has been shown in other down-stream application such as shape correspondences (Solomonet al., 2016) or the alignment of word embedding spaces(Alvarez-Melis & Jaakkola, 2018).

Motivated by Salimans et al. (2018) and justified by theenvelope theorem (Carter, 2001), we do not backpropagatethe gradient through the iterative computation of the GWε

coupling T (Problem (6)).

Normalization of Gromov-Wasserstein With entropyregularization, GWε is not a distance any more, as thediscrepancy of identical metric measure spaces is then nolonger zero. Similar to the Wasserstein metric (Bellemareet al., 2017), the estimation of GWε from samples yieldsbiased gradients. Inspired by Bellemare et al. (2017), weuse a normalized entropy-regularized Gromov-Wassersteindiscrepancy defined as

GWε(D, D,p,q) := 2× GWε(D, D,p,q)

−GWε(D,D,p,p)− GWε(D, D,q,q).(7)

Numerical Stability of Gromov-Wasserstein Comput-ing the entropy-regularized Gromov-Wasserstein formula-tion relies on a projected gradient algorithm (Peyre et al.,

2016), in which each iteration involves a projection intothe transportation polytope, efficiently computed with theSinkhorn-Knopp algorithm (Cuturi, 2013), a matrix-scalingprocedure that alternatingly updates marginal scaling vari-ables. In the limit of vanishing regularization (ε→ 0) thesescaling factors diverge, resulting in numerical instabilities.

To improve the numerical stability, we compute GWε usinga stabilized version of the Sinkhorn algorithm (Schmitzer,2016). This significantly increases the robustness of theGromov-Wasserstein computation. Performing Sinkhornupdates in the log-domain further increases the stabilityof the algorithm, by avoiding numerical overflow whilepreserving its efficient matrix multiplication structure.

Normalizing the intra-space distances of the generated andthe data samples, respectively, further improves the numeri-cal stability of the Gromov-Wasserstein computation. How-ever, to preserve information on the scale of the samples,we use normalized distances for the Sinkhorn iterates, whilethe final loss is calculated using the original distances.

4. Empirical ResultsIn this section, we empirically demonstrate the effective-ness of the GW GAN formulation and regularization, andillustrate its versatility by tackling various novel settings forgenerative modeling, including learning distributions acrossdifferent dimensionalities, data types and styles.

4.1. Learning across Identical Spaces

As a sanity check, we first consider the special case wherethe two distributions are defined on identical spaces (i.e.,the usual GAN setting). Specifically, we test the model’sability to recover 2D mixtures of Gaussians, a commonproof of concept task for mode recovery (Che et al., 2017;Metz et al., 2017; Li et al., 2018). For the experiments onsynthetic datasets, generator and adversary architecturesare multilayer perceptrons (MLPs) with ReLU activationfunctions. Figure 2a shows that the GW GAN reproduces amixture of Gaussians with learned adversary fω that stabi-lizes the learning. We observe that `1-regularization indeedhelps position the learned distributions around the origin.Comparative results with and without `1-regularization areshown in Appendix B. As opposed to the OT GAN pro-posed by Salimans et al. (2016), our model robustly learnsGaussian mixtures with differing number of modes and ar-rangements (see Appendix E). The Appendix shows severaltraining runs. While the generated distributions vary inorientation in the Euclidean plane, the cluster structure isclearly preserved.

To illustrate the ability of the GW GAN to generate im-ages, we train the model on MNIST (LeCun et al., 1998),fashion-MNIST (Xiao et al., 2017) and gray-scale CIFAR-


b.

a.

epoch 75epoch 1 epoch 2 epoch 5 epoch 35targetc.

epoch 25epoch 1 epoch 2 epoch 5 epoch 20target

epoch 750epoch 10 epoch 50 epoch 100 epoch 500targetd.

Figure 2. The Gromov-Wasserstein GAN can learn distributions of different dimensionality. a. Results of learning a mixture of Gaussiandistributions with adversary fω (β = 1). `1-regularization allows centering the distribution across the origin (`1-penalty: λ = 0.001).Each plot shows 1000 generated samples. Learning to generate b. MNIST digits (β = 32), c. fashion-MNIST (β = 35) and d. gray-scaleCIFAR10 (β = 40). Trained with total variation denoising (λ = 0.5).

10 (Krizhevsky et al., 2014). Both generator and adversaryfollow the deep convolutional architecture introduced byChen et al. (2016), whereby the adversary fω maps into Rsrather than applying a final tanh. To stabilize the initialtraining phase, the weights of the adversary network wereinitialized with random orthogonal matrices as proposedby Saxe et al. (2014). We train the model using Adamwith a learning rate of 2 × 10−4, β1 = 0.5, β2 = 0.99(Kingma & Ba, 2015). Figure 2b, c and d display generatedimages throughout the training process. The adversary wasconstrained to approximate an orthogonal operator. Theresults highlight the effectiveness of the orthogonal Pro-crustes regularization, which allows successful learning ofcomplex distributions using different network architectures.Additional experiments on the influence of adversary fω areprovided in Appendix C.

Having validated the overall soundness of the GW GANon traditional settings, we now demonstrate its usefulnessin tasks that go beyond the scope of traditional generativeadversarial models, namely, learning across spaces that arenot directly comparable.

4.2. Learning across Dimensionalities

Arguably, the simplest instance of incomparable spacesare Euclidean spaces of different dimensionality. In thissection, we investigate whether the Gromov-WassersteinGAN can learn to generate a distribution defined on a spaceof different dimensionality than that of the reference. Weconsider both directions: learning to a smaller and higherdimensional space. In this experimental setup, we computeintra-space distances using the Euclidean distance withouta parametrized adversary. The generator network followsan MLP architecture with ReLU activation functions. Thetraining task consists of translating between a mixture ofGaussian distributions in two and three dimensions. Theresults, shown in Figure 3, demonstrate that our model suc-cessfully recovers the global structure and relative distancesof the modes of the reference distribution, despite the differ-ent dimensionality.

4.3. Learning across Data Modalities and Manifolds

Next, we consider distributions with more complex structure,and test whether our model is able to recover manifoldstructure on the generated distribution. Using the popular


a.

b.

Figure 3. The GW GAN can be applied to generate samples of a. reduced and b. increased dimensionality compared to the targetdistribution. All plots show 1000 samples.

target generated

graphx

y

x

z

yx

y

target generateda. b.

Figure 4. By learning from intra-space distances, the GW GAN learns the manifold structure of the data. a. The model can be applied todimensionality reduction tasks and reproduce a three-dimensional S-curve in two-dimensions. Intra-space distances of the data samplesare Floyd-Warshall shortest paths of the corresponding k-nearest neighbor graph. b. Similarly, it can map a graph into R2. The plotsdisplay 500 samples.

three-dimensional S-shaped dataset as example, we definedistances between the samples via shortest paths on their k-nearest neighbor graph, computed using the Floyd-Warshallalgorithm (Floyd, 1962). For the generated distribution weuse a space of the same intrinsic dimensionality (two) asthe reference manifold. The results in Figure 4a show thatthe generated distribution learnt with GW GAN successfullyrecovers the manifold structure of the data.

Taking the notion of incomparability further, we next con-sider a setting when the reference distribution is accessibleonly through relational information, i.e., a weighted graphwithout absolute representations of the samples. While con-ceptually very different from previous scenarios, applyingour model to this setting is just as simple as previous scenar-ios. Once a notion of distance is defined over the referencegraph, our model learns the distribution based on pairwiserelations as before. Given merely a graph, we use pairwiseshortest paths as the intra-space distance metric, and usethe 2D Euclidean space for the generated distribution. Fig-ure 4b shows that GW GAN is able to successfully learn adistribution that approximately recovers the neighborhoodstructure of the reference graph.

4.4. Shaping Learned Distributions

The Gromov-Wasserstein GAN enjoys remarkable flexibil-ity, allowing us to actively influence stylistic characteristicsof the generated distribution.

While structure and content of the distribution are learnedvia the adversary fω, stylistic features can be introducedvia a style adversary as outlined in Section 3.1. As a proofof concept of this modular framework, we learn MNISTdigits and enforce their font style to be bold via additionaldesign constraints. The style adversary is parametrized bya binary classifier trained on handwritten letters of the EM-NIST dataset (Cohen et al., 2017) which were assigned thinand bold class labels l ∈ {0, 1}. The training objective ofthe generator gθ is augmented with the classification resultof the trained binary classifier (Eq. (4)). Further details areprovided in Appendix G. After the generator has satisfacto-rily learnt the data distribution based on training with lossGWε, the style adversary c is activated. Figure 5 showsthat the style adversary affects the generator to increase thethickness of the MNIST digits, while the structural contentlearned in the first stage is retained.


epochs

switch from adversary fω to style adversary cFigure 5. Cross-Style Generation. By decoupling topological in-formation from superficial characteristics, our method allows forstylistic aspects to be enforced (boldness in this case, enforced witha style adversary) upon the generated distribution while preservingthe principal characteristics of the reference distribution (MNIST).

5. Related WorkGenerative adversarial models have been extensively stud-ied and applied in various fields including image synthesis(Brock et al., 2019), semantic image editing (Wang et al.,2018), style transfer (Zhu et al., 2017), and semi-supervisedlearning (Kingma et al., 2014). As the literature is extensive,we provide a brief overview on GANs and focus on selectedapproaches targeting tasks in cross-domain learning.

Generative Adversarial Networks Goodfellow et al.(2014) proposed generative adversarial networks (GANs)as a zero-sum game between a generator and a discrimi-nator, which learns to distinguish between generated anddata samples. Despite their success and improvements inoptimization, the training of GANs is difficult and unsta-ble (Salimans et al., 2016; Arjovsky & Bottou, 2017). Toremedy these issues, various extensions of this frameworkhave been proposed, most of which seek to replace thegame objective with more stable or general losses. Theseinclude using Maximum Mean Discrepancy (MMD) (Dzi-ugaite et al., 2015; Li et al., 2017; Binkowski et al., 2018),other IPMs (Mroueh et al., 2017; 2018), or Optimal Trans-port distances (Arjovsky et al., 2017; Salimans et al., 2018;Genevay et al., 2018). Due to their relevance, we discussthe latter in detail below. A crucial characteristic that dis-tinguishes our approach from other generative models is itsability to learn across different domains and modalities.

GANs and Optimal Transport (OT) To compare proba-bility distributions supported on low dimensional manifoldsin high dimensional spaces, recent GAN variants integrateOT metrics in their training objective (Arjovsky et al., 2017;Salimans et al., 2018; Genevay et al., 2018; Gulrajani et al.,2017). Since OT metrics are computationally expensive,Arjovsky et al. (2017) use the dual formulation of the 1-Wasserstein distance. Other approaches approximate the pri-mal via entropically smoothed generalizations of the Wasser-stein distance (Salimans et al., 2018; Genevay et al., 2018).Our work departs from these methods in that it relies ona much more general instance of Optimal Transport (the

Gromov-Wasserstein distance) as a loss function, whichallows us to compare distributions even if cross-domainpairwise distances are not available.

GANs for Cross-Domain Learning GANs have beensuccessfully applied to style transfer between images (Isolaet al., 2017; Karacan et al., 2016; Zhu et al., 2017), text-to-image synthesis (Reed et al., 2016; Zhang et al., 2017),visual manipulation (Zhu et al., 2016; Engel et al., 2018) orfont style transfer (Azadi et al., 2018). However, to achievethis, these methods depend on conditional variables, trainingsets of aligned data pairs or cycle consistency constraints.Kim et al. (2017) utilize two different, coupled GANs todiscover cross-domain relations given unpaired data. How-ever, the method’s applicability is limited as all images inone domain need to be representable by images in the otherdomain.

Gromov-Wasserstein Learning Since its introductionby Memoli (2011), the Gromov-Wasserstein discrepancyhas found applications in many learning problems that relyon a coupling between different metric spaces. Being aneffective method to solve matching problems, it has beenused in shape and object matching (Memoli, 2009; 2011;Solomon et al., 2016; Ezuz et al., 2017), for aligning wordembedding spaces (Alvarez-Melis & Jaakkola, 2018) andfor matching weighted directed networks (Chowdhury &Memoli, 2018). Other recent applications of the GW dis-tance include the computation of barycenters of a set ofdistance or kernel matrices (Peyre et al., 2016) and heteroge-neous domain adaptation where source and target samplesare represented in different feature spaces (Yan et al., 2018).While relying on a shared tool —the GW discrepancy— thispaper leverages it in a very different framework, generativemodeling, where questions of efficiency, degrees of freedom,minimax objectives and end-to-end learning pose variouschallenges that need to be addressed to successfully use thistool.

6. ConclusionIn this paper, we presented a new generative model thatcan learn a distribution in a space that is different from,and even incomparable to, that of the reference distribu-tion. Our model accomplishes this by relying on relational—rather than absolute— comparisons of samples via theGromov-Wasserstein distance. Such disentanglement ofdata and generator spaces opens up a wide array of novelpossibilities for generative modeling, as portrayed by ourexperiments on learning across different dimensional repre-sentations and learning across modalities (weighted graph toEuclidean representations). Validated here through simpleexperiments on digit thickness control, the use of crafted reg-ularization losses on the generator to impose certain stylisticcharacteristics makes for an exciting avenue of future work.


AcknowledgementsThis research was supported in part by NSF CAREERAward 1553284 and The Defense Advanced ResearchProjects Agency (grant number YFA17 N66001-17-1-4039).The views, opinions, and/or findings contained in this arti-cle are those of the author and should not be interpreted asrepresenting the official views or policies, either expressedor implied, of the Defense Advanced Research ProjectsAgency or the Department of Defense. Charlotte Bunnewas supported by the Zeno Karl Schindler Foundation. Wethank Suvrit Sra for a question that initiated this research,and MIT Supercloud and the Lincoln Laboratory Supercom-puting Center for providing computational resources.

ReferencesAlvarez-Melis, D. and Jaakkola, T. Gromov-Wasserstein

Alignment of Word Embedding Spaces. In Conferenceon Empirical Methods in Natural Language Processing(EMNLP). Association for Computational Linguistics,2018.

Arjovsky, M. and Bottou, L. Towards Principled Methodsfor Training Generative Adversarial Networks. In Interna-tional Conference on Learning Representations (ICLR),2017.

Arjovsky, M., Chintala, S., and Bottou, L. WassersteinGenerative Adversarial Networks. In International Con-ference on Machine Learning (ICML), volume 70. PMLR,2017.

Azadi, S., Fisher, M., Kim, V. G., Wang, Z., Shechtman, E.,and Darrell, T. Multi-Content GAN for Few-Shot FontStyle Transfer. In IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2018.

Bellemare, M. G., Danihelka, I., Dabney, W., Mohamed, S.,Lakshminarayanan, B., Hoyer, S., and Munos, R. TheCramer Distance as a Solution to Biased WassersteinGradients. arXiv preprint arXiv:1705.10743, 2017.

Binkowski, M., Sutherland, D. J., Arbel, M., and Gretton, A.Demystifying MMD GANs. In International Conferenceon Learning Representations (ICLR), 2018.

Brock, A., Lim, T., Ritchie, J. M., and Weston, N. NeuralPhoto Editing with Introspective Adversarial Networks.In International Conference on Learning Representations(ICLR), 2017.

Brock, A., Donahue, J., and Simonyan, K. Large ScaleGAN Training for High Fidelity Natural Image Synthesis.International Conference on Learning Representations(ICLR), 2019.

Carter, M. Foundations of Mathematical Economics. MITPress, 2001.

Che, T., Li, Y., Jacob, A. P., Bengio, Y., and Li, W. ModeRegularized Generative Adversarial Networks. 2017.

Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever,I., and Abbeel, P. InfoGAN: Interpretable Representa-tion Learning by Information Maximizing GenerativeAdversarial Nets. In Advances in Neural InformationProcessing Systems (NeurIPS), 2016.

Chowdhury, S. and Memoli, F. The Gromov-Wassersteindistance between networks and stable network invariants.arXiv preprint arXiv:1808.04337, 2018.

Cisse, M., Bojanowski, P., Grave, E., Dauphin, Y., andUsunier, N. Parseval Networks: Improving Robustnessto Adversarial Examples. In International Conference onMachine Learning (ICML), volume 70. PMLR, 2017.

Cohen, G., Afshar, S., Tapson, J., and van Schaik, A. EM-NIST: an extension of MNIST to handwritten letters.arXiv preprint arXiv:1702.05373, 2017.

Cuturi, M. Sinkhorn Distances: Lightspeed Computation ofOptimal Transport. In Advances in Neural InformationProcessing Systems (NeurIPS), 2013.

Cuturi, M. and Peyre, G. A Smoothed Dual Approachfor Variational Wasserstein Problems. SIAM Journal onImaging Sciences, 9, 2016.

Dziugaite, G. K., Roy, D. M., and Ghahramani, Z. Train-ing generative neural networks via Maximum Mean Dis-crepancy optimization. In Conference on Uncertainty inArtificial Intelligence (UAI), 2015.

Engel, J., Hoffman, M., and Roberts, A. Latent Con-straints: Learning to Generate Conditionally from Uncon-ditional Generative Models. In International Conferenceon Learning Representations (ICLR), 2018.

Ezuz, D., Solomon, J., Kim, V. G., and Ben-Chen, M.GWCNN: A Metric Alignment Layer for Deep ShapeAnalysis. In Computer Graphics Forum, volume 36. Wi-ley Online Library, 2017.

Floyd, R. W. Algorithm 97: Shortest path. Communicationsof the ACM, 5(6):345, 1962.

Genevay, A., Peyre, G., and Cuturi, M. Learning GenerativeModels with Sinkhorn Divergences. In International Con-ference on Artificial Intelligence and Statistics (AISTATS),volume 84. PLMR, 2018.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,Warde-Farley, D., Ozair, S., Courville, A., and Bengio,Y. Generative Adversarial Nets. In Advances in NeuralInformation Processing Systems (NeurIPS), 2014.


Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., andCourville, A. C. Improved Training of Wasserstein GANs.In Advances in Neural Information Processing Systems(NeurIPS), 2017.

Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. Image-to-Image Translation with Conditional Adversarial Net-works. In IEEE Conference on Computer Vision andPattern Recognition (CVPR). IEEE, 2017.

Karacan, L., Akata, Z., Erdem, A., and Erdem, E. Learningto Generate Images of Outdoor Scenes from Attributesand Semantic Layouts. arXiv preprint arXiv:1612.00215,2016.

Kim, T., Cha, M., Kim, H., Lee, J. K., and Kim, J. Learningto Discover Cross-Domain Relations with GenerativeAdversarial Networks. In International Conference onMachine Learning (ICML), volume 70, 2017.

Kingma, D. and Ba, J. Adam: A Method for StochasticOptimization. In International Conference on LearningRepresentations (ICLR), volume 5, 2015.

Kingma, D. P., Mohamed, S., Rezende, D. J., and Welling,M. Semi-Supervised Learning with Deep GenerativeModels. In Advances in Neural Information ProcessingSystems (NeurIPS), 2014.

Krizhevsky, A., Nair, V., and Hinton, G. The CIFAR-10Dataset, 2014.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceed-ings of the IEEE, 86(11), 1998.

Li, C., Alvarez-Melis, D., Xu, K., Jegelka, S., and Sra, S.Distributional Adversarial Networks. International Con-ference on Learning Representations (ICLR), WorkshopTrack, 2018.

Li, C.-L., Chang, W.-C., Cheng, Y., Yang, Y., and Poczos, B.MMD GAN: Towards Deeper Understanding of MomentMatching Network. In Advances in Neural InformationProcessing Systems (NeurIPS), 2017.

Memoli, F. Spectral Gromov-Wasserstein Distances forShape Matching. In Computer Vision Workshops (ICCVWorkshops). IEEE, 2009.

Memoli, F. Gromov-Wasserstein Distances and the MetricApproach to Object Matching. Foundations of Computa-tional Mathematics, 11(4), 2011.

Metz, L., Poole, B., Pfau, D., and Sohl-Dickstein, J. Un-rolled generative adversarial networks. International Con-ference on Learning Representations (ICLR), 2017.

Mroueh, Y., Sercu, T., and Goel, V. McGAN: Mean andCovariance Feature Matching GAN. In InternationalConference on Machine Learning (ICML), volume 70.PMLR, 2017.

Mroueh, Y., Li, C.-L., Sercu, T., Raj, A., and Cheng, Y.Sobolev GAN. In International Conference on LearningRepresentations (ICLR), 2018.

Muller, A. Integral Probability Metrics and Their GeneratingClasses of Functions. Advances in Applied Probability,29(2), 1997.

Peyre, G., Cuturi, M., and Solomon, J. Gromov-WassersteinAveraging of Kernel and Distance Matrices. In Inter-national Conference on Machine Learning (ICML), vol-ume 48, 2016.

Radford, A., Metz, L., and Chintala, S. Unsupervised Repre-sentation Learning with Deep Convolutional GenerativeAdversarial Networks. In International Conference onLearning Representations (ICLR), 2016.

Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B.,and Lee, H. Generative Adversarial Text to Image Syn-thesis. In International Conference on Machine Learning(ICML), volume 48, 2016.

Rudin, L. I., Osher, S., and Fatemi, E. Nonlinear totalvariation based noise removal algorithms. Physica D:nonlinear phenomena, 60(1-4), 1992.

Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Rad-ford, A., Chen, X., and Chen, X. Improved Techniquesfor Training GANs. In Advances in Neural InformationProcessing Systems (NeurIPS). 2016.

Salimans, T., Zhang, H., Radford, A., and Metaxas, D. Im-proving GANs Using Optimal Transport. In InternationalConference on Learning Representations (ICLR), 2018.

Saxe, A. M., McClelland, J. L., and Ganguli, S. Exactsolutions to the nonlinear dynamics of learning in deeplinear neural networks. In International Conference onLearning Representations (ICLR), 2014.

Schmitzer, B. Stabilized Sparse Scaling Algorithms forEntropy Regularized Transport Problems. arXiv preprintarXiv:1610.06519, 2016.

Schonemann, P. H. A generalized solution of the OrthogonalProcrustes problem. Psychometrika, 31(1), 1966.

Solomon, J., Peyre, G., Kim, V. G., and Sra, S. EntropicMetric Alignment for Correspondence Problems. ACMTransactions on Graphics (TOG), 35(4), 2016.


Sriperumbudur, B., Fukumizu, K., Gretton, A., Scholkopf,B., and Lanckriet, G. On the Empirical Estimation of Inte-gral Probability Metrics. Electronic Journal of Statistics,6, 2012.

Vorontsov, E., Trabelsi, C., Kadoury, S., and Pal, C. Onorthogonality and learning recurrent networks with longterm dependencies. In International Conference on Ma-chine Learning (ICML), volume 70. PMLR, 2017.

Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Tao, A., Kautz, J.,and Catanzaro, B. High-Resolution Image Synthesis andSemantic Manipulation with Conditional GANs. In IEEEConference on Computer Vision and Pattern Recognition(CVPR), 2018.

Xiao, H., Rasul, K., and Roland, V. Fashion-MNIST: aNovel Image Dataset for Benchmarking Machine Learn-ing Algorithms. arXiv preprint arXiv:1708.07747, 2017.

Yan, Y., Li, W., Wu, H., Min, H., Tan, M., and Wu, Q.Semi-Supervised Optimal Transport for HeterogeneousDomain Adaptation. International Joint Conference onArtificial Intelligence (IJCAI), 2018.

Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X.,and Metaxas, D. Stackgan: Text to photo-realistic imagesynthesis with stacked generative adversarial networks.In International Conference on Computer Vision (ICCV),2017.

Zhu, J.-Y., Krahenbuhl, P., Shechtman, E., and Efros, A. A.Generative Visual Manipulation on the Natural ImageManifold. In European Conference on Computer Vision(ECCV), 2016.

Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. UnpairedImage-to-Image Translation using Cycle-Consistent Ad-versarial Networks. In International Conference on Com-puter Vision (ICCV), 2017.


Appendix

A. Training of the GW GAN with Fixed Adversary

adversary fωnot trained henceforth

Figure 6. Results demonstrating the consistency of the GW training objective when fixing adversary fω . Even after holding the adversaryfω fixed and stopping its training after 10000 iterations, the generator does not diverge and remains consistent. All plots display 1000samples.

B. Influence of Generator Constraints on the GW GAN

GW GANwith regularization

(ℓ1-penalty, λ=0.001)

GW GANwithout

regularization

Figure 7. The GW GAN recovers relational and geometric properties of the reference distribution. Global aspects can be determined viaconstraints of the generator gθ . When learning a mixture of four Gaussians we can enforce centering around the origin by penalizing the`1-norm of the generated samples. The results show training the GW GAN with and without a `1-penalty. While the GW GAN recovers allmodes in both cases, learning with `1-regularization centers the resulting distribution around the origin and determines its orientation inthe Euclidean plane. All plots display 1000 samples.


C. Influence of the Adversary

epoch 50epoch 1 epoch 10 epoch 100

GW GANno adversary

no tv-loss

GW GANno adversary

with tv-loss (𝜆 = 0.5)

GW GANwith adversary (β = 32)

with tv-loss (𝜆 = 0.5)

iteration 100 iteration 500 iteration 5000 iteration 10000

GW GANno adversary

with ℓ1-penalty (𝜆 = 0.01)

GW GANwith adversary (β = 1)

with ℓ1-penalty (𝜆 = 0.001)

a

b

Figure 8. Learning the intra-space metrics adversarially is crucial for high dimensional applications. a. For simple applications suchas generating 2D Gaussian distributions, the GW GAN performs well irrespective of the use of an adversary. b. However, for higherdimensional inputs, such as generating MNIST digits, the use of the adversary is crucial. Without the adversary, the GW GAN is not ableto recover the underlying target distributions, while the total variation regularization (tv-loss) effects a clustering of local intensities.


D. Comparison of the Effectiveness of Orthogonal Regularization Approaches

OrthogonalProcrustes-based

Regularization(this paper)

OrthogonalRegularizationof the Weights

(Brock et al., 2019)

OrthogonalRegularizationof the Weights

(Brock et al., 2017)

OrthogonalInitialization

(Saxe et al., 2014)

Figure 9. To avoid arbitrary distortion of the space by the adversary, we propose to regularize fω by (approximately) enforcing it todefine a unitary transformation. We compare different methods of orthogonal regularization of neural networks; including randominitialization of the network’s weights at the beginning of the training (Saxe et al., 2014), and layerwise orthogonality constraintswhich penalize deviations of the weights from orthogonality (Rβ(Wk) := β‖W>

k Wk − I‖2F) (Brock et al., 2017). Brock et al. (2019)remove the diagonal terms from the regularization, which enforces the weights to be orthogonal but does not constrain their norms(Rβ(Wk) := β‖W>

k Wk � (1− I)‖2F). The approaches of Saxe et al. (2014); Brock et al. (2017; 2019) are not able to tightly constrainfω . As a result, the adversary is able to stretch the space and thus maximize the Gromov-Wasserstein distance. Only the orthogonalProcrustes-based orthogonality approach introduced in this paper is able to effectively regularize adversary fω preventing arbitrarydistortions of the intra-space distances in the feature space. All plots display 1000 samples.


E. Comparison of the GW GAN with Salimans et al. (2018)

OT GAN(Salimans et al., 2018)

GW GAN(this paper)


OT GAN(Salimans et al., 2018)

GW GAN(this paper)


Figure 10. Comparison of the performance of the GW GAN and an OT based generative model proposed by Salimans et al. (2018)(OT GAN). The GW GAN learns mixture of Gaussians with differing number of modes and arrangements. The approach of Salimanset al. (2018) is not able to recover a mixture of five Gaussians with a centering distribution. In the case of a mixture of four Gaussiandistributions, both models are able to recover the reference distribution in a similar number of iterations. All plots display 1000 samples.

F. Comparison of Training Times

Model Average Training Time (Seconds per Epoch)WASSERSTEIN GAN with Gradient Penalty (Gulrajani et al., 2017) 17.57± 2.07SINKHORN GAN (Genevay et al., 2018) (default configuration, ε = 0.1) 145.52± 1.90SINKHORN GAN (Genevay et al., 2018) (default configuration, ε = 0.005) 153.86± 1.64GW GAN (this paper, ε = 0.005) 156.62± 1.06

Table 1. Training time comparisons of PyTorch implementations of different GAN architectures. The generative models were trained ongenerating MNIST digits and their average training time per epoch was recorded. All experiments were performed on a single GPU forconsistency.


G. Training Details of the Style AdversaryWe introduce a novel framework which allows a modular application of style transfer tasks by integrating a style adversaryinto the architecture of the Gromov-Wasserstein GAN. In order to demonstrate the practicability of this modular framework,we learn MNIST digits and enforce their font style to be bold via additional design constraints. The style adversary isparametrized by a binary classifier trained on handwritten letters of the EMNIST dataset (Cohen et al., 2017) which wereassigned bold and thin class labels based on the letterwise `1-norm of each image. As the style adversary is trained based ona different dataset, it is independent of the original learning task. The binary classifier is parametrized by a convolutionalneural network and trained by computing a binary cross-entropy loss. The dataset, classification results of bold and thinletters as well as the loss curve of training the binary classifier are shown in figure 11.

Bina

ry C

ross

-Ent

ropy

Los

s

Training Iterations

a b

c

d

Figure 11. Training of a binary classifier to discriminate between bold and thin letters. a. Training set of the EMNIST dataset includingbold and thin letters. Output of the trained network of letters labelled as b. bold and c. thin. d. Loss curve corresponding to the training ofthe binary classifier.

Documents

Learning Generative Models across Incomparable Spaces