10. Generative Adversarial Networks EE-559 { Deep learningAdvances in generative models that can sample but not estimate ... { Deep learning / 10. Generative Adversarial Networks 18

EE-559 – Deep learning

10. Generative Adversarial Networks

Francois Fleuret

https://fleuret.org/dlc/

[version of: May 17, 2018]

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

Adversarial generative models

Francois Fleuret EE-559 – Deep learning / 10. Generative Adversarial Networks 2 / 81

A different approach to learn high-dimension generative models are theGenerative Adversarial Networks proposed by Goodfellow et al. (2014).

The idea behind GANs is to train two networks jointly:

• A generator G to map a Z following a [simple] fixed distribution to thedesired “real” distribution, and

• a discriminator D to classify data points as “real” or “fake” (i.e. from G).

The approach is adversarial since the two networks have antagonistic objectives.


A bit more formally, let X be the signal space and D the latent space dimension.

• The generatorG : RD → X

is trained so that [ideally] if it gets a random normal-distributed Z as input,it produces a sample following the data distribution as output.

• The discriminatorD : X → [0, 1]

is trained so that if it gets a sample as input, it predicts if it is genuine.


If G is fixed, to train D given a set of “real points”

xn ∼ µ, n = 1, . . . ,N,

we can generatezn ∼ N(0, I ), n = 1, . . . ,N,

build a two-class data-set

D ={

(x1, 1), . . . , (xN , 1)︸︷︷︸real samples∼µ

, (G(z1), 0), . . . , (G(zN), 0)︸︷︷︸fake samples∼µG

},

and minimize the binary cross-entropy

L(D) = − 1

2N

(N∑

n=1

log D(xn) +N∑

n=1

log(1−D(G(zn)))

)

= −1

2

(EX∼µ

[log D(X )

]+ EX∼µG

[log(1−D(X ))

]),

where µ is the true distribution of the data, and µG is the distribution of G(Z)with Z ∼ N(0, I ).


The situation is slightly more complicated since we also want to optimize G tomaximize D’s loss.

Goodfellow et al. (2014) provide an analysis of the resulting equilibrium of thatstrategy.


Let’s define

V (D,G) = EX∼µ

[log D(X )

]+ EX∼µG

[log(1−D(X ))

]which is high if D is doing a good job (low cross entropy), and low if G fools D.

Our ultimate goal is a G∗ that fools any D, so

G∗ = argminG

maxD

V (D,G).

If we define the optimal discriminator for a given generator

D∗G = argmaxD

V (D,G),

our objective becomesG∗ = argmin

GV (D∗G,G).


We have

V (D,G) = EX∼µ

[log D(X )

]+ EX∼µG

[log(1−D(X ))

]=

∫xµ(x) log D(x) + µG(x) log(1−D(x))dx .

Since

argmaxd

µ(x) log d + µG(x) log(1− d) =µ(x)

µ(x) + µG(x),

andD∗G = argmax

DV (D,G),

if there is no regularization on D, we get

∀x , D∗G(x) =µ(x)

µ(x) + µG(x).


So, since

∀x , D∗G(x) =µ(x)

µ(x) + µG(x).

we get

V (D∗G,G) = EX∼µ

[log D∗G(X )

]+ EX∼µG

[log(1−D∗G(X ))

]= EX∼µ

[log

µ(X )

µ(X ) + µG(X )

]+ EX∼µG

[log

µG(X )

µ(X ) + µG(X )

]= DKL

(µ

∥∥∥∥ µ+ µG

2

)+DKL

(µG

∥∥∥∥ µ+ µG

2

)− log 4

= 2DJS (µ, µG)− log 4

where DJS is the Jensen–Shannon Divergence, a standard dissimilarity measurebetween distributions.


To recap: if there is no capacity limitation for D, and if we define

V (D,G) = EX∼µ

[log D(X )

]+ EX∼µG

[log(1−D(X ))

],

computingG∗ = argmin

Gmax

DV (D,G)

amounts to computeG∗ = argmin

GDJS (µ, µG),

where DJS is a reasonable dissimilarity measure between distributions.

B Although this derivation provides a nice formal framework, in practice Dis not “fully” optimized to [come close to] D∗G when optimizing G.

In our minimal example, we alternate gradient steps to improve G and D.


z_dim , nb_hidden = 8, 100

model_G = nn.Sequential(nn.Linear(z_dim , nb_hidden),

nn.ReLU(),

nn.Linear(nb_hidden , 2))

model_D = nn.Sequential(nn.Linear(2, nb_hidden),

nn.ReLU(),

nn.Linear(nb_hidden , 1),

nn.Sigmoid ())

batch_size , lr = 10, 1e-3

optimizer_G = optim.Adam(model_G.parameters (), lr = lr)

optimizer_D = optim.Adam(model_D.parameters (), lr = lr)

for e in range(nb_epochs):

for t, real_batch in enumerate(real_samples.split(batch_size)):

z = Variable(real_batch.new(real_batch.size (0), z_dim).normal_ ())

fake_batch = model_G(z)

real_batch = Variable(real_batch)

D_scores_on_fake = model_D(fake_batch)

D_scores_on_real = model_D(real_batch)

if t%2 == 0:

loss = (1 - D_scores_on_fake).log().mean()

optimizer_G.zero_grad ()

loss.backward ()

optimizer_G.step()

else:

loss = - (D_scores_on_real.log().mean() + (1 - D_scores_on_fake).log().mean())

optimizer_D.zero_grad ()

loss.backward ()

optimizer_D.step()


2d

-4

-3

-2

-1

0

1

2

3

4

-6 -4 -2 0 2 4 6

RealSynth

-4

-3

-2

-1

0

1

2

3

4

-6 -4 -2 0 2 4 6

RealSynth

-4

-3

-2

-1

0

1

2

3

4

-6 -4 -2 0 2 4 6

RealSynth

-4

-3

-2

-1

0

1

2

3

4

-6 -4 -2 0 2 4 6

RealSynth

8d

-4

-3

-2

-1

0

1

2

3

4

-6 -4 -2 0 2 4 6

RealSynth

-4

-3

-2

-1

0

1

2

3

4

-6 -4 -2 0 2 4 6

RealSynth

-4

-3

-2

-1

0

1

2

3

4

-6 -4 -2 0 2 4 6

RealSynth

-4

-3

-2

-1

0

1

2

3

4

-6 -4 -2 0 2 4 6

RealSynth

32d

-4

-3

-2

-1

0

1

2

3

4

-6 -4 -2 0 2 4 6

RealSynth

-4

-3

-2

-1

0

1

2

3

4

-6 -4 -2 0 2 4 6

RealSynth

-4

-3

-2

-1

0

1

2

3

4

-6 -4 -2 0 2 4 6

RealSynth

-4

-3

-2

-1

0

1

2

3

4

-6 -4 -2 0 2 4 6

RealSynth


In more realistic settings, the fake samples may be initially so “unrealistic” thatthe response of D saturates. That causes the loss for G

EX∼µG

[log(1−D(X ))

]to be far in the exponential tail of D’s sigmoid, and have zero gradient sincelog(1 + ε) ' ε does not correct it in any way.

Goodfellow et al. suggest to replace this term with a non-saturating cost

−EX∼µG

[log(D(X ))

]so that the log fixes D’s exponential behavior. The resulting optimizationproblem has the same optima as the original one.

B The loss for D remains unchanged.


Model MNIST TFDDBN [3] 138± 2 1909± 66

Stacked CAE [3] 121± 1.6 2110± 50Deep GSN [6] 214± 1.1 1890± 29

Adversarial nets 225± 2 2057± 26

Table 1: Parzen window-based log-likelihood estimates. The reported numbers on MNIST are the mean log-likelihood of samples on test set, with the standard error of the mean computed across examples. On TFD, wecomputed the standard error across folds of the dataset, with a different σ chosen using the validation set ofeach fold. On TFD, σ was cross validated on each fold and mean log-likelihood on each fold were computed.For MNIST we compare against other models of the real-valued (rather than binary) version of dataset.

of the Gaussians was obtained by cross validation on the validation set. This procedure was intro-duced in Breuleux et al. [8] and used for various generative models for which the exact likelihoodis not tractable [25, 3, 5]. Results are reported in Table 1. This method of estimating the likelihoodhas somewhat high variance and does not perform well in high dimensional spaces but it is the bestmethod available to our knowledge. Advances in generative models that can sample but not estimatelikelihood directly motivate further research into how to evaluate such models.

In Figures 2 and 3 we show samples drawn from the generator net after training. While we make noclaim that these samples are better than samples generated by existing methods, we believe that thesesamples are at least competitive with the better generative models in the literature and highlight thepotential of the adversarial framework.

a) b)

c) d)

Figure 2: Visualization of samples from the model. Rightmost column shows the nearest training example ofthe neighboring sample, in order to demonstrate that the model has not memorized the training set. Samplesare fair random draws, not cherry-picked. Unlike most other visualizations of deep generative models, theseimages show actual samples from the model distributions, not conditional means given samples of hidden units.Moreover, these samples are uncorrelated because the sampling process does not depend on Markov chainmixing. a) MNIST b) TFD c) CIFAR-10 (fully connected model) d) CIFAR-10 (convolutional discriminatorand “deconvolutional” generator)

6

(Goodfellow et al., 2014)


Training a standard GAN often results in two pathological behaviors:

• Oscillations without convergence. Contrary to standard loss minimization,we have no guarantee here that it will actually decrease.

• The infamous “mode collapse”, when G models very well a smallsub-population, concentrating on a few modes.

Additionally, performance is hard to assess and often boils down to a “beautycontest”.


Deep Convolutional GAN


“We also encountered difficulties attempting to scale GANs using CNNarchitectures commonly used in the supervised literature. However, afterextensive model exploration we identified a family of architectures thatresulted in stable training across a range of datasets and allowed for traininghigher resolution and deeper generative models.”

(Radford et al., 2015)


Radford et al. converged to the following rules:

• Replace pooling layers with strided convolutions in D and stridedtransposed convolutions in G,

• use batchnorm in both D and G,

• remove fully connected hidden layers,

• use ReLU in G except for the output, which uses Tanh,

• use LeakyReLU activation in D for all layers.


Under review as a conference paper at ICLR 2016

Figure 1: DCGAN generator used for LSUN scene modeling. A 100 dimensional uniform distribu-tion Z is projected to a small spatial extent convolutional representation with many feature maps.A series of four fractionally-strided convolutions (in some recent papers, these are wrongly calleddeconvolutions) then convert this high level representation into a 64 × 64 pixel image. Notably, nofully connected or pooling layers are used.

suggested value of 0.9 resulted in training oscillation and instability while reducing it to 0.5 helpedstabilize training.

4.1 LSUN

As visual quality of samples from generative image models has improved, concerns of over-fittingand memorization of training samples have risen. To demonstrate how our model scales with moredata and higher resolution generation, we train a model on the LSUN bedrooms dataset containinga little over 3 million training examples. Recent analysis has shown that there is a direct link be-tween how fast models learn and their generalization performance (Hardt et al., 2015). We showsamples from one epoch of training (Fig.2), mimicking online learning, in addition to samples afterconvergence (Fig.3), as an opportunity to demonstrate that our model is not producing high qualitysamples via simply overfitting/memorizing training examples. No data augmentation was applied tothe images.

4.1.1 DEDUPLICATION

To further decrease the likelihood of the generator memorizing input examples (Fig.2) we perform asimple image de-duplication process. We fit a 3072-128-3072 de-noising dropout regularized RELUautoencoder on 32x32 downsampled center-crops of training examples. The resulting code layeractivations are then binarized via thresholding the ReLU activation which has been shown to be aneffective information preserving technique (Srivastava et al., 2014) and provides a convenient formof semantic-hashing, allowing for linear time de-duplication . Visual inspection of hash collisionsshowed high precision with an estimated false positive rate of less than 1 in 100. Additionally, thetechnique detected and removed approximately 275,000 near duplicates, suggesting a high recall.

4.2 FACES

We scraped images containing human faces from random web image queries of peoples names. Thepeople names were acquired from dbpedia, with a criterion that they were born in the modern era.This dataset has 3M images from 10K people. We run an OpenCV face detector on these images,keeping the detections that are sufficiently high resolution, which gives us approximately 350,000face boxes. We use these face boxes for training. No data augmentation was applied to the images.

4

(Radford et al., 2015)


We can have a look at the reference implementation provided in

https://github.com/pytorch/examples.git

# default nz = 100, ngf = 64

class _netG(nn.Module):

def __init__(self , ngpu):

super(_netG , self).__init__ ()

self.ngpu = ngpu

self.main = nn.Sequential(

# input is Z, going into a convolution

nn.ConvTranspose2d(nz, ngf * 8, 4, 1, 0, bias=False),

nn.BatchNorm2d(ngf * 8),

nn.ReLU(True),

# state size. (ngf * 8) x 4 x 4

nn.ConvTranspose2d(ngf * 8, ngf * 4, 4, 2, 1, bias=False),


nn.ReLU(True),


nn.ConvTranspose2d(ngf * 4, ngf * 2, 4, 2, 1, bias=False),


nn.ReLU(True),


nn.ConvTranspose2d(ngf * 2, ngf , 4, 2, 1, bias=False),

nn.BatchNorm2d(ngf),

nn.ReLU(True),

# state size. (ngf) x 32 x 32

nn.ConvTranspose2d( ngf , nc, 4, 2, 1, bias=False),

nn.Tanh()

# state size. (nc) x 64 x 64

)


https://github.com/pytorch/examples.git

# default nz = 100, ndf = 64

class _netD(nn.Module):

def __init__(self , ngpu):

super(_netD , self).__init__ ()

self.ngpu = ngpu

self.main = nn.Sequential(

# input is (nc) x 64 x 64

nn.Conv2d(nc, ndf , 4, 2, 1, bias=False),

nn.LeakyReLU (0.2, inplace=True),

# state size. (ndf) x 32 x 32

nn.Conv2d(ndf , ndf * 2, 4, 2, 1, bias=False),

nn.BatchNorm2d(ndf * 2),


# state size. (ndf * 2) x 16 x 16

nn.Conv2d(ndf * 2, ndf * 4, 4, 2, 1, bias=False),



# state size. (ndf * 4) x 8 x 8

nn.Conv2d(ndf * 4, ndf * 8, 4, 2, 1, bias=False),



# state size. (ndf *8) x 4 x 4

nn.Conv2d(ndf * 8, 1, 4, 1, 0, bias=False),

nn.Sigmoid ()

)


# custom weights initialization called on netG and netD

def weights_init(m):

classname = m.__class__.__name__

if classname.find(’Conv ’) != -1:

m.weight.data.normal_ (0.0, 0.02)

elif classname.find(’BatchNorm ’) != -1:

m.weight.data.normal_ (1.0, 0.02)

m.bias.data.fill_ (0)

criterion = nn.BCELoss ()

input = torch.FloatTensor(opt.batchSize , 3, opt.imageSize , opt.imageSize)

noise = torch.FloatTensor(opt.batchSize , nz, 1, 1)

fixed_noise = torch.FloatTensor(opt.batchSize , nz, 1, 1).normal_(0, 1)

label = torch.FloatTensor(opt.batchSize)

real_label = 1

fake_label = 0

fixed_noise = Variable(fixed_noise)

# setup optimizer

optimizerD = optim.Adam(netD.parameters (), lr=opt.lr , betas=(opt.beta1 , 0.999))

optimizerG = optim.Adam(netG.parameters (), lr=opt.lr , betas=(opt.beta1 , 0.999))

for epoch in range(opt.niter):

for i, data in enumerate(dataloader , 0):


# (1) Update D network: maximize log(D(x)) + log(1 - D(G(z)))

# train with real

netD.zero_grad ()

real_cpu , _ = data

batch_size = real_cpu.size (0)

if opt.cuda:

real_cpu = real_cpu.cuda()

input.resize_as_(real_cpu).copy_(real_cpu)

label.resize_(batch_size).fill_(real_label)

inputv = Variable(input)

labelv = Variable(label)

output = netD(inputv)

errD_real = criterion(output , labelv)

errD_real.backward ()

D_x = output.data.mean()

# train with fake

noise.resize_(batch_size , nz, 1, 1).normal_(0, 1)

noisev = Variable(noise)

fake = netG(noisev)

labelv = Variable(label.fill_(fake_label))

output = netD(fake.detach ())

errD_fake = criterion(output , labelv)

errD_fake.backward ()

D_G_z1 = output.data.mean()

errD = errD_real + errD_fake

optimizerD.step()


# (2) Update G network: maximize log(D(G(z)))

netG.zero_grad ()

# fake labels are real for generator cost

labelv = Variable(label.fill_(real_label))

output = netD(fake)

errG = criterion(output , labelv)

errG.backward ()

D_G_z2 = output.data.mean()

optimizerG.step()

This update of G minimizes the loss with inverted labels instead of maximizingit for the correct ones, and hence implements the non-saturating loss.


Real images from LSUN’s “bedroom” class.


Fake images after 1 epoch (3M images)


Fake images after 20 epochs


Wasserstein GAN


Arjovsky et al. (2017) point out that DJS does not account [much] for themetric structure of the space.

δ

x

12

12δ

1

µ′

µ

DJS (µ, µ′) = min(δ, |x |)(

1

δlog

(1 +

1

2δ

)−(

1 +1

δ

)log

(1 +

1

δ

))Hence all |x | greater than δ are seen the same.


An alternative choice is the “earth moving distance”, which intuitively is theminimum mass displacement to transform one distribution into the other.

4× 14

2× 14

3× 12

1 2 3 4 5 6 7 8 9 10

µ =1

41[1,2] +

1

41[3,4] +

1

21[9,10] µ′ =

1

21[5,7]

W(µ, µ′) = 4× 1

4+ 2× 1

4+ 3× 1

2= 3


This distance is also known as the Wasserstein distance, defined as

W(µ, µ′) = minq∈Π(µ,µ′)

E(X ,X ′)∼q

[‖X − X ′‖

],

where Π(µ, µ′) is the set of distributions over X2 whose marginals are µ and µ′.


Intuitively, it increases monotonically with the distance between modes

δ

x

12

12δ

1

µ′

µ

W(µ, µ′) =1

2|x |


So it would make a lot of sense to look for a generator matching the density forthis metric, that is

G∗ = argminG

W(µ, µG).

Unfortunately, the definition of W does not provide an operational way ofestimating it.


A duality theorem from Kantorovich and Rubinstein implies

W(µ, µ′) = max‖f ‖L≤1

EX∼µ

[f (X )

]− EX∼µ′

[f (X )

]where

‖f ‖L = maxx,x′

‖f (x)− f (x ′)‖‖x − x ′‖

is the Lipschitz seminorm.


−1

0

1

2

3 f

1 2 3 4 5 6 7 8 9 10

µ =1

41[1,2] +

1

41[3,4] +

1

21[9,10] µ′ =

1

21[5,7]

W(µ, µ′) =

(3× 1

4+ 1× 1

4+ 2× 1

2

)︸︷︷︸

EX∼µf (X )

−(−1× 1

2− 1× 1

2

)︸︷︷︸

EX∼µ′ f (X )

= 3


Using this result, we are looking for a generator

G∗ = argminG

W(µ, µG)

= argminG

max‖D‖L≤1

(EX∼µ

[D(X )

]− EX∼µG

[D(X )

]),

where the max is now an optimized predictor.

This is very similar to the original GAN formulation, except that the value of Dis not interpreted through a log-loss, and there is a strong regularization on D.


The main issue in this formulation is to optimize the network D under aconstraint on its Lipschitz seminorm

‖D‖L ≤ 1.

Arjovsky et al. achieve this by clipping D’s weights.


The two main benefits observed by Arjovsky et al. are

• A greater stability of the learning process, both in principle and in theirexperiments: they do not witness “mode collapse”.

• A greater interpretability of the loss, which is a better indicator of thequality of the samples.


Figure 2: Optimal discriminator and critic when learning to differentiate two Gaussians.As we can see, the traditional GAN discriminator saturates and results in vanishing gra-dients. Our WGAN critic provides very clean gradients on all parts of the space.

4 Empirical Results

We run experiments on image generation using our Wasserstein-GAN algorithm andshow that there are significant practical benefits to using it over the formulationused in standard GANs.

We claim two main benefits:

• a meaningful loss metric that correlates with the generator’s convergence andsample quality

• improved stability of the optimization process

4.1 Experimental Procedure

We run experiments on image generation. The target distribution to learn is theLSUN-Bedrooms dataset [24] – a collection of natural images of indoor bedrooms.Our baseline comparison is DCGAN [18], a GAN with a convolutional architecturetrained with the standard GAN procedure using the − logD trick [4]. The generatedsamples are 3-channel images of 64x64 pixels in size. We use the hyper-parametersspecified in Algorithm 1 for all of our experiments.

9

(Arjovsky et al., 2017)


Figure 4: JS estimates for an MLP generator (upper left) and a DCGAN generator (upperright) trained with the standard GAN procedure. Both had a DCGAN discriminator. Bothcurves have increasing error. Samples get better for the DCGAN but the JS estimateincreases or stays constant, pointing towards no significant correlation between samplequality and loss. Bottom: MLP with both generator and discriminator. The curve goes upand down regardless of sample quality. All training curves were passed through the samemedian filter as in Figure 3.

to stare at the generated samples to figure out failure modes and to gain informationon which models are doing better over others.

However, we do not claim that this is a new method to quantitatively evaluategenerative models yet. The constant scaling factor that depends on the critic’sarchitecture means it’s hard to compare models with different critics. Even more,in practice the fact that the critic doesn’t have infinite capacity makes it hard toknow just how close to the EM distance our estimate really is. This being said,we have succesfully used the loss metric to validate our experiments repeatedly andwithout failure, and we see this as a huge improvement in training GANs whichpreviously had no such facility.

In contrast, Figure 4 plots the evolution of the GAN estimate of the JS distanceduring GAN training. More precisely, during GAN training, the discriminator istrained to maximize

L(D, gθ) = Ex∼Pr [logD(x)] + Ex∼Pθ [log(1−D(x))]

which is is a lower bound of 2JS(Pr,Pθ)−2 log 2. In the figure, we plot the quantity12L(D, gθ) + log 2, which is a lower bound of the JS distance.

This quantity clearly correlates poorly the sample quality. Note also that the

11



Figure 3: Training curves and samples at different stages of training. We can see a clearcorrelation between lower error and better sample quality. Upper left: the generator is anMLP with 4 hidden layers and 512 units at each layer. The loss decreases constistently astraining progresses and sample quality increases. Upper right: the generator is a standardDCGAN. The loss decreases quickly and sample quality increases as well. In both upperplots the critic is a DCGAN without the sigmoid so losses can be subjected to comparison.Lower half: both the generator and the discriminator are MLPs with substantially highlearning rates (so training failed). Loss is constant and samples are constant as well. Thetraining curves were passed through a median filter for visualization purposes.

4.2 Meaningful loss metric

Because the WGAN algorithm attempts to train the critic f (lines 2–8 in Algo-rithm 1) relatively well before each generator update (line 10 in Algorithm 1), theloss function at this point is an estimate of the EM distance, up to constant factorsrelated to the way we constrain the Lipschitz constant of f .

Our first experiment illustrates how this estimate correlates well with the qualityof the generated samples. Besides the convolutional DCGAN architecture, we alsoran experiments where we replace the generator or both the generator and the criticby 4-layer ReLU-MLP with 512 hidden units.

Figure 3 plots the evolution of the WGAN estimate (3) of the EM distanceduring WGAN training for all three architectures. The plots clearly show thatthese curves correlate well with the visual quality of the generated samples.

To our knowledge, this is the first time in GAN literature that such a property isshown, where the loss of the GAN shows properties of convergence. This property isextremely useful when doing research in adversarial networks as one does not need

10



However, as Arjovsky et al. wrote:

“Weight clipping is a clearly terrible way to enforce a Lipschitz constraint. Ifthe clipping parameter is large, then it can take a long time for any weightsto reach their limit, thereby making it harder to train the critic till optimality.If the clipping is small, this can easily lead to vanishing gradients whenthe number of layers is big, or batch normalization is not used (such as inRNNs).”


In some way, the resulting Wasserstein GAN (WGAN) trades the difficulty totrain G for the difficulty to train D.

In practice, this weakness results in extremely long convergence time.


Gulrajani et al. (2017) proposed the improved Wasserstein GAN in which theconstraint on the Lipschitz seminorm is replaced with a smooth penalty term.

They state that if

D∗ = argmax‖D‖L≤1

(EX∼µ

[D(X )

]− EX∼µG

[D(X )

])then, with probability one under µ and µG

‖∇D∗(X )‖ = 1.

This implies that adding a regularization that pushes the gradient norm to oneshould not exclude [any of] the optimal discriminator[s].


So instead of looking for

argmax‖D‖L≤1

EX∼µ

[D(X )

]− EX∼µG

[D(X )

],

Gulrajani et al. propose to solve

argmaxD

EX∼µ

[D(X )

]− EX∼µG

[D(X )

]− λEX∼µp

[(‖∇D(X )‖ − 1)2

]where µp is the distribution of a point B sampled uniformly between a realsample X and a fake sample G(Z), that is B = UX + (1− U)X ′ where X ∼ µ,X ′ ∼ µG, and U ∼ U[0, 1].

Note that this loss involves second-order derivatives.

Experiments show that this scheme is more stable than WGAN under manydifferent conditions.


Conditional GAN


All the models we have seen so far model a density in high dimension andprovide means to sample according to it, which is useful for synthesis only.

However, most of the practical applications require the ability to sample aconditional distribution. E.g.:

• Next frame prediction.

• “in-painting”,

• segmentation,

• style transfer.


The Conditional GAN proposed by Mirza and Osindero (2014) consists ofparameterizing both G and D by a conditioning quantity Y .

V (D,G) = E(X ,Y )∼µ

[log D(X ,Y )

]+EZ∼N(0,I ),Y∼µY

[log(1−D(G(Z ,Y ),Y ))

],


To generate MNIST characters, with

Z ∼ U([0, 1]100

),

and conditioned with the class y , encoded as a one-hot vector of dimension 10,they propose

z

100d

y

10d

fc

200d

fc

1000d

fc

1200d

fc

784d

x

maxout

240d

maxout

50d

maxout

240d

fc

1d

δ


Model MNISTDBN [1] 138± 2

Stacked CAE [1] 121± 1.6Deep GSN [2] 214± 1.1

Adversarial nets 225± 2Conditional adversarial nets 132± 1.8

Table 1: Parzen window-based log-likelihood estimates for MNIST. We followed the same procedure as [8]for computing these values.

The discriminator maps x to a maxout [6] layer with 240 units and 5 pieces, and y to a maxout layerwith 50 units and 5 pieces. Both of the hidden layers mapped to a joint maxout layer with 240 unitsand 4 pieces before being fed to the sigmoid layer. (The precise architecture of the discriminatoris not critical as long as it has sufficient power; we have found that maxout units are typically wellsuited to the task.)

The model was trained using stochastic gradient decent with mini-batches of size 100 and ini-tial learning rate of 0.1 which was exponentially decreased down to .000001 with decay factor of1.00004. Also momentum was used with initial value of .5 which was increased up to 0.7. Dropout[9] with probability of 0.5 was applied to both the generator and discriminator. And best estimate oflog-likelihood on the validation set was used as stopping point.

Table 1 shows Gaussian Parzen window log-likelihood estimate for the MNIST dataset test data.1000 samples were drawn from each 10 class and a Gaussian Parzen window was fitted to thesesamples. We then estimate the log-likelihood of the test set using the Parzen window distribution.(See [8] for more details of how this estimate is constructed.)

The conditional adversarial net results that we present are comparable with some other networkbased, but are outperformed by several other approaches – including non-conditional adversarialnets. We present these results more as a proof-of-concept than as demonstration of efficacy, andbelieve that with further exploration of hyper-parameter space and architecture that the conditionalmodel should match or exceed the non-conditional results.

Fig 2 shows some of the generated samples. Each row is conditioned on one label and each columnis a different generated sample.

Figure 2: Generated MNIST digits, each row conditioned on one label

4.2 Multimodal

Photo sites such as Flickr are a rich source of labeled data in the form of images and their associateduser-generated metadata (UGM) — in particular user-tags.

4

(Mirza and Osindero, 2014)


Image-to-Image translations


The main issue to generate realistic signal is that the value X to predict mayremain non-deterministic given the conditioning quantity Y .

For a loss function such as MSE, the best fit is E(X |Y = y) which can bepretty different from the MAP, or from any reasonable sample from µX |Y=y .

In practice for images there are often remaining location indeterminacy thatresults into a blurry prediction.

Sampling according to µX |Y=y is the proper way to address the problem.


Isola et al. (2016) use conditional GANs to address this issue for the“translation” of images with pixel-to-pixel correspondence:

• edges to realistic photos,

• semantic segmentation,

• gray-scales to colors, etc.


Real or fake pair?

Positive examples Negative examples

Real or fake pair?

DD

G

G tries to synthesize fake images that fool D

D tries to identify the fakes

Figure 2: Training a conditional GAN to predict aerial photos frommaps. The discriminator, D, learns to classify between real andsynthesized pairs. The generator learns to fool the discriminator.Unlike an unconditional GAN, both the generator and discrimina-tor observe an input image.

where G tries to minimize this objective against an ad-versarial D that tries to maximize it, i.e. G∗ =argminG maxD LcGAN (G,D).

To test the importance of conditioning the discrimintor,we also compare to an unconditional variant in which thediscriminator does not observe x:

LGAN (G,D) =Ey∼pdata(y)[logD(y)]+

Ex∼pdata(x),z∼pz(z)[log(1−D(G(x, z))].(2)

Previous approaches to conditional GANs have found itbeneficial to mix the GAN objective with a more traditionalloss, such as L2 distance [29]. The discriminator’s job re-mains unchanged, but the generator is tasked to not onlyfool the discriminator but also to be near the ground truthoutput in an L2 sense. We also explore this option, usingL1 distance rather than L2 as L1 encourages less blurring:

LL1(G) = Ex,y∼pdata(x,y),z∼pz(z)[‖y −G(x, z)‖1]. (3)

Our final objective is

G∗ = argminG

maxDLcGAN (G,D) + λLL1(G). (4)

Without z, the net could still learn a mapping from x toy, but would produce deterministic outputs, and thereforefail to match any distribution other than a delta function.Past conditional GANs have acknowledged this and pro-vided Gaussian noise z as an input to the generator, in addi-tion to x (e.g., [39]). In initial experiments, we did not find

Encoder-decoder U-Net

Figure 3: Two choices for the architecture of the generator. The“U-Net” [34] is an encoder-decoder with skip connections be-tween mirrored layers in the encoder and decoder stacks.

this strategy effective – the generator simply learned to ig-nore the noise – which is consistent with Mathieu et al. [27].Instead, for our final models, we provide noise only in theform of dropout, applied on several layers of our generatorat both training and test time. Despite the dropout noise, weobserve very minor stochasticity in the output of our nets.Designing conditional GANs that produce stochastic out-put, and thereby capture the full entropy of the conditionaldistributions they model, is an important question left openby the present work.

2.2. Network architectures

We adapt our generator and discriminator architecturesfrom those in [30]. Both generator and discriminator usemodules of the form convolution-BatchNorm-ReLu [18].Details of the architecture are provided in the appendix,with key features discussed below.

2.2.1 Generator with skips

A defining feature of image-to-image translation problemsis that they map a high resolution input grid to a high resolu-tion output grid. In addition, for the problems we consider,the input and output differ in surface appearance, but bothare renderings of the same underlying structure. Therefore,structure in the input is roughly aligned with structure in theoutput. We design the generator architecture around theseconsiderations.

Many previous solutions [29, 39, 19, 48, 43] to problemsin this area have used an encoder-decoder network [16]. Insuch a network, the input is passed through a series of lay-ers that progressively downsample, until a bottleneck layer,at which point the process is reversed (Figure 3). Such anetwork requires that all information flow pass through allthe layers, including the bottleneck. For many image trans-lation problems, there is a great deal of low-level informa-tion shared between the input and output, and it would be

(Isola et al., 2016)


They define

V (D,G) = E(X ,Y )∼µ

[log D(Y ,X )

]+ EZ∼µZ ,X∼µX

[log(1−D(G(Z ,X ),X ))

],

LL1 (G) = E(X ,Y )∼µ,Z∼N(0,I )

[‖Y − G(Z ,X )‖1

],

andG∗ = argmin

Gmax

DV (D,G) + λLL1 (G).

The term LL1 pushes toward proper pixel-wise prediction, and V makes thegenerator prefer realistic images to better fitting pixel-wise.

BNote that Isola et al. switch the meaning of X and Y wrt Mirza andOsindero. Here X is the conditioning quantity and Y the signal togenerate.


For G, they start with Radford et al. (2015)’s DCGAN architecture and add skipconnections from layer i to layer D − i that concatenate channels.

Real or fake pair?

Positive examples Negative examples

Real or fake pair?

DD

G

G tries to synthesize fake

images that fool D

D tries to identify the fakes

Figure 2: Training a conditional GAN to predict aerial photos frommaps. The discriminator, D, learns to classify between real andsynthesized pairs. The generator learns to fool the discriminator.Unlike an unconditional GAN, both the generator and discrimina-tor observe an input image.

where G tries to minimize this objective against an ad-versarial D that tries to maximize it, i.e. G∗ =argminG maxD LcGAN (G,D).

To test the importance of conditioning the discrimintor,we also compare to an unconditional variant in which thediscriminator does not observe x:

LGAN (G,D) =Ey∼pdata(y)[logD(y)]+

Ex∼pdata(x),z∼pz(z)[log(1−D(G(x, z))].(2)

Previous approaches to conditional GANs have found itbeneficial to mix the GAN objective with a more traditionalloss, such as L2 distance [29]. The discriminator’s job re-mains unchanged, but the generator is tasked to not onlyfool the discriminator but also to be near the ground truthoutput in an L2 sense. We also explore this option, usingL1 distance rather than L2 as L1 encourages less blurring:

LL1(G) = Ex,y∼pdata(x,y),z∼pz(z)[‖y −G(x, z)‖1]. (3)

Our final objective is

G∗ = argminG

maxD

LcGAN (G,D) + λLL1(G). (4)

Without z, the net could still learn a mapping from x toy, but would produce deterministic outputs, and thereforefail to match any distribution other than a delta function.Past conditional GANs have acknowledged this and pro-vided Gaussian noise z as an input to the generator, in addi-tion to x (e.g., [39]). In initial experiments, we did not find

Encoder-decoder U-Net

Figure 3: Two choices for the architecture of the generator. The“U-Net” [34] is an encoder-decoder with skip connections be-tween mirrored layers in the encoder and decoder stacks.

this strategy effective – the generator simply learned to ig-nore the noise – which is consistent with Mathieu et al. [27].Instead, for our final models, we provide noise only in theform of dropout, applied on several layers of our generatorat both training and test time. Despite the dropout noise, weobserve very minor stochasticity in the output of our nets.Designing conditional GANs that produce stochastic out-put, and thereby capture the full entropy of the conditionaldistributions they model, is an important question left openby the present work.

2.2. Network architectures

We adapt our generator and discriminator architecturesfrom those in [30]. Both generator and discriminator usemodules of the form convolution-BatchNorm-ReLu [18].Details of the architecture are provided in the appendix,with key features discussed below.

2.2.1 Generator with skips

A defining feature of image-to-image translation problemsis that they map a high resolution input grid to a high resolu-tion output grid. In addition, for the problems we consider,the input and output differ in surface appearance, but bothare renderings of the same underlying structure. Therefore,structure in the input is roughly aligned with structure in theoutput. We design the generator architecture around theseconsiderations.

Many previous solutions [29, 39, 19, 48, 43] to problemsin this area have used an encoder-decoder network [16]. Insuch a network, the input is passed through a series of lay-ers that progressively downsample, until a bottleneck layer,at which point the process is reversed (Figure 3). Such anetwork requires that all information flow pass through allthe layers, including the bottleneck. For many image trans-lation problems, there is a great deal of low-level informa-tion shared between the input and output, and it would be


Randomness Z is provided through dropout, and not as an additional input.


The discriminator D is a regular convnet which scores overlapping patches ofsize N × N and averages the scores for the final one.

This controls the network’s complexity, while allowing to detect anyinconsistency of the generated image (e.g. blurriness).


Input Ground truth L1 cGAN L1 + cGAN

Figure 4: Different losses induce different quality of results. Each column shows results trained under a different loss. Please seehttps://phillipi.github.io/pix2pix/ for additional examples.

L1 1x1 16x16 70x70 256x256

Figure 6: Patch size variations. Uncertainty in the output manifests itself differently for different loss functions. Uncertain regions becomeblurry and desaturated under L1. The 1x1 PixelGAN encourages greater color diversity but has no effect on spatial statistics. The 16x16PatchGAN creates locally sharp results, but also leads to tiling artifacts beyond the scale it can observe. The 70x70 PatchGAN forcesoutputs that are sharp, even if incorrect, in both the spatial and spectral (coforfulness) dimensions. The full 256x256 ImageGAN producesresults that are visually similar to the 70x70 PatchGAN, but somewhat lower quality according to our FCN-score metric (Table 2). Pleasesee https://phillipi.github.io/pix2pix/ for additional examples.

put label maps. Combining all terms, L1+cGAN, performssimilarly well.

Colorfulness A striking effect of conditional GANs isthat they produce sharp images, hallucinating spatial struc-ture even where it does not exist in the input label map. Onemight imagine cGANs have a similar effect on “sharpening”in the spectral dimension – i.e. making images more color-ful. Just as L1 will incentivize a blur when it is uncertainwhere exactly to locate an edge, it will also incentivize anaverage, grayish color when it is uncertain which of severalplausible color values a pixel should take on. Specially, L1will be minimized by choosing the median of of the con-ditional probability density function over possible colors.An adversarial loss, on the other hand, can in principle be-come aware that grayish outputs are unrealistic, and encour-age matching the true color distribution [14]. In Figure 7,we investigate if our cGANs actually achieve this effect onthe Cityscapes dataset. The plots show the marginal distri-

butions over output color values in Lab color space. Theground truth distributions are shown with a dotted line. Itis apparent that L1 leads to a narrower distribution than theground truth, confirming the hypothesis that L1 encouragesaverage, grayish colors. Using a cGAN, on the other hand,pushes the output distribution closer to the ground truth.

3.3. Analysis of the generator architecture

A U-Net architecture allows low-level information toshortcut across the network. Does this lead to better results?Figure 5 compares the U-Net against an encoder-decoder oncityscape generation U-Net. The encoder-decoder is createdsimply by severing the skip connections in the U-Net. Theencoder-decoder is unable to learn to generate realistic im-ages in our experiments, and indeed collapses to producingnearly identical results for each input label map. The advan-tages of the U-Net appear not to be specific to conditionalGANs: when both U-Net and encoder-decoder are trained



Input Ground truth L1 cGAN L1 + cGAN

Figure 4: Different losses induce different quality of results. Each column shows results trained under a different loss. Please seehttps://phillipi.github.io/pix2pix/ for additional examples.

L1 1x1 16x16 70x70 256x256

Figure 6: Patch size variations. Uncertainty in the output manifests itself differently for different loss functions. Uncertain regions becomeblurry and desaturated under L1. The 1x1 PixelGAN encourages greater color diversity but has no effect on spatial statistics. The 16x16PatchGAN creates locally sharp results, but also leads to tiling artifacts beyond the scale it can observe. The 70x70 PatchGAN forcesoutputs that are sharp, even if incorrect, in both the spatial and spectral (coforfulness) dimensions. The full 256x256 ImageGAN producesresults that are visually similar to the 70x70 PatchGAN, but somewhat lower quality according to our FCN-score metric (Table 2). Pleasesee https://phillipi.github.io/pix2pix/ for additional examples.

put label maps. Combining all terms, L1+cGAN, performssimilarly well.

Colorfulness A striking effect of conditional GANs isthat they produce sharp images, hallucinating spatial struc-ture even where it does not exist in the input label map. Onemight imagine cGANs have a similar effect on “sharpening”in the spectral dimension – i.e. making images more color-ful. Just as L1 will incentivize a blur when it is uncertainwhere exactly to locate an edge, it will also incentivize anaverage, grayish color when it is uncertain which of severalplausible color values a pixel should take on. Specially, L1will be minimized by choosing the median of of the con-ditional probability density function over possible colors.An adversarial loss, on the other hand, can in principle be-come aware that grayish outputs are unrealistic, and encour-age matching the true color distribution [14]. In Figure 7,we investigate if our cGANs actually achieve this effect onthe Cityscapes dataset. The plots show the marginal distri-

butions over output color values in Lab color space. Theground truth distributions are shown with a dotted line. Itis apparent that L1 leads to a narrower distribution than theground truth, confirming the hypothesis that L1 encouragesaverage, grayish colors. Using a cGAN, on the other hand,pushes the output distribution closer to the ground truth.

3.3. Analysis of the generator architecture

A U-Net architecture allows low-level information toshortcut across the network. Does this lead to better results?Figure 5 compares the U-Net against an encoder-decoder oncityscape generation U-Net. The encoder-decoder is createdsimply by severing the skip connections in the U-Net. Theencoder-decoder is unable to learn to generate realistic im-ages in our experiments, and indeed collapses to producingnearly identical results for each input label map. The advan-tages of the U-Net appear not to be specific to conditionalGANs: when both U-Net and encoder-decoder are trained



input output input output

Map to aerial photoAerial photo to map

Figure 8: Example results on Google Maps at 512x512 resolution (model was trained on images at 256x256 resolution, and run convolu-tionally on the larger images at test time). Contrast adjusted for clarity.

Input Ground truth Output Input Ground truth Output

Figure 11: Example results of our method on Cityscapes labels→photo, compared to ground truth.



input output input output

Map to aerial photoAerial photo to map

Figure 8: Example results on Google Maps at 512x512 resolution (model was trained on images at 256x256 resolution, and run convolu-tionally on the larger images at test time). Contrast adjusted for clarity.


Figure 11: Example results of our method on Cityscapes labels→photo, compared to ground truth.




Figure 12: Example results of our method on facades labels→photo, compared to ground truth




Figure 13: Example results of our method on day→night, compared to ground truth.


Figure 14: Example results of our method on automatically detected edges→handbags, compared to ground truth.




Figure 13: Example results of our method on day→night, compared to ground truth.


Figure 14: Example results of our method on automatically detected edges→handbags, compared to ground truth.




Figure 15: Example results of our method on automatically detected edges→shoes, compared to ground truth.

Input Output Input Output Input Output Input Output

Figure 16: Example results of the edges→photo models applied to human-drawn sketches from [10]. Note that the models were trained onautomatically detected edges, but generalize to human drawings



The main drawback of this technique is that it requires pairs of samples withpixel-to-pixel correspondence.

In many cases, one has at its disposal examples from two densities and wants totranslate a sample from the first (“images of apples”) into a sample likely underthe second (“images of oranges”).


We consider X r.v. on X a sample from the first data-set, and Y r.v. on Y asample for the second data-set. Zhu et al. (2017) propose to train at the sametime two mappings

G : X →Y

F : Y → X

such that

G(X ) ∼ µY ,G ◦ F(X ) ' X .

Where the matching in density is characterized with a discriminator DY and thereconstruction with the L1 loss. They also do this both ways symmetrically.


X Y

G

F

DYDX

G

FY

X Y( X Y(

G

FX

(a) (b) (c)

cycle-consistencyloss

cycle-consistencyloss

DY DX

yxx y

Figure 3: (a) Our model contains two mapping functions G : X → Y and F : Y → X , and associated adversarialdiscriminators DY and DX . DY encourages G to translate X into outputs indistinguishable from domain Y , and vice versafor DX and F . To further regularize the mappings, we introduce two cycle consistency losses that capture the intuition that ifwe translate from one domain to the other and back again we should arrive at where we started: (b) forward cycle-consistencyloss: x → G(x) → F (G(x)) ≈ x, and (c) backward cycle-consistency loss: y → F (y) → G(F (y)) ≈ y

image cannot be distinguished from images in the target do-main.

Image-to-Image Translation The idea of image-to-image translation goes back at least to Hertzmann et al.’sImage Analogies [18], who employ a nonparametric tex-ture model [9] on a single input-output training image pair.More recent approaches use a dataset of input-output exam-ples to learn a parametric translation function using CNNs,e.g. [31]. Our approach builds on the “pix2pix” frameworkof Isola et al. [21], which uses a conditional generative ad-versarial network [15] to learn a mapping from input to out-put images. Similar ideas have been applied to various taskssuch as generating photographs from sketches [43] or fromattribute and semantic layouts [23]. However, unlike theseprior works, we learn the mapping without paired trainingexamples.

Unpaired Image-to-Image Translation Several othermethods also tackle the unpaired setting, where the goal isto relate two data domains, X and Y . Rosales et al. [40]propose a Bayesian framework that includes a prior basedon a patch-based Markov random field computed from asource image, and a likelihood term obtained from multi-ple style images. More recently, CoGAN [30] and cross-modal scene networks [1] use a weight-sharing strategy tolearn a common representation across domains. Concurrentto our method, Liu et al. [29] extends this framework witha combination of variational autoencoders [25] and gen-erative adversarial networks. Another line of concurrentwork [45, 48, 2] encourages the input and output to sharecertain “content” features even though they may differ in“style“. They also use adversarial networks, with additionalterms to enforce the output to be close to the input in a pre-defined metric space, such as class label space [2], imagepixel space [45], and image feature space [48].

Unlike the above approaches, our formulation does notrely on any task-specific, predefined similarity function be-

tween the input and output, nor do we assume that the inputand output have to lie in the same low-dimensional embed-ding space. This makes our method a general-purpose solu-tion for many vision and graphics tasks. We directly com-pare against several prior and contemporary approaches inSection 5.1.

Cycle Consistency The idea of using transitivity as away to regularize structured data has a long history. Invisual tracking, enforcing simple forward-backward con-sistency has been a standard trick for decades [47]. Inthe language domain, verifying and improving translationsvia “back translation and reconsiliation” is a techniqueused by human translators [3] (including, humorously, byMark Twain [50]), as well as by machines [16]. Morerecently, higher-order cycle consistency has been used instructure from motion [60], 3D shape matching [20], co-segmentation [54], dense semantic alignment [63, 64], anddepth estimation [13]. Of these, Zhou et al. [64] and Go-dard et al. [13] are most similar to our work, as they use acycle consistency loss as a way of using transitivity to su-pervise CNN training. In this work, we are introducing asimilar loss to push G and F to be consistent with eachother. Concurrent with our work, in these same proceed-ings, Yi et al. [58] independently use a similar objectivefor unpaired image-to-image translation, inspired by duallearning in machine translation [16].

Neural Style Transfer [12, 22, 51, 11] is another wayto perform image-to-image translation, which synthesizes anovel image by combining the content of one image withthe style of another image (typically a painting) based onmatching the Gram matrix statistics of pre-trained deep fea-tures. Our main focus, on the other hand, is learning themapping between two image collections, rather than be-tween two specific images, by trying to capture correspon-dences between higher-level appearance structures. There-fore, our method can be applied to other tasks, such as

(Zhu et al., 2017)


The generator is from Johnson et al. (2016), an updated version of the onefrom Radford et al. (2015)’s DCGAN.

The loss optimized alternatively is

V ∗(G,F,DX ,DY ) =V (G,DY ,X ,Y ) + V (F,DX ,Y ,X )

+ λ(E[‖F(G(X ))− X‖1

]+ E

[‖G(F(Y ))− Y ‖1

])where V is a quadratic loss, instead of the usual log (Mao et al., 2016)

V (G,DY ,X ,Y ) = E[

(DY (Y )− 1)2]

+ E[DY (G(X ))2

].

As always, there are plenty of specific technical details in the models and thetraining, e.g. using an history of generated images (Shrivastava et al., 2016).


Unpaired Image-to-Image Translationusing Cycle-Consistent Adversarial Networks

Jun-Yan Zhu∗ Taesung Park∗ Phillip Isola Alexei A. EfrosBerkeley AI Research (BAIR) laboratory, UC Berkeley

Zebras Horses

horse zebra

zebra horse

Summer Winter

summer winter

winter summer

Photograph Van Gogh CezanneMonet Ukiyo-e

Monet Photos

Monet photo

photo Monet

Figure 1: Given any two unordered image collections X and Y , our algorithm learns to automatically “translate” an imagefrom one into the other and vice versa: (left) Monet paintings and landscape photos from Flickr; (center) zebras and horsesfrom ImageNet; (right) summer and winter Yosemite photos from Flickr. Example application (bottom): using a collectionof paintings of famous artists, our method learns to render natural photographs into the respective styles.

AbstractImage-to-image translation is a class of vision and

graphics problems where the goal is to learn the mappingbetween an input image and an output image using a train-ing set of aligned image pairs. However, for many tasks,paired training data will not be available. We present anapproach for learning to translate an image from a sourcedomain X to a target domain Y in the absence of pairedexamples. Our goal is to learn a mapping G : X → Ysuch that the distribution of images from G(X) is indistin-guishable from the distribution Y using an adversarial loss.Because this mapping is highly under-constrained, we cou-ple it with an inverse mapping F : Y → X and introduce acycle consistency loss to enforce F (G(X)) ≈ X (and viceversa). Qualitative results are presented on several taskswhere paired training data does not exist, including collec-tion style transfer, object transfiguration, season transfer,photo enhancement, etc. Quantitative comparisons againstseveral prior methods demonstrate the superiority of ourapproach.

1. IntroductionWhat did Claude Monet see as he placed his easel by the

bank of the Seine near Argenteuil on a lovely spring dayin 1873 (Figure 1, top-left)? A color photograph, had itbeen invented, may have documented a crisp blue sky anda glassy river reflecting it. Monet conveyed his impressionof this same scene through wispy brush strokes and a brightpalette.

What if Monet had happened upon the little harbor inCassis on a cool summer evening (Figure 1, bottom-left)?A brief stroll through a gallery of Monet paintings makes itpossible to imagine how he would have rendered the scene:perhaps in pastel shades, with abrupt dabs of paint, and asomewhat flattened dynamic range.

We can imagine all this despite never having seen a sideby side example of a Monet painting next to a photo of thescene he painted. Instead we have knowledge of the set ofMonet paintings and of the set of landscape photographs.We can reason about the stylistic differences between these

* indicates equal contribution

1

arX

iv:1

70

3.1

05

93

v2

[c

s.C

V]

5 O

ct

20

17

(Zhu et al., 2017)

Francois Fleuret EE-559 – Deep learning / 10. Generative Adversarial Networks 69 / 81Input Input Input OutputOutputOutput

horse → zebra

zebra → horse

summer Yosemite → winter Yosemite

apple → orange

orange → apple

winter Yosemite → summer Yosemite

Figure 13: Our method applied to several translation problems. These images are selected as relatively successful results– please see our website for more comprehensive and random results. In the top two rows, we show results on objecttransfiguration between horses and zebras, trained on 939 images from the wild horse class and 1177 images from the zebraclass in Imagenet [41]. The middle two rows show results on season transfer, trained on winter and summer photos ofYosemite from Flickr. In the bottom two rows, we train our method on 996 apple images and 1020 navel orange images fromImageNet.

(Zhu et al., 2017)


Input Input Input OutputOutputOutput

horse → zebra

zebra → horse

summer Yosemite → winter Yosemite

apple → orange

orange → apple

winter Yosemite → summer Yosemite

Figure 13: Our method applied to several translation problems. These images are selected as relatively successful results– please see our website for more comprehensive and random results. In the top two rows, we show results on objecttransfiguration between horses and zebras, trained on 939 images from the wild horse class and 1177 images from the zebraclass in Imagenet [41]. The middle two rows show results on season transfer, trained on winter and summer photos ofYosemite from Flickr. In the bottom two rows, we train our method on 996 apple images and 1020 navel orange images fromImageNet.

(Zhu et al., 2017)


While GANs are often used for their [theoretical] ability to model a distributionfully and accurately, generating consistent samples is enough for image-to-imagetranslation.

In particular, this application does not suffer much from mode collapse, as longas the generated images “look nice”.

The key aspect of the GAN here is the “perceptual loss” that the discriminatorimplements, more than the theoretical convergence to the true distribution.


Model persistence and checkpoints in PyTorch


Saving and loading models is key to use models trained previously.

It also allows to implement checkpoints which keep track of the state duringtraining and allow to either restart after an expected interruption, or modulatemeta-parameters manually.

The underlying operation is serialization, that is the transcription of anarbitrary object into a sequence of bytes saved on disk.


The main PyTorch methods for serializing are torch.save(obj, filename)

and torch.load(filename) .

>>> x = 34

>>> torch.save(x, ’x.pth ’)

>>> y = torch.load(’x.pth ’)

>>> y

34

>>> z = { ’a’: torch.LongTensor (2, 3).random_ (10),

... ’b’: nn.Linear (10, 20) }

>>> torch.save(z, ’z.pth ’)

>>> w = torch.load(’z.pth ’)

>>> w

{’a’:

2 2 3

8 9 8

[torch.LongTensor of size 2x3]

, ’b’: Linear(in_features =10, out_features =20)}


One can save directly a full model like this, including arbitrary fields

>>> x = nn.Sequential(nn.Linear(3, 10), nn.ReLU(), nn.Linear (10, 1))

>>> x.blah = 14

>>> torch.save(x, ’model.pth ’)

>>>

>>> z = torch.load(’model.pth ’)

>>> z(Variable(Tensor(2, 3).normal_ ()))

Variable containing:

0.2408

0.0929

[torch.FloatTensor of size 2x1]

>>> z.blah

14


Saving a full model with torch.save() bounds the saved quantities to thespecific class implementation, and may break after changes in the code.

The suggested policy is to save the state dictionary alone, as provided byModule.state dict() , which encompasses Parameters and buffers such asbatchnorm running estimates, etc.

Additionally

• Tensors are saved with their locations (CPU, or GPU), and will be loadedin the same configuration,

• in your Module s, buffers have to be identified with register buffer ,

• loaded models are in train mode by default,

• optimizers have a state too (momentum, Adam).


A checkpoint is a persistent object that keeps the global state of the training:model and optimizer. In the following example (1) we load it when we start if itexists, and (2) we save it at every epoch.

criterion = nn.CrossEntropyLoss ()

nb_epochs_finished = 0

model = Net()

optimizer = torch.optim.SGD(model.parameters (), lr = lr)

checkpoint_name = ’checkpoint.pth ’

try:

checkpoint = torch.load(checkpoint_name)

nb_epochs_finished = checkpoint[’nb_epochs_finished ’]

model.load_state_dict(checkpoint[’model_state ’])

optimizer.load_state_dict(checkpoint[’optimizer_state ’])

print(’Checkpoint loaded with {:d} epochs finished.’.format(nb_epochs_finished))

except FileNotFoundError:

print(’Starting from scratch.’)

except:

print(’Error when loading the checkpoint .’)

exit (1)

if torch.cuda.is_available ():

torch.backends.cudnn.benchmark = True

model.cuda()

criterion.cuda()


for k in range(nb_epochs_finished , nb_epochs):

acc_loss = 0

for b in range(0, train_input.size (0), batch_size):

output = model(train_input.narrow(0, b, batch_size))

loss = criterion(output , train_target.narrow(0, b, batch_size))

acc_loss += loss.data [0]

optimizer.zero_grad ()

loss.backward ()

optimizer.step()

print(k, acc_loss)

checkpoint = {

’nb_epochs_finished ’: k + 1,

’model_state ’: model.state_dict (),

’optimizer_state ’: optimizer.state_dict ()

}

torch.save(checkpoint , checkpoint_name)


If we killall python during training

fleuret@elk :/tmp ./ tinywithcheckpoint.py

Starting from scratch.

0 155.7866949379677

1 34.80593343087821

2 23.501393611499225

Terminated

and re-start

fleuret@elk :/tmp ./ tinywithcheckpoint.py

Checkpoint loaded with 3 epochs finished.

3 17.466753122906084

4 13.512543070963147

5 10.474066113200024

6 8.01903374180074

7 6.152274705537366

8 4.789176231754482

9 3.5722521024140406

test_error 0.97% (97/10000)


BSince a model is saved with information about the CPU/GPUs whereeach Storage is located there may be issues if the model is loaded ona different hardware configuration.

For instance, if we save a model located on a GPU:

>>> import torch

>>> from torch import nn

>>> x = nn.Linear (10, 4)

>>> x.cuda()

Linear(in_features =10, out_features =4, bias=True)

>>> torch.save(x, ’x.pth ’)

And load it on a machine without GPU:

>>> import torch

>>> from torch import nn

>>> x = torch.load(’x.pth ’)

THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp ...

This can be fixed by specifying at load time how to relocate storages:

>>> x = torch.load(’x.pth ’, map_location = lambda storage , loc: storage)


References

M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. CoRR, abs/1701.07875, 2017.

I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,A. Courville, and Y. Bengio. Generative adversarial networks. CoRR, abs/1406.2661,2014.

I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training ofwasserstein gans. CoRR, abs/1704.00028, 2017.

P. Isola, J. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditionaladversarial networks. CoRR, abs/1611.07004, 2016.

J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer andsuper-resolution. In European Conference on Computer Vision (ECCV), 2016.

X. Mao, Q. Li, H. Xie, R. Lau, Z. Wang, and S. Smolley. Least squares generativeadversarial networks. CoRR, abs/1611.04076, 2016.

M. Mirza and S. Osindero. Conditional generative adversarial nets. CoRR, abs/1411.1784,2014.

A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deepconvolutional generative adversarial networks. CoRR, abs/1511.06434, 2015.

A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning fromsimulated and unsupervised images through adversarial training. CoRR, abs/1612.07828,2016.

J. Zhu, T. Park, P. Isola, and A. Efros. Unpaired image-to-image translation usingcycle-consistent adversarial networks. CoRR, abs/1703.10593, 2017.

Documents

10. Generative Adversarial Networks EE-559 { Deep learningAdvances in generative models that can sample but not estimate ... { Deep learning / 10. Generative Adversarial Networks 18