WassersteinGANpeople.ee.duke.edu/~lcarin/Chunyuan2.17.2017.pdf · 2/17/2017 · Wasserstein GAN Author: Martin Arjovsky1, Soumith Chintala2, Leon Bottou1,2 Created Date: 2/16/2017

Preliminaries WGAN Experiments Related works

Wasserstein GAN

Martin Arjovsky1, Soumith Chintala2, Leon Bottou1,2

1Courant Institute of Mathematical Sciences2Facebook AI Research

Presented by Chunyuan Li

1 / 16


Preliminaries: “vanilla” GAN1 Real data distribution Pr;

Generator’s distribution Pg , implemented as x = G(z), z ∼ P (z)

minG maxD V (D,G)Discriminator

−Ex∼Pr [logD(x)]− Ex∼Pg [log(1−D(x))] (1)D(x): the probability that x from the real data rather than generator.

GeneratorEx∼Pg [log(1−D(x))] GAN0 (2)Ex∼Pg [− log(D(x))] GAN1 (3)2

Problems [Goodfellow et al., 2014]:

P1: “In practice, GAN0 may not provide sufficient gradientfor G to learn well”, GAN1 is used instead. (log D trick)

P2: “G collapses too many values of z to the same value ofx” (Mode collapse in GAN1)

What is the princpled interpretation? [Arjovsky and Bottou, 2017]2 / 16


P1 on GAN0

In GAN0, better discriminator leads to worse vanishing gradient in its generator

Q: Why is GAN difficult to train?

A: Either our updates to the discriminator will be inacurate, or they willvanish. It leaves up to the user to decide the precise amount of trainingdedicated to the discriminator, which can make GAN training hard.

3 / 16


P1 on GAN0Proof Sketch:

1 Minimizing generator yields minimizing the JS divergence when thediscriminator is optimal.For given x, the optimal discriminator is

D∗(x) =Pr(x)

Pr(x) + Pg(x)(4)

The generator loss (by adding a term independent of Pg) is

L = Ex∼Pr [logD(x)] + Ex∼Pg [log(1−D(x))] (5)

Plug (4) into (5):

2JS(Pr||Pg)− 2 log 2 (6)

2 If the supports (underlying low-dimension manifolds) of Pr and Pg (almost)have no overlap, then JS(Pr||Pg) = log 2 (Theorem 2.3), and thus thegradient of (5) wrt. Pg vanishes (Theorem 2.4 and Corollary 2.1)

3 The probability that the support of Pr and Pg have almost zero overlap is 1(Lemma 2, Lemma 3 and Theorem 2.2)

4 / 16


P2 on GAN1

GAN1 is a conflicting/asymmetric objective, thus (1)unstable gradient (2) mode callapse

5 / 16


P2 on GAN1

Proof Sketch:1 (Theorem 2.5) GAN1 equals to optimize

KL(Pg ||Pr)− 2JS(Pg ||Pr) (7)

2 Opposite signs for KL and JS. (Theorem 2.6: Instability of generatorgradient updates)

3 KL(Pg ||Pr), NOT KL(Pr||Pg):KL(Pg ||Pr) assigns an high cost to generating fake looking samples, and anlow cost on mode dropping;KL(Pr||Pg) assigns an high cost to not covering parts of the data, and anlow cost on generating fake looking samples;

6 / 16


Preliminaries: distance measures for distributions

1 KLKL(P ||Q) = EP log

P

Q2 JS

JS(P ||Q) = 12KL(P ||P +Q

2 ) +12KL(Q||

P +Q2 )

3 Wasserstein

W (P ||Q) = infγ∈Π(P,Q)

E(x,y)∼γ [||x− y||]

Π(P,Q) denotes the set of all joint distributions γ(x, y)whose marginals are P and Q, respectivelyγ(x, y) indicates a plan to transport “mass” from x to y,when deforming P into Q.The Wasserstein (or Earth-Mover) distance is then the“cost” of the optimal transport plan

7 / 16


Examples

P0: distribution of (0, Z), where Z ∼ U [0, 1]Pθ: distribution of (θ, Z), where θ is a single real parameter

KL(P0||Pθ) = KL(Pθ||P0) ={

+∞ if θ 6= 00 if θ = 0

JS(P0||Pθ) ={

log 2 if θ 6= 00 if θ = 0

W (P0||Pθ) = |θ|

✓0

1

Z

P0 P✓

(a) Distributions

8 / 16


Insight

(b) Output of W and JS

1 ObservationsWhen the distributions are supported by low dimensional manifolds (such asPr and Pg in GANs)

KL or JS are binary, no meaningful gradientW is continuous and differentiable, hence always sensible

2 Theoretical supportTheorem 1 supports the above statementCorollary 1 say Theorem 1 is true when the mapping is neural nets.Theorem 2 imply TV distance has the same probem with KL and JS.

9 / 16


Wasserstein GAN

The infimum is highly intractableWasserstein distance has a duality form

W (Pr, Pg) = sup||f ||L≤1

Ex∼Pr [f(x)]− Ex∼Pg [f(x)] (8)

=1K

sup||f ||L≤K

Ex∼Pr [f(x)]− Ex∼Pg [f(x)] (9)

where supremum is over all the K-Lipschitz functionsConsider a w-parameterized family of functions {fw}w∈W that are allK-Lipschitz

W (Pr, Pg) = maxw∈W

Ex∼Pr [fw(x)]− Ex∼Pg [fw(x)] (10)

For example, W = [−c, c]l

10 / 16


Wasserstein GAN

The loss for discriminator/critic

Ex∼Pr [fw(x)]− Ex∼Pg [fw(x)] (11)

The loss for generator

−Ex∼Pg [fw(x)] = −Ez∼p(z)[fw(gθ(z))] (12)

11 / 16


AlgorithmAlgorithm

Main difference to vanilla GAN

Remove the sigmoid of the last layer in DRemove the log in the loss of D and G.Clip the parameters of D in an inverval centered at 0.Momentum-based optmizaition is not allowed

12 / 16


Meaningful loss metricA meaningful loss metric that correlates with the generator’sconvergence and sample quality. WGAN algorithm attempts to trainthe critic relatively well before each generator update, the loss function atthis point is an estimate of the EM distance.NOT to quantitatively evaluate generative models

Top: DCGAN discriminator; Bottom: MLP discriminator

13 / 16


Improved stability

It allows us to train the critic till optimality, and thus no longer need tobalance generator and discriminator’s capacity properly

A generator without batch normalization in DCGAN

In no experiment did the authors see evidence of mode collapseA generator constrcuted with MLP

14 / 16


Integral Probability Metrics (IPMs)

dF (Pr, Pg) = supf∈F

Ex∼Pr [f(x)]− Ex∼Pg [f(x)] (13)

1 Wasserstein distance: F is the set of K-Lipschitz functions

2 Total variation distance: F is the set of all measurable functions boundedbetween -1 and 1

3 Energy-based GANs: generative approach to the total variation distance

4 Maximum Mean Discrepancy (MMD): F : f ∈ H, ||f ||∞ ≤ 1, for some H inRKHS [Sutherland et al., 2016]

5 Kernelized Stein Discrepancy: a special case of MMD, with “Steinalized”kernels depending on Pg , i.e., κ(x, x′) = T xPg (T

x′Pg⊗ k(x, x′))

[Wang and Liu, 2016]

15 / 16


References I

[Arjovsky and Bottou, 2017] Arjovsky, M. and Bottou, L. (2017).Towards principled methods for training generative adversarial networks.

[Goodfellow et al., 2014] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley,D., Ozair, S., Courville, A., and Bengio, Y. (2014).Generative adversarial nets.In Advances in neural information processing systems, pages 2672–2680.

[Sutherland et al., 2016] Sutherland, D. J., Tung, H.-Y., Strathmann, H., De, S., Ramdas, A.,Smola, A., and Gretton, A. (2016).Generative models and model criticism via optimized maximum mean discrepancy.arXiv preprint arXiv:1611.04488.

[Wang and Liu, 2016] Wang, D. and Liu, Q. (2016).Learning to draw samples: With application to amortized mle for generative adversariallearning.arXiv preprint arXiv:1611.01722.

16 / 16

PreliminariesWGANWasserstein GAN

ExperimentsRelated works

Documents

WassersteinGANpeople.ee.duke.edu/~lcarin/Chunyuan2.17.2017.pdf · 2/17/2017 · Wasserstein GAN Author: Martin Arjovsky1, Soumith Chintala2, Leon Bottou1,2 Created Date: 2/16/2017