16
Preliminaries WGAN Experiments Related works Wasserstein GAN Martin Arjovsky 1 , Soumith Chintala 2 , Leon Bottou 1,2 1 Courant Institute of Mathematical Sciences 2 Facebook AI Research Presented by Chunyuan Li 1 / 16

WassersteinGANpeople.ee.duke.edu/~lcarin/Chunyuan2.17.2017.pdf · 2/17/2017  · Wasserstein GAN Author: Martin Arjovsky1, Soumith Chintala2, Leon Bottou1,2 Created Date: 2/16/2017

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

  • Preliminaries WGAN Experiments Related works

    Wasserstein GAN

    Martin Arjovsky1, Soumith Chintala2, Leon Bottou1,2

    1Courant Institute of Mathematical Sciences2Facebook AI Research

    Presented by Chunyuan Li

    1 / 16

  • Preliminaries WGAN Experiments Related works

    Preliminaries: “vanilla” GAN1 Real data distribution Pr;

    Generator’s distribution Pg , implemented as x = G(z), z ∼ P (z)

    minG maxD V (D,G)Discriminator

    −Ex∼Pr [logD(x)]− Ex∼Pg [log(1−D(x))] (1)D(x): the probability that x from the real data rather than generator.

    GeneratorEx∼Pg [log(1−D(x))] GAN0 (2)Ex∼Pg [− log(D(x))] GAN1 (3)2

    Problems [Goodfellow et al., 2014]:

    P1: “In practice, GAN0 may not provide sufficient gradientfor G to learn well”, GAN1 is used instead. (log D trick)

    P2: “G collapses too many values of z to the same value ofx” (Mode collapse in GAN1)

    What is the princpled interpretation? [Arjovsky and Bottou, 2017]2 / 16

  • Preliminaries WGAN Experiments Related works

    P1 on GAN0

    In GAN0, better discriminator leads to worse vanishing gradient in its generator

    Q: Why is GAN difficult to train?

    A: Either our updates to the discriminator will be inacurate, or they willvanish. It leaves up to the user to decide the precise amount of trainingdedicated to the discriminator, which can make GAN training hard.

    3 / 16

  • Preliminaries WGAN Experiments Related works

    P1 on GAN0Proof Sketch:

    1 Minimizing generator yields minimizing the JS divergence when thediscriminator is optimal.For given x, the optimal discriminator is

    D∗(x) =Pr(x)

    Pr(x) + Pg(x)(4)

    The generator loss (by adding a term independent of Pg) is

    L = Ex∼Pr [logD(x)] + Ex∼Pg [log(1−D(x))] (5)

    Plug (4) into (5):

    2JS(Pr||Pg)− 2 log 2 (6)

    2 If the supports (underlying low-dimension manifolds) of Pr and Pg (almost)have no overlap, then JS(Pr||Pg) = log 2 (Theorem 2.3), and thus thegradient of (5) wrt. Pg vanishes (Theorem 2.4 and Corollary 2.1)

    3 The probability that the support of Pr and Pg have almost zero overlap is 1(Lemma 2, Lemma 3 and Theorem 2.2)

    4 / 16

  • Preliminaries WGAN Experiments Related works

    P2 on GAN1

    GAN1 is a conflicting/asymmetric objective, thus (1)unstable gradient (2) mode callapse

    5 / 16

  • Preliminaries WGAN Experiments Related works

    P2 on GAN1

    Proof Sketch:1 (Theorem 2.5) GAN1 equals to optimize

    KL(Pg ||Pr)− 2JS(Pg ||Pr) (7)

    2 Opposite signs for KL and JS. (Theorem 2.6: Instability of generatorgradient updates)

    3 KL(Pg ||Pr), NOT KL(Pr||Pg):KL(Pg ||Pr) assigns an high cost to generating fake looking samples, and anlow cost on mode dropping;KL(Pr||Pg) assigns an high cost to not covering parts of the data, and anlow cost on generating fake looking samples;

    6 / 16

  • Preliminaries WGAN Experiments Related works

    Preliminaries: distance measures for distributions

    1 KLKL(P ||Q) = EP log

    P

    Q2 JS

    JS(P ||Q) = 12KL(P ||P +Q

    2 ) +12KL(Q||

    P +Q2 )

    3 Wasserstein

    W (P ||Q) = infγ∈Π(P,Q)

    E(x,y)∼γ [||x− y||]

    Π(P,Q) denotes the set of all joint distributions γ(x, y)whose marginals are P and Q, respectivelyγ(x, y) indicates a plan to transport “mass” from x to y,when deforming P into Q.The Wasserstein (or Earth-Mover) distance is then the“cost” of the optimal transport plan

    7 / 16

  • Preliminaries WGAN Experiments Related works

    Examples

    P0: distribution of (0, Z), where Z ∼ U [0, 1]Pθ: distribution of (θ, Z), where θ is a single real parameter

    KL(P0||Pθ) = KL(Pθ||P0) ={

    +∞ if θ 6= 00 if θ = 0

    JS(P0||Pθ) ={

    log 2 if θ 6= 00 if θ = 0

    W (P0||Pθ) = |θ|

    ✓0

    1

    Z

    P0 P✓

    (a) Distributions

    8 / 16

  • Preliminaries WGAN Experiments Related works

    Insight

    (b) Output of W and JS

    1 ObservationsWhen the distributions are supported by low dimensional manifolds (such asPr and Pg in GANs)

    KL or JS are binary, no meaningful gradientW is continuous and differentiable, hence always sensible

    2 Theoretical supportTheorem 1 supports the above statementCorollary 1 say Theorem 1 is true when the mapping is neural nets.Theorem 2 imply TV distance has the same probem with KL and JS.

    9 / 16

  • Preliminaries WGAN Experiments Related works

    Wasserstein GAN

    The infimum is highly intractableWasserstein distance has a duality form

    W (Pr, Pg) = sup||f ||L≤1

    Ex∼Pr [f(x)]− Ex∼Pg [f(x)] (8)

    =1K

    sup||f ||L≤K

    Ex∼Pr [f(x)]− Ex∼Pg [f(x)] (9)

    where supremum is over all the K-Lipschitz functionsConsider a w-parameterized family of functions {fw}w∈W that are allK-Lipschitz

    W (Pr, Pg) = maxw∈W

    Ex∼Pr [fw(x)]− Ex∼Pg [fw(x)] (10)

    For example, W = [−c, c]l

    10 / 16

  • Preliminaries WGAN Experiments Related works

    Wasserstein GAN

    The loss for discriminator/critic

    Ex∼Pr [fw(x)]− Ex∼Pg [fw(x)] (11)

    The loss for generator

    −Ex∼Pg [fw(x)] = −Ez∼p(z)[fw(gθ(z))] (12)

    11 / 16

  • Preliminaries WGAN Experiments Related works

    AlgorithmAlgorithm

    Main difference to vanilla GAN

    Remove the sigmoid of the last layer in DRemove the log in the loss of D and G.Clip the parameters of D in an inverval centered at 0.Momentum-based optmizaition is not allowed

    12 / 16

  • Preliminaries WGAN Experiments Related works

    Meaningful loss metricA meaningful loss metric that correlates with the generator’sconvergence and sample quality. WGAN algorithm attempts to trainthe critic relatively well before each generator update, the loss function atthis point is an estimate of the EM distance.NOT to quantitatively evaluate generative models

    Top: DCGAN discriminator; Bottom: MLP discriminator

    13 / 16

  • Preliminaries WGAN Experiments Related works

    Improved stability

    It allows us to train the critic till optimality, and thus no longer need tobalance generator and discriminator’s capacity properly

    A generator without batch normalization in DCGAN

    In no experiment did the authors see evidence of mode collapseA generator constrcuted with MLP

    14 / 16

  • Preliminaries WGAN Experiments Related works

    Integral Probability Metrics (IPMs)

    dF (Pr, Pg) = supf∈F

    Ex∼Pr [f(x)]− Ex∼Pg [f(x)] (13)

    1 Wasserstein distance: F is the set of K-Lipschitz functions

    2 Total variation distance: F is the set of all measurable functions boundedbetween -1 and 1

    3 Energy-based GANs: generative approach to the total variation distance

    4 Maximum Mean Discrepancy (MMD): F : f ∈ H, ||f ||∞ ≤ 1, for some H inRKHS [Sutherland et al., 2016]

    5 Kernelized Stein Discrepancy: a special case of MMD, with “Steinalized”kernels depending on Pg , i.e., κ(x, x′) = T xPg (T

    x′Pg⊗ k(x, x′))

    [Wang and Liu, 2016]

    15 / 16

  • Preliminaries WGAN Experiments Related works

    References I

    [Arjovsky and Bottou, 2017] Arjovsky, M. and Bottou, L. (2017).Towards principled methods for training generative adversarial networks.

    [Goodfellow et al., 2014] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley,D., Ozair, S., Courville, A., and Bengio, Y. (2014).Generative adversarial nets.In Advances in neural information processing systems, pages 2672–2680.

    [Sutherland et al., 2016] Sutherland, D. J., Tung, H.-Y., Strathmann, H., De, S., Ramdas, A.,Smola, A., and Gretton, A. (2016).Generative models and model criticism via optimized maximum mean discrepancy.arXiv preprint arXiv:1611.04488.

    [Wang and Liu, 2016] Wang, D. and Liu, Q. (2016).Learning to draw samples: With application to amortized mle for generative adversariallearning.arXiv preprint arXiv:1611.01722.

    16 / 16

    PreliminariesWGANWasserstein GAN

    ExperimentsRelated works