1
Adaptive Density Estimation for Generative Models Thomas Lucas *1 , Konstantin Shmelkov *†2 , Cordelia Schmid 1 , Karteek Alahari 1 , Jakob Verbeek 1 * Equal contribution 1 Inria 2 Noah’s Ark Lab - Huawei Work done while at Inria Goals GANs produce high quality samples but suffer from mode dropping Maximum-likelihood estimation (MLE) covers the full training support but over-generalises Joint training is richer but non trivial due to conditional independance assumptions Ideas: Use flow transformations to learn go beyond conditional independance in image space Use adversarial training together with MLE to control where the off-dataset mass goes Complementarity and conflicts of joint training MLE evaluates coverage of training points, adversarial training evaluates sample quality Strongly penalysed by GAN Strongly penalysed by MLE Maximum- Likelihood Data Mode-dropping Adversarial training Over-generalization Hybrid Data If p θ is assumed less flexible than p * : MLE pushes p θ to go through regions of low density under p * Adversarial training leads p θ to miss modes to avoid low density regions If training jointly, the independance and gaussianity assumptions create a conflict Adaptive Density Estimation Amortized variational inference is used in feature space, the output density is reshaped with a flow model; MLE and adversarial training are used jointly, to optimise support coverage and penalyse over-generalisation A deep invertible transformation f ψ is used to avoid factored Gaussian assumptions in p θ (·|z ) and learn the shape of the output density The flow model f ψ is trained to maximize the den- sity of training points under the model p θ while also maximizing their volume in feature space Sample quality is evaluated with Inception Score and Fréshet Inception Distance, support cov- erage is evaluated with bits per dimension Joint training with Adaptive Density Estimation Amortized Variational inference in feature space: Train a model p θ with feature targets f ψ (x), using an inference network q φ L ψ ELBO (x, θ, φ)= E q φ (z |x) [ln(p θ (f ψ (x)|z ))] - D KL (q φ (z |x)||p θ (z )) ln p θ (f ψ (x)). Maximum-Likelihood Estimation: Train f ψ , p θ and q φ together, using the change of variable formula L C (θ, φ, ψ )= E xp * -L ψ ELBO (x, θ, φ) - ln det ∂f ψ ∂x ≥-E xp * [ln p θ,ψ (x)] . Adversarial training with Adaptive Density Estimation: Sample from p θ in feature space, obtain image samples with f -1 ψ and evaluate with a discriminator D L Q (p θ,ψ )= -E p θ (z ) ln D (f -1 ψ (μ θ (z ))) - ln(1 - D (f -1 ψ (μ θ (z )))) . Ideal loss: Under discriminator optimality assumption, optimises a lower bound on the symmetric KL L C (p θ,ψ )+ L * Q (p θ,ψ ) D KL (p * ||p θ,ψ )+ D KL (p θ,ψ ||p * )+ H(p * ) Results f ψ Adv. MLE BPD IS FID GAN × X × [7.0] 6.8 31.4 VAE × × X 4.4 2.0 171.0 Ours X × X 3.5 3.0 112.0 Ours × X X 4.4 5.1 58.6 Ours X X X 3.9 7.1 28.0 Ablation study on CIFAR-10: f ψ improves IS/FID and BPD and is beneficial to joint training. : compensated for extra parame- ters in f ψ Hybrid (MLE) BPD IS FID Ours (wg, rd) 3.8 8.2 17.2 Ours (iaf, rd) 3.7 8.1 18.6 Ours (S2) 3.5 6.9 28.9 FlowGan(A) 8.5 5.8 FlowGan(H) 4.2 3.9 Hybrid (Adv) BPD IS FID AGE 5.9 ALI 5.3 SVAE 6.8 α-GAN 6.8 SVAE-r 7.0 Adversarial BPD IS FID mmd-GAN 7.3 25.0 SNGan 7.4 29.3 BatchGAN 7.5 23.7 WGAN-GP 7.9 SNGAN (R,H ) 8.2 21.7 MLE BPD IS FID NVP 3.5 4.5 56.8 VAE-IAF 3.1 3.8 73.5 Pixcnn++ 2.9 5.5 Flow++ 3.1 Glow 3.4 5.5 46.8 SOTA comparison on CIFAR-10 Ours @ 3.8 BPD (CIFAR-10, 32 × 32) Ours @ 4.3 BPD (LSUN churches, 64 × 64) Glow @ 2.67 BPD (LSUN churches, 64 × 64)

THOTHCreated Date 10/23/2019 5:15:28 PM

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: THOTHCreated Date 10/23/2019 5:15:28 PM

AdaptiveDensityEstimationforGenerativeModelsThomas Lucas∗1, Konstantin Shmelkov∗†2, Cordelia Schmid1, Karteek Alahari1, Jakob Verbeek1

∗Equal contribution 1Inria 2 Noah’s Ark Lab - Huawei † Work done while at Inria

Goals• GANs produce high quality samples but suffer from mode dropping

• Maximum-likelihood estimation (MLE) covers the full training support but over-generalises

• Joint training is richer but non trivial due to conditional independance assumptions

Ideas:

• Use flow transformations to learn go beyond conditional independance in image space

• Use adversarial training together with MLE to control where the off-dataset mass goes

Complementarity and conflicts of joint training• MLE evaluates coverage of training points, adversarial training evaluates sample quality

Strongly penalysed by GAN

Stronglypenalysed by MLE

Maximum-Likelihood

Data

Mode-dropping

Adversarial training

Over-generalization

Hybrid

Data

If pθ is assumed less flexible than p∗:

– MLE pushes pθ to go through regionsof low density under p∗

– Adversarial training leads pθ to missmodes to avoid low density regions

If training jointly, the independance andgaussianity assumptions create a conflict

Adaptive Density Estimation

Amortized variational inference is used in feature space, the output density is reshaped with a flow model;MLE and adversarial training are used jointly, to optimise support coverage and penalyse over-generalisation

• A deep invertible transformation fψ is used toavoid factored Gaussian assumptions in pθ(·|z) andlearn the shape of the output density

• The flow model fψ is trained to maximize the den-sity of training points under the model pθ whilealso maximizing their volume in feature space

• Sample quality is evaluated with Inception Scoreand Fréshet Inception Distance, support cov-erage is evaluated with bits per dimension

Joint training with Adaptive Density Estimation• Amortized Variational inference in feature space: Train a model pθ with feature targets fψ(x),using an inference network qφ

LψELBO(x, θ, φ) = Eqφ(z|x) [ln(pθ(fψ(x)|z))]−DKL(qφ(z|x)||pθ(z)) ≤ ln pθ(fψ(x)).

• Maximum-Likelihood Estimation: Train fψ, pθ and qφ together, using the change of variable formula

LC(θ, φ, ψ) = Ex∼p∗

[−LψELBO(x, θ, φ)− ln

∣∣∣∣det ∂fψ∂x

∣∣∣∣] ≥ −Ex∼p∗ [ln pθ,ψ(x)] .

• Adversarial training with Adaptive Density Estimation: Sample from pθ in feature space, obtainimage samples with f−1

ψ and evaluate with a discriminator D

LQ(pθ,ψ) = −Epθ(z)

[lnD(f−1

ψ (µθ(z)))− ln(1−D(f−1ψ (µθ(z))))

].

• Ideal loss: Under discriminator optimality assumption, optimises a lower bound on the symmetric KL

LC(pθ,ψ) + L∗Q(pθ,ψ) ≥ DKL(p∗||pθ,ψ) +DKL(pθ,ψ||p∗) +H(p∗)

Results

fψ Adv. MLE BPD ↓ IS ↑ FID ↓GAN × X × [7.0] 6.8 31.4VAE × × X 4.4 2.0 171.0Ours† X × X 3.5 3.0 112.0Ours × X X 4.4 5.1 58.6Ours† X X X 3.9 7.1 28.0

Ablation study on CIFAR-10: fψ improvesIS/FID and BPD and is beneficial to jointtraining. †: compensated for extra parame-ters in fψ

Hybrid (MLE) BPD ↓ IS ↑ FID ↓

Ours (wg, rd) 3.8 8.2 17.2Ours (iaf, rd) 3.7 8.1 18.6Ours (S2) 3.5 6.9 28.9FlowGan(A) 8.5 5.8FlowGan(H) 4.2 3.9

Hybrid (Adv) BPD ↓ IS ↑ FID ↓

AGE 5.9ALI 5.3SVAE 6.8α-GAN 6.8SVAE-r 7.0

Adversarial BPD ↓ IS ↑ FID ↓

mmd-GAN 7.3 25.0SNGan 7.4 29.3BatchGAN 7.5 23.7WGAN-GP 7.9SNGAN(R,H) 8.2 21.7

MLE BPD ↓ IS ↑ FID ↓

NVP 3.5 4.5† 56.8†VAE-IAF 3.1 3.8† 73.5†Pixcnn++ 2.9 5.5Flow++ 3.1Glow 3.4 5.5‡ 46.8‡

SOTA comparison on CIFAR-10

Ours @ 3.8 BPD(CIFAR-10, 32 × 32)

Ours @ 4.3 BPD(LSUN churches, 64 × 64)

Glow @ 2.67 BPD(LSUN churches, 64 × 64)