84
Deep Generative Models STAT G8201: Deep Generative Models 1 / 29

Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Deep Generative Models

STAT G8201: Deep Generative Models 1 / 29

Page 2: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Part IV.2: Disentangling, geometry,and interpreting VAE latents

STAT G8201: Deep Generative Models 2 / 29

Page 3: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

The Disentangling Hypothesis

DiCarlo & Cox (2007). “Untangling Invariant Object Recognition”Trends In Cognitive Sciences.

STAT G8201: Deep Generative Models 3 / 29

Page 4: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

The Disentangling Hypothesis

DiCarlo & Cox (2007). “Untangling Invariant Object Recognition”Trends In Cognitive Sciences.

STAT G8201: Deep Generative Models 3 / 29

Page 5: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

The Disentangling Hypothesis

DiCarlo & Cox (2007). “Untangling Invariant Object Recognition”Trends In Cognitive Sciences.

STAT G8201: Deep Generative Models 3 / 29

Page 6: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

The Disentangling Hypothesis

DiCarlo & Cox (2007). “Untangling Invariant Object Recognition”Trends In Cognitive Sciences.

STAT G8201: Deep Generative Models 3 / 29

Page 7: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Part I:Disentanglement

STAT G8201: Deep Generative Models 4 / 29

Page 8: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Disentanglement

What is meant by disentanglement here?

A disentangled representation can be defined as one where single latent units aresensitive to changes in single generative factors, while being relatively invariantto changes in other factors. (Bengio et al., 2013)

Disentanglement involves two points:

I Sparse mapping between latent units and generative factors

I Mutual information between latent code and X

However, mutual information between latent code and X is not guaranteed. In fact,powerful decoders might decouple latent code from X.

STAT G8201: Deep Generative Models 5 / 29

Page 9: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Disentanglement

What is meant by disentanglement here?

A disentangled representation can be defined as one where single latent units aresensitive to changes in single generative factors, while being relatively invariantto changes in other factors. (Bengio et al., 2013)

Disentanglement involves two points:

I Sparse mapping between latent units and generative factors

I Mutual information between latent code and X

However, mutual information between latent code and X is not guaranteed. In fact,powerful decoders might decouple latent code from X.

STAT G8201: Deep Generative Models 5 / 29

Page 10: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Part II: InfoGAN

STAT G8201: Deep Generative Models 6 / 29

Page 11: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

InfoGAN: the idea

Aim: Unsupervised learning of disentangled representations in the GAN framework.

Recap GAN:

minG

maxD

V (D,G) = Ex∼Pdata [logD(x)] + Ez∼noise[log(1−D(G(z)))]

Decompose the input noise vector into two parts:

I z : treated as incompressible noise

I c : the ’latent code’ will target the salient semantic features in the data,prestructured, factorized

Extend GAN objective with mutual information between generated data distributionG(z, c) and latent code c.

minG

maxD

VI(D,G) = V (D,G)− λI(c;G(z, c))

STAT G8201: Deep Generative Models 7 / 29

Page 12: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

InfoGAN: the idea

Aim: Unsupervised learning of disentangled representations in the GAN framework.

Recap GAN:

minG

maxD

V (D,G) = Ex∼Pdata [logD(x)] + Ez∼noise[log(1−D(G(z)))]

Decompose the input noise vector into two parts:

I z : treated as incompressible noise

I c : the ’latent code’ will target the salient semantic features in the data,prestructured, factorized

Extend GAN objective with mutual information between generated data distributionG(z, c) and latent code c.

minG

maxD

VI(D,G) = V (D,G)− λI(c;G(z, c))

STAT G8201: Deep Generative Models 7 / 29

Page 13: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

InfoGAN: mutual information

InfoGAN optimizes a variational lower bound on mutual information (aka VariationalInformation Maximization).

I(c;G(z, c)) = H(c)−H(c|G(z, c))

= Ex∼G(z,c)[Ec′∼P (c|x)[logP (c′|x)]] +H(c)

= Ex∼G(z,c)[DKL(P (·|x)||Q(·|x)) + Ec′∼P (c|x)[logQ(c′|x)]] +H(c)

≥ Ex∼G(z,c)[Ec′∼P (c|x)[logQ(c′|x)]] +H(c)

= Ec∼P (c),x∼G(z,c)[logQ(c|x)] +H(c)

= LI(G,Q)

Hence, the objective becomes:

minG,Q

maxD

VInfoGAN(D,G,Q) = V (D,G)− λLI(G,Q)

Note:

I Bound becomes tight as Ex[DKL[P (·|x)||Q(·|x))]]→ 0

I Q parametrized as neuronal network. Here: shares all convolutional layers with Dand there is one final fully connected layer to parameters of Q(c|x)

STAT G8201: Deep Generative Models 8 / 29

Page 14: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

InfoGAN: mutual information

InfoGAN optimizes a variational lower bound on mutual information (aka VariationalInformation Maximization).

I(c;G(z, c)) = H(c)−H(c|G(z, c))

= Ex∼G(z,c)[Ec′∼P (c|x)[logP (c′|x)]] +H(c)

= Ex∼G(z,c)[DKL(P (·|x)||Q(·|x)) + Ec′∼P (c|x)[logQ(c′|x)]] +H(c)

≥ Ex∼G(z,c)[Ec′∼P (c|x)[logQ(c′|x)]] +H(c)

= Ec∼P (c),x∼G(z,c)[logQ(c|x)] +H(c)

= LI(G,Q)

Hence, the objective becomes:

minG,Q

maxD

VInfoGAN(D,G,Q) = V (D,G)− λLI(G,Q)

Note:

I Bound becomes tight as Ex[DKL[P (·|x)||Q(·|x))]]→ 0

I Q parametrized as neuronal network. Here: shares all convolutional layers with Dand there is one final fully connected layer to parameters of Q(c|x)

STAT G8201: Deep Generative Models 8 / 29

Page 15: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

InfoGAN: MNIST example

Setup:

I latent code c: c1 ∼ Cat(K = 10, p = 0.1) categorical. c2, c3 ∼ Unif(−1, 1)I ’incompressible noise’ z1, . . . , z62 ∼ N (0, 1)

STAT G8201: Deep Generative Models 9 / 29

Page 16: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

InfoGAN: MNIST example

Setup:

I latent code c: c1 ∼ Cat(K = 10, p = 0.1) categorical. c2, c3 ∼ Unif(−1, 1)I ’incompressible noise’ z1, . . . , z62 ∼ N (0, 1)

STAT G8201: Deep Generative Models 9 / 29

Page 17: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Part III: β-VAE

STAT G8201: Deep Generative Models 10 / 29

Page 18: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

β-VAE: Idea

Assumption: Data is generated by a ’true world simulator’ p(x|v,w) = Sim(v,w) with

I conditionally independent factors v ∈ RK

I conditional dependent factors w ∈ RH

Aim: unsupervised learning of disentangled representations in VAE framework

Learn joint distribution on data x and generative latent factors z (z ∈ RM ,M ≥ K)such that p(x|z) ≈ p(x|v,w).

VAE standard procedure: infer posterior configuration of latent factors z by qφ(z|x) thatshould match prior p(z) = N (0, I)

Idea: control degree of match of qφ(z|x) to p(z). Thereby:

I control capacity of latent information bottleneck z

I put implicit independence pressure on the learnt posterior

STAT G8201: Deep Generative Models 11 / 29

Page 19: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

β-VAE: Idea

Assumption: Data is generated by a ’true world simulator’ p(x|v,w) = Sim(v,w) with

I conditionally independent factors v ∈ RK

I conditional dependent factors w ∈ RH

Aim: unsupervised learning of disentangled representations in VAE framework

Learn joint distribution on data x and generative latent factors z (z ∈ RM ,M ≥ K)such that p(x|z) ≈ p(x|v,w).

VAE standard procedure: infer posterior configuration of latent factors z by qφ(z|x) thatshould match prior p(z) = N (0, I)

Idea: control degree of match of qφ(z|x) to p(z). Thereby:

I control capacity of latent information bottleneck z

I put implicit independence pressure on the learnt posterior

STAT G8201: Deep Generative Models 11 / 29

Page 20: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

β-VAE: Idea

Assumption: Data is generated by a ’true world simulator’ p(x|v,w) = Sim(v,w) with

I conditionally independent factors v ∈ RK

I conditional dependent factors w ∈ RH

Aim: unsupervised learning of disentangled representations in VAE framework

Learn joint distribution on data x and generative latent factors z (z ∈ RM ,M ≥ K)such that p(x|z) ≈ p(x|v,w).

VAE standard procedure: infer posterior configuration of latent factors z by qφ(z|x) thatshould match prior p(z) = N (0, I)

Idea: control degree of match of qφ(z|x) to p(z). Thereby:

I control capacity of latent information bottleneck z

I put implicit independence pressure on the learnt posterior

STAT G8201: Deep Generative Models 11 / 29

Page 21: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

β-VAE: Idea

Assumption: Data is generated by a ’true world simulator’ p(x|v,w) = Sim(v,w) with

I conditionally independent factors v ∈ RK

I conditional dependent factors w ∈ RH

Aim: unsupervised learning of disentangled representations in VAE framework

Learn joint distribution on data x and generative latent factors z (z ∈ RM ,M ≥ K)such that p(x|z) ≈ p(x|v,w).

VAE standard procedure: infer posterior configuration of latent factors z by qφ(z|x) thatshould match prior p(z) = N (0, I)

Idea: control degree of match of qφ(z|x) to p(z). Thereby:

I control capacity of latent information bottleneck z

I put implicit independence pressure on the learnt posterior

STAT G8201: Deep Generative Models 11 / 29

Page 22: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

β-VAE: Idea

Specifically, a constrained optimisation problem.

maxφ,θ

Ex∼D[Eqφ(z|x)[log pθ(x|z)]] subject to DKL(qφ(z|x)||p(z)) ≤ ε

Rewriting as a Lagrangian under KKT condition yields modified ELBO objective

L(θ, φ;x, z, β) = Eqφ(z|x)[log pθ(x|z)]]− βDKL(qφ(z|x)||p(z))

Note:

I β = 1 yields original VAE objective

I Hypothesis: β > 1 puts stronger constraint on latent bottleneck and yields learningdisentangled representations of the conditionally independent data generativefactors

I β > 1 will reduce reconstruction fidelity

I Hypothesis: Disentangled representations emerge with right balance betweeninformation preservation (reconstruction cost as regularization) and latent channelcapacity restriction → optimal β.

STAT G8201: Deep Generative Models 12 / 29

Page 23: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

β-VAE: Idea

Specifically, a constrained optimisation problem.

maxφ,θ

Ex∼D[Eqφ(z|x)[log pθ(x|z)]] subject to DKL(qφ(z|x)||p(z)) ≤ ε

Rewriting as a Lagrangian under KKT condition yields modified ELBO objective

L(θ, φ;x, z, β) = Eqφ(z|x)[log pθ(x|z)]]− βDKL(qφ(z|x)||p(z))

Note:

I β = 1 yields original VAE objective

I Hypothesis: β > 1 puts stronger constraint on latent bottleneck and yields learningdisentangled representations of the conditionally independent data generativefactors

I β > 1 will reduce reconstruction fidelity

I Hypothesis: Disentangled representations emerge with right balance betweeninformation preservation (reconstruction cost as regularization) and latent channelcapacity restriction → optimal β.

STAT G8201: Deep Generative Models 12 / 29

Page 24: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

β-VAE: qualitative assessment ofdisentanglement

If no ground truth generative factors are known (or labeled), optimal β is found byvisual inspection. Here, β = 5 for the 3D chairs dataset.

STAT G8201: Deep Generative Models 13 / 29

Page 25: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

β-VAE: a controlled generative world

Simulated data (images w resolution 64 x 64 x 1), five data generative factors:

I binary 2D shapes (heart, oval, and square)

I varying over x- and y-position (32 x 32 values), scale (6 values), and rotation (40values)

STAT G8201: Deep Generative Models 14 / 29

Page 26: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

β-VAE: a controlled generative world

STAT G8201: Deep Generative Models 15 / 29

Page 27: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

β-VAE: disentanglement score

If ground truth generative factors are known or have been labeled, disentanglement canbe assessed quantitatively.

Idea:

I a (true) data generative factor should map to a single latent factor in z

I this would enable robust classification even using a very simple linear classifier

Algorithm: For dataset D = {X,V,W} (V = RK) and x ∼ Sim(v,w)

1. choose factor y ∼ Unif [1 . . .K] (e.g., y corresponds to scale)

2. for a batch of L samples

I sample latent v1,l and v2,l with [v1,l]k = [v2,l]k if k = yI simulate x1,l ∼ Sim(v1,l), x2,l ∼ Sim(v2,l) and infer z1,l = µ(x1,l),

z1,l = µ(x1,l) using the encoder q(z|x) = N (µ(x), σ(x))I compute L1 distance zldiff = |z1,l − z2,l|

3. Use the average zbdiff = 1L

∑Ll=1 z

ldiff to predict p(y|zbdiff).

4. Accuracy across batches := disentanglement score

STAT G8201: Deep Generative Models 16 / 29

Page 28: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

β-VAE: disentanglement score

If ground truth generative factors are known or have been labeled, disentanglement canbe assessed quantitatively.

Idea:

I a (true) data generative factor should map to a single latent factor in z

I this would enable robust classification even using a very simple linear classifier

Algorithm: For dataset D = {X,V,W} (V = RK) and x ∼ Sim(v,w)

1. choose factor y ∼ Unif [1 . . .K] (e.g., y corresponds to scale)

2. for a batch of L samples

I sample latent v1,l and v2,l with [v1,l]k = [v2,l]k if k = yI simulate x1,l ∼ Sim(v1,l), x2,l ∼ Sim(v2,l) and infer z1,l = µ(x1,l),

z1,l = µ(x1,l) using the encoder q(z|x) = N (µ(x), σ(x))I compute L1 distance zldiff = |z1,l − z2,l|

3. Use the average zbdiff = 1L

∑Ll=1 z

ldiff to predict p(y|zbdiff).

4. Accuracy across batches := disentanglement score

STAT G8201: Deep Generative Models 16 / 29

Page 29: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

β-VAE: disentanglement score

STAT G8201: Deep Generative Models 17 / 29

Page 30: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

β-VAE: disentanglement

I β can be conceived as mixing coefficienct for balancing the magnitude of gradientsfrom reconstruction and prior-matching. Hence, normalization for bettercomparison across different input and latent sizes.

I βnorm = βMN

, M : latent code size, N : input size

STAT G8201: Deep Generative Models 18 / 29

Page 31: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Part IV: Fixing a broken ELBO

STAT G8201: Deep Generative Models 19 / 29

Page 32: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Fixing a broken ELBO

https://www.inference.vc/goals-and-principles-of-representation-learning/

“Fixing a broken ELBO”touches on deep questions like:

I Why are we using VAEs andELBOs?

I Why do they seem to learnuseful representationssometimes and not others?

Unifies the information theoreticapproach in InfoGAN with β-VAE

STAT G8201: Deep Generative Models 20 / 29

Page 33: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Fixing a broken ELBO

https://www.inference.vc/goals-and-principles-of-representation-learning/

“Fixing a broken ELBO”touches on deep questions like:

I Why are we using VAEs andELBOs?

I Why do they seem to learnuseful representationssometimes and not others?

Unifies the information theoreticapproach in InfoGAN with β-VAE

STAT G8201: Deep Generative Models 20 / 29

Page 34: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Fixing a broken ELBO

https://www.inference.vc/goals-and-principles-of-representation-learning/

“Fixing a broken ELBO”touches on deep questions like:

I Why are we using VAEs andELBOs?

I Why do they seem to learnuseful representationssometimes and not others?

Unifies the information theoreticapproach in InfoGAN with β-VAE

STAT G8201: Deep Generative Models 20 / 29

Page 35: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Fixing a broken ELBO

https://www.inference.vc/goals-and-principles-of-representation-learning/

“Fixing a broken ELBO”touches on deep questions like:

I Why are we using VAEs andELBOs?

I Why do they seem to learnuseful representationssometimes and not others?

Unifies the information theoreticapproach in InfoGAN with β-VAE

STAT G8201: Deep Generative Models 20 / 29

Page 36: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Fixing a broken ELBO

https://www.inference.vc/goals-and-principles-of-representation-learning/

“Fixing a broken ELBO”touches on deep questions like:

I Why are we using VAEs andELBOs?

I Why do they seem to learnuseful representationssometimes and not others?

Unifies the information theoreticapproach in InfoGAN with β-VAE

STAT G8201: Deep Generative Models 20 / 29

Page 37: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Fixing a broken ELBO

https://www.inference.vc/goals-and-principles-of-representation-learning/

“Fixing a broken ELBO”touches on deep questions like:

I Why are we using VAEs andELBOs?

I Why do they seem to learnuseful representationssometimes and not others?

Unifies the information theoreticapproach in InfoGAN with β-VAE

STAT G8201: Deep Generative Models 20 / 29

Page 38: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Fixing a broken ELBO

https://www.inference.vc/goals-and-principles-of-representation-learning/

“Fixing a broken ELBO”touches on deep questions like:

I Why are we using VAEs andELBOs?

I Why do they seem to learnuseful representationssometimes and not others?

Unifies the information theoreticapproach in InfoGAN with β-VAE

STAT G8201: Deep Generative Models 20 / 29

Page 39: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

X-rays showing the broken ELBO

In vanilla VAE, you can maximize ELBO and get a model that produces nice samples,reproduces the input distribution overall, but ignores individual inputs.

I In this case, it also has a useless (and entangled) latent representation.

I True data generated by: Z? = {z0, z1} ∼ Ber(0.7)I Next, sample in 30 discrete bins to get a one-hot categorical value x

I z is a one-hot encoding of latent categorical value

I Added Gaussian noise and discretization so that I(x; z?) = 0.5 nats

R = 0.0002 nats (“auto-decoder”) R = 0.4999 nats (“perfect”)

STAT G8201: Deep Generative Models 21 / 29

Page 40: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

X-rays showing the broken ELBO

In vanilla VAE, you can maximize ELBO and get a model that produces nice samples,reproduces the input distribution overall, but ignores individual inputs.

I In this case, it also has a useless (and entangled) latent representation.

I True data generated by: Z? = {z0, z1} ∼ Ber(0.7)I Next, sample in 30 discrete bins to get a one-hot categorical value x

I z is a one-hot encoding of latent categorical value

I Added Gaussian noise and discretization so that I(x; z?) = 0.5 nats

R = 0.0002 nats (“auto-decoder”) R = 0.4999 nats (“perfect”)

STAT G8201: Deep Generative Models 21 / 29

Page 41: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

X-rays showing the broken ELBO

In vanilla VAE, you can maximize ELBO and get a model that produces nice samples,reproduces the input distribution overall, but ignores individual inputs.

I In this case, it also has a useless (and entangled) latent representation.

I True data generated by: Z? = {z0, z1} ∼ Ber(0.7)

I Next, sample in 30 discrete bins to get a one-hot categorical value x

I z is a one-hot encoding of latent categorical value

I Added Gaussian noise and discretization so that I(x; z?) = 0.5 nats

R = 0.0002 nats (“auto-decoder”) R = 0.4999 nats (“perfect”)

STAT G8201: Deep Generative Models 21 / 29

Page 42: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

X-rays showing the broken ELBO

In vanilla VAE, you can maximize ELBO and get a model that produces nice samples,reproduces the input distribution overall, but ignores individual inputs.

I In this case, it also has a useless (and entangled) latent representation.

I True data generated by: Z? = {z0, z1} ∼ Ber(0.7)I Next, sample in 30 discrete bins to get a one-hot categorical value x

I z is a one-hot encoding of latent categorical value

I Added Gaussian noise and discretization so that I(x; z?) = 0.5 nats

R = 0.0002 nats (“auto-decoder”) R = 0.4999 nats (“perfect”)

STAT G8201: Deep Generative Models 21 / 29

Page 43: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

X-rays showing the broken ELBO

In vanilla VAE, you can maximize ELBO and get a model that produces nice samples,reproduces the input distribution overall, but ignores individual inputs.

I In this case, it also has a useless (and entangled) latent representation.

I True data generated by: Z? = {z0, z1} ∼ Ber(0.7)I Next, sample in 30 discrete bins to get a one-hot categorical value x

I z is a one-hot encoding of latent categorical value

I Added Gaussian noise and discretization so that I(x; z?) = 0.5 nats

R = 0.0002 nats (“auto-decoder”) R = 0.4999 nats (“perfect”)

STAT G8201: Deep Generative Models 21 / 29

Page 44: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

X-rays showing the broken ELBO

In vanilla VAE, you can maximize ELBO and get a model that produces nice samples,reproduces the input distribution overall, but ignores individual inputs.

I In this case, it also has a useless (and entangled) latent representation.

I True data generated by: Z? = {z0, z1} ∼ Ber(0.7)I Next, sample in 30 discrete bins to get a one-hot categorical value x

I z is a one-hot encoding of latent categorical value

I Added Gaussian noise and discretization so that I(x; z?) = 0.5 nats

R = 0.0002 nats (“auto-decoder”) R = 0.4999 nats (“perfect”)

STAT G8201: Deep Generative Models 21 / 29

Page 45: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

X-rays showing the broken ELBO

In vanilla VAE, you can maximize ELBO and get a model that produces nice samples,reproduces the input distribution overall, but ignores individual inputs.

I In this case, it also has a useless (and entangled) latent representation.

I True data generated by: Z? = {z0, z1} ∼ Ber(0.7)I Next, sample in 30 discrete bins to get a one-hot categorical value x

I z is a one-hot encoding of latent categorical value

I Added Gaussian noise and discretization so that I(x; z?) = 0.5 nats

R = 0.0002 nats (“auto-decoder”) R = 0.4999 nats (“perfect”)

STAT G8201: Deep Generative Models 21 / 29

Page 46: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

X-rays showing the broken ELBO

In vanilla VAE, you can maximize ELBO and get a model that produces nice samples,reproduces the input distribution overall, but ignores individual inputs.

I In this case, it also has a useless (and entangled) latent representation.

I True data generated by: Z? = {z0, z1} ∼ Ber(0.7)I Next, sample in 30 discrete bins to get a one-hot categorical value x

I z is a one-hot encoding of latent categorical value

I Added Gaussian noise and discretization so that I(x; z?) = 0.5 nats

R = 0.0002 nats (“auto-decoder”) R = 0.4999 nats (“perfect”)

STAT G8201: Deep Generative Models 21 / 29

Page 47: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Connection to rate-distortion theory

The “lossy compression problem”: how to store and/or transmit data over alimited-capacity channel and retain as much information as possible?

I Two competing costs are traded off: rate and distortion

Rate = entropy of thecompressed code, orrepresentation(in the discretized case, thisis usually measured in bits)

Distortion = thereconstruction error

Balle, Laparra, Simoncelli (2017, ICLR)

(Here, λ = 1β

)

STAT G8201: Deep Generative Models 22 / 29

Page 48: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Connection to rate-distortion theory

The “lossy compression problem”: how to store and/or transmit data over alimited-capacity channel and retain as much information as possible?

I Two competing costs are traded off: rate and distortion

Rate = entropy of thecompressed code, orrepresentation(in the discretized case, thisis usually measured in bits)

Distortion = thereconstruction error

Balle, Laparra, Simoncelli (2017, ICLR)

(Here, λ = 1β

)

STAT G8201: Deep Generative Models 22 / 29

Page 49: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Connection to rate-distortion theory

The “lossy compression problem”: how to store and/or transmit data over alimited-capacity channel and retain as much information as possible?

I Two competing costs are traded off: rate and distortion

Rate = entropy of thecompressed code, orrepresentation(in the discretized case, thisis usually measured in bits)

Distortion = thereconstruction error

Balle, Laparra, Simoncelli (2017, ICLR)

(Here, λ = 1β

)

STAT G8201: Deep Generative Models 22 / 29

Page 50: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Connection to rate-distortion theory

The “lossy compression problem”: how to store and/or transmit data over alimited-capacity channel and retain as much information as possible?

I Two competing costs are traded off: rate and distortion

Rate = entropy of thecompressed code, orrepresentation(in the discretized case, thisis usually measured in bits)

Distortion = thereconstruction error

Balle, Laparra, Simoncelli (2017, ICLR)

(Here, λ = 1β

)

STAT G8201: Deep Generative Models 22 / 29

Page 51: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Connection to rate-distortion theory

The “lossy compression problem”: how to store and/or transmit data over alimited-capacity channel and retain as much information as possible?

I Two competing costs are traded off: rate and distortion

Rate = entropy of thecompressed code, orrepresentation(in the discretized case, thisis usually measured in bits)

Distortion = thereconstruction error

Balle, Laparra, Simoncelli (2017, ICLR)

(Here, λ = 1β

)

STAT G8201: Deep Generative Models 22 / 29

Page 52: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Linking rate-distortion theory to β-VAE

Rate-distortion objective is to jointly minimize the rate and distortion, while maximizingthe mutual information (Ie) between the latent code (Z) and the input (X):

Ie(X;Z) =

∫∫dx dz pθ(x, z) log

pθ(x, z)

pdata(x) pθ(z)

Mutual information is bounded by data entropy (H), distortion (D), and rate (R):

H −D ≤ Ie(X;Z) ≤ R

Where:

H = −∫dx pdata(x) log pdata(x)

D = −∫dx pdata(x)

∫dz qφ(z|x) log pθ(x|z)

R = −∫dx pdata(x)

∫dz qφ(z|x) log

qφ(z|x)m(z)

Note the new (to our class) object:m(z), a variational approximation to marginal posterior pθ(z)

STAT G8201: Deep Generative Models 23 / 29

Page 53: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Linking rate-distortion theory to β-VAE

Rate-distortion objective is to jointly minimize the rate and distortion, while maximizingthe mutual information (Ie) between the latent code (Z) and the input (X):

Ie(X;Z) =

∫∫dx dz pθ(x, z) log

pθ(x, z)

pdata(x) pθ(z)

Mutual information is bounded by data entropy (H), distortion (D), and rate (R):

H −D ≤ Ie(X;Z) ≤ R

Where:

H = −∫dx pdata(x) log pdata(x)

D = −∫dx pdata(x)

∫dz qφ(z|x) log pθ(x|z)

R = −∫dx pdata(x)

∫dz qφ(z|x) log

qφ(z|x)m(z)

Note the new (to our class) object:m(z), a variational approximation to marginal posterior pθ(z)

STAT G8201: Deep Generative Models 23 / 29

Page 54: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Linking rate-distortion theory to β-VAE

Rate-distortion objective is to jointly minimize the rate and distortion, while maximizingthe mutual information (Ie) between the latent code (Z) and the input (X):

Ie(X;Z) =

∫∫dx dz pθ(x, z) log

pθ(x, z)

pdata(x) pθ(z)

Mutual information is bounded by data entropy (H), distortion (D), and rate (R):

H −D ≤ Ie(X;Z) ≤ R

Where:

H = −∫dx pdata(x) log pdata(x)

D = −∫dx pdata(x)

∫dz qφ(z|x) log pθ(x|z)

R = −∫dx pdata(x)

∫dz qφ(z|x) log

qφ(z|x)m(z)

Note the new (to our class) object:m(z), a variational approximation to marginal posterior pθ(z)

STAT G8201: Deep Generative Models 23 / 29

Page 55: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Linking rate-distortion theory to β-VAE

β-VAE approach: look for the best R and D by assuming a fixed β = ∂D∂R

and optimize:

minqφ(z|x),m(z),pθ(x|z) D + βR

Substituting D and R, we get

minqφ(z|x),m(z),pθ(x|z)

∫dx pdata(x)

∫dz qφ(z|x)

[− log pθ(x|z)︸ ︷︷ ︸reconstruction loss

+ β logqφ(z|x)m(z)︸ ︷︷ ︸

KL divergence

]

As in β-VAE , when β = 1, this is vanilla VAE

A sling for a broken ELBO?What they actually do is target a specific rate σ and optimize:

minqφ(z|x),m(z),pθ(x|z) D + |σ −R|

STAT G8201: Deep Generative Models 24 / 29

Page 56: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Linking rate-distortion theory to β-VAE

β-VAE approach: look for the best R and D by assuming a fixed β = ∂D∂R

and optimize:

minqφ(z|x),m(z),pθ(x|z) D + βR

Substituting D and R, we get

minqφ(z|x),m(z),pθ(x|z)

∫dx pdata(x)

∫dz qφ(z|x)

[− log pθ(x|z)

︸ ︷︷ ︸reconstruction loss

+ β logqφ(z|x)m(z)

︸ ︷︷ ︸KL divergence

]

As in β-VAE , when β = 1, this is vanilla VAE

A sling for a broken ELBO?What they actually do is target a specific rate σ and optimize:

minqφ(z|x),m(z),pθ(x|z) D + |σ −R|

STAT G8201: Deep Generative Models 24 / 29

Page 57: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Linking rate-distortion theory to β-VAE

β-VAE approach: look for the best R and D by assuming a fixed β = ∂D∂R

and optimize:

minqφ(z|x),m(z),pθ(x|z) D + βR

Substituting D and R, we get

minqφ(z|x),m(z),pθ(x|z)

∫dx pdata(x)

∫dz qφ(z|x)

[− log pθ(x|z)︸ ︷︷ ︸reconstruction loss

+ β logqφ(z|x)m(z)︸ ︷︷ ︸

KL divergence

]

As in β-VAE , when β = 1, this is vanilla VAE

A sling for a broken ELBO?What they actually do is target a specific rate σ and optimize:

minqφ(z|x),m(z),pθ(x|z) D + |σ −R|

STAT G8201: Deep Generative Models 24 / 29

Page 58: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Linking rate-distortion theory to β-VAE

β-VAE approach: look for the best R and D by assuming a fixed β = ∂D∂R

and optimize:

minqφ(z|x),m(z),pθ(x|z) D + βR

Substituting D and R, we get

minqφ(z|x),m(z),pθ(x|z)

∫dx pdata(x)

∫dz qφ(z|x)

[− log pθ(x|z)︸ ︷︷ ︸reconstruction loss

+ β logqφ(z|x)m(z)︸ ︷︷ ︸

KL divergence

]

As in β-VAE , when β = 1, this is vanilla VAE

A sling for a broken ELBO?What they actually do is target a specific rate σ and optimize:

minqφ(z|x),m(z),pθ(x|z) D + |σ −R|

STAT G8201: Deep Generative Models 24 / 29

Page 59: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Linking rate-distortion theory to β-VAE

β-VAE approach: look for the best R and D by assuming a fixed β = ∂D∂R

and optimize:

minqφ(z|x),m(z),pθ(x|z) D + βR

Substituting D and R, we get

minqφ(z|x),m(z),pθ(x|z)

∫dx pdata(x)

∫dz qφ(z|x)

[− log pθ(x|z)︸ ︷︷ ︸reconstruction loss

+ β logqφ(z|x)m(z)︸ ︷︷ ︸

KL divergence

]

As in β-VAE , when β = 1, this is vanilla VAE

A sling for a broken ELBO?What they actually do is target a specific rate σ and optimize:

minqφ(z|x),m(z),pθ(x|z) D + |σ −R|

STAT G8201: Deep Generative Models 24 / 29

Page 60: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Targeting specific rates

Theoretical feasible regionbounded by β lines, whichcorrespond to tight bound onauto-decoding (β = 0),auto-encoding (β =∞), orboth (β = 1)

High R, low D: goodreconstruction, larger code(auto-encoders)

Low R, high D: goodsynthesis, smaller code(auto-decoders)

STAT G8201: Deep Generative Models 25 / 29

Page 61: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Targeting specific rates

Theoretical feasible regionbounded by β lines, whichcorrespond to tight bound onauto-decoding (β = 0),auto-encoding (β =∞), orboth (β = 1)

High R, low D: goodreconstruction, larger code(auto-encoders)

Low R, high D: goodsynthesis, smaller code(auto-decoders)

STAT G8201: Deep Generative Models 25 / 29

Page 62: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Targeting specific rates

Theoretical feasible regionbounded by β lines, whichcorrespond to tight bound onauto-decoding (β = 0),auto-encoding (β =∞), orboth (β = 1)

High R, low D: goodreconstruction, larger code(auto-encoders)

Low R, high D: goodsynthesis, smaller code(auto-decoders)

STAT G8201: Deep Generative Models 25 / 29

Page 63: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Model results with different βs

“Syntactic encoder”

I Prioritizes input domain info. Better at reconstruction and still a good generator

“Semantic encoder”

I Prioritizes semantic info. Good reconstruction and a better generator

STAT G8201: Deep Generative Models 26 / 29

Page 64: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Model results with different βs

“Syntactic encoder”

I Prioritizes input domain info. Better at reconstruction and still a good generator

“Semantic encoder”

I Prioritizes semantic info. Good reconstruction and a better generator

STAT G8201: Deep Generative Models 26 / 29

Page 65: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Model results with different βs

“Syntactic encoder”

I Prioritizes input domain info. Better at reconstruction and still a good generator

“Semantic encoder”

I Prioritizes semantic info. Good reconstruction and a better generator

STAT G8201: Deep Generative Models 26 / 29

Page 66: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Model results with different βs

“Syntactic encoder”

I Prioritizes input domain info. Better at reconstruction and still a good generator

“Semantic encoder”

I Prioritizes semantic info. Good reconstruction and a better generator

STAT G8201: Deep Generative Models 26 / 29

Page 67: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Rate distortion results for models on MNIST

Tried a bunch of VAE variants

I (+/-,+/-,+/-/v)(encoder, decoder, marginal)

I best (R,D) values for eachmodel trace out a frontier

I top right: sweep throughdifferent β for two differentencoders

STAT G8201: Deep Generative Models 27 / 29

Page 68: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Rate distortion results for models on MNIST

Tried a bunch of VAE variants

I (+/-,+/-,+/-/v)(encoder, decoder, marginal)

I best (R,D) values for eachmodel trace out a frontier

I top right: sweep throughdifferent β for two differentencoders

STAT G8201: Deep Generative Models 27 / 29

Page 69: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Rate distortion results for models on MNIST

Tried a bunch of VAE variants

I (+/-,+/-,+/-/v)(encoder, decoder, marginal)

I best (R,D) values for eachmodel trace out a frontier

I top right: sweep throughdifferent β for two differentencoders

STAT G8201: Deep Generative Models 27 / 29

Page 70: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Rate distortion results for models on MNIST

Tried a bunch of VAE variants

I (+/-,+/-,+/-/v)(encoder, decoder, marginal)

I best (R,D) values for eachmodel trace out a frontier

I top right: sweep throughdifferent β for two differentencoders

STAT G8201: Deep Generative Models 27 / 29

Page 71: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Fixing ELBO Conclusions

Showed how expressive decoders can ignore the inputs/latents

Targeting a specific rate regularizes the ELBO towards the Rate-Distortion tradeoffmost appropriate for the task you want the model to solve

I Allows a more explicit statement, in the objective function, of what the model does

Authors propose that instead of comparing ELBOs, compare (R,D) values

I Showed that current approaches don’t do well at low distortion, high rate scenarios

STAT G8201: Deep Generative Models 28 / 29

Page 72: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Fixing ELBO Conclusions

Showed how expressive decoders can ignore the inputs/latents

Targeting a specific rate regularizes the ELBO towards the Rate-Distortion tradeoffmost appropriate for the task you want the model to solve

I Allows a more explicit statement, in the objective function, of what the model does

Authors propose that instead of comparing ELBOs, compare (R,D) values

I Showed that current approaches don’t do well at low distortion, high rate scenarios

STAT G8201: Deep Generative Models 28 / 29

Page 73: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Fixing ELBO Conclusions

Showed how expressive decoders can ignore the inputs/latents

Targeting a specific rate regularizes the ELBO towards the Rate-Distortion tradeoffmost appropriate for the task you want the model to solve

I Allows a more explicit statement, in the objective function, of what the model does

Authors propose that instead of comparing ELBOs, compare (R,D) values

I Showed that current approaches don’t do well at low distortion, high rate scenarios

STAT G8201: Deep Generative Models 28 / 29

Page 74: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Fixing ELBO Conclusions

Showed how expressive decoders can ignore the inputs/latents

Targeting a specific rate regularizes the ELBO towards the Rate-Distortion tradeoffmost appropriate for the task you want the model to solve

I Allows a more explicit statement, in the objective function, of what the model does

Authors propose that instead of comparing ELBOs, compare (R,D) values

I Showed that current approaches don’t do well at low distortion, high rate scenarios

STAT G8201: Deep Generative Models 28 / 29

Page 75: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Fixing ELBO Conclusions

Showed how expressive decoders can ignore the inputs/latents

Targeting a specific rate regularizes the ELBO towards the Rate-Distortion tradeoffmost appropriate for the task you want the model to solve

I Allows a more explicit statement, in the objective function, of what the model does

Authors propose that instead of comparing ELBOs, compare (R,D) values

I Showed that current approaches don’t do well at low distortion, high rate scenarios

STAT G8201: Deep Generative Models 28 / 29

Page 76: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Fixing ELBO Conclusions

Showed how expressive decoders can ignore the inputs/latents

Targeting a specific rate regularizes the ELBO towards the Rate-Distortion tradeoffmost appropriate for the task you want the model to solve

I Allows a more explicit statement, in the objective function, of what the model does

Authors propose that instead of comparing ELBOs, compare (R,D) values

I Showed that current approaches don’t do well at low distortion, high rate scenarios

STAT G8201: Deep Generative Models 28 / 29

Page 77: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Tying it all together

Our VAE-style models don’t always learn useful representations

Disentanglement is desired for “useful” representations

InfoGAN imposes a mutual information constraint between factorized latent code anddata that yields more disentangled representations

β-VAE disentangles by controlling the capacity of the encoder and the match to afactorized prior

β-VAE paper quantifies disentanglement with a new metric

Fixing a broken ELBO unifies information theoretic conceptualization of ELBO fromInfoGAN and β-VAE

Proposes a path forward for targeting more unique solutions, avoiding useless latentrepresentations, and comparing models

STAT G8201: Deep Generative Models 29 / 29

Page 78: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Tying it all together

Our VAE-style models don’t always learn useful representations

Disentanglement is desired for “useful” representations

InfoGAN imposes a mutual information constraint between factorized latent code anddata that yields more disentangled representations

β-VAE disentangles by controlling the capacity of the encoder and the match to afactorized prior

β-VAE paper quantifies disentanglement with a new metric

Fixing a broken ELBO unifies information theoretic conceptualization of ELBO fromInfoGAN and β-VAE

Proposes a path forward for targeting more unique solutions, avoiding useless latentrepresentations, and comparing models

STAT G8201: Deep Generative Models 29 / 29

Page 79: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Tying it all together

Our VAE-style models don’t always learn useful representations

Disentanglement is desired for “useful” representations

InfoGAN imposes a mutual information constraint between factorized latent code anddata that yields more disentangled representations

β-VAE disentangles by controlling the capacity of the encoder and the match to afactorized prior

β-VAE paper quantifies disentanglement with a new metric

Fixing a broken ELBO unifies information theoretic conceptualization of ELBO fromInfoGAN and β-VAE

Proposes a path forward for targeting more unique solutions, avoiding useless latentrepresentations, and comparing models

STAT G8201: Deep Generative Models 29 / 29

Page 80: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Tying it all together

Our VAE-style models don’t always learn useful representations

Disentanglement is desired for “useful” representations

InfoGAN imposes a mutual information constraint between factorized latent code anddata that yields more disentangled representations

β-VAE disentangles by controlling the capacity of the encoder and the match to afactorized prior

β-VAE paper quantifies disentanglement with a new metric

Fixing a broken ELBO unifies information theoretic conceptualization of ELBO fromInfoGAN and β-VAE

Proposes a path forward for targeting more unique solutions, avoiding useless latentrepresentations, and comparing models

STAT G8201: Deep Generative Models 29 / 29

Page 81: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Tying it all together

Our VAE-style models don’t always learn useful representations

Disentanglement is desired for “useful” representations

InfoGAN imposes a mutual information constraint between factorized latent code anddata that yields more disentangled representations

β-VAE disentangles by controlling the capacity of the encoder and the match to afactorized prior

β-VAE paper quantifies disentanglement with a new metric

Fixing a broken ELBO unifies information theoretic conceptualization of ELBO fromInfoGAN and β-VAE

Proposes a path forward for targeting more unique solutions, avoiding useless latentrepresentations, and comparing models

STAT G8201: Deep Generative Models 29 / 29

Page 82: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Tying it all together

Our VAE-style models don’t always learn useful representations

Disentanglement is desired for “useful” representations

InfoGAN imposes a mutual information constraint between factorized latent code anddata that yields more disentangled representations

β-VAE disentangles by controlling the capacity of the encoder and the match to afactorized prior

β-VAE paper quantifies disentanglement with a new metric

Fixing a broken ELBO unifies information theoretic conceptualization of ELBO fromInfoGAN and β-VAE

Proposes a path forward for targeting more unique solutions, avoiding useless latentrepresentations, and comparing models

STAT G8201: Deep Generative Models 29 / 29

Page 83: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Tying it all together

Our VAE-style models don’t always learn useful representations

Disentanglement is desired for “useful” representations

InfoGAN imposes a mutual information constraint between factorized latent code anddata that yields more disentangled representations

β-VAE disentangles by controlling the capacity of the encoder and the match to afactorized prior

β-VAE paper quantifies disentanglement with a new metric

Fixing a broken ELBO unifies information theoretic conceptualization of ELBO fromInfoGAN and β-VAE

Proposes a path forward for targeting more unique solutions, avoiding useless latentrepresentations, and comparing models

STAT G8201: Deep Generative Models 29 / 29

Page 84: Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Tying it all together

Our VAE-style models don’t always learn useful representations

Disentanglement is desired for “useful” representations

InfoGAN imposes a mutual information constraint between factorized latent code anddata that yields more disentangled representations

β-VAE disentangles by controlling the capacity of the encoder and the match to afactorized prior

β-VAE paper quantifies disentanglement with a new metric

Fixing a broken ELBO unifies information theoretic conceptualization of ELBO fromInfoGAN and β-VAE

Proposes a path forward for targeting more unique solutions, avoiding useless latentrepresentations, and comparing models

STAT G8201: Deep Generative Models 29 / 29