Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Deep Generative Models
STAT G8201: Deep Generative Models 1 / 29
Part IV.2: Disentangling, geometry,and interpreting VAE latents
STAT G8201: Deep Generative Models 2 / 29
The Disentangling Hypothesis
DiCarlo & Cox (2007). “Untangling Invariant Object Recognition”Trends In Cognitive Sciences.
STAT G8201: Deep Generative Models 3 / 29
The Disentangling Hypothesis
DiCarlo & Cox (2007). “Untangling Invariant Object Recognition”Trends In Cognitive Sciences.
STAT G8201: Deep Generative Models 3 / 29
The Disentangling Hypothesis
DiCarlo & Cox (2007). “Untangling Invariant Object Recognition”Trends In Cognitive Sciences.
STAT G8201: Deep Generative Models 3 / 29
The Disentangling Hypothesis
DiCarlo & Cox (2007). “Untangling Invariant Object Recognition”Trends In Cognitive Sciences.
STAT G8201: Deep Generative Models 3 / 29
Part I:Disentanglement
STAT G8201: Deep Generative Models 4 / 29
Disentanglement
What is meant by disentanglement here?
A disentangled representation can be defined as one where single latent units aresensitive to changes in single generative factors, while being relatively invariantto changes in other factors. (Bengio et al., 2013)
Disentanglement involves two points:
I Sparse mapping between latent units and generative factors
I Mutual information between latent code and X
However, mutual information between latent code and X is not guaranteed. In fact,powerful decoders might decouple latent code from X.
STAT G8201: Deep Generative Models 5 / 29
Disentanglement
What is meant by disentanglement here?
A disentangled representation can be defined as one where single latent units aresensitive to changes in single generative factors, while being relatively invariantto changes in other factors. (Bengio et al., 2013)
Disentanglement involves two points:
I Sparse mapping between latent units and generative factors
I Mutual information between latent code and X
However, mutual information between latent code and X is not guaranteed. In fact,powerful decoders might decouple latent code from X.
STAT G8201: Deep Generative Models 5 / 29
Part II: InfoGAN
STAT G8201: Deep Generative Models 6 / 29
InfoGAN: the idea
Aim: Unsupervised learning of disentangled representations in the GAN framework.
Recap GAN:
minG
maxD
V (D,G) = Ex∼Pdata [logD(x)] + Ez∼noise[log(1−D(G(z)))]
Decompose the input noise vector into two parts:
I z : treated as incompressible noise
I c : the ’latent code’ will target the salient semantic features in the data,prestructured, factorized
Extend GAN objective with mutual information between generated data distributionG(z, c) and latent code c.
minG
maxD
VI(D,G) = V (D,G)− λI(c;G(z, c))
STAT G8201: Deep Generative Models 7 / 29
InfoGAN: the idea
Aim: Unsupervised learning of disentangled representations in the GAN framework.
Recap GAN:
minG
maxD
V (D,G) = Ex∼Pdata [logD(x)] + Ez∼noise[log(1−D(G(z)))]
Decompose the input noise vector into two parts:
I z : treated as incompressible noise
I c : the ’latent code’ will target the salient semantic features in the data,prestructured, factorized
Extend GAN objective with mutual information between generated data distributionG(z, c) and latent code c.
minG
maxD
VI(D,G) = V (D,G)− λI(c;G(z, c))
STAT G8201: Deep Generative Models 7 / 29
InfoGAN: mutual information
InfoGAN optimizes a variational lower bound on mutual information (aka VariationalInformation Maximization).
I(c;G(z, c)) = H(c)−H(c|G(z, c))
= Ex∼G(z,c)[Ec′∼P (c|x)[logP (c′|x)]] +H(c)
= Ex∼G(z,c)[DKL(P (·|x)||Q(·|x)) + Ec′∼P (c|x)[logQ(c′|x)]] +H(c)
≥ Ex∼G(z,c)[Ec′∼P (c|x)[logQ(c′|x)]] +H(c)
= Ec∼P (c),x∼G(z,c)[logQ(c|x)] +H(c)
= LI(G,Q)
Hence, the objective becomes:
minG,Q
maxD
VInfoGAN(D,G,Q) = V (D,G)− λLI(G,Q)
Note:
I Bound becomes tight as Ex[DKL[P (·|x)||Q(·|x))]]→ 0
I Q parametrized as neuronal network. Here: shares all convolutional layers with Dand there is one final fully connected layer to parameters of Q(c|x)
STAT G8201: Deep Generative Models 8 / 29
InfoGAN: mutual information
InfoGAN optimizes a variational lower bound on mutual information (aka VariationalInformation Maximization).
I(c;G(z, c)) = H(c)−H(c|G(z, c))
= Ex∼G(z,c)[Ec′∼P (c|x)[logP (c′|x)]] +H(c)
= Ex∼G(z,c)[DKL(P (·|x)||Q(·|x)) + Ec′∼P (c|x)[logQ(c′|x)]] +H(c)
≥ Ex∼G(z,c)[Ec′∼P (c|x)[logQ(c′|x)]] +H(c)
= Ec∼P (c),x∼G(z,c)[logQ(c|x)] +H(c)
= LI(G,Q)
Hence, the objective becomes:
minG,Q
maxD
VInfoGAN(D,G,Q) = V (D,G)− λLI(G,Q)
Note:
I Bound becomes tight as Ex[DKL[P (·|x)||Q(·|x))]]→ 0
I Q parametrized as neuronal network. Here: shares all convolutional layers with Dand there is one final fully connected layer to parameters of Q(c|x)
STAT G8201: Deep Generative Models 8 / 29
InfoGAN: MNIST example
Setup:
I latent code c: c1 ∼ Cat(K = 10, p = 0.1) categorical. c2, c3 ∼ Unif(−1, 1)I ’incompressible noise’ z1, . . . , z62 ∼ N (0, 1)
STAT G8201: Deep Generative Models 9 / 29
InfoGAN: MNIST example
Setup:
I latent code c: c1 ∼ Cat(K = 10, p = 0.1) categorical. c2, c3 ∼ Unif(−1, 1)I ’incompressible noise’ z1, . . . , z62 ∼ N (0, 1)
STAT G8201: Deep Generative Models 9 / 29
Part III: β-VAE
STAT G8201: Deep Generative Models 10 / 29
β-VAE: Idea
Assumption: Data is generated by a ’true world simulator’ p(x|v,w) = Sim(v,w) with
I conditionally independent factors v ∈ RK
I conditional dependent factors w ∈ RH
Aim: unsupervised learning of disentangled representations in VAE framework
Learn joint distribution on data x and generative latent factors z (z ∈ RM ,M ≥ K)such that p(x|z) ≈ p(x|v,w).
VAE standard procedure: infer posterior configuration of latent factors z by qφ(z|x) thatshould match prior p(z) = N (0, I)
Idea: control degree of match of qφ(z|x) to p(z). Thereby:
I control capacity of latent information bottleneck z
I put implicit independence pressure on the learnt posterior
STAT G8201: Deep Generative Models 11 / 29
β-VAE: Idea
Assumption: Data is generated by a ’true world simulator’ p(x|v,w) = Sim(v,w) with
I conditionally independent factors v ∈ RK
I conditional dependent factors w ∈ RH
Aim: unsupervised learning of disentangled representations in VAE framework
Learn joint distribution on data x and generative latent factors z (z ∈ RM ,M ≥ K)such that p(x|z) ≈ p(x|v,w).
VAE standard procedure: infer posterior configuration of latent factors z by qφ(z|x) thatshould match prior p(z) = N (0, I)
Idea: control degree of match of qφ(z|x) to p(z). Thereby:
I control capacity of latent information bottleneck z
I put implicit independence pressure on the learnt posterior
STAT G8201: Deep Generative Models 11 / 29
β-VAE: Idea
Assumption: Data is generated by a ’true world simulator’ p(x|v,w) = Sim(v,w) with
I conditionally independent factors v ∈ RK
I conditional dependent factors w ∈ RH
Aim: unsupervised learning of disentangled representations in VAE framework
Learn joint distribution on data x and generative latent factors z (z ∈ RM ,M ≥ K)such that p(x|z) ≈ p(x|v,w).
VAE standard procedure: infer posterior configuration of latent factors z by qφ(z|x) thatshould match prior p(z) = N (0, I)
Idea: control degree of match of qφ(z|x) to p(z). Thereby:
I control capacity of latent information bottleneck z
I put implicit independence pressure on the learnt posterior
STAT G8201: Deep Generative Models 11 / 29
β-VAE: Idea
Assumption: Data is generated by a ’true world simulator’ p(x|v,w) = Sim(v,w) with
I conditionally independent factors v ∈ RK
I conditional dependent factors w ∈ RH
Aim: unsupervised learning of disentangled representations in VAE framework
Learn joint distribution on data x and generative latent factors z (z ∈ RM ,M ≥ K)such that p(x|z) ≈ p(x|v,w).
VAE standard procedure: infer posterior configuration of latent factors z by qφ(z|x) thatshould match prior p(z) = N (0, I)
Idea: control degree of match of qφ(z|x) to p(z). Thereby:
I control capacity of latent information bottleneck z
I put implicit independence pressure on the learnt posterior
STAT G8201: Deep Generative Models 11 / 29
β-VAE: Idea
Specifically, a constrained optimisation problem.
maxφ,θ
Ex∼D[Eqφ(z|x)[log pθ(x|z)]] subject to DKL(qφ(z|x)||p(z)) ≤ ε
Rewriting as a Lagrangian under KKT condition yields modified ELBO objective
L(θ, φ;x, z, β) = Eqφ(z|x)[log pθ(x|z)]]− βDKL(qφ(z|x)||p(z))
Note:
I β = 1 yields original VAE objective
I Hypothesis: β > 1 puts stronger constraint on latent bottleneck and yields learningdisentangled representations of the conditionally independent data generativefactors
I β > 1 will reduce reconstruction fidelity
I Hypothesis: Disentangled representations emerge with right balance betweeninformation preservation (reconstruction cost as regularization) and latent channelcapacity restriction → optimal β.
STAT G8201: Deep Generative Models 12 / 29
β-VAE: Idea
Specifically, a constrained optimisation problem.
maxφ,θ
Ex∼D[Eqφ(z|x)[log pθ(x|z)]] subject to DKL(qφ(z|x)||p(z)) ≤ ε
Rewriting as a Lagrangian under KKT condition yields modified ELBO objective
L(θ, φ;x, z, β) = Eqφ(z|x)[log pθ(x|z)]]− βDKL(qφ(z|x)||p(z))
Note:
I β = 1 yields original VAE objective
I Hypothesis: β > 1 puts stronger constraint on latent bottleneck and yields learningdisentangled representations of the conditionally independent data generativefactors
I β > 1 will reduce reconstruction fidelity
I Hypothesis: Disentangled representations emerge with right balance betweeninformation preservation (reconstruction cost as regularization) and latent channelcapacity restriction → optimal β.
STAT G8201: Deep Generative Models 12 / 29
β-VAE: qualitative assessment ofdisentanglement
If no ground truth generative factors are known (or labeled), optimal β is found byvisual inspection. Here, β = 5 for the 3D chairs dataset.
STAT G8201: Deep Generative Models 13 / 29
β-VAE: a controlled generative world
Simulated data (images w resolution 64 x 64 x 1), five data generative factors:
I binary 2D shapes (heart, oval, and square)
I varying over x- and y-position (32 x 32 values), scale (6 values), and rotation (40values)
STAT G8201: Deep Generative Models 14 / 29
β-VAE: a controlled generative world
STAT G8201: Deep Generative Models 15 / 29
β-VAE: disentanglement score
If ground truth generative factors are known or have been labeled, disentanglement canbe assessed quantitatively.
Idea:
I a (true) data generative factor should map to a single latent factor in z
I this would enable robust classification even using a very simple linear classifier
Algorithm: For dataset D = {X,V,W} (V = RK) and x ∼ Sim(v,w)
1. choose factor y ∼ Unif [1 . . .K] (e.g., y corresponds to scale)
2. for a batch of L samples
I sample latent v1,l and v2,l with [v1,l]k = [v2,l]k if k = yI simulate x1,l ∼ Sim(v1,l), x2,l ∼ Sim(v2,l) and infer z1,l = µ(x1,l),
z1,l = µ(x1,l) using the encoder q(z|x) = N (µ(x), σ(x))I compute L1 distance zldiff = |z1,l − z2,l|
3. Use the average zbdiff = 1L
∑Ll=1 z
ldiff to predict p(y|zbdiff).
4. Accuracy across batches := disentanglement score
STAT G8201: Deep Generative Models 16 / 29
β-VAE: disentanglement score
If ground truth generative factors are known or have been labeled, disentanglement canbe assessed quantitatively.
Idea:
I a (true) data generative factor should map to a single latent factor in z
I this would enable robust classification even using a very simple linear classifier
Algorithm: For dataset D = {X,V,W} (V = RK) and x ∼ Sim(v,w)
1. choose factor y ∼ Unif [1 . . .K] (e.g., y corresponds to scale)
2. for a batch of L samples
I sample latent v1,l and v2,l with [v1,l]k = [v2,l]k if k = yI simulate x1,l ∼ Sim(v1,l), x2,l ∼ Sim(v2,l) and infer z1,l = µ(x1,l),
z1,l = µ(x1,l) using the encoder q(z|x) = N (µ(x), σ(x))I compute L1 distance zldiff = |z1,l − z2,l|
3. Use the average zbdiff = 1L
∑Ll=1 z
ldiff to predict p(y|zbdiff).
4. Accuracy across batches := disentanglement score
STAT G8201: Deep Generative Models 16 / 29
β-VAE: disentanglement score
STAT G8201: Deep Generative Models 17 / 29
β-VAE: disentanglement
I β can be conceived as mixing coefficienct for balancing the magnitude of gradientsfrom reconstruction and prior-matching. Hence, normalization for bettercomparison across different input and latent sizes.
I βnorm = βMN
, M : latent code size, N : input size
STAT G8201: Deep Generative Models 18 / 29
Part IV: Fixing a broken ELBO
STAT G8201: Deep Generative Models 19 / 29
Fixing a broken ELBO
https://www.inference.vc/goals-and-principles-of-representation-learning/
“Fixing a broken ELBO”touches on deep questions like:
I Why are we using VAEs andELBOs?
I Why do they seem to learnuseful representationssometimes and not others?
Unifies the information theoreticapproach in InfoGAN with β-VAE
STAT G8201: Deep Generative Models 20 / 29
Fixing a broken ELBO
https://www.inference.vc/goals-and-principles-of-representation-learning/
“Fixing a broken ELBO”touches on deep questions like:
I Why are we using VAEs andELBOs?
I Why do they seem to learnuseful representationssometimes and not others?
Unifies the information theoreticapproach in InfoGAN with β-VAE
STAT G8201: Deep Generative Models 20 / 29
Fixing a broken ELBO
https://www.inference.vc/goals-and-principles-of-representation-learning/
“Fixing a broken ELBO”touches on deep questions like:
I Why are we using VAEs andELBOs?
I Why do they seem to learnuseful representationssometimes and not others?
Unifies the information theoreticapproach in InfoGAN with β-VAE
STAT G8201: Deep Generative Models 20 / 29
Fixing a broken ELBO
https://www.inference.vc/goals-and-principles-of-representation-learning/
“Fixing a broken ELBO”touches on deep questions like:
I Why are we using VAEs andELBOs?
I Why do they seem to learnuseful representationssometimes and not others?
Unifies the information theoreticapproach in InfoGAN with β-VAE
STAT G8201: Deep Generative Models 20 / 29
Fixing a broken ELBO
https://www.inference.vc/goals-and-principles-of-representation-learning/
“Fixing a broken ELBO”touches on deep questions like:
I Why are we using VAEs andELBOs?
I Why do they seem to learnuseful representationssometimes and not others?
Unifies the information theoreticapproach in InfoGAN with β-VAE
STAT G8201: Deep Generative Models 20 / 29
Fixing a broken ELBO
https://www.inference.vc/goals-and-principles-of-representation-learning/
“Fixing a broken ELBO”touches on deep questions like:
I Why are we using VAEs andELBOs?
I Why do they seem to learnuseful representationssometimes and not others?
Unifies the information theoreticapproach in InfoGAN with β-VAE
STAT G8201: Deep Generative Models 20 / 29
Fixing a broken ELBO
https://www.inference.vc/goals-and-principles-of-representation-learning/
“Fixing a broken ELBO”touches on deep questions like:
I Why are we using VAEs andELBOs?
I Why do they seem to learnuseful representationssometimes and not others?
Unifies the information theoreticapproach in InfoGAN with β-VAE
STAT G8201: Deep Generative Models 20 / 29
X-rays showing the broken ELBO
In vanilla VAE, you can maximize ELBO and get a model that produces nice samples,reproduces the input distribution overall, but ignores individual inputs.
I In this case, it also has a useless (and entangled) latent representation.
I True data generated by: Z? = {z0, z1} ∼ Ber(0.7)I Next, sample in 30 discrete bins to get a one-hot categorical value x
I z is a one-hot encoding of latent categorical value
I Added Gaussian noise and discretization so that I(x; z?) = 0.5 nats
R = 0.0002 nats (“auto-decoder”) R = 0.4999 nats (“perfect”)
STAT G8201: Deep Generative Models 21 / 29
X-rays showing the broken ELBO
In vanilla VAE, you can maximize ELBO and get a model that produces nice samples,reproduces the input distribution overall, but ignores individual inputs.
I In this case, it also has a useless (and entangled) latent representation.
I True data generated by: Z? = {z0, z1} ∼ Ber(0.7)I Next, sample in 30 discrete bins to get a one-hot categorical value x
I z is a one-hot encoding of latent categorical value
I Added Gaussian noise and discretization so that I(x; z?) = 0.5 nats
R = 0.0002 nats (“auto-decoder”) R = 0.4999 nats (“perfect”)
STAT G8201: Deep Generative Models 21 / 29
X-rays showing the broken ELBO
In vanilla VAE, you can maximize ELBO and get a model that produces nice samples,reproduces the input distribution overall, but ignores individual inputs.
I In this case, it also has a useless (and entangled) latent representation.
I True data generated by: Z? = {z0, z1} ∼ Ber(0.7)
I Next, sample in 30 discrete bins to get a one-hot categorical value x
I z is a one-hot encoding of latent categorical value
I Added Gaussian noise and discretization so that I(x; z?) = 0.5 nats
R = 0.0002 nats (“auto-decoder”) R = 0.4999 nats (“perfect”)
STAT G8201: Deep Generative Models 21 / 29
X-rays showing the broken ELBO
In vanilla VAE, you can maximize ELBO and get a model that produces nice samples,reproduces the input distribution overall, but ignores individual inputs.
I In this case, it also has a useless (and entangled) latent representation.
I True data generated by: Z? = {z0, z1} ∼ Ber(0.7)I Next, sample in 30 discrete bins to get a one-hot categorical value x
I z is a one-hot encoding of latent categorical value
I Added Gaussian noise and discretization so that I(x; z?) = 0.5 nats
R = 0.0002 nats (“auto-decoder”) R = 0.4999 nats (“perfect”)
STAT G8201: Deep Generative Models 21 / 29
X-rays showing the broken ELBO
In vanilla VAE, you can maximize ELBO and get a model that produces nice samples,reproduces the input distribution overall, but ignores individual inputs.
I In this case, it also has a useless (and entangled) latent representation.
I True data generated by: Z? = {z0, z1} ∼ Ber(0.7)I Next, sample in 30 discrete bins to get a one-hot categorical value x
I z is a one-hot encoding of latent categorical value
I Added Gaussian noise and discretization so that I(x; z?) = 0.5 nats
R = 0.0002 nats (“auto-decoder”) R = 0.4999 nats (“perfect”)
STAT G8201: Deep Generative Models 21 / 29
X-rays showing the broken ELBO
In vanilla VAE, you can maximize ELBO and get a model that produces nice samples,reproduces the input distribution overall, but ignores individual inputs.
I In this case, it also has a useless (and entangled) latent representation.
I True data generated by: Z? = {z0, z1} ∼ Ber(0.7)I Next, sample in 30 discrete bins to get a one-hot categorical value x
I z is a one-hot encoding of latent categorical value
I Added Gaussian noise and discretization so that I(x; z?) = 0.5 nats
R = 0.0002 nats (“auto-decoder”) R = 0.4999 nats (“perfect”)
STAT G8201: Deep Generative Models 21 / 29
X-rays showing the broken ELBO
In vanilla VAE, you can maximize ELBO and get a model that produces nice samples,reproduces the input distribution overall, but ignores individual inputs.
I In this case, it also has a useless (and entangled) latent representation.
I True data generated by: Z? = {z0, z1} ∼ Ber(0.7)I Next, sample in 30 discrete bins to get a one-hot categorical value x
I z is a one-hot encoding of latent categorical value
I Added Gaussian noise and discretization so that I(x; z?) = 0.5 nats
R = 0.0002 nats (“auto-decoder”) R = 0.4999 nats (“perfect”)
STAT G8201: Deep Generative Models 21 / 29
X-rays showing the broken ELBO
In vanilla VAE, you can maximize ELBO and get a model that produces nice samples,reproduces the input distribution overall, but ignores individual inputs.
I In this case, it also has a useless (and entangled) latent representation.
I True data generated by: Z? = {z0, z1} ∼ Ber(0.7)I Next, sample in 30 discrete bins to get a one-hot categorical value x
I z is a one-hot encoding of latent categorical value
I Added Gaussian noise and discretization so that I(x; z?) = 0.5 nats
R = 0.0002 nats (“auto-decoder”) R = 0.4999 nats (“perfect”)
STAT G8201: Deep Generative Models 21 / 29
Connection to rate-distortion theory
The “lossy compression problem”: how to store and/or transmit data over alimited-capacity channel and retain as much information as possible?
I Two competing costs are traded off: rate and distortion
Rate = entropy of thecompressed code, orrepresentation(in the discretized case, thisis usually measured in bits)
Distortion = thereconstruction error
Balle, Laparra, Simoncelli (2017, ICLR)
(Here, λ = 1β
)
STAT G8201: Deep Generative Models 22 / 29
Connection to rate-distortion theory
The “lossy compression problem”: how to store and/or transmit data over alimited-capacity channel and retain as much information as possible?
I Two competing costs are traded off: rate and distortion
Rate = entropy of thecompressed code, orrepresentation(in the discretized case, thisis usually measured in bits)
Distortion = thereconstruction error
Balle, Laparra, Simoncelli (2017, ICLR)
(Here, λ = 1β
)
STAT G8201: Deep Generative Models 22 / 29
Connection to rate-distortion theory
The “lossy compression problem”: how to store and/or transmit data over alimited-capacity channel and retain as much information as possible?
I Two competing costs are traded off: rate and distortion
Rate = entropy of thecompressed code, orrepresentation(in the discretized case, thisis usually measured in bits)
Distortion = thereconstruction error
Balle, Laparra, Simoncelli (2017, ICLR)
(Here, λ = 1β
)
STAT G8201: Deep Generative Models 22 / 29
Connection to rate-distortion theory
The “lossy compression problem”: how to store and/or transmit data over alimited-capacity channel and retain as much information as possible?
I Two competing costs are traded off: rate and distortion
Rate = entropy of thecompressed code, orrepresentation(in the discretized case, thisis usually measured in bits)
Distortion = thereconstruction error
Balle, Laparra, Simoncelli (2017, ICLR)
(Here, λ = 1β
)
STAT G8201: Deep Generative Models 22 / 29
Connection to rate-distortion theory
The “lossy compression problem”: how to store and/or transmit data over alimited-capacity channel and retain as much information as possible?
I Two competing costs are traded off: rate and distortion
Rate = entropy of thecompressed code, orrepresentation(in the discretized case, thisis usually measured in bits)
Distortion = thereconstruction error
Balle, Laparra, Simoncelli (2017, ICLR)
(Here, λ = 1β
)
STAT G8201: Deep Generative Models 22 / 29
Linking rate-distortion theory to β-VAE
Rate-distortion objective is to jointly minimize the rate and distortion, while maximizingthe mutual information (Ie) between the latent code (Z) and the input (X):
Ie(X;Z) =
∫∫dx dz pθ(x, z) log
pθ(x, z)
pdata(x) pθ(z)
Mutual information is bounded by data entropy (H), distortion (D), and rate (R):
H −D ≤ Ie(X;Z) ≤ R
Where:
H = −∫dx pdata(x) log pdata(x)
D = −∫dx pdata(x)
∫dz qφ(z|x) log pθ(x|z)
R = −∫dx pdata(x)
∫dz qφ(z|x) log
qφ(z|x)m(z)
Note the new (to our class) object:m(z), a variational approximation to marginal posterior pθ(z)
STAT G8201: Deep Generative Models 23 / 29
Linking rate-distortion theory to β-VAE
Rate-distortion objective is to jointly minimize the rate and distortion, while maximizingthe mutual information (Ie) between the latent code (Z) and the input (X):
Ie(X;Z) =
∫∫dx dz pθ(x, z) log
pθ(x, z)
pdata(x) pθ(z)
Mutual information is bounded by data entropy (H), distortion (D), and rate (R):
H −D ≤ Ie(X;Z) ≤ R
Where:
H = −∫dx pdata(x) log pdata(x)
D = −∫dx pdata(x)
∫dz qφ(z|x) log pθ(x|z)
R = −∫dx pdata(x)
∫dz qφ(z|x) log
qφ(z|x)m(z)
Note the new (to our class) object:m(z), a variational approximation to marginal posterior pθ(z)
STAT G8201: Deep Generative Models 23 / 29
Linking rate-distortion theory to β-VAE
Rate-distortion objective is to jointly minimize the rate and distortion, while maximizingthe mutual information (Ie) between the latent code (Z) and the input (X):
Ie(X;Z) =
∫∫dx dz pθ(x, z) log
pθ(x, z)
pdata(x) pθ(z)
Mutual information is bounded by data entropy (H), distortion (D), and rate (R):
H −D ≤ Ie(X;Z) ≤ R
Where:
H = −∫dx pdata(x) log pdata(x)
D = −∫dx pdata(x)
∫dz qφ(z|x) log pθ(x|z)
R = −∫dx pdata(x)
∫dz qφ(z|x) log
qφ(z|x)m(z)
Note the new (to our class) object:m(z), a variational approximation to marginal posterior pθ(z)
STAT G8201: Deep Generative Models 23 / 29
Linking rate-distortion theory to β-VAE
β-VAE approach: look for the best R and D by assuming a fixed β = ∂D∂R
and optimize:
minqφ(z|x),m(z),pθ(x|z) D + βR
Substituting D and R, we get
minqφ(z|x),m(z),pθ(x|z)
∫dx pdata(x)
∫dz qφ(z|x)
[− log pθ(x|z)︸ ︷︷ ︸reconstruction loss
+ β logqφ(z|x)m(z)︸ ︷︷ ︸
KL divergence
]
As in β-VAE , when β = 1, this is vanilla VAE
A sling for a broken ELBO?What they actually do is target a specific rate σ and optimize:
minqφ(z|x),m(z),pθ(x|z) D + |σ −R|
STAT G8201: Deep Generative Models 24 / 29
Linking rate-distortion theory to β-VAE
β-VAE approach: look for the best R and D by assuming a fixed β = ∂D∂R
and optimize:
minqφ(z|x),m(z),pθ(x|z) D + βR
Substituting D and R, we get
minqφ(z|x),m(z),pθ(x|z)
∫dx pdata(x)
∫dz qφ(z|x)
[− log pθ(x|z)
︸ ︷︷ ︸reconstruction loss
+ β logqφ(z|x)m(z)
︸ ︷︷ ︸KL divergence
]
As in β-VAE , when β = 1, this is vanilla VAE
A sling for a broken ELBO?What they actually do is target a specific rate σ and optimize:
minqφ(z|x),m(z),pθ(x|z) D + |σ −R|
STAT G8201: Deep Generative Models 24 / 29
Linking rate-distortion theory to β-VAE
β-VAE approach: look for the best R and D by assuming a fixed β = ∂D∂R
and optimize:
minqφ(z|x),m(z),pθ(x|z) D + βR
Substituting D and R, we get
minqφ(z|x),m(z),pθ(x|z)
∫dx pdata(x)
∫dz qφ(z|x)
[− log pθ(x|z)︸ ︷︷ ︸reconstruction loss
+ β logqφ(z|x)m(z)︸ ︷︷ ︸
KL divergence
]
As in β-VAE , when β = 1, this is vanilla VAE
A sling for a broken ELBO?What they actually do is target a specific rate σ and optimize:
minqφ(z|x),m(z),pθ(x|z) D + |σ −R|
STAT G8201: Deep Generative Models 24 / 29
Linking rate-distortion theory to β-VAE
β-VAE approach: look for the best R and D by assuming a fixed β = ∂D∂R
and optimize:
minqφ(z|x),m(z),pθ(x|z) D + βR
Substituting D and R, we get
minqφ(z|x),m(z),pθ(x|z)
∫dx pdata(x)
∫dz qφ(z|x)
[− log pθ(x|z)︸ ︷︷ ︸reconstruction loss
+ β logqφ(z|x)m(z)︸ ︷︷ ︸
KL divergence
]
As in β-VAE , when β = 1, this is vanilla VAE
A sling for a broken ELBO?What they actually do is target a specific rate σ and optimize:
minqφ(z|x),m(z),pθ(x|z) D + |σ −R|
STAT G8201: Deep Generative Models 24 / 29
Linking rate-distortion theory to β-VAE
β-VAE approach: look for the best R and D by assuming a fixed β = ∂D∂R
and optimize:
minqφ(z|x),m(z),pθ(x|z) D + βR
Substituting D and R, we get
minqφ(z|x),m(z),pθ(x|z)
∫dx pdata(x)
∫dz qφ(z|x)
[− log pθ(x|z)︸ ︷︷ ︸reconstruction loss
+ β logqφ(z|x)m(z)︸ ︷︷ ︸
KL divergence
]
As in β-VAE , when β = 1, this is vanilla VAE
A sling for a broken ELBO?What they actually do is target a specific rate σ and optimize:
minqφ(z|x),m(z),pθ(x|z) D + |σ −R|
STAT G8201: Deep Generative Models 24 / 29
Targeting specific rates
Theoretical feasible regionbounded by β lines, whichcorrespond to tight bound onauto-decoding (β = 0),auto-encoding (β =∞), orboth (β = 1)
High R, low D: goodreconstruction, larger code(auto-encoders)
Low R, high D: goodsynthesis, smaller code(auto-decoders)
STAT G8201: Deep Generative Models 25 / 29
Targeting specific rates
Theoretical feasible regionbounded by β lines, whichcorrespond to tight bound onauto-decoding (β = 0),auto-encoding (β =∞), orboth (β = 1)
High R, low D: goodreconstruction, larger code(auto-encoders)
Low R, high D: goodsynthesis, smaller code(auto-decoders)
STAT G8201: Deep Generative Models 25 / 29
Targeting specific rates
Theoretical feasible regionbounded by β lines, whichcorrespond to tight bound onauto-decoding (β = 0),auto-encoding (β =∞), orboth (β = 1)
High R, low D: goodreconstruction, larger code(auto-encoders)
Low R, high D: goodsynthesis, smaller code(auto-decoders)
STAT G8201: Deep Generative Models 25 / 29
Model results with different βs
“Syntactic encoder”
I Prioritizes input domain info. Better at reconstruction and still a good generator
“Semantic encoder”
I Prioritizes semantic info. Good reconstruction and a better generator
STAT G8201: Deep Generative Models 26 / 29
Model results with different βs
“Syntactic encoder”
I Prioritizes input domain info. Better at reconstruction and still a good generator
“Semantic encoder”
I Prioritizes semantic info. Good reconstruction and a better generator
STAT G8201: Deep Generative Models 26 / 29
Model results with different βs
“Syntactic encoder”
I Prioritizes input domain info. Better at reconstruction and still a good generator
“Semantic encoder”
I Prioritizes semantic info. Good reconstruction and a better generator
STAT G8201: Deep Generative Models 26 / 29
Model results with different βs
“Syntactic encoder”
I Prioritizes input domain info. Better at reconstruction and still a good generator
“Semantic encoder”
I Prioritizes semantic info. Good reconstruction and a better generator
STAT G8201: Deep Generative Models 26 / 29
Rate distortion results for models on MNIST
Tried a bunch of VAE variants
I (+/-,+/-,+/-/v)(encoder, decoder, marginal)
I best (R,D) values for eachmodel trace out a frontier
I top right: sweep throughdifferent β for two differentencoders
STAT G8201: Deep Generative Models 27 / 29
Rate distortion results for models on MNIST
Tried a bunch of VAE variants
I (+/-,+/-,+/-/v)(encoder, decoder, marginal)
I best (R,D) values for eachmodel trace out a frontier
I top right: sweep throughdifferent β for two differentencoders
STAT G8201: Deep Generative Models 27 / 29
Rate distortion results for models on MNIST
Tried a bunch of VAE variants
I (+/-,+/-,+/-/v)(encoder, decoder, marginal)
I best (R,D) values for eachmodel trace out a frontier
I top right: sweep throughdifferent β for two differentencoders
STAT G8201: Deep Generative Models 27 / 29
Rate distortion results for models on MNIST
Tried a bunch of VAE variants
I (+/-,+/-,+/-/v)(encoder, decoder, marginal)
I best (R,D) values for eachmodel trace out a frontier
I top right: sweep throughdifferent β for two differentencoders
STAT G8201: Deep Generative Models 27 / 29
Fixing ELBO Conclusions
Showed how expressive decoders can ignore the inputs/latents
Targeting a specific rate regularizes the ELBO towards the Rate-Distortion tradeoffmost appropriate for the task you want the model to solve
I Allows a more explicit statement, in the objective function, of what the model does
Authors propose that instead of comparing ELBOs, compare (R,D) values
I Showed that current approaches don’t do well at low distortion, high rate scenarios
STAT G8201: Deep Generative Models 28 / 29
Fixing ELBO Conclusions
Showed how expressive decoders can ignore the inputs/latents
Targeting a specific rate regularizes the ELBO towards the Rate-Distortion tradeoffmost appropriate for the task you want the model to solve
I Allows a more explicit statement, in the objective function, of what the model does
Authors propose that instead of comparing ELBOs, compare (R,D) values
I Showed that current approaches don’t do well at low distortion, high rate scenarios
STAT G8201: Deep Generative Models 28 / 29
Fixing ELBO Conclusions
Showed how expressive decoders can ignore the inputs/latents
Targeting a specific rate regularizes the ELBO towards the Rate-Distortion tradeoffmost appropriate for the task you want the model to solve
I Allows a more explicit statement, in the objective function, of what the model does
Authors propose that instead of comparing ELBOs, compare (R,D) values
I Showed that current approaches don’t do well at low distortion, high rate scenarios
STAT G8201: Deep Generative Models 28 / 29
Fixing ELBO Conclusions
Showed how expressive decoders can ignore the inputs/latents
Targeting a specific rate regularizes the ELBO towards the Rate-Distortion tradeoffmost appropriate for the task you want the model to solve
I Allows a more explicit statement, in the objective function, of what the model does
Authors propose that instead of comparing ELBOs, compare (R,D) values
I Showed that current approaches don’t do well at low distortion, high rate scenarios
STAT G8201: Deep Generative Models 28 / 29
Fixing ELBO Conclusions
Showed how expressive decoders can ignore the inputs/latents
Targeting a specific rate regularizes the ELBO towards the Rate-Distortion tradeoffmost appropriate for the task you want the model to solve
I Allows a more explicit statement, in the objective function, of what the model does
Authors propose that instead of comparing ELBOs, compare (R,D) values
I Showed that current approaches don’t do well at low distortion, high rate scenarios
STAT G8201: Deep Generative Models 28 / 29
Fixing ELBO Conclusions
Showed how expressive decoders can ignore the inputs/latents
Targeting a specific rate regularizes the ELBO towards the Rate-Distortion tradeoffmost appropriate for the task you want the model to solve
I Allows a more explicit statement, in the objective function, of what the model does
Authors propose that instead of comparing ELBOs, compare (R,D) values
I Showed that current approaches don’t do well at low distortion, high rate scenarios
STAT G8201: Deep Generative Models 28 / 29
Tying it all together
Our VAE-style models don’t always learn useful representations
Disentanglement is desired for “useful” representations
InfoGAN imposes a mutual information constraint between factorized latent code anddata that yields more disentangled representations
β-VAE disentangles by controlling the capacity of the encoder and the match to afactorized prior
β-VAE paper quantifies disentanglement with a new metric
Fixing a broken ELBO unifies information theoretic conceptualization of ELBO fromInfoGAN and β-VAE
Proposes a path forward for targeting more unique solutions, avoiding useless latentrepresentations, and comparing models
STAT G8201: Deep Generative Models 29 / 29
Tying it all together
Our VAE-style models don’t always learn useful representations
Disentanglement is desired for “useful” representations
InfoGAN imposes a mutual information constraint between factorized latent code anddata that yields more disentangled representations
β-VAE disentangles by controlling the capacity of the encoder and the match to afactorized prior
β-VAE paper quantifies disentanglement with a new metric
Fixing a broken ELBO unifies information theoretic conceptualization of ELBO fromInfoGAN and β-VAE
Proposes a path forward for targeting more unique solutions, avoiding useless latentrepresentations, and comparing models
STAT G8201: Deep Generative Models 29 / 29
Tying it all together
Our VAE-style models don’t always learn useful representations
Disentanglement is desired for “useful” representations
InfoGAN imposes a mutual information constraint between factorized latent code anddata that yields more disentangled representations
β-VAE disentangles by controlling the capacity of the encoder and the match to afactorized prior
β-VAE paper quantifies disentanglement with a new metric
Fixing a broken ELBO unifies information theoretic conceptualization of ELBO fromInfoGAN and β-VAE
Proposes a path forward for targeting more unique solutions, avoiding useless latentrepresentations, and comparing models
STAT G8201: Deep Generative Models 29 / 29
Tying it all together
Our VAE-style models don’t always learn useful representations
Disentanglement is desired for “useful” representations
InfoGAN imposes a mutual information constraint between factorized latent code anddata that yields more disentangled representations
β-VAE disentangles by controlling the capacity of the encoder and the match to afactorized prior
β-VAE paper quantifies disentanglement with a new metric
Fixing a broken ELBO unifies information theoretic conceptualization of ELBO fromInfoGAN and β-VAE
Proposes a path forward for targeting more unique solutions, avoiding useless latentrepresentations, and comparing models
STAT G8201: Deep Generative Models 29 / 29
Tying it all together
Our VAE-style models don’t always learn useful representations
Disentanglement is desired for “useful” representations
InfoGAN imposes a mutual information constraint between factorized latent code anddata that yields more disentangled representations
β-VAE disentangles by controlling the capacity of the encoder and the match to afactorized prior
β-VAE paper quantifies disentanglement with a new metric
Fixing a broken ELBO unifies information theoretic conceptualization of ELBO fromInfoGAN and β-VAE
Proposes a path forward for targeting more unique solutions, avoiding useless latentrepresentations, and comparing models
STAT G8201: Deep Generative Models 29 / 29
Tying it all together
Our VAE-style models don’t always learn useful representations
Disentanglement is desired for “useful” representations
InfoGAN imposes a mutual information constraint between factorized latent code anddata that yields more disentangled representations
β-VAE disentangles by controlling the capacity of the encoder and the match to afactorized prior
β-VAE paper quantifies disentanglement with a new metric
Fixing a broken ELBO unifies information theoretic conceptualization of ELBO fromInfoGAN and β-VAE
Proposes a path forward for targeting more unique solutions, avoiding useless latentrepresentations, and comparing models
STAT G8201: Deep Generative Models 29 / 29
Tying it all together
Our VAE-style models don’t always learn useful representations
Disentanglement is desired for “useful” representations
InfoGAN imposes a mutual information constraint between factorized latent code anddata that yields more disentangled representations
β-VAE disentangles by controlling the capacity of the encoder and the match to afactorized prior
β-VAE paper quantifies disentanglement with a new metric
Fixing a broken ELBO unifies information theoretic conceptualization of ELBO fromInfoGAN and β-VAE
Proposes a path forward for targeting more unique solutions, avoiding useless latentrepresentations, and comparing models
STAT G8201: Deep Generative Models 29 / 29
Tying it all together
Our VAE-style models don’t always learn useful representations
Disentanglement is desired for “useful” representations
InfoGAN imposes a mutual information constraint between factorized latent code anddata that yields more disentangled representations
β-VAE disentangles by controlling the capacity of the encoder and the match to afactorized prior
β-VAE paper quantifies disentanglement with a new metric
Fixing a broken ELBO unifies information theoretic conceptualization of ELBO fromInfoGAN and β-VAE
Proposes a path forward for targeting more unique solutions, avoiding useless latentrepresentations, and comparing models
STAT G8201: Deep Generative Models 29 / 29