Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment

Deep Generative Models

STAT G8201: Deep Generative Models 1 / 29

Part IV.2: Disentangling, geometry,and interpreting VAE latents


The Disentangling Hypothesis

DiCarlo & Cox (2007). “Untangling Invariant Object Recognition”Trends In Cognitive Sciences.











Part I:Disentanglement


Disentanglement

What is meant by disentanglement here?

A disentangled representation can be defined as one where single latent units aresensitive to changes in single generative factors, while being relatively invariantto changes in other factors. (Bengio et al., 2013)

Disentanglement involves two points:

I Sparse mapping between latent units and generative factors

I Mutual information between latent code and X

However, mutual information between latent code and X is not guaranteed. In fact,powerful decoders might decouple latent code from X.


Disentanglement

What is meant by disentanglement here?

A disentangled representation can be defined as one where single latent units aresensitive to changes in single generative factors, while being relatively invariantto changes in other factors. (Bengio et al., 2013)

Disentanglement involves two points:

I Sparse mapping between latent units and generative factors

I Mutual information between latent code and X

However, mutual information between latent code and X is not guaranteed. In fact,powerful decoders might decouple latent code from X.


Part II: InfoGAN


InfoGAN: the idea

Aim: Unsupervised learning of disentangled representations in the GAN framework.

Recap GAN:

minG

maxD

V (D,G) = Ex∼Pdata [logD(x)] + Ez∼noise[log(1−D(G(z)))]

Decompose the input noise vector into two parts:

I z : treated as incompressible noise

I c : the ’latent code’ will target the salient semantic features in the data,prestructured, factorized

Extend GAN objective with mutual information between generated data distributionG(z, c) and latent code c.

minG

maxD

VI(D,G) = V (D,G)− λI(c;G(z, c))


InfoGAN: the idea

Aim: Unsupervised learning of disentangled representations in the GAN framework.

Recap GAN:

minG

maxD

V (D,G) = Ex∼Pdata [logD(x)] + Ez∼noise[log(1−D(G(z)))]

Decompose the input noise vector into two parts:

I z : treated as incompressible noise

I c : the ’latent code’ will target the salient semantic features in the data,prestructured, factorized

Extend GAN objective with mutual information between generated data distributionG(z, c) and latent code c.

minG

maxD

VI(D,G) = V (D,G)− λI(c;G(z, c))


InfoGAN: mutual information

InfoGAN optimizes a variational lower bound on mutual information (aka VariationalInformation Maximization).

I(c;G(z, c)) = H(c)−H(c|G(z, c))

= Ex∼G(z,c)[Ec′∼P (c|x)[logP (c′|x)]] +H(c)

= Ex∼G(z,c)[DKL(P (·|x)||Q(·|x)) + Ec′∼P (c|x)[logQ(c′|x)]] +H(c)

≥ Ex∼G(z,c)[Ec′∼P (c|x)[logQ(c′|x)]] +H(c)

= Ec∼P (c),x∼G(z,c)[logQ(c|x)] +H(c)

= LI(G,Q)

Hence, the objective becomes:

minG,Q

maxD

VInfoGAN(D,G,Q) = V (D,G)− λLI(G,Q)

Note:

I Bound becomes tight as Ex[DKL[P (·|x)||Q(·|x))]]→ 0

I Q parametrized as neuronal network. Here: shares all convolutional layers with Dand there is one final fully connected layer to parameters of Q(c|x)


InfoGAN: mutual information

InfoGAN optimizes a variational lower bound on mutual information (aka VariationalInformation Maximization).

I(c;G(z, c)) = H(c)−H(c|G(z, c))

= Ex∼G(z,c)[Ec′∼P (c|x)[logP (c′|x)]] +H(c)

= Ex∼G(z,c)[DKL(P (·|x)||Q(·|x)) + Ec′∼P (c|x)[logQ(c′|x)]] +H(c)

≥ Ex∼G(z,c)[Ec′∼P (c|x)[logQ(c′|x)]] +H(c)

= Ec∼P (c),x∼G(z,c)[logQ(c|x)] +H(c)

= LI(G,Q)

Hence, the objective becomes:

minG,Q

maxD

VInfoGAN(D,G,Q) = V (D,G)− λLI(G,Q)

Note:

I Bound becomes tight as Ex[DKL[P (·|x)||Q(·|x))]]→ 0

I Q parametrized as neuronal network. Here: shares all convolutional layers with Dand there is one final fully connected layer to parameters of Q(c|x)


InfoGAN: MNIST example

Setup:

I latent code c: c1 ∼ Cat(K = 10, p = 0.1) categorical. c2, c3 ∼ Unif(−1, 1)I ’incompressible noise’ z1, . . . , z62 ∼ N (0, 1)


InfoGAN: MNIST example

Setup:

I latent code c: c1 ∼ Cat(K = 10, p = 0.1) categorical. c2, c3 ∼ Unif(−1, 1)I ’incompressible noise’ z1, . . . , z62 ∼ N (0, 1)


Part III: β-VAE


β-VAE: Idea

Assumption: Data is generated by a ’true world simulator’ p(x|v,w) = Sim(v,w) with

I conditionally independent factors v ∈ RK

I conditional dependent factors w ∈ RH

Aim: unsupervised learning of disentangled representations in VAE framework

Learn joint distribution on data x and generative latent factors z (z ∈ RM ,M ≥ K)such that p(x|z) ≈ p(x|v,w).

VAE standard procedure: infer posterior configuration of latent factors z by qφ(z|x) thatshould match prior p(z) = N (0, I)

Idea: control degree of match of qφ(z|x) to p(z). Thereby:

I control capacity of latent information bottleneck z

I put implicit independence pressure on the learnt posterior


β-VAE: Idea











β-VAE: Idea











β-VAE: Idea











β-VAE: Idea

Specifically, a constrained optimisation problem.

maxφ,θ

Ex∼D[Eqφ(z|x)[log pθ(x|z)]] subject to DKL(qφ(z|x)||p(z)) ≤ ε

Rewriting as a Lagrangian under KKT condition yields modified ELBO objective

L(θ, φ;x, z, β) = Eqφ(z|x)[log pθ(x|z)]]− βDKL(qφ(z|x)||p(z))

Note:

I β = 1 yields original VAE objective

I Hypothesis: β > 1 puts stronger constraint on latent bottleneck and yields learningdisentangled representations of the conditionally independent data generativefactors

I β > 1 will reduce reconstruction fidelity

I Hypothesis: Disentangled representations emerge with right balance betweeninformation preservation (reconstruction cost as regularization) and latent channelcapacity restriction → optimal β.


β-VAE: Idea

Specifically, a constrained optimisation problem.

maxφ,θ

Ex∼D[Eqφ(z|x)[log pθ(x|z)]] subject to DKL(qφ(z|x)||p(z)) ≤ ε

Rewriting as a Lagrangian under KKT condition yields modified ELBO objective

L(θ, φ;x, z, β) = Eqφ(z|x)[log pθ(x|z)]]− βDKL(qφ(z|x)||p(z))

Note:

I β = 1 yields original VAE objective

I Hypothesis: β > 1 puts stronger constraint on latent bottleneck and yields learningdisentangled representations of the conditionally independent data generativefactors

I β > 1 will reduce reconstruction fidelity

I Hypothesis: Disentangled representations emerge with right balance betweeninformation preservation (reconstruction cost as regularization) and latent channelcapacity restriction → optimal β.


β-VAE: qualitative assessment ofdisentanglement

If no ground truth generative factors are known (or labeled), optimal β is found byvisual inspection. Here, β = 5 for the 3D chairs dataset.


β-VAE: a controlled generative world

Simulated data (images w resolution 64 x 64 x 1), five data generative factors:

I binary 2D shapes (heart, oval, and square)

I varying over x- and y-position (32 x 32 values), scale (6 values), and rotation (40values)


β-VAE: a controlled generative world


β-VAE: disentanglement score

If ground truth generative factors are known or have been labeled, disentanglement canbe assessed quantitatively.

Idea:

I a (true) data generative factor should map to a single latent factor in z

I this would enable robust classification even using a very simple linear classifier

Algorithm: For dataset D = {X,V,W} (V = RK) and x ∼ Sim(v,w)

1. choose factor y ∼ Unif [1 . . .K] (e.g., y corresponds to scale)

2. for a batch of L samples

I sample latent v1,l and v2,l with [v1,l]k = [v2,l]k if k = yI simulate x1,l ∼ Sim(v1,l), x2,l ∼ Sim(v2,l) and infer z1,l = µ(x1,l),

z1,l = µ(x1,l) using the encoder q(z|x) = N (µ(x), σ(x))I compute L1 distance zldiff = |z1,l − z2,l|

3. Use the average zbdiff = 1L

∑Ll=1 z

ldiff to predict p(y|zbdiff).

4. Accuracy across batches := disentanglement score



If ground truth generative factors are known or have been labeled, disentanglement canbe assessed quantitatively.

Idea:

I a (true) data generative factor should map to a single latent factor in z

I this would enable robust classification even using a very simple linear classifier

Algorithm: For dataset D = {X,V,W} (V = RK) and x ∼ Sim(v,w)

1. choose factor y ∼ Unif [1 . . .K] (e.g., y corresponds to scale)

2. for a batch of L samples

I sample latent v1,l and v2,l with [v1,l]k = [v2,l]k if k = yI simulate x1,l ∼ Sim(v1,l), x2,l ∼ Sim(v2,l) and infer z1,l = µ(x1,l),

z1,l = µ(x1,l) using the encoder q(z|x) = N (µ(x), σ(x))I compute L1 distance zldiff = |z1,l − z2,l|

3. Use the average zbdiff = 1L

∑Ll=1 z

ldiff to predict p(y|zbdiff).

4. Accuracy across batches := disentanglement score




β-VAE: disentanglement

I β can be conceived as mixing coefficienct for balancing the magnitude of gradientsfrom reconstruction and prior-matching. Hence, normalization for bettercomparison across different input and latent sizes.

I βnorm = βMN

, M : latent code size, N : input size


Part IV: Fixing a broken ELBO


Fixing a broken ELBO

https://www.inference.vc/goals-and-principles-of-representation-learning/

“Fixing a broken ELBO”touches on deep questions like:

I Why are we using VAEs andELBOs?

I Why do they seem to learnuseful representationssometimes and not others?

Unifies the information theoreticapproach in InfoGAN with β-VAE



















































X-rays showing the broken ELBO

In vanilla VAE, you can maximize ELBO and get a model that produces nice samples,reproduces the input distribution overall, but ignores individual inputs.

I In this case, it also has a useless (and entangled) latent representation.

I True data generated by: Z? = {z0, z1} ∼ Ber(0.7)I Next, sample in 30 discrete bins to get a one-hot categorical value x

I z is a one-hot encoding of latent categorical value

I Added Gaussian noise and discretization so that I(x; z?) = 0.5 nats

R = 0.0002 nats (“auto-decoder”) R = 0.4999 nats (“perfect”)













I True data generated by: Z? = {z0, z1} ∼ Ber(0.7)

I Next, sample in 30 discrete bins to get a one-hot categorical value x













































Connection to rate-distortion theory

The “lossy compression problem”: how to store and/or transmit data over alimited-capacity channel and retain as much information as possible?

I Two competing costs are traded off: rate and distortion

Rate = entropy of thecompressed code, orrepresentation(in the discretized case, thisis usually measured in bits)

Distortion = thereconstruction error

Balle, Laparra, Simoncelli (2017, ICLR)

(Here, λ = 1β

)








(Here, λ = 1β

)








(Here, λ = 1β

)








(Here, λ = 1β

)








(Here, λ = 1β

)


Linking rate-distortion theory to β-VAE

Rate-distortion objective is to jointly minimize the rate and distortion, while maximizingthe mutual information (Ie) between the latent code (Z) and the input (X):

Ie(X;Z) =

∫∫dx dz pθ(x, z) log

pθ(x, z)

pdata(x) pθ(z)

Mutual information is bounded by data entropy (H), distortion (D), and rate (R):

H −D ≤ Ie(X;Z) ≤ R

Where:

H = −∫dx pdata(x) log pdata(x)

D = −∫dx pdata(x)

∫dz qφ(z|x) log pθ(x|z)

R = −∫dx pdata(x)

∫dz qφ(z|x) log

qφ(z|x)m(z)

Note the new (to our class) object:m(z), a variational approximation to marginal posterior pθ(z)




Ie(X;Z) =


pθ(x, z)

pdata(x) pθ(z)


H −D ≤ Ie(X;Z) ≤ R

Where:





∫dz qφ(z|x) log

qφ(z|x)m(z)





Ie(X;Z) =


pθ(x, z)

pdata(x) pθ(z)


H −D ≤ Ie(X;Z) ≤ R

Where:





∫dz qφ(z|x) log

qφ(z|x)m(z)




β-VAE approach: look for the best R and D by assuming a fixed β = ∂D∂R

and optimize:

minqφ(z|x),m(z),pθ(x|z) D + βR

Substituting D and R, we get

minqφ(z|x),m(z),pθ(x|z)

∫dx pdata(x)

∫dz qφ(z|x)

[− log pθ(x|z)︸︷︷︸reconstruction loss

+ β logqφ(z|x)m(z)︸︷︷︸

KL divergence

]

As in β-VAE , when β = 1, this is vanilla VAE

A sling for a broken ELBO?What they actually do is target a specific rate σ and optimize:

minqφ(z|x),m(z),pθ(x|z) D + |σ −R|




and optimize:




∫dx pdata(x)

∫dz qφ(z|x)

[− log pθ(x|z)

︸︷︷︸reconstruction loss

+ β logqφ(z|x)m(z)

︸︷︷︸KL divergence

]







and optimize:




∫dx pdata(x)

∫dz qφ(z|x)



KL divergence

]







and optimize:




∫dx pdata(x)

∫dz qφ(z|x)



KL divergence

]







and optimize:




∫dx pdata(x)

∫dz qφ(z|x)



KL divergence

]





Targeting specific rates

Theoretical feasible regionbounded by β lines, whichcorrespond to tight bound onauto-decoding (β = 0),auto-encoding (β =∞), orboth (β = 1)

High R, low D: goodreconstruction, larger code(auto-encoders)

Low R, high D: goodsynthesis, smaller code(auto-decoders)












Model results with different βs

“Syntactic encoder”

I Prioritizes input domain info. Better at reconstruction and still a good generator

“Semantic encoder”

I Prioritizes semantic info. Good reconstruction and a better generator




















Rate distortion results for models on MNIST

Tried a bunch of VAE variants

I (+/-,+/-,+/-/v)(encoder, decoder, marginal)

I best (R,D) values for eachmodel trace out a frontier

I top right: sweep throughdifferent β for two differentencoders




















Fixing ELBO Conclusions

Showed how expressive decoders can ignore the inputs/latents

Targeting a specific rate regularizes the ELBO towards the Rate-Distortion tradeoffmost appropriate for the task you want the model to solve

I Allows a more explicit statement, in the objective function, of what the model does

Authors propose that instead of comparing ELBOs, compare (R,D) values

I Showed that current approaches don’t do well at low distortion, high rate scenarios





































Tying it all together

Our VAE-style models don’t always learn useful representations

Disentanglement is desired for “useful” representations

InfoGAN imposes a mutual information constraint between factorized latent code anddata that yields more disentangled representations

β-VAE disentangles by controlling the capacity of the encoder and the match to afactorized prior

β-VAE paper quantifies disentanglement with a new metric

Fixing a broken ELBO unifies information theoretic conceptualization of ELBO fromInfoGAN and β-VAE

Proposes a path forward for targeting more unique solutions, avoiding useless latentrepresentations, and comparing models

































































Documents

Deep Generative Models - Columbia Universitystat.columbia.edu/~cunningham/teaching/GR8201/STAT_GR8201_20… · STAT G8201: Deep Generative Models 12 / 29 -VAE: qualitative assessment