Deep Generative Models with Stick-Breaking Priorsenalisni/nalisnick_uciAIML_talk.pdfnot necessary but learns faster (Nalisnick & Smyth, 2017) Stick-Breaking VAE Truncation level of

Deep Generative Models with Stick-Breaking Priors

Eric NalisnickUniversity of California, Irvine

Overview of Deep Generative Models.1

Outline

DefinitionApplicationsTraining

Overview of Deep Generative Models.1

Outline

The Dirichlet Process.2

Applications

Stochastic Gradient Variational Bayes for the DP

Definition

Definition

Training

Overview of Deep Generative Models.

Deep Generative Models with Stick-Breaking Priors.

1

3

Outline

Stick-Breaking Variational Autoencoders

The Dirichlet Process.2

Applications

Dirichlet Process Variational AutoencodersStick-Breaking for Adaptive Computation

Stochastic Gradient Variational Bayes for the DP

Definition

Definition

Training

Overview of Deep Generative Models

“Generative modeling…denotes modeling tasks in which a density over all the observable quantities is constructed.”

David MacKayInformation Theory, Inference, and Learning Algorithms

p( y | x ) p( x, y )p( x )

Deep Generative Models

Density Network (MacKay & Gibbs, 1997)

XZ

One or More Neural Network Layers

Observed Data

Stochastic Latent Variable

Deep Generative Models

Density Network (MacKay & Gibbs, 1997)

XZ

One or More Neural Network Layers

Observed Data

Stochastic Latent Variable

Z d

X N

θ λ

Sampling from a Deep Generative Model

XZ^~


XZ^~

ImageNet


XZ^~

ImageNet Samples from a DGM

XZ^~


Samples from a DGM

(Bowman et al., 2016)


1.1 Why Generative Modeling?

Why Generative Modeling?

Cognitive Motivations.1


2 Semi-Supervised Learning: Regularizing discriminative models with unlabeled data.

} p( y | x )


2

} p( x, y )

Semi-Supervised Learning: Regularizing discriminative models with unlabeled data.


3 Simulating Environments for Reinforcement Learning.


1.2 How to Train a Deep Generative Model

Learning a Deep Generative Model

1 Importance Sampling

2 Variational Inference

3 Adversarial Training

Soph

istic

atio

n


1 Importance Sampling(MacKay & Gibbs, 1997)


1 Importance Sampling(MacKay & Gibbs, 1997)


1 Importance Sampling

Works only in low dimensions

(MacKay & Gibbs, 1997)




2 Variational InferencePosterior Approximation




Data Reconstruction Regularization



Problem: For neural network likelihoods, these integrals are analytically intractable

Data Reconstruction Regularization



Step #1: Use Monte Carlo Approximations




Problem: Still can’t compute gradient w.r.t. variational parameters




Problem: Still can’t compute gradient w.r.t. variational parameters

Step #2: Sample via Differentiable Non-Centered Parametrizations


(Kingma & Welling, 2014), (Rezende et al., 2014)

The Variational Autoencoder (VAE)

}


Classifier predicting if x is generated or observed data

3 Adversarial Training(Goodfellow et al., 2014)


Classifier predicting if x is generated or observed data

1 Minimize with respect to

Two step training:

2 Maximize with respect to

3 Adversarial Training(Goodfellow et al., 2014)

Generative Adversarial Networks (GANs)(Goodfellow et al., 2014)

Variational Bayes vs Adversarial Training

VAE GAN

Deep Generative Models Summary

XZ

(Original) Density Network

Variational Autoencoder

Generative Adversarial Network

Training Method

Importance Sampled Estimate of Marginal

Likelihood

Stochastic Gradient Variational Bayes

Adversarial Training

No auxiliary model

Backend auxiliary model

Frontend auxiliary model

Deep Generative Models Summary

XZ

(Original) Density Network

Variational Autoencoder

Generative Adversarial Network

Training Method

Importance Sampled Estimate of Marginal

Likelihood

Stochastic Gradient Variational Bayes

Adversarial Training

No auxiliary model

Backend auxiliary model

Frontend auxiliary model

The Dirichlet Process



1


1 2


1 2

Draw from GEM distribution


1 2



1 2



1 2



1 2


SGVB for The Dirichlet Process

Two Obstacles:1. Gradients through 2. Gradients thought samples from the mixture

Obstacle: The Beta distribution does not have a non-centered parametrization (except in special cases)

Differentiating Through πk

Obstacle: The Beta distribution does not have a non-centered parametrization (except in special cases)

Kumaraswamy Distribution: A Beta-like distribution with a closed-form inverse CDF. Use as variational posterior.

Poondi Kumaraswamy (1930-1988)

Differentiating Through πk

Obstacle: Not obvious how to use the reparametrization trick for samples from a mixture distribution.

Stochastic Backpropagation through Mixtures

Obstacle: Not obvious how to use the reparametrization trick for samples from a mixture distribution.

Stochastic Backpropagation through Mixtures

Brute Force Solution: Summing over all components results in a tractable ELBO but requires O(T) decoder propagations.


Why Combine the DP with DGMs?

XZ

Latent variable drawn from

stochastic process


1 Dynamic Latent Representations (Width)

XZ


stochastic process



2 Neural Networks with Adaptive Computation (Depth)

XZ


stochastic process



2

Automatic and Principled Model Selection (Based on Marginal Likelihood)

3

Neural Networks with Adaptive Computation (Depth)

XZ


stochastic process

3.1 The Stick-Breaking Variational Autoencoder

In collaboration with

Padhraic Smyth


MODEL #1: Stick-Breaking Variational Autoencoder

Kumaraswamy Samples

Truncated posterior; not necessary but learns faster

(Nalisnick & Smyth, 2017)

Stick-Breaking VAE

Truncation level of 50 dimensions, Beta(1,5) Prior

Samples from Generative Model (Nalisnick & Smyth, 2017)

Stick-Breaking VAE Gaussian VAE

50 dimensions, N(0,1) PriorTruncation level of 50 dimensions, Beta(1,5) Prior

Samples from Generative Model (Nalisnick & Smyth, 2017)

Uns

uper

vise

dSe

mi-

Supe

rvis

ed

MNIST: Dirichlet Latent Space (t-SNE)

MNIST: Gaussian Latent Space (t-SNE)

MNIST: kNN Classifier on Latent Space

Quantitative Results for SB-VAE (Nalisnick & Smyth, 2017)

Uns

uper

vise

dSe

mi-

Supe

rvis

ed





(Estimated) Marginal Likelihoods

Uns

uper

vise

dSe

mi-

Supe

rvis

ed




Nonparametric version of (Kingma et al., NIPS 2014)’s M2 model



3.2 The Dirichlet Process Variational Autoencoder

Lars Hertel


Padhraic Smyth


MODEL #2: Dirichlet Process Variational Autoencoder (Nalisnick et al., 2016)

MNIST Samples from Component #1

MNIST Samples from Component #5

(Nalisnick et al., 2016)

Samples from Generative Model

MNIST: Dirichlet Process Latent Space (t-SNE)



Quantitative Results for DP-VAE (Nalisnick et al., 2016)

MNIST: Dirichlet Process Latent Space (t-SNE)



Quantitative Results for DP-VAE (Nalisnick et al., 2016)


3.3 Adaptive Computation Time via Stick-Breaking

Marc-Alexandre Cote


Hugo Larochelle


MODEL #3: Stick-Breaking Adaptive Computation

…

yky2y1

X

π1 π2 πk

Remaining stick is negligibly small

MODEL #3: Stick-Breaking Adaptive Computation

…

yky2y1

X

π1 π2 πk

Remaining stick is negligibly small

Quantitative Results for SB-RNN: Image Modeling

Quantitative Results for SB-RNN: Text Modeling

Quantitative Results for SB-RNN: Text Modeling

Superior Latent Spaces with NP priors: Seem to preserve class structure well, resulting in better discriminative properties. Their dynamic capacity naturally encodes factors of variation.

1

Conclusions

Adaptive Computation: Fusing NNs with nonparametric priors endows NNs with natural model selection properties.

2

Thank you. Questions?

Appendix

Documents

Deep Generative Models with Stick-Breaking Priorsenalisni/nalisnick_uciAIML_talk.pdfnot necessary but learns faster (Nalisnick & Smyth, 2017) Stick-Breaking VAE Truncation level of