37
Visual AI CPSC 532R/533R 2019/2020 Term 2 Lecture 8. Representation learning II (probabilistic models) Helge Rhodin

CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

Visual AICPSC 532R/533R – 2019/2020 Term 2

Lecture 8. Representation learning II (probabilistic models)

Helge Rhodin

Page 2: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

2CPSC 532R/533R - Visual AI - Helge Rhodin

Recap: Transposed convolution

Example: 2D transposed convolution with 3x3 kernel

Transposed what?

• Express classical convolution as linear matrix and transpose it

• special case of a linear/fully-connected layer

[https://d2l.ai/chapter_computer-vision/tranposed-conv.html]

Page 3: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

3CPSC 532R/533R - Visual AI - Helge Rhodin

Convolution as matrix multiplication

0, 1, 0, 2, 3, 0, 0, 0, 0

0, 0, 1, 0, 2, 3, 0, 0, 0

0, 0, 0, 0, 1, 0, 2, 3, 0

0, 0, 0, 0, 0, 1, 0, 2, 3

1

2

3

4

5

6

7

8

9

1, 2, 3

4, 5, 6

7, 8, 9

Input

Kernel

0, 1

2, 3

=

0∙1 + 1∙2 + 2∙4 + 3∙5

0∙2 + 1∙3 + 2∙5 + 3∙6

0∙4 + 1∙5 + 2∙7 + 3∙8

0∙5 + 1∙6 + 2∙8 + 3∙9

=

25

31

43

49

Output

* =25, 31

43, 49

Convolution operator

Convolution with 2x2 kernel

as a linear mapping

Flattened 3x3 input image

Page 4: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

4CPSC 532R/533R - Visual AI - Helge Rhodin

Convolution as matrix multiplication (details)

0, 1, 0, 2, 3, 0, 0, 0, 0

0, 0, 1, 0, 2, 3, 0, 0, 0

0, 0, 0, 0, 1, 0, 2, 3, 0

0, 0, 0, 0, 0, 1, 0, 2, 3

1

2

3

4

5

6

7

8

9

1, 2, 3

4, 5, 6

7, 8, 9

Input

Kernel

0, 1

2, 3Larger input width?

insert zeroes here.

Larger input

height?

Insert more

rows here.

Larger kernel?

Add non-zeroes here

(5 values for 3x3 from 2x2)

What about horizontal striding?

Skip every second row Padding?

Insert a row of zeroes

What about vertical striding?

Skip block of rows

Page 5: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

5CPSC 532R/533R - Visual AI - Helge Rhodin

Transposed convolution as matrix multiplication

0

1∙25

1∙31

2∙25

3∙25+2∙31+1∙43

3∙31+1∙49

2∙43

3∙43+2∙49

3∙49

=

25

31

43

49

0

25

31

50

180

142

86

227

147

=

0, 25, 31

50, 180, 142

86, 227, 147

Output

Kernel

0, 1

2, 3

Input

*

=

25, 31

43, 49

Transposed convolution

T

Transposed weight matrix

Page 6: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

6CPSC 532R/533R - Visual AI - Helge Rhodin

Recap: Feature map size after convolutional kernels

Transformation of input and output by convolutions

• output size = (input size + 2*padding – kernel size + stride)/stride

• e.g., a 3x3 kernel that preserves size: W + 2*1 – 3 + 1 = W

• e.g., a 4x4 kernel that reduces size by factor two: (W + 2*1 – 4 + 2)/2 = W/2

• holds per dimension, i.e., 1D, 2D and 3D convolutions

Transformation of input and output by transposed convolutions (aka. deconvolution)

• output size = input size * stride - stride + kernel size - 2*padding

• it has exactly the opposite effect of convolution

• e.g., a 3x3 kernel that preserves size: W - 1 + 3 - 2*1 = W

• e.g., a 4x4 kernel that increases size by factor two: W*2 + 2*1 – 4 + 2 = W*2

• e.g., a 3x3 kernel that increases size by two elements: W - 1 + 3 - 2*0 = W + 2

Page 7: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

7CPSC 532R/533R - Visual AI - Helge Rhodin

Assignment 3

Task I will be published tonight.

• Neural rendering

The other ones are delayed due to unforeseen

difficulties.

Page 8: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

8CPSC 532R/533R - Visual AI - Helge Rhodin

Paper assignment finished (on Piazza)

Page 9: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

9CPSC 532R/533R - Visual AI - Helge Rhodin

Recap: Auto Encoder (AE)

General case

Simple non-linear case

Linear case (similar to PCA)

https://en.wikipedia.org/

wiki/Autoencoder

General reconstruction objective

• e.g., MSE loss

[From Principal Subspaces to Principal Components with Linear Autoencoders]

Page 10: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

10CPSC 532R/533R - Visual AI - Helge Rhodin

Recap: Autoencoder variants

Bottleneck autoencoder:

• hidden dimension smaller than input dimension

• leads to compressed representations

• like dimensionality reduction with PCA

Sparse autoencoder:

• hidden dimension larger than input dimension

• hidden activation enforced to be sparse

(=few activations

Denoising autoencoder:

• corrupt the input values, e.g. by additive noise

Variational Auto Encoder (VAE)

• a probabilistic model

• ‘adding noise on the hidden variables’

• More on this topic today!

Page 11: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

11CPSC 532R/533R - Visual AI - Helge Rhodin

Prior distribution

• belief in a proposition before accounting for

evidence

Probability preliminaries

Bayes’ theorem

• Links the degree of belief in a proposition before

and after accounting for evidence Posterior distribution

• belief in a proposition after accounting for

evidence

• here without knowing event B

• a conditional probability

• here conditioned on B

Page 12: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

12CPSC 532R/533R - Visual AI - Helge Rhodin

Laplace

distribution

Gaussian

distribution

Mean squared

error (MSE)Mean absolute

error (MAE)

Error functions

From lecture 3: Regression revisited

Many loss functions are –log of probability distributions

Page 13: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

13CPSC 532R/533R - Visual AI - Helge Rhodin

The prior is a regularizer

For instance, an l2 loss on the neural network weights

corresponds to a prior on the weights

When optimizing a network…

• without a prior, we infer the maximum likelihood (ML) estimate

• with a prior term, we infer the maximum a posteriori (MAP) estimate

• while considering the distribution of weights, we infer the posterior distribution of networks

Bayesian networks, usually via Variational Inference of parametric distributions

We usually don’t know the prior probability of X,Y,

but we know that it is constant

Data term / log likelihood

Regularizer / prior term

Page 14: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

14CPSC 532R/533R - Visual AI - Helge Rhodin

Variational Autoencoder (VAE) concept

• mapping to a latent variable distribution

• a parametric distribution

• usually a Gaussian

• with variable mean and std parameters

• impose a prior distribution on the latent variables

• usually a Gaussian

• with fixed mean=0 and std=1

• Enables the generation of new samples

• draw a random sample from the prior

• pass it through the decoder

• or draw a sample from the posterior

• pass it through the decoder

https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf

Page 15: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

15CPSC 532R/533R - Visual AI - Helge Rhodin

VAE examples

Generating unseen faces

https://github.com/yzwxx/vae-celebA

Generating music

[Roberts et al., Hierarchical Variational

Autoencoders for Music]

Page 16: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

16CPSC 532R/533R - Visual AI - Helge Rhodin

Kullback–Leibler divergence (relative entropy)

• a dissimilarity measure between distributions

• not symmetric, KL(p,q) != KL(q,p)

• Definition for continuous distributions

Common parametrization

• Normal distributions

• parametrized by neural networks

• encoder e

• decoder d

The Variational Autoencoder (VAE)

VAE Objective (general)

Data term / log likelihood

Regularizer / prior term

probability density of Q

Expectation over q

Page 17: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

17CPSC 532R/533R - Visual AI - Helge Rhodin

The Variational Autoencoder (VAE), simplified I

VAE Objective (general)

A simple autoencoder reconstruction loss,

the squared difference between input and output

Page 18: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

18CPSC 532R/533R - Visual AI - Helge Rhodin

Kullback–Leibler divergence and entropy

Definition

Interpretation

• information gain achieved if Q is used instead of P

• relative entropy

• Entropy:

• the expected number of extra bits required to

code samples from P using a code optimized for

Q rather than the code optimized for P

Page 19: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

19CPSC 532R/533R - Visual AI - Helge Rhodin

KL divergence between Normal distributions (univariate case)

The KL divergence can be split in two parts

For Gaussians it holds

Hence

https://stats.stackexchange.com/questions/7440/kl-divergence-between-two-univariate-gaussians

Using that m2=0 and s2=1

Page 20: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

21CPSC 532R/533R - Visual AI - Helge Rhodin

The Variational Autoencoder (VAE), simplified II

Starting point

Simplification (for Gaussian prior p and Gaussian q with

Sampling (here a single sample)

Expected value

Data term / log likelihood

‘keep sigma and mu small’

reconstruct

the image

‘keep sigma > 0’

Page 21: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

22CPSC 532R/533R - Visual AI - Helge Rhodin

Sampling from a Gaussian

Rejection sampling from a uniform distribution

• intuitive approach

• ignores the tails of the distribution

Better alternative:

• Box-Muller Transform

• requires only two uniform samples

• mathematically correct

(not an approximation)

• efficient to compute

All candidates

Accepted samples

Page 22: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

23CPSC 532R/533R - Visual AI - Helge Rhodin

The effect of the prior

Create a dense and smooth latent space

• without holes

• all samples will make sense

• e.g., will reconstruct to

plausible images

with prior without prior

Page 23: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

25CPSC 532R/533R - Visual AI - Helge Rhodin

Deriving the posterior via Bayes

Goal: Compute the posterior

Attempt: Approximate the posterior with a NN (encoder)

• still intractable due to p(x) in the divergence

Good hidden

code h, given x

The evidence,

intractable to

compute

(marginalization)

It requires

integration over

all possible latent

values h

Page 24: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

26CPSC 532R/533R - Visual AI - Helge Rhodin

Evidence Lower BOund

Consider the term

Together with the KL divergence from before, we get log p(x) as

• the Kullback-Leibler divergence is always greater than or equal to zero

• minimizing the Kullback-Leibler divergence is equivalent to maximizing the ELBO

(making one bigger must reduce the other one)

(equal to what we had before)

Page 25: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

28CPSC 532R/533R - Visual AI - Helge Rhodin

Differentiation and sampling

Problem: How to differentiate through the sampling step?

• it’s a random process, only statistically dependent on the mean and

standard deviation of the sampling distribution

Solutions:

1. The reparameterization trick: Use

instead of

2. Monte-Carlo solution

• related to reinforcement learning and importance sampling

• works for discrete and continuous variables

• we will cover it next week

Page 26: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

29CPSC 532R/533R - Visual AI - Helge Rhodin

Reparametrization trick, visually and mathematically

Equation:

Influence

• changing mu

• increase -> moves sample right

• decrease -> moves sample left

• changing sigma

• increase -> moves away from center

• decrease -> moves to the center

Gradient

current sample sample after downscaling sigma

gradient

Page 27: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

31CPSC 532R/533R - Visual AI - Helge Rhodin

VAE results

Mixed appearance generation

Interpolation

Page 28: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

32CPSC 532R/533R - Visual AI - Helge Rhodin

VAE limitations

Generating human pose and appearance

Page 29: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

33CPSC 532R/533R - Visual AI - Helge Rhodin

VAE Limitations II

Tradeoff between data and prior term

• high weight on data term (big lambda):

• crisp reconstruction of training data

• but latent code is not Gaussian

• the reconstruction of latent code samples

from a Gaussian will be incorrect

• high weight on prior term (small lambda):

• blurry reconstruction

• but latent code follows a Gaussian

distribution

• sampling leads to expected outcomes

(as good as training samples)

Page 30: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

34CPSC 532R/533R - Visual AI - Helge Rhodin

GANs

A min max game

• Effects:

• learning a loss function

• like a VAE, we sample from a Gaussian

distribution (some form of a prior assumption)

D should be high for

real examples

(not influenced by G)

https://www.kdnuggets.com/2017/01/generative-

adversarial-networks-hot-topic-machine-learning.html

D should be low for fake

examples

(from perspective of D)

D should be high for fake

examples

(from perspective of G)

Page 31: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

36CPSC 532R/533R - Visual AI - Helge Rhodin

GAN training

[Goodfellow et al., Generative Adversarial Networks. 2014]

Chaotic GAN loss behavior

(e.g., generator loss going up not down)

Page 32: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

37CPSC 532R/533R - Visual AI - Helge Rhodin

Wasserstein GAN

Diverse measures exist to compare probability distributions

[Arjovsky et al., Wasserstein GAN. 2017]

EM distance principle

Page 33: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

38CPSC 532R/533R - Visual AI - Helge Rhodin

GAN vs. WGAN

Wasserstein distance is even simpler!

GAN WGAN

Page 34: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

39CPSC 532R/533R - Visual AI - Helge Rhodin

DCGAN

Convolutional generator architecture

Page 35: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

40CPSC 532R/533R - Visual AI - Helge Rhodin

PatchGAN

Patch-wise classification into real or fake (instead of globally)

[Li and Wandt, Precomputed Real-Time Texture Synthesis with

Markovian Generative Adversarial Networks

A feature map of

fake/real probabilities

Page 36: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

41CPSC 532R/533R - Visual AI - Helge Rhodin

Conditional Generative Adversarial Nets

First week of paper reading..

[Isola et al., Image-to-Image Translation with Conditional Adversarial Networks]

Page 37: CPSC 532R/533R 2019/2020 Term 2rhodin/20_CPSC_532R_533R/lectures/... · 2020. 1. 30. · CPSC 532R/533R - Visual AI - Helge Rhodin 6 Recap: Feature map size after convolutional kernels

42CPSC 532R/533R - Visual AI - Helge Rhodin

Hidden questions