Coulomb Autoencoders · 2020. 8. 3. · • Generalization Analysis • Conclusions Contents. HUAWEI TECHNOLOGIES CO., LTD. Motivation BigGANs [Brock et al. 2019] Improvement of deep

Coulomb AutoencodersEmanuele Sansone, Hafiz Tiomoko Ali, Sun Jiacheng Huawei Noah’s Ark Lab (London)

HUAWEI TECHNOLOGIES CO., LTD. Page 3

• Motivation • Background on Autoencoders • The Problem of Local Minima • Generalization Analysis • Conclusions

Contents

HUAWEI TECHNOLOGIES CO., LTD.

Motivation

BigGANs [Brock et al. 2019]

Improvement of deep generative models (GANs, Flow, Autoregressive, VAEs) in recent years Lack of theoretical understanding: 1. Training (i.e. convergence guarantees to optimal solutions) 2. Generalization (i.e. quality of solutions with finite number of samples)

https://arxiv.org/pdf/1809.11096.pdf

https://arxiv.org/pdf/1809.11096.pdf


• Motivation • Background on Autoencoders• The Problem of Local Minima • Generalization Analysis • Conclusions


Background on Autoencoders - I

𝑓𝜃(𝑥) ∼ 𝑞𝑍(𝑧 ; 𝜃 ) 𝑧 ∼ 𝑝𝑍(𝑧)

𝑥 ∼ 𝑝𝑋(𝑥) 𝑔𝛾(𝑧) ∼ 𝑞𝑋(𝑥)

Goal Implicitly learning the unknown density 𝑝𝑋(𝑥)



𝑓𝜃(𝑥) ∼ 𝑞𝑍(𝑧 ; 𝜃 ) 𝑧 ∼ 𝑝𝑍(𝑧)


Goal Implicitly learning the unknown density

Problem formulation In order to ensure that , we need: 1. Left-invertibility on the support of 2. Density matching

𝑝𝑋(𝑥)

𝑝𝑋(𝑥) = 𝑞𝑋(𝑥)𝑥 = 𝑔𝛾(𝑓𝜃(𝑥)) 𝑝𝑋(𝑥)

𝑞𝑍(𝑧; 𝜃) = 𝑝𝑍(𝑧)



𝑓𝜃(𝑥) ∼ 𝑞𝑍(𝑧 ; 𝜃 ) 𝑧 ∼ 𝑝𝑍(𝑧)


Goal Implicitly learning the unknown density

Problem formulation In order to ensure that , we need: 1. Left-invertibility on the support of 2. Density matching

Objective:

𝑝𝑋(𝑥)

𝑝𝑋(𝑥) = 𝑞𝑋(𝑥)𝑥 = 𝑔𝛾(𝑓𝜃(𝑥)) 𝑝𝑋(𝑥)

𝑞𝑍(𝑧; 𝜃) = 𝑝𝑍(𝑧)

ℒ(θ, γ) = 𝑅𝐸𝐶(gγ ∘ fθ) + λ𝐷(qZ, pZ)


Background on Autoencoders - II


𝑓𝜃(𝑥) ∼ 𝑞𝑍(𝑧 ; 𝜃 ) 𝑧 ∼ 𝑝𝑍(𝑧)



Background on Autoencoders - II

Properties 1. is typically the L2 loss, which is convex

2. has many forms, all of them are non-convex • Kullback-Leibler Divergence (KL) in Variational Autoencoders [Kingma and Welling 2014] [Rezende et al. 2014] • Maximum-Mean Discrepancy (MMD) in Generative Moment Matching Networks [Li et al. 2015] Wasserstein (WAE) [Tolstikhin et al. 2018] Coulomb Autoencoders (CouAEs)


𝑅𝐸𝐶(gγ ∘ fθ)𝐷(qZ, pZ)

𝑓𝜃(𝑥) ∼ 𝑞𝑍(𝑧 ; 𝜃 ) 𝑧 ∼ 𝑝𝑍(𝑧)


https://arxiv.org/abs/1312.6114





















Background on Autoencoders - III

Why MMD should be preferred over KL? 1. KL term is not a proper metric, while MMD is an integral probability metric



Why MMD should be preferred over KL? 1. KL term is not a proper metric, while MMD is an integral probability metric 2. KL is not always defined (e.g. empirical distributions, namely superposition of Dirac impulses)



Why MMD should be preferred over KL? 1. KL term is not a proper metric, while MMD is an integral probability metric 2. KL is not always defined (e.g. empirical distributions, namely superposition of Dirac impulses) 3. Local minima (problem of reconstruction/posterior collapse, due to stochastic encoder)

𝑥1, 𝑥2 𝑔𝛾(𝑧1), 𝑔𝛾(𝑧2)




𝑥1, 𝑥2 𝑔𝛾(𝑧1), 𝑔𝛾(𝑧2)

𝑞 (𝑧 |𝑥1) 𝑞 (𝑧 |𝑥2)




𝑥1, 𝑥2 𝑔𝛾(𝑧1), 𝑔𝛾(𝑧2)

𝑞 (𝑧 |𝑥1) 𝑞 (𝑧 |𝑥2)

X




𝑥1, 𝑥2 𝑔𝛾(𝑧1), 𝑔𝛾(𝑧2)

𝑞 (𝑧 |𝑥1) 𝑞 (𝑧 |𝑥2)

𝑥1, 𝑥2 𝑔𝛾(𝑧1), 𝑔𝛾(𝑧2)

𝑞 (𝑧 |𝑥1) 𝑞 (𝑧 |𝑥2)

X V




𝑘 (𝑧 , 𝑧 ′�){𝑧𝑖′�}𝑁

𝑖=1∼ 𝑝𝑍{𝑧𝑖}𝑁

𝑖=1∼ 𝑞𝑍

Properties of MMD

M𝑀𝐷({𝑧𝑖}𝑁𝑖=1

, {𝑧𝑖′�}𝑁𝑖=1) =

1𝑁(𝑁 − 1)

𝑁

∑𝑖=1

∑𝑗≠𝑖

𝑘(𝑧𝑖′�, 𝑧𝑗′�) +1

𝑁(𝑁 − 1)

𝑁

∑𝑖=1

∑𝑗≠𝑖

𝑘(𝑧𝑖, 𝑧𝑗) −2

𝑁 2

𝑁

∑𝑖=1

𝑁

∑𝑗=1

𝑘(𝑧𝑖′�, 𝑧𝑗)


𝑘 (𝑧 , 𝑧 ′�){𝑧𝑖′�}𝑁


𝑖=1∼ 𝑞𝑍

Properties of MMD

Minimization of MMD wrt maximization of inter-similarity and minimization of intra-similarities Used for density matching or two sample test [Gretton et al. 2012], recently used in autoencoders [Tolstikhin et al. 2018]

{𝑧𝑖}𝑁𝑖=1

≈

Intra-similarity Intra-similarity Inter-similarity


, {𝑧𝑖′�}𝑁𝑖=1) =

1𝑁(𝑁 − 1)

𝑁

∑𝑖=1

∑𝑗≠𝑖

𝑘(𝑧𝑖′�, 𝑧𝑗′�) +1

𝑁(𝑁 − 1)

𝑁

∑𝑖=1

∑𝑗≠𝑖


𝑁 2

𝑁

∑𝑖=1

𝑁

∑𝑗=1


http://www.jmlr.org/papers/volume13/gretton12a/gretton12a.pdf



https://openreview.net/pdf?id=HkL7n1-0b










𝑘 (𝑧 , 𝑧 ′�){𝑧𝑖′�}𝑁


𝑖=1∼ 𝑞𝑍

Properties of MMD

Minimization of MMD wrt maximization of inter-similarity and minimization of intra-similarities Used for density matching or two sample test [Gretton et al. 2012], recently used in autoencoders [Tolstikhin et al. 2018]


≈

The choice of kernel function is related with the problem of local minima



, {𝑧𝑖′�}𝑁𝑖=1) =

1𝑁(𝑁 − 1)

𝑁

∑𝑖=1

∑𝑗≠𝑖

𝑘(𝑧𝑖′�, 𝑧𝑗′�) +1

𝑁(𝑁 − 1)

𝑁

∑𝑖=1

∑𝑗≠𝑖


𝑁 2

𝑁

∑𝑖=1

𝑁

∑𝑗=1















𝑘 (𝑧 , 𝑧 ′�){𝑧𝑖′�}𝑁


𝑖=1∼ 𝑞𝑍

Properties of MMD and Coulomb kernel

𝑘(𝑧, 𝑧′�) =1

𝑧 − 𝑧′� h−2 𝑁 > h > 2Coulomb kernel



, {𝑧𝑖′�}𝑁𝑖=1) =

1𝑁(𝑁 − 1)

𝑁

∑𝑖=1

∑𝑗≠𝑖

𝑘(𝑧𝑖′�, 𝑧𝑗′�) +1

𝑁(𝑁 − 1)

𝑁

∑𝑖=1

∑𝑗≠𝑖


𝑁 2

𝑁

∑𝑖=1

𝑁

∑𝑗=1



𝑘 (𝑧 , 𝑧 ′�){𝑧𝑖′�}𝑁


𝑖=1∼ 𝑞𝑍


𝑘(𝑧, 𝑧′�) =1


Theorem Minimization of MMD wrt 1. All local extrema are global 2. The set of saddle points has measure zero




, {𝑧𝑖′�}𝑁𝑖=1) =

1𝑁(𝑁 − 1)

𝑁

∑𝑖=1

∑𝑗≠𝑖

𝑘(𝑧𝑖′�, 𝑧𝑗′�) +1

𝑁(𝑁 − 1)

𝑁

∑𝑖=1

∑𝑗≠𝑖


𝑁 2

𝑁

∑𝑖=1

𝑁

∑𝑗=1



𝑘 (𝑧 , 𝑧 ′�){𝑧𝑖′�}𝑁


𝑖=1∼ 𝑞𝑍


𝑘(𝑧, 𝑧′�) =1




Remark: Convergence to global minimum when optimized through local-search methods!



, {𝑧𝑖′�}𝑁𝑖=1) =

1𝑁(𝑁 − 1)

𝑁

∑𝑖=1

∑𝑗≠𝑖

𝑘(𝑧𝑖′�, 𝑧𝑗′�) +1

𝑁(𝑁 − 1)

𝑁

∑𝑖=1

∑𝑗≠𝑖


𝑁 2

𝑁

∑𝑖=1

𝑁

∑𝑗=1



𝑘 (𝑧 , 𝑧 ′�){𝑧𝑖′�}𝑁


𝑖=1∼ 𝑞𝑍


𝑘(𝑧, 𝑧′�) =1




Remark: Convergence to global minimum when optimized through local-search methods!

Ground truth

Gaussian kernel

Coulomb kernel



, {𝑧𝑖′�}𝑁𝑖=1) =

1𝑁(𝑁 − 1)

𝑁

∑𝑖=1

∑𝑗≠𝑖

𝑘(𝑧𝑖′�, 𝑧𝑗′�) +1

𝑁(𝑁 − 1)

𝑁

∑𝑖=1

∑𝑗≠𝑖


𝑁 2

𝑁

∑𝑖=1

𝑁

∑𝑗=1



Experiments


Experiments

VAE WAE CouAEGround truth


Experiments

VAE (KL) WAE (MMD) CouAE (MMD + Coulomb)Ground truth


Experiments



Experiments


VAE (KL) WAE (MMD) CouAE (MMD + Coulomb)




Generalization Analysis

(finite number of samples) (infinite number of samples)

ℒ̂ = ^𝑅𝐸𝐶 + 𝜆 ^𝑀𝑀𝐷ℒ = 𝑅𝐸𝐶 + 𝜆𝑀𝑀𝐷

?

?

0 0

ℒℒ̂





Theorem

For any

0 ≤ 𝑘(𝑧, 𝑧′ �) = 1/( 𝑧 − 𝑧′�h−2

+ 𝜖) ≤ 𝐾0 ≤ 𝑅𝐸𝐶 ≤ 𝜉

𝑠, 𝑡 > 0

Pr{ ℒ̂ − ℒ > 𝑡 + 𝜆𝑠} ≤ 2𝑒𝑥𝑝{−2𝑁𝑡2

𝜉2 } + 6𝑒𝑥𝑝{−2⌊𝑁/2⌋𝑠2

9𝐾2 }

ℒℒ̂

?

?

0 0





Theorem

For any

0 ≤ 𝑘(𝑧, 𝑧′ �) = 1/( 𝑧 − 𝑧′�h−2

+ 𝜖) ≤ 𝐾0 ≤ 𝑅𝐸𝐶 ≤ 𝜉

𝑠, 𝑡 > 0

Pr{ ℒ̂ − ℒ > 𝑡 + 𝜆𝑠} ≤ 2𝑒𝑥𝑝{−2𝑁𝑡2

𝜉2 } + 6𝑒𝑥𝑝{−2⌊𝑁/2⌋𝑠2

9𝐾2 }

Contribution of recon. error

Contribution of MMD



ℒℒ̂

?

?

0 0





Contribution of recon. error

Contribution of MMD

How can we make small? 1. Estimation of -> maximum reconstruction error on both

training and validation data 2. Minimization of -> Finding proper network architecture (e.g.

layer width, networks’ depth, residual connections)

𝝃 𝜉

𝜉



Theorem

For any

0 ≤ 𝑘(𝑧, 𝑧′ �) = 1/( 𝑧 − 𝑧′�h−2

+ 𝜖) ≤ 𝐾0 ≤ 𝑅𝐸𝐶 ≤ 𝜉

𝑠, 𝑡 > 0

Pr{ ℒ̂ − ℒ > 𝑡 + 𝜆𝑠} ≤ 2𝑒𝑥𝑝{−2𝑁𝑡2

𝜉2 } + 6𝑒𝑥𝑝{−2⌊𝑁/2⌋𝑠2

9𝐾2 }

ℒℒ̂

?

?

0 0


ExperimentsControlling by changing total number of hidden neurons (capacity)𝜉





x0.25 x0.5 x1



x0.25 x0.5 x1

Remarks 1. Network architecture is fundamental to control generalization 2. Increasing capacity (the number of hidden neurons) leads to better generalization (as

long as is decreased) 3. Other architectural choices (e.g. depth, residual connections) may further decrease

𝜉𝜉



x0.25 x0.5 x1

Remarks 1. Network architecture is fundamental to control generalization 2. Increasing capacity (the number of hidden neurons) leads to better generalization (as

long as is decreased) 3. Other architectural choices (e.g. depth, residual connections) may further decrease

𝜉𝜉

Open Question What is/are the optimal network architecture/s minimizing ?𝜉




Conclusions

1. Problem of local minima, MMD + Coulomb kernel behaves similarly to a convex functional

2. Generalization analysis, probabilistic bound giving insights on possible directions to improve autoencoder in principled manner

24th May 2019


www.huawei.com

Thank You

Documents

Coulomb Autoencoders · 2020. 8. 3. · • Generalization Analysis • Conclusions Contents. HUAWEI TECHNOLOGIES CO., LTD. Motivation BigGANs [Brock et al. 2019] Improvement of deep