Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Coulomb AutoencodersEmanuele Sansone, Hafiz Tiomoko Ali, Sun Jiacheng Huawei Noah’s Ark Lab (London)
HUAWEI TECHNOLOGIES CO., LTD. Page 3
• Motivation • Background on Autoencoders • The Problem of Local Minima • Generalization Analysis • Conclusions
Contents
HUAWEI TECHNOLOGIES CO., LTD.
Motivation
BigGANs [Brock et al. 2019]
Improvement of deep generative models (GANs, Flow, Autoregressive, VAEs) in recent years Lack of theoretical understanding: 1. Training (i.e. convergence guarantees to optimal solutions) 2. Generalization (i.e. quality of solutions with finite number of samples)
HUAWEI TECHNOLOGIES CO., LTD. Page 6
• Motivation • Background on Autoencoders• The Problem of Local Minima • Generalization Analysis • Conclusions
HUAWEI TECHNOLOGIES CO., LTD.
Background on Autoencoders - I
𝑓𝜃(𝑥) ∼ 𝑞𝑍(𝑧 ; 𝜃 ) 𝑧 ∼ 𝑝𝑍(𝑧)
𝑥 ∼ 𝑝𝑋(𝑥) 𝑔𝛾(𝑧) ∼ 𝑞𝑋(𝑥)
Goal Implicitly learning the unknown density 𝑝𝑋(𝑥)
HUAWEI TECHNOLOGIES CO., LTD.
Background on Autoencoders - I
𝑓𝜃(𝑥) ∼ 𝑞𝑍(𝑧 ; 𝜃 ) 𝑧 ∼ 𝑝𝑍(𝑧)
𝑥 ∼ 𝑝𝑋(𝑥) 𝑔𝛾(𝑧) ∼ 𝑞𝑋(𝑥)
Goal Implicitly learning the unknown density
Problem formulation In order to ensure that , we need: 1. Left-invertibility on the support of 2. Density matching
𝑝𝑋(𝑥)
𝑝𝑋(𝑥) = 𝑞𝑋(𝑥)𝑥 = 𝑔𝛾(𝑓𝜃(𝑥)) 𝑝𝑋(𝑥)
𝑞𝑍(𝑧; 𝜃) = 𝑝𝑍(𝑧)
HUAWEI TECHNOLOGIES CO., LTD.
Background on Autoencoders - I
𝑓𝜃(𝑥) ∼ 𝑞𝑍(𝑧 ; 𝜃 ) 𝑧 ∼ 𝑝𝑍(𝑧)
𝑥 ∼ 𝑝𝑋(𝑥) 𝑔𝛾(𝑧) ∼ 𝑞𝑋(𝑥)
Goal Implicitly learning the unknown density
Problem formulation In order to ensure that , we need: 1. Left-invertibility on the support of 2. Density matching
Objective:
𝑝𝑋(𝑥)
𝑝𝑋(𝑥) = 𝑞𝑋(𝑥)𝑥 = 𝑔𝛾(𝑓𝜃(𝑥)) 𝑝𝑋(𝑥)
𝑞𝑍(𝑧; 𝜃) = 𝑝𝑍(𝑧)
ℒ(θ, γ) = 𝑅𝐸𝐶(gγ ∘ fθ) + λ𝐷(qZ, pZ)
HUAWEI TECHNOLOGIES CO., LTD.
Background on Autoencoders - II
ℒ(θ, γ) = 𝑅𝐸𝐶(gγ ∘ fθ) + λ𝐷(qZ, pZ)
𝑓𝜃(𝑥) ∼ 𝑞𝑍(𝑧 ; 𝜃 ) 𝑧 ∼ 𝑝𝑍(𝑧)
𝑥 ∼ 𝑝𝑋(𝑥) 𝑔𝛾(𝑧) ∼ 𝑞𝑋(𝑥)
HUAWEI TECHNOLOGIES CO., LTD.
Background on Autoencoders - II
Properties 1. is typically the L2 loss, which is convex
2. has many forms, all of them are non-convex • Kullback-Leibler Divergence (KL) in Variational Autoencoders [Kingma and Welling 2014] [Rezende et al. 2014] • Maximum-Mean Discrepancy (MMD) in Generative Moment Matching Networks [Li et al. 2015] Wasserstein (WAE) [Tolstikhin et al. 2018] Coulomb Autoencoders (CouAEs)
ℒ(θ, γ) = 𝑅𝐸𝐶(gγ ∘ fθ) + λ𝐷(qZ, pZ)
𝑅𝐸𝐶(gγ ∘ fθ)𝐷(qZ, pZ)
𝑓𝜃(𝑥) ∼ 𝑞𝑍(𝑧 ; 𝜃 ) 𝑧 ∼ 𝑝𝑍(𝑧)
𝑥 ∼ 𝑝𝑋(𝑥) 𝑔𝛾(𝑧) ∼ 𝑞𝑋(𝑥)
HUAWEI TECHNOLOGIES CO., LTD.
Background on Autoencoders - III
Why MMD should be preferred over KL? 1. KL term is not a proper metric, while MMD is an integral probability metric
HUAWEI TECHNOLOGIES CO., LTD.
Background on Autoencoders - III
Why MMD should be preferred over KL? 1. KL term is not a proper metric, while MMD is an integral probability metric 2. KL is not always defined (e.g. empirical distributions, namely superposition of Dirac impulses)
HUAWEI TECHNOLOGIES CO., LTD.
Background on Autoencoders - III
Why MMD should be preferred over KL? 1. KL term is not a proper metric, while MMD is an integral probability metric 2. KL is not always defined (e.g. empirical distributions, namely superposition of Dirac impulses) 3. Local minima (problem of reconstruction/posterior collapse, due to stochastic encoder)
𝑥1, 𝑥2 𝑔𝛾(𝑧1), 𝑔𝛾(𝑧2)
HUAWEI TECHNOLOGIES CO., LTD.
Background on Autoencoders - III
Why MMD should be preferred over KL? 1. KL term is not a proper metric, while MMD is an integral probability metric 2. KL is not always defined (e.g. empirical distributions, namely superposition of Dirac impulses) 3. Local minima (problem of reconstruction/posterior collapse, due to stochastic encoder)
𝑥1, 𝑥2 𝑔𝛾(𝑧1), 𝑔𝛾(𝑧2)
𝑞 (𝑧 |𝑥1) 𝑞 (𝑧 |𝑥2)
HUAWEI TECHNOLOGIES CO., LTD.
Background on Autoencoders - III
Why MMD should be preferred over KL? 1. KL term is not a proper metric, while MMD is an integral probability metric 2. KL is not always defined (e.g. empirical distributions, namely superposition of Dirac impulses) 3. Local minima (problem of reconstruction/posterior collapse, due to stochastic encoder)
𝑥1, 𝑥2 𝑔𝛾(𝑧1), 𝑔𝛾(𝑧2)
𝑞 (𝑧 |𝑥1) 𝑞 (𝑧 |𝑥2)
X
HUAWEI TECHNOLOGIES CO., LTD.
Background on Autoencoders - III
Why MMD should be preferred over KL? 1. KL term is not a proper metric, while MMD is an integral probability metric 2. KL is not always defined (e.g. empirical distributions, namely superposition of Dirac impulses) 3. Local minima (problem of reconstruction/posterior collapse, due to stochastic encoder)
𝑥1, 𝑥2 𝑔𝛾(𝑧1), 𝑔𝛾(𝑧2)
𝑞 (𝑧 |𝑥1) 𝑞 (𝑧 |𝑥2)
𝑥1, 𝑥2 𝑔𝛾(𝑧1), 𝑔𝛾(𝑧2)
𝑞 (𝑧 |𝑥1) 𝑞 (𝑧 |𝑥2)
X V
HUAWEI TECHNOLOGIES CO., LTD. Page 21
• Motivation • Background on Autoencoders • The Problem of Local Minima • Generalization Analysis • Conclusions
HUAWEI TECHNOLOGIES CO., LTD.
𝑘 (𝑧 , 𝑧 ′�){𝑧𝑖′�}𝑁
𝑖=1∼ 𝑝𝑍{𝑧𝑖}𝑁
𝑖=1∼ 𝑞𝑍
Properties of MMD
M𝑀𝐷({𝑧𝑖}𝑁𝑖=1
, {𝑧𝑖′�}𝑁𝑖=1) =
1𝑁(𝑁 − 1)
𝑁
∑𝑖=1
∑𝑗≠𝑖
𝑘(𝑧𝑖′�, 𝑧𝑗′�) +1
𝑁(𝑁 − 1)
𝑁
∑𝑖=1
∑𝑗≠𝑖
𝑘(𝑧𝑖, 𝑧𝑗) −2
𝑁 2
𝑁
∑𝑖=1
𝑁
∑𝑗=1
𝑘(𝑧𝑖′�, 𝑧𝑗)
HUAWEI TECHNOLOGIES CO., LTD.
𝑘 (𝑧 , 𝑧 ′�){𝑧𝑖′�}𝑁
𝑖=1∼ 𝑝𝑍{𝑧𝑖}𝑁
𝑖=1∼ 𝑞𝑍
Properties of MMD
Minimization of MMD wrt maximization of inter-similarity and minimization of intra-similarities Used for density matching or two sample test [Gretton et al. 2012], recently used in autoencoders [Tolstikhin et al. 2018]
{𝑧𝑖}𝑁𝑖=1
≈
Intra-similarity Intra-similarity Inter-similarity
M𝑀𝐷({𝑧𝑖}𝑁𝑖=1
, {𝑧𝑖′�}𝑁𝑖=1) =
1𝑁(𝑁 − 1)
𝑁
∑𝑖=1
∑𝑗≠𝑖
𝑘(𝑧𝑖′�, 𝑧𝑗′�) +1
𝑁(𝑁 − 1)
𝑁
∑𝑖=1
∑𝑗≠𝑖
𝑘(𝑧𝑖, 𝑧𝑗) −2
𝑁 2
𝑁
∑𝑖=1
𝑁
∑𝑗=1
𝑘(𝑧𝑖′�, 𝑧𝑗)
HUAWEI TECHNOLOGIES CO., LTD.
𝑘 (𝑧 , 𝑧 ′�){𝑧𝑖′�}𝑁
𝑖=1∼ 𝑝𝑍{𝑧𝑖}𝑁
𝑖=1∼ 𝑞𝑍
Properties of MMD
Minimization of MMD wrt maximization of inter-similarity and minimization of intra-similarities Used for density matching or two sample test [Gretton et al. 2012], recently used in autoencoders [Tolstikhin et al. 2018]
{𝑧𝑖}𝑁𝑖=1
≈
The choice of kernel function is related with the problem of local minima
Intra-similarity Intra-similarity Inter-similarity
M𝑀𝐷({𝑧𝑖}𝑁𝑖=1
, {𝑧𝑖′�}𝑁𝑖=1) =
1𝑁(𝑁 − 1)
𝑁
∑𝑖=1
∑𝑗≠𝑖
𝑘(𝑧𝑖′�, 𝑧𝑗′�) +1
𝑁(𝑁 − 1)
𝑁
∑𝑖=1
∑𝑗≠𝑖
𝑘(𝑧𝑖, 𝑧𝑗) −2
𝑁 2
𝑁
∑𝑖=1
𝑁
∑𝑗=1
𝑘(𝑧𝑖′�, 𝑧𝑗)
HUAWEI TECHNOLOGIES CO., LTD.
𝑘 (𝑧 , 𝑧 ′�){𝑧𝑖′�}𝑁
𝑖=1∼ 𝑝𝑍{𝑧𝑖}𝑁
𝑖=1∼ 𝑞𝑍
Properties of MMD and Coulomb kernel
𝑘(𝑧, 𝑧′�) =1
𝑧 − 𝑧′� h−2 𝑁 > h > 2Coulomb kernel
Intra-similarity Intra-similarity Inter-similarity
M𝑀𝐷({𝑧𝑖}𝑁𝑖=1
, {𝑧𝑖′�}𝑁𝑖=1) =
1𝑁(𝑁 − 1)
𝑁
∑𝑖=1
∑𝑗≠𝑖
𝑘(𝑧𝑖′�, 𝑧𝑗′�) +1
𝑁(𝑁 − 1)
𝑁
∑𝑖=1
∑𝑗≠𝑖
𝑘(𝑧𝑖, 𝑧𝑗) −2
𝑁 2
𝑁
∑𝑖=1
𝑁
∑𝑗=1
𝑘(𝑧𝑖′�, 𝑧𝑗)
HUAWEI TECHNOLOGIES CO., LTD.
𝑘 (𝑧 , 𝑧 ′�){𝑧𝑖′�}𝑁
𝑖=1∼ 𝑝𝑍{𝑧𝑖}𝑁
𝑖=1∼ 𝑞𝑍
Properties of MMD and Coulomb kernel
𝑘(𝑧, 𝑧′�) =1
𝑧 − 𝑧′� h−2 𝑁 > h > 2Coulomb kernel
Theorem Minimization of MMD wrt 1. All local extrema are global 2. The set of saddle points has measure zero
{𝑧𝑖}𝑁𝑖=1
Intra-similarity Intra-similarity Inter-similarity
M𝑀𝐷({𝑧𝑖}𝑁𝑖=1
, {𝑧𝑖′�}𝑁𝑖=1) =
1𝑁(𝑁 − 1)
𝑁
∑𝑖=1
∑𝑗≠𝑖
𝑘(𝑧𝑖′�, 𝑧𝑗′�) +1
𝑁(𝑁 − 1)
𝑁
∑𝑖=1
∑𝑗≠𝑖
𝑘(𝑧𝑖, 𝑧𝑗) −2
𝑁 2
𝑁
∑𝑖=1
𝑁
∑𝑗=1
𝑘(𝑧𝑖′�, 𝑧𝑗)
HUAWEI TECHNOLOGIES CO., LTD.
𝑘 (𝑧 , 𝑧 ′�){𝑧𝑖′�}𝑁
𝑖=1∼ 𝑝𝑍{𝑧𝑖}𝑁
𝑖=1∼ 𝑞𝑍
Properties of MMD and Coulomb kernel
𝑘(𝑧, 𝑧′�) =1
𝑧 − 𝑧′� h−2 𝑁 > h > 2Coulomb kernel
Theorem Minimization of MMD wrt 1. All local extrema are global 2. The set of saddle points has measure zero
{𝑧𝑖}𝑁𝑖=1
Remark: Convergence to global minimum when optimized through local-search methods!
Intra-similarity Intra-similarity Inter-similarity
M𝑀𝐷({𝑧𝑖}𝑁𝑖=1
, {𝑧𝑖′�}𝑁𝑖=1) =
1𝑁(𝑁 − 1)
𝑁
∑𝑖=1
∑𝑗≠𝑖
𝑘(𝑧𝑖′�, 𝑧𝑗′�) +1
𝑁(𝑁 − 1)
𝑁
∑𝑖=1
∑𝑗≠𝑖
𝑘(𝑧𝑖, 𝑧𝑗) −2
𝑁 2
𝑁
∑𝑖=1
𝑁
∑𝑗=1
𝑘(𝑧𝑖′�, 𝑧𝑗)
HUAWEI TECHNOLOGIES CO., LTD.
𝑘 (𝑧 , 𝑧 ′�){𝑧𝑖′�}𝑁
𝑖=1∼ 𝑝𝑍{𝑧𝑖}𝑁
𝑖=1∼ 𝑞𝑍
Properties of MMD and Coulomb kernel
𝑘(𝑧, 𝑧′�) =1
𝑧 − 𝑧′� h−2 𝑁 > h > 2Coulomb kernel
Theorem Minimization of MMD wrt 1. All local extrema are global 2. The set of saddle points has measure zero
{𝑧𝑖}𝑁𝑖=1
Remark: Convergence to global minimum when optimized through local-search methods!
Ground truth
Gaussian kernel
Coulomb kernel
Intra-similarity Intra-similarity Inter-similarity
M𝑀𝐷({𝑧𝑖}𝑁𝑖=1
, {𝑧𝑖′�}𝑁𝑖=1) =
1𝑁(𝑁 − 1)
𝑁
∑𝑖=1
∑𝑗≠𝑖
𝑘(𝑧𝑖′�, 𝑧𝑗′�) +1
𝑁(𝑁 − 1)
𝑁
∑𝑖=1
∑𝑗≠𝑖
𝑘(𝑧𝑖, 𝑧𝑗) −2
𝑁 2
𝑁
∑𝑖=1
𝑁
∑𝑗=1
𝑘(𝑧𝑖′�, 𝑧𝑗)
HUAWEI TECHNOLOGIES CO., LTD.
Experiments
HUAWEI TECHNOLOGIES CO., LTD.
Experiments
VAE WAE CouAEGround truth
HUAWEI TECHNOLOGIES CO., LTD.
Experiments
VAE (KL) WAE (MMD) CouAE (MMD + Coulomb)Ground truth
HUAWEI TECHNOLOGIES CO., LTD.
Experiments
VAE (KL) WAE (MMD) CouAE (MMD + Coulomb)Ground truth
HUAWEI TECHNOLOGIES CO., LTD.
Experiments
VAE (KL) WAE (MMD) CouAE (MMD + Coulomb)Ground truth
VAE (KL) WAE (MMD) CouAE (MMD + Coulomb)
HUAWEI TECHNOLOGIES CO., LTD. Page 45
• Motivation • Background on Autoencoders • The Problem of Local Minima • Generalization Analysis • Conclusions
HUAWEI TECHNOLOGIES CO., LTD.
Generalization Analysis
(finite number of samples) (infinite number of samples)
ℒ̂ = ^𝑅𝐸𝐶 + 𝜆 ^𝑀𝑀𝐷ℒ = 𝑅𝐸𝐶 + 𝜆𝑀𝑀𝐷
?
?
0 0
ℒℒ̂
HUAWEI TECHNOLOGIES CO., LTD.
Generalization Analysis
(finite number of samples) (infinite number of samples)
ℒ̂ = ^𝑅𝐸𝐶 + 𝜆 ^𝑀𝑀𝐷ℒ = 𝑅𝐸𝐶 + 𝜆𝑀𝑀𝐷
Theorem
For any
0 ≤ 𝑘(𝑧, 𝑧′ �) = 1/( 𝑧 − 𝑧′�h−2
+ 𝜖) ≤ 𝐾0 ≤ 𝑅𝐸𝐶 ≤ 𝜉
𝑠, 𝑡 > 0
Pr{ ℒ̂ − ℒ > 𝑡 + 𝜆𝑠} ≤ 2𝑒𝑥𝑝{−2𝑁𝑡2
𝜉2 } + 6𝑒𝑥𝑝{−2⌊𝑁/2⌋𝑠2
9𝐾2 }
ℒℒ̂
?
?
0 0
HUAWEI TECHNOLOGIES CO., LTD.
Generalization Analysis
(finite number of samples) (infinite number of samples)
ℒ̂ = ^𝑅𝐸𝐶 + 𝜆 ^𝑀𝑀𝐷ℒ = 𝑅𝐸𝐶 + 𝜆𝑀𝑀𝐷
Theorem
For any
0 ≤ 𝑘(𝑧, 𝑧′ �) = 1/( 𝑧 − 𝑧′�h−2
+ 𝜖) ≤ 𝐾0 ≤ 𝑅𝐸𝐶 ≤ 𝜉
𝑠, 𝑡 > 0
Pr{ ℒ̂ − ℒ > 𝑡 + 𝜆𝑠} ≤ 2𝑒𝑥𝑝{−2𝑁𝑡2
𝜉2 } + 6𝑒𝑥𝑝{−2⌊𝑁/2⌋𝑠2
9𝐾2 }
Contribution of recon. error
Contribution of MMD
(finite number of samples) (infinite number of samples)
ℒ̂ = ^𝑅𝐸𝐶 + 𝜆 ^𝑀𝑀𝐷ℒ = 𝑅𝐸𝐶 + 𝜆𝑀𝑀𝐷
ℒℒ̂
?
?
0 0
HUAWEI TECHNOLOGIES CO., LTD.
Generalization Analysis
(finite number of samples) (infinite number of samples)
ℒ̂ = ^𝑅𝐸𝐶 + 𝜆 ^𝑀𝑀𝐷ℒ = 𝑅𝐸𝐶 + 𝜆𝑀𝑀𝐷
Contribution of recon. error
Contribution of MMD
How can we make small? 1. Estimation of -> maximum reconstruction error on both
training and validation data 2. Minimization of -> Finding proper network architecture (e.g.
layer width, networks’ depth, residual connections)
𝝃 𝜉
𝜉
(finite number of samples) (infinite number of samples)
ℒ̂ = ^𝑅𝐸𝐶 + 𝜆 ^𝑀𝑀𝐷ℒ = 𝑅𝐸𝐶 + 𝜆𝑀𝑀𝐷
Theorem
For any
0 ≤ 𝑘(𝑧, 𝑧′ �) = 1/( 𝑧 − 𝑧′�h−2
+ 𝜖) ≤ 𝐾0 ≤ 𝑅𝐸𝐶 ≤ 𝜉
𝑠, 𝑡 > 0
Pr{ ℒ̂ − ℒ > 𝑡 + 𝜆𝑠} ≤ 2𝑒𝑥𝑝{−2𝑁𝑡2
𝜉2 } + 6𝑒𝑥𝑝{−2⌊𝑁/2⌋𝑠2
9𝐾2 }
ℒℒ̂
?
?
0 0
HUAWEI TECHNOLOGIES CO., LTD.
ExperimentsControlling by changing total number of hidden neurons (capacity)𝜉
HUAWEI TECHNOLOGIES CO., LTD.
ExperimentsControlling by changing total number of hidden neurons (capacity)𝜉
HUAWEI TECHNOLOGIES CO., LTD.
ExperimentsControlling by changing total number of hidden neurons (capacity)𝜉
x0.25 x0.5 x1
HUAWEI TECHNOLOGIES CO., LTD.
ExperimentsControlling by changing total number of hidden neurons (capacity)𝜉
x0.25 x0.5 x1
Remarks 1. Network architecture is fundamental to control generalization 2. Increasing capacity (the number of hidden neurons) leads to better generalization (as
long as is decreased) 3. Other architectural choices (e.g. depth, residual connections) may further decrease
𝜉𝜉
HUAWEI TECHNOLOGIES CO., LTD.
ExperimentsControlling by changing total number of hidden neurons (capacity)𝜉
x0.25 x0.5 x1
Remarks 1. Network architecture is fundamental to control generalization 2. Increasing capacity (the number of hidden neurons) leads to better generalization (as
long as is decreased) 3. Other architectural choices (e.g. depth, residual connections) may further decrease
𝜉𝜉
Open Question What is/are the optimal network architecture/s minimizing ?𝜉
HUAWEI TECHNOLOGIES CO., LTD. Page 63
• Motivation • Background on Autoencoders • The Problem of Local Minima • Generalization Analysis • Conclusions
HUAWEI TECHNOLOGIES CO., LTD.
Conclusions
1. Problem of local minima, MMD + Coulomb kernel behaves similarly to a convex functional
2. Generalization analysis, probabilistic bound giving insights on possible directions to improve autoencoder in principled manner
24th May 2019
HUAWEI TECHNOLOGIES CO., LTD.
www.huawei.com
Thank You