Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Variational Dropout and the Local Reparameterization Trick
Diederik Kingma, Tim Salimans, Max Welling
Presented by: Changwei Hu
Jan 29, 2016
1 / 16
Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments
Main Idea
When variance of gradients is large, stochastic gradientdescent may fail
Propose a SGVB estimator that has variance inverselyproportional to minibatch size
Use local parameterization trick to make estimatorcomputationally efficient
Propose variational dropout under the framework ofvariational inference
Dropout rate is learned instead of being fixed
2 / 16
Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments
Contents
1 Background
2 Local Reparameterization Trick
3 Variational Dropout
4 Experimental Results
3 / 16
Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments
Variational Inference
Optimize variational parameter φ of some parameterizedmodel qφ(w) such that qφ(w) is a close approximation totrue posterior p(w |D).w : parameters or weights for the model; D: data
In practice, maximize variational lower bound L(φ) of themarginal likelihood of the data:
L(φ) = −DKL(qφ(w)||p(w)) + LD(φ) (1)
where LD(φ) =∑
(x ,y)∈D
Eqφ(w)[log p(y |x ,w)] (2)
LD(φ): expected log-likelihood(x , y) ∈ D: observation of tuples
4 / 16
Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments
Stochastic Gradient Variational Bayes (SGVB)
SGVB parameterizes random parameters w ∼ qφ(w) asw = f (ε,φ).f (·): differentiable function; ε ∼ p(ε): random noise variable
Unbiased minibatch-based Monte Carlo estimator of expectedlog-likelihood can be formed:
LD(φ) ' LSGVBD (φ) =N
M
M∑i=1
log p(y i |x i ,w = f (ε,φ)) (3)
where (x i , y i )Mi=1 is a minibatch of data with M randomdatapoints.
5 / 16
Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments
Variance of SGVB Estimator
Define Li = log p(y i |x i ,w = f (εi ,φ)), then LSGVBD (φ) = NM
∑Mi=1 Li .
The variance of LSGVBD (φ) is given by
Var[LSGVBD (φ)] =N2
M2(
M∑i=1
Var[Li ] + 2M∑i=1
M∑j=i+1
Cov[Li , Lj ])(4)
= N2(1
MVar[Li ] +
M − 1
MCov[Li , Lj ]) (5)
Contribution to variance by Var[Li ] inversely proportional tominibatch size M
Contribution by covariance does not decrease with M(Variance of LSGVBD (φ) can be dominated by covariances foreven moderately large M)
6 / 16
Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments
Local Reparameterization Trick
Var[LSGVBD (φ)] = N2(1
MVar[Li ] +
M − 1
MCov[Li , Lj ])
What this paper does:
propose an estimator for which we have Cov[Li , Lj ] = 0(variance scales as 1
M )
make estimator computationally efficient by not sampling εdirectly, but only sampling intermediate variables f (ε) throughwhich ε influence LSGVBD (φ)
7 / 16
Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments
Local Reparameterization Trick
Example:
A standard fully connected neural network containing a hiddenlayer consisting of 1000 neurons.
Hidden layer receives an M × 1000 input feature matrix A,which is multiplied by a 1000× 1000 weight matrix W, i.e.B = AW (before nonlinearity is applied).
Specify posterior on W to be Gaussian qφ(wi ,j) = N(µi ,j , σ2i ,j),
i.e. wi ,j = µi ,j + σi ,jεi ,j with εi ,j ∼ N(0, 1).
To ensure Cov[Li , Lj ] = 0,
sample a separate weight matrix W for each example inminibatch
not computationally efficient: need to sample M millionrandom numbers
8 / 16
Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments
Local Reparameterization Trick
Solution: local parameterization trick, ε→ f (ε)
The weights (and therefore ε) only influence the expected loglikelihood through neuron activations B)
Sample B directly, instead of sampling W or ε
Example: For a factorized Gaussian posterior on the weights,the posterior for activation is also factorized Gaussian.
qφ(wi ,j) = N(µi ,j , σ2i ,j) ∀wi ,j ∈W⇒ qφ(bm,j |A) = N(γm,j , δm,j),
with γm,j =1000∑i=1
am,iµi ,j , δm,j =1000∑i=1
a2m,iσ
2i ,j (6)
Computation cost: M × 1000, a thousand fold saving because
bm,j = γm,j +√δm,jζm,j , ζm,j ∼ N(0, 1)
Local parametrization trick leads to an estimator with lowervariance
9 / 16
Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments
Dropout
For a fully connected neural network, dropout corresponds to
B = (A ◦ ξ)θ, with ξi ,j ∼ p(ξi ,j)
A: M × K matrix of input featuresθ: K × L weight matrixB: M × L output matrix for current layer (before nonlinearity)ξ: M × K matrix of independent noise variable
ξ can be Bernoulli distributed
Gaussian distribution N(1, α) for ξ works as well or better
10 / 16
Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments
Variational Dropout
With independent weight noise: elements of ξ drawnindependently from N(1, α), then bm,j ∈ B is Gaussian
qφ(bm,j |A) = N(γm,j , δm,j),
with γm,j =K∑i=1
am,iθi ,j , and δm,j = α
K∑i=1
a2m,iθ
2i ,j (7)
Equation (7) can be interpreted as B = AW, whereqφ(wi ,j) = N(θi ,j , αθ
2i ,j).
With correlated weight noise:
B = (A ◦ ξ)θ, ξi ,j ∼ N(1, α)⇔ bm = amW, with
W = (w ′1,w
′2, · · · ,w ′
K )′,w i = siθi , with qφ(si ) = N(1, α) (8)
11 / 16
Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments
Scale-invariant Prior and Variational Objective
Scale-invariant Prior:
In dropout training, θ is adapted to maximize the expectedlog likelihood.
To be consistent with optimization of variational lower bound,choose prior p(w) such that DKL(qφ(w))||p(w)) not dependon θ. p(log |wi ,j |) ∝ cAbove is called scale invariant log-uniform prior.
Variational Objective:
qφ(W) can be decomposed into parameter θ which capturesthe mean, and a multiplicative noise term determined byparameter α.
Dropout maximizes following variational lower bound
Eqα [LD(θ)]− DKL(qα(w)||p(w)) (9)
12 / 16
Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments
Adaptive Dropout Rate
−DKL(qφ(w)||p(w)) is not analytically tractable, can beapproximated by
−DKL(qφ(wi )||p(wi )) ≈ const+0.5 log(α)+c1α+c2α2 +c3α
3
Maximize variational lower bound with respect to α, insteadof setting it.
13 / 16
Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments
Experiments
Datasets:
MNIST
CIFAR-10
Compare with three methods:
standard binary dropout
Gaussian dropout type A: correlated weight noise
Gaussian dropout type B: independent weight uncertainty
14 / 16
Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments
Experiments
Variance of gradient
Table 1: Average empirical variance of minibatch stochastic gradient estimates(1000 examples) for a fully connected neural network, regularized by variationaldropout with independent weight noise.
Speed:Without local parameterization trick: 1635 seconds per epochWith local parameterization trick: 7.4 seconds per epoch
15 / 16
Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments
Experiments
Figure 1: (a) Comparison of various dropout methods, when applied to fully-connected neural networks for classification. Shown is the classification error ofnetworks with 3 hidden layers, averaged over 5 runs. The variational versions ofGaussian dropout perform equal or better than their non-adaptive counterparts. (b)Comparison of dropout methods when applied to convolutional net for differentsettings of network size k. The network has two convolutional layers with each 32k and64k feature maps, followed by two fully connected layers with each 128k hidden units.
16 / 16