16
Tighter Variational Bounds are Not Necessarily Better Tom Rainforth 1 Adam R. Kosiorek 12 Tuan Anh Le 2 Chris J. Maddison 1 Maximilian Igl 2 Frank Wood 2 Yee Whye Teh 1 Abstract We provide theoretical and empirical evidence that using tighter evidence lower bounds (ELBOs) can be detrimental to the process of learning an inference network by reducing the signal-to-noise ratio of the gradient estimator. Our results call into question common implicit assumptions that tighter ELBOs are better variational objectives for simultaneous model learning and inference amor- tization schemes. Based on our insights, we in- troduce three new algorithms: the partially impor- tance weighted auto-encoder (PIWAE), the multi- ply importance weighted auto-encoder (MIWAE), and the combination importance weighted auto- encoder (CIWAE), each of which includes the stan- dard importance weighted auto-encoder (IWAE) as a special case. We show that each can deliver im- provements over IWAE, even when performance is measured by the IWAE target itself. Moreover, PIWAE can simultaneously deliver improvements in both the quality of the inference network and generative network, relative to IWAE. 1. Introduction Variational bounds provide tractable and state-of-the-art ob- jectives for training deep generative models (Kingma and Welling, 2014; Rezende et al., 2014). Typically taking the form of a lower bound on the intractable model evidence, they provide surrogate targets that are more amenable to optimization. In general, this optimization requires the gen- eration of approximate posterior samples during the model training and so a number of methods look to simultane- ously learn an inference network alongside the target gen- erative network. This assists in the training process and provides an amortized inference artifact, the inference net- work, which can be used at test time (Kingma and Welling, 2014; Rezende et al., 2014). The performance of variational bounds depends upon the 1 Department of Statistics, University of Oxford 2 Department of Engineering, University of Oxford. Correspondence to: Tom Rainforth <[email protected]>. choice of evidence lower bound (ELBO) and the formulation of the inference network, with the two often intricately linked to one another; if the inference network formulation is not sufficiently expressive, this can have a knock-on effect on the generative network (Burda et al., 2016). In choosing the ELBO, it is often implicitly assumed in the literature that using tighter ELBOs is universally beneficial. In this work we question this implicit assumption by demon- strating that, although using a tighter ELBO is typically beneficial to gradient updates of the generative network, it can be detrimental to updates of the inference network. Specifically, we present theoretical and empirical evidence that increasing the number of importance sampling particles, T , to tighten the bound in the importance-weighted auto- encoder (IWAE) (Burda et al., 2016), degrades the signal-to- noise ratio (SNR) of the gradient estimates for the inference network, inevitably degrading the overall learning process. Our results suggest that it may be best to use distinct objec- tives for learning the generative and inference networks, or that when using the same target, it should take into account the needs of both networks. Namely, that tighter bounds may be better for training the generative network, while looser bounds are often preferable for training the inference network. Based on these insights, we introduce three new algorithms: the partially importance-weighted auto-encoder (PIWAE), the multiply importance-weighted auto-encoder (MIWAE), and the combination importance-weighted auto- encoder (CIWAE). Each of these include IWAE as a special case and use of the same T importance weights, but use these weights in different ways to ensure a higher SNR ratio for the inference network. We demonstrate that the optimal setting for each of these algorithms improves in a variety of ways over IWAE. In particular, our new algorithms produce inference networks more closely matched to the true posterior. Remarkably, when treating the IWAE objective as the measure of perfor- mance, all our algorithms outperform IWAE. Finally, in the settings we considered, PIWAE improves uniformly over IWAE; it achieves improved final marginal likelihood scores, produces better inference networks, and trains faster. arXiv:1802.04537v1 [stat.ML] 13 Feb 2018

Tighter Variational Bounds are Not Necessarily Bettermobile/Papers/2018ARXIV_rainforth.pdf · 2018-03-11 · and (Burda et al., 2016) provide strong empirical evidence that setting

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Tighter Variational Bounds are Not Necessarily Bettermobile/Papers/2018ARXIV_rainforth.pdf · 2018-03-11 · and (Burda et al., 2016) provide strong empirical evidence that setting

Tighter Variational Bounds are Not Necessarily Better

Tom Rainforth 1 Adam R. Kosiorek 1 2 Tuan Anh Le 2 Chris J. Maddison 1 Maximilian Igl 2 Frank Wood 2

Yee Whye Teh 1

AbstractWe provide theoretical and empirical evidencethat using tighter evidence lower bounds (ELBOs)can be detrimental to the process of learning aninference network by reducing the signal-to-noiseratio of the gradient estimator. Our results callinto question common implicit assumptions thattighter ELBOs are better variational objectives forsimultaneous model learning and inference amor-tization schemes. Based on our insights, we in-troduce three new algorithms: the partially impor-tance weighted auto-encoder (PIWAE), the multi-ply importance weighted auto-encoder (MIWAE),and the combination importance weighted auto-encoder (CIWAE), each of which includes the stan-dard importance weighted auto-encoder (IWAE) asa special case. We show that each can deliver im-provements over IWAE, even when performanceis measured by the IWAE target itself. Moreover,PIWAE can simultaneously deliver improvementsin both the quality of the inference network andgenerative network, relative to IWAE.

1. IntroductionVariational bounds provide tractable and state-of-the-art ob-jectives for training deep generative models (Kingma andWelling, 2014; Rezende et al., 2014). Typically taking theform of a lower bound on the intractable model evidence,they provide surrogate targets that are more amenable tooptimization. In general, this optimization requires the gen-eration of approximate posterior samples during the modeltraining and so a number of methods look to simultane-ously learn an inference network alongside the target gen-erative network. This assists in the training process andprovides an amortized inference artifact, the inference net-work, which can be used at test time (Kingma and Welling,2014; Rezende et al., 2014).The performance of variational bounds depends upon the

1Department of Statistics, University of Oxford 2Departmentof Engineering, University of Oxford. Correspondence to: TomRainforth <[email protected]>.

choice of evidence lower bound (ELBO) and the formulationof the inference network, with the two often intricatelylinked to one another; if the inference network formulationis not sufficiently expressive, this can have a knock-on effecton the generative network (Burda et al., 2016). In choosingthe ELBO, it is often implicitly assumed in the literature thatusing tighter ELBOs is universally beneficial.In this work we question this implicit assumption by demon-strating that, although using a tighter ELBO is typicallybeneficial to gradient updates of the generative network,it can be detrimental to updates of the inference network.Specifically, we present theoretical and empirical evidencethat increasing the number of importance sampling particles,T , to tighten the bound in the importance-weighted auto-encoder (IWAE) (Burda et al., 2016), degrades the signal-to-noise ratio (SNR) of the gradient estimates for the inferencenetwork, inevitably degrading the overall learning process.Our results suggest that it may be best to use distinct objec-tives for learning the generative and inference networks, orthat when using the same target, it should take into accountthe needs of both networks. Namely, that tighter boundsmay be better for training the generative network, whilelooser bounds are often preferable for training the inferencenetwork. Based on these insights, we introduce three newalgorithms: the partially importance-weighted auto-encoder(PIWAE), the multiply importance-weighted auto-encoder(MIWAE), and the combination importance-weighted auto-encoder (CIWAE). Each of these include IWAE as a specialcase and use of the same T importance weights, but usethese weights in different ways to ensure a higher SNR ratiofor the inference network.We demonstrate that the optimal setting for each of thesealgorithms improves in a variety of ways over IWAE. Inparticular, our new algorithms produce inference networksmore closely matched to the true posterior. Remarkably,when treating the IWAE objective as the measure of perfor-mance, all our algorithms outperform IWAE. Finally, in thesettings we considered, PIWAE improves uniformly overIWAE; it achieves improved final marginal likelihood scores,produces better inference networks, and trains faster.

arX

iv:1

802.

0453

7v1

[st

at.M

L]

13

Feb

2018

Page 2: Tighter Variational Bounds are Not Necessarily Bettermobile/Papers/2018ARXIV_rainforth.pdf · 2018-03-11 · and (Burda et al., 2016) provide strong empirical evidence that setting

Tighter Variational Bounds are Not Necessarily Better

2. Background and NotationLet x be an X -valued random variable defined via a processinvolving an unobserved Z-valued random variable z withjoint density pθ(x, z). Direct maximum likelihood estima-tion of θ is generally intractable if pθ(x, z) is a deep gen-erative model due to the marginalization of z. A commonstrategy is to optimize instead a variational lower bound onlog pθ(x), defined via an auxiliary inference model qφ(z|x):

ELBOVAE(θ, φ, x) :=

∫qφ(z|x) log

pθ(x, z)

qφ(z|x)dz (1)

= log pθ(x)−KL(qφ(z|x)||pθ(z|x)). (2)

Typically, qφ is parameterized by a neural network, forwhich the approach is known as the variational auto-encoder(VAE) (Kingma and Welling, 2014; Rezende et al., 2014).Optimization is performed with stochastic gradient ascent(SGA) using unbiased estimators of ∇θ,φ ELBOVAE(θ, φ, x).If qφ is reparameterizable (Kingma and Welling, 2014;Rezende et al., 2014), then given a reparameterized samplez ∼ qφ(z|x), the gradients∇θ,φ(log pθ(x, z)−log qφ(z|x))can be used for the optimization.The VAE objective places a harsh penalty on mismatch be-tween qφ(z|x) and pθ(z|x); optimizing jointly in θ, φ canconfound improvements in log pθ(x) with reductions in theKL (Turner and Sahani, 2011). Thus, research has lookedto develop bounds that separate the tightness of the boundfrom the expressiveness of the class of qφ. For example, theIWAE objectives (Burda et al., 2016), which we denote asELBOIS(θ, φ, x), are a family of bounds defined by

QIS(z1:K |x) :=

K∏k=1

qφ(zk|x), (3)

ZIS(z1:K , x) :=1

K

K∑k=1

pθ(x, zk)

qφ(zk|x), (4)

ELBOIS(θ, φ, x) :=

∫QIS(z1:K |x) log ZIS(z1:K , x) dz1:K

(5)

≤ log pθ(x). The IWAE objectives generalize the VAE objec-tive (K = 1 corresponds to the VAE) and the bounds becomestrictly tighter as K increases Burda et al. (2016). When thefamily of qφ contains the true posteriors, the global optimumparameters {θ∗, φ∗} are independent of K, see for exam-ple (Le et al., 2017). Still, except for the most trivial models,it is not usually the case that qφ contains the true posteriors,and (Burda et al., 2016) provide strong empirical evidencethat setting K > 1 leads to significant empirical gains overthe VAE in terms of learning the generative model.Optimizing tighter bounds is usually empirically associatedwith better models pθ in terms of marginal likelihood on heldout data. Other related approaches extend this to sequentialMonte Carlo (SMC) (Maddison et al., 2017; Le et al., 2017;

Naesseth et al., 2017) or change the lower bound that isoptimized to reduce the bias (Li and Turner, 2016; Bamleret al., 2017). A second, unrelated, approach is to tightenthe bound by improving the expressiveness of qφ (Salimanset al., 2015; Tran et al., 2015; Rezende and Mohamed, 2015;Kingma et al., 2016; Maaløe et al., 2016; Ranganath et al.,2016). In this work we focus on the former, algorithmic,approaches to tighter bounds.

3. Assessing the Signal-to-Noise Ratio of theGradient Estimators

Because it is not feasible to analytically optimize any ELBOin complex models, the effectiveness of any particularchoice of ELBO is linked to our ability to numerically solvethe resulting optimization problem. This motivates us toexamine the effect K has on the variance and magnitude ofthe gradient estimates of IWAE for the two networks. Moregenerally, we study IWAE gradient estimators constructed asthe average of M estimates, each built from K independentparticles. We present a result characterizing the asymptoticsignal-to-noise ratio in M and K. For the standard case ofM = 1, our result shows that the signal-to-noise ratio of thereparameterization gradients of the inference network forthe IWAE decreases with rate O(1/

√K).

Specifically, as estimating the ELBO requires a Monte Carloestimation of an expectation over z, we have two samplesizes to tune for the estimate: the number of samplesM usedfor Monte Carlo estimation of the ELBO and the numberof importance samples K used in the bound construction.Here M does not change the true value of∇θ,φ ELBO, onlyour variance in estimating it, while changing K changes theELBO itself, with larger K leading to tighter bounds (Burdaet al., 2016). Presuming that reparameterization is possible,we can express our gradient estimate in the general form

∆M,K :=1

M

M∑m=1

∇θ,φ log Zm,K , (6)

where Zm,K =1

K

K∑k=1

wm,k, wm,k =pθ(zm,k, x)

qφ(zm,k|x), (7)

and each zm,ki.i.d.∼ qφ(zm,k|x). Thus, for a fixed budget of

T = MK particle weights, we have a family of estimatorswith the case K = 1 corresponding to the VAE objectiveand the case M = 1 corresponding to IWAE. We will use∆M,K (θ) to refer to gradient estimates with respect to θand ∆M,K (φ) for the same with respect to φ.Variance is not a good enough barometer for the effective-ness of a gradient estimation scheme; estimators with smallexpected values need proportionally smaller variances toestimate accurately. In the case of IWAE, when changesin M,K simultaneously affect both the variance and ex-pected value, the quality of the estimator for learning can

2

Page 3: Tighter Variational Bounds are Not Necessarily Bettermobile/Papers/2018ARXIV_rainforth.pdf · 2018-03-11 · and (Burda et al., 2016) provide strong empirical evidence that setting

Tighter Variational Bounds are Not Necessarily Better

worsen even as variance decreases. To see why consider themarginal likelihood estimates Zm,K . Because these becomeexact (and thus independent of the proposal) as K →∞, itmust be the case that limK→∞∆M,K(φ) = 0. Thus as Kbecomes large, the expected value of the gradient must de-crease along with its variance, such that the variance relativeto the problem scaling need not actually improve.Specifically, we introduce the signal-to-noise-ratio, definedas the absolute value of the expected gradient scaled by itsstandard deviation:

SNRM,K(θ) =

∣∣∣∣E [∆M,K(θ)]

σ [∆M,K(θ)]

∣∣∣∣ (8)

where σ[·] denotes the standard deviation of a random vari-able. The SNR is defined similarly for SNRM,K(φ) and isdefined separately on each dimension of the parameter vec-tor. The SNR provides a measure of the relative accuracyof the gradient estimates. Though a high SNR does not al-ways indicate a good SGA scheme (as the target objectiveitself might be poorly chosen), a low SNR is always prob-lematic because it indicates that the gradient estimates aredominated by noise: if SNR → 0 then the estimates becomecompletely random. We are now ready to state our main the-oretical result, which at a high level is that SNRM,K(θ) =

O(√

MK)

and SNRM,K(φ) = O(√

M/K)

.

Theorem 1. Assume that when M = K = 1, the expectedgradients; the variances of the gradients; and the first fourmoments of w1,1,∇θw1,1, and∇φw1,1 are all finite and thevariances are also non-zero. Then the signal-to-noise ratiosof the gradient estimates converge at the following rates

SNRM,K(θ) = (9)

√M

∣∣∣∣∣∣∣∣√K ∇θZ − 1

2Z√K∇θ(

Var[w1,1]Z2

)+O

(1

K3/2

)√E[w2

1,1 (∇θ logw1,1 −∇θ logZ)2]

+O(

1K

)∣∣∣∣∣∣∣∣

SNRM,K(φ) =√M

∣∣∣∣∣∣ ∇φVar [w1,1] +O(

1K

)2Z√K σ [∇φw1,1] +O

(1√K

)∣∣∣∣∣∣

(10)

where Z := pθ(x) is the true marginal likelihood.

Proof. We give only an intuitive demonstration of the high-level result here and provide a formal proof in Appendix A.The effect of M on the SNR ratio follows from using the lawof large numbers on the random variable ∇θ,φ log Zm,K .Namely, the overall expectation is independent of M andthe variance reduces at a rate O(1/M). The effect of K ismore complicated but is perhaps most easily seen by noting

that (as shown by (Burda et al., 2016))

∇θ,φ log Zm,K =

K∑k=1

wm,k∑K`=1 wm,k

∇θ,φ log(wm,k

),

such that ∇θ,φ log Zm,K can be interpreted as a self-normalized importance sampling estimate. We can, there-fore, invoke the known result (see e.g. (Hesterberg, 1988))that the bias of a self-normalized importance sampler con-verges at a rate O(1/K) and the standard deviation at a rateO(1/

√K). We thus see that the SNR converges at a rate

O((1/K)/(1/√K)) = O(1/

√K) if the asymptotic gradi-

ent is 0 and O((1)/(1/√K)) = O(

√K) otherwise, giving

the convergence rates in the φ and θ cases respectively.The implication of these convergence rates is that increasingM is monotonically beneficial to the SNR for both θ andφ, but that increasing K is beneficial to the former anddetrimental to the latter.An important point of note is that, for large K, the directionthe expected inference network gradient is independent ofK. Namely, because we have as an intermediary result fromderiving the SNRs that

E [∆M,K(φ)] = −∇φVar [w1,1]

2KZ2+O

(1

K2

)(11)

we see that expected gradient points in the direction of−∇φVar [w1,1] as K →∞. This direction is rather interest-ing: it implies that as K →∞, the optimal φ is that whichminimizes the variance of the weights. This is well knownto be the optimal importance sampling distribution in termsof approximating the posterior (Owen, 2013). Though it isnot also necessarily the optimal proposal in terms of estimat-ing the ELBO,1 this is nonetheless an interesting equivalencethat complements the results of (Cremer et al., 2017). Itsuggests that increasing K may provide a preferable targetin terms of the direction of the true inference network gradi-ents, creating a trade-off with the fact that it also diminishesthe SNR, reducing the estimates to pure noise if K is set toohigh. In the absence of other factors, there may thus be a“sweet-spot” for setting K.Typically when training deep generative models, one doesnot optimize a single ELBO but instead its average overmultiple data points, i.e.

J (θ, φ) :=1

N

N∑n=1

ELBOIS(θ, φ, x(n)). (12)

Our results extend to this setting because the z are drawn

1The optimum importance sampling proposal for calculatingexpectations of a particular function is distinct to that which bestapproximates the posterior (Owen, 2013; Ruiz et al., 2016).

3

Page 4: Tighter Variational Bounds are Not Necessarily Bettermobile/Papers/2018ARXIV_rainforth.pdf · 2018-03-11 · and (Burda et al., 2016) provide strong empirical evidence that setting

Tighter Variational Bounds are Not Necessarily Better

-0.1 0 0.1 0.2 0.3 0.40

10

20

30

(a) IWAE inference network gradient estimates

-0.05 0 0.05 0.1 0.150

10

20

30

40

50

(b) IWAE generative network gradient estimates

Figure 1: Histograms of gradient estimates ∆M,K for the generative network and the inference network using the IWAE(M = 1) (K = 1) objective with different values of K.

independently for each x(n), so

E

[1

N

N∑n=1

∆(n)M,K

]=

1

N

N∑n=1

E[∆

(n)M,K

], (13)

Var

[1

N

N∑n=1

∆(n)M,K

]=

1

N2

N∑n=1

Var[∆

(n)M,K

](14)

We thus also see that if we are using mini-batches such thatN is a chosen parameter and the x(n) are drawn from theempirical data distribution, then the SNRs of ∆N,M,K :=1N

∑Nn=1 ∆

(n)M,K scales as

√N , i.e. SNRN,M,K(θ) =

O(√NMK) and SNRN,M,K(φ) = O(

√NM/K). There-

fore increasing N has the same ubiquitous benefit as in-creasing M . In the rest of the paper, we will implicitlybe considering the SNRs for ∆N,M,K , but will omit thedependency on N to simplify the notation.

4. Empirical ConfirmationOur convergence results hold exactly in relation to M (andN ) but are only asymptomatic in K due to the higher orderterms. Therefore their applicability should be viewed witha healthy degree of skepticism in the small K regime. Withthis in mind, we now present empirical support for ourtheoretical results and test how well they hold in the smallK regime using a simple Gaussian model, for which we cananalytically calculate the ground truth.Consider a family of generative models with RD–valuedlatent variable z and observed variable x:

z ∼ N (z;µ, I), x|z ∼ N (x; z, I), (15)

which is parameterized by θ := µ. Let the inference networkbe parameterized by φ = (A, b), A ∈ RD×D, b ∈ RDwhere qφ(z|x) = N (z;Ax + b, 23I). Given a dataset(x(n))Nn=1, we can analytically calculate the optimum ofour target J (θ, φ) as explained in Appendix B, givingθ∗ := µ∗ = 1

N

∑Nn=1 x

(n) and φ∗ := (A∗, b∗), where

A∗ = I/2 and b∗ = µ∗/2. For this particular problem, theoptimal proposal is independent of K. This will not be thecase in general unless the family of possible qφ containsthe the true posteriors pθ(z|x(n)). Further, even for thisproblem, the expected gradients for the inference networkstill change with K.To conduct our investigation, we randomly generated a syn-thetic dataset from the model with D = 20 dimensions,N = 1024 data points, and a true model parameter valueµtrue that was itself randomly generated from a unit Gaus-sian, i.e. µtrue ∼ N (µtrue; 0, I). We then considered thegradient at a random point in the parameter space close tooptimum,2 namely each dimension of each parameter wasrandomly offset from its optimum value using a zero-meanGaussian with standard deviation 0.01. We then calculatedempirical estimates of the ELBO gradients for IWAE, whereM = 1 is held fixed and we consider increasing K, andfor VAE, where K = 1 is held fixed and we consider in-creasing M . In all cases we calculated 104 such estimatesand used these samples to provide empirical estimates for,amongst other things, the mean and standard deviation ofthe estimator, and thereby an empirical estimate for the SNR.For the inference network, we predominantly focused oninvestigating the gradients of b.We start by examining the qualitative behavior of the differ-ent gradient estimators as K increases as shown in Figure 1.This shows histograms of the gradient estimators for a singleparameter of the inference network (left) and generative net-work (right) for IWAE (left). We first see in Figure 1a that asK increases, both the magnitude and the standard deviationof the estimator decrease for the inference network, withthe former decreasing faster. This matches the qualitativebehavior of our theoretical result, with the SNR ratio dimin-ishing as K increases. In particular, the probability of thegradient being positive or negative becomes roughly even

2We consider the behavior of for points far away from theoptimum in Appendix C.3.

4

Page 5: Tighter Variational Bounds are Not Necessarily Bettermobile/Papers/2018ARXIV_rainforth.pdf · 2018-03-11 · and (Burda et al., 2016) provide strong empirical evidence that setting

Tighter Variational Bounds are Not Necessarily Better

100 101 102 10310-3

10-2

10-1

100

101

102

103

(a) Convergence of SNR for inference network

100 101 102 10310-3

10-2

10-1

100

101

102

103

(b) Convergence of SNR for generative network

Figure 2: Convergence of signal-to-noise ratios of gradient estimates with total budget MK. Different lines correspond todifferent dimensions of the parameter vectors. Shown in blue is IWAE where we keep M = 1 fixed and increase K. Shownin red is VAE where K = 1 is fixed and we increase M . The black and green dashed lines show the expected convergencerates from our theoretical results, representing gradients of 1/2 and −1/2 respectively.

for the larger values of K, meaning the optimizer is equallylikely to increase as decrease the inference network param-eters at the next iteration. By contrast, for the generativenetwork, IWAE converges towards a non-zero gradient, suchthat, even though the SNR initially decreases with K, it thenrises again, with a very clear gradient signal for K = 1000.To provide a more rigorous analysis, we next directly ex-amine the convergence of the SNR. Figure 2 shows theconvergence of the estimators with increasing M and K.The observed rates for the inference network (Figure 2a) cor-respond to our theoretical results, with the suggested ratesobserved all the way back to K = M = 1. As expected,we see that as M increases, so does SNRM,K(b), but as Kincreases, SNRM,K(b) reduces.In Figure 2b, we see that the theoretical convergence forSNRM,K(µ) is again observed exactly for variations in M ,but a more unusual behavior is seen for variations in Kwhere the SNR initially decreases before starting to increaseagain for large enough K, eventually exhibiting behaviorconsistent with the theoretical result for large enough K.The driving factor for this appears to be that, at least forthis model, E[∆M,∞(µ)] typically has a smaller magnitude(and often opposite sign) to E[∆M,1(µ)] (see Figure 1b). Ifwe think of the estimators for all values of K as biased esti-mates for E[∆M,∞(µ)], we see from our theoretical resultsthat this bias decreases faster than the standard deviation.Consequently, if reducing this bias causes the magnitude ofthe expected gradient to diminish, this can mean that increas-ing K initially causes the SNR to reduce for the generativenetwork.Note that this does not mean that the estimates are gettingworse for the generative network. As we increase K ourbound is getting tighter and our estimates closer to the truegradient for the target that we actually want to optimize, i.e.∇µ logZ. See Appendix C.2 for more details.

It is also the case that increasing K could be beneficial forthe inference network even if it reduces the SNR by improv-ing the direction of the expected gradient. However, we willreturn to consider a metric that examines this direction inFigure 4, where we will see that the SNR seems to be thedominant effect for the inference network.As a reassurance that our chosen definition of the SNR isappropriate for the problem at hand and to examine the ef-fect of multiple dimensions explicitly, we now also consideran alternative definition of the SNR that is similar (thoughdistinct) to that used in (Roberts and Tedrake, 2009). Werefer to this as the “directional” SNR (DSNR). At a high-level, we define the DSNR by splitting each gradient esti-mate into two component vectors, one parallel to the truegradient and one perpendicular, then taking the expectationof ratio of their magnitudes. More precisely, we defineu = E [∆M,K ] /‖E [∆M,K ]‖2 as being the true normalizedgradient direction and then the DSNR as

DSNRM,K = E[ ‖∆‖‖2‖∆⊥‖2

]where (16)

∆‖ =(∆TM,Ku

)u and ∆⊥ = ∆M,K −∆‖.

The DSNR thus provides a measure of the expected pro-portion of the gradient that will point is the true direction.For perfect estimates of the gradients, then DSNR → ∞,but unlike the SNR, arbitrarily bad estimates do not haveDSNR = 0 because even random vectors will have a compo-nent of their gradient in the true direction.The convergence of the DSNR is shown in Figure 3, forwhich the true normalized gradient u has been estimatedempirically, noting that this varies with K. We see a similarqualitative behavior to the SNR, with the gradients of IWAEfor the inference network degrading to having the same di-rectional accuracy as drawing a random vector. Interestingly,the DSNR seems to be following the same asymptotic con-

5

Page 6: Tighter Variational Bounds are Not Necessarily Bettermobile/Papers/2018ARXIV_rainforth.pdf · 2018-03-11 · and (Burda et al., 2016) provide strong empirical evidence that setting

Tighter Variational Bounds are Not Necessarily Better

(a) Convergence of DSNR for inference network (b) Convergence of DSNR for generative network

Figure 3: Convergence of directional signal-to-noise ratio of gradients estimates with total budget MK. The solid linesshow the estimates DSNR and the shaded regions the interquartile range of in the individual ratios. Also shown for referenceis the DSNR for a randomly generated vector where each component is drawn from a unit Gaussian.

(a) Convergence of DSNR for inference network (b) Convergence of DSNR for generative network

Figure 4: Convergence of directional signal-to-noise ratio of gradient estimates where the true gradient is taken as E [∆1,1000].Figure conventions as per Figure 3.

vergence behavior as SNR for the generative network andfor the inference network in M (as shown by the dashedlines), even though we have no theoretical result to suggestthis should occur.As our theoretical results suggest that the direction of thetrue gradients correspond to targeting an improved objectiveasK increases, we now examine whether this or the changesin the SNR is the dominant effect. To this end, we repeat ourcalculations for the DSNR but take u as the true direction ofthe gradient for K = 1000. This provides a measure of howvarying M and K affects the quality of the gradient direc-tions as biased estimators for E [∆1,1000] /‖E [∆1,1000]‖2.As shown in Figure 4, increasing K is still detrimental forthe inference network by this metric, even though it bringsthe expected gradient estimate closer to the true gradient.By contrast, increasing K is now monotonically beneficialfor the generative network. Increasing M with K = 1 leadsto initial improvements for the inference network beforeplateauing due to the bias of the estimator. For the genera-tive network, increasing M has little impact, with the biasbeing the dominant factor throughout. Though this met-

ric is not an absolute measure of performance of the SGAscheme, e.g. because high bias may be more detrimentalthan high variance, it is nonetheless a powerful result insuggesting that increasing K can be detrimental to learningthe inference network.

5. New EstimatorsBased on our previous findings, we now introduce three newalgorithms that address the issue of diminishing SNR for theinference network. Our first, MIWAE, is exactly equivalentto the general formulation given in (6), the distinction com-ing from the fact that we will use both M > 1 and K > 1,which has not been previously considered in the literature.The motivation for this is that because our inference net-work SNR increases as O(

√M/K), we should be able to

mitigate the issues increasing K has on the SNR by alsoincreasing M . For fairness, we will keep our overall budgetT = MK fixed, but we will show that given this budget,the optimal value for K is often not 1.Our second algorithm, CIWAE is simply a linear combination

6

Page 7: Tighter Variational Bounds are Not Necessarily Bettermobile/Papers/2018ARXIV_rainforth.pdf · 2018-03-11 · and (Burda et al., 2016) provide strong empirical evidence that setting

Tighter Variational Bounds are Not Necessarily Better

0 500 1000 1500 2000 2500 3000Epoch

−92

−90

−88

−86

−84

IWA

E-6

4

IWAE

CIWAE β = 0.5

PIWAE 8/8

MIWAE 8/8

VAE

(a) IWAE64

0 500 1000 1500 2000 2500 3000Epoch

−92

−90

−88

−86

−84

logp(x

)

(b) log p(x)

0 500 1000 1500 2000 2500 3000Epoch

−2.5

−2.0

−1.5

−1.0

−0.5

0.0

−KL

(Q||P

)

(c) −KL(Qφ(z|x)||Pθ(z|x))

Figure 5: Convergence of evaluation metrics IWAE-64, log p(x), and −KL(Qφ(z|x)||Pθ(z|x)) during training. All linesshow mean ± standard deviation over 4 runs.

−0.5

0.0

0.5

IWA

E-6

4

−1

0

logp(x

)

(1, 64) (4, 16) (8, 8) (16, 4) (64, 1)

(M, K)

0

1

−KL

(Q||P

)

(a) Comparing MIWAE and IWAE

−0.5

0.0

0.5

IWA

E-6

4

−1

0

logp(x

)

0.0 0.2 0.4 0.6 0.8 1.0β

0

1

−KL

(Q||P

)

(b) Comparing CIWAE and IWAE

−0.5

0.0

0.5

IWA

E-6

4

−1

0

logp(x

)

(1, 64) (2, 32) (4, 16) (8, 8) (16, 4) (32, 2) (64, 1)

(M, K)

0

1

−KL

(Q||P

)

(c) Comparing PIWAE and IWAE

Figure 6: Performance of MIWAE, CIWAE, PIWAE relative to IWAE in terms of metrics ELBOIWAE with K = 64 (top row),log p(x) (middle row), and −KL(Qφ(z|x)||Pθ(z|x)) (final row). All dots are the difference in the metric between a modeltrained on IWAE-64 and the experimental condition. Dotted line is the IWAE baseline. Note that in all cases, the far left ofthe plot correspond to settings equivalent to the IWAE.

of IWAE and VAE, namely

ELBOCIWAE = β ELBOVAE +(1− β) ELBOIWAE (17)

where β ∈ [0, 1] is a combination parameter. It is trivial tosee that ELBOCIWAE is a lower bound on the log marginalthat is tighter than the VAE bound but looser than the IWAEbound. We then employ the following estimator

∆CK,β (18)

= ∇θ,φ(β

1

K

K∑k=1

logwk + (1− β) log

(1

K

K∑k=1

wk

))

where we use the same wk for both terms. The motivationfor CIWAE is that, if we set β to a relatively small value,the objective will behave mostly like IWAE, except whenthe expected IWAE gradient becomes very small. Whenthis happens, the VAE component should “take-over” andalleviate SNR issues: the asymptotic SNR of ∆C

K,β for φis O(

√MK) because the VAE component has non-zero

expectation in the limit K →∞.Our results suggest that what is good for the generativenetwork, in terms of setting K, is often detrimental forthe inference network. It is therefore natural to question

whether it is sensible to always use the same target for boththe inference and generative networks. Motivated by this,our third method, PIWAE, uses the IWAE target when trainingthe generative network, but the MIWAE target for trainingthe inference network. We thus have

∆CK,β(θ) = ∇θ log

1

K

K∑k=1

wk (19a)

∆CM,K,β(φ) =

1

M

M∑m=1

∇φ log1

L

L∑`=1

wm,` (19b)

where we will generally set K = ML so that the sameweights can be used for both gradients.

6. ExperimentsWe now consider using our new estimators to train deepvariational autoencoders for the MNIST digits dataset (Le-Cun et al., 1998). For this, we duplicated the architectureand training schedule outlined in Burda et al. (2016). Inparticular, all networks were trained and evaluated usingthe stochastic binarization of Burda et al. (2016). For allmethods we set a budget of T = 64 weights in the targetestimate for each datapoint in the minibatch.

7

Page 8: Tighter Variational Bounds are Not Necessarily Bettermobile/Papers/2018ARXIV_rainforth.pdf · 2018-03-11 · and (Burda et al., 2016) provide strong empirical evidence that setting

Tighter Variational Bounds are Not Necessarily Better

IWAE CIWAE PIWAE MIWAE VAE0

1

2

3

4N

orm

aliz

edE

ffec

tive

Sam

ple

Siz

e×10−3

Figure 7: Violin plots of ESS estimates for qφ for eachimage of MNIST, normalized by the number of samplesdrawn. A violin plot is similar to box plot with a kerneldensity plot on each side — thicker means more MNISTimages whose qφ achieves that ESS.

To assess different aspects of the training performance, weconsider three different metrics: ELBOIWAE with K = 64,ELBOIWAE with K = 5000, and the latter of these minus theformer. All reported metrics are evaluated on the test data.The motivation for the ELBOIWAE with K = 64 metric,denoted as IWAE-64, is that this is the target used for trainingthe IWAE and so if another method does better on this metricthan the IWAE, this is a clear indicator that SNR issues ofthe IWAE estimator have degraded its performance. In fact,this would demonstrate that, from a practical perspective,using the IWAE estimator is sub-optimal, even if our explicitaim is to optimize the IWAE bound.The second metric, ELBOIWAE with K = 5000, denotedlog p(x), is used as a surrogate for estimating the logmarginal likelihood and thus provides an indicator for fi-delity of the learned generative model.The third metric is an estimator for the divergence implicitlytargeted by the IWAE. Namely, as shown by Le et al. (2017),the ELBOIWAE can be interpreted as

ELBOIWAE = log pθ(x)−KL(Qφ(z|x)||Pθ(z|x)) (20)

where Qφ(z|x) :=

K∏k=1

qφ(xk|x), and (21)

Pθ(z|x) :=1

K

K∑k=1

∏K`=1 qφ(z`|x)

qφ(zk|x)pθ(zk|x). (22)

Thus we can estimate KL(Qφ(z|x)||Pθ(z|x)) usinglog p(x) − IWAE-64, to provide a metric for divergencebetween the inference network and the proposal network.We use this instead of the simply KL(qφ(z|x)||pθ(z|x))because the latter can be deceptive metric for inference net-work fidelity (Cremer et al., 2017). For example, it tends toprefer qφ(z|x) that covers one of the posterior modes, ratherthan encompassing all of them.

0 500 1000 1500 2000 2500 3000Epoch

0.2

0.4

0.6

0.8

1.0

1.2

1.4

SN

R

IWAE

CIWAE β = 0.5

PIWAE 8/8

MIWAE 8/8

VAE

Figure 8: SNR of inference network weights during training.All lines are mean ± standard deviation over 20 randomlychosen weights per layer.

Figure 6 shows the convergence of these metrics for eachalgorithm. Here we have considered the middle value foreach of the parameters, namely we set K = M = 8 forPIWAE and MIWAE, and β = 0.5 for CIWAE. We see thatPIWAE and MIWAE both comfortably outperformed, andCIWAE slightly outperformed, IWAE in terms of IWAE-64metric, despite IWAE being directly trained on this target.In terms of log p(x), PIWAE gave the best performance,followed by IWAE. In terms of the KL, we see that the VAEperformed best followed by MIWAE, with IWAE performingthe worst. These results imply that, while IWAE is able tolearn a generative network which performance similar tothat of PIWAE, the inference network it learns is substantiallyworse than all the competing methods.We next considered tuning the parameters for each of ouralgorithms as shown in Figure 6, for which we look at thefinal metric values after training. For MIWAE we see thatas we increase M , the log p(x) metric gets worse, whilethe KL gets better. The IWAE-64 metric initially increases,before reducing again fromM = 16 toM = 64, suggestingthat intermediate values for M (i.e. M 6= 1, K 6= 1) give abetter trade-off. For the PIWAE, similar behavior to MIWAEis seen for the IWAE-64 and KL metrics. However, unlikefor MIWAE, we see that log p(x) initially increases with M ,such that PIWAE provides uniform improvement over IWAEfor the M = 2, 4, 8, and 16 cases. CIWAE exhibits similarbehavior in increasing β as increasing M for MIWAE, butthere appears to be a larger degree of noise in the evaluations,while the optimal value of β, though non-zero, seems to becloser to IWAE than for the other algorithms.As an additional measure of the performance of the infer-ence network that is distinct to any of the training targets,we also considered the effective sample size (ESS) (Owen,2013) for the full trained networks defined as

ESS =

(∑Kk=1 wk

)2∑Kk=1 w

2k

. (23)

8

Page 9: Tighter Variational Bounds are Not Necessarily Bettermobile/Papers/2018ARXIV_rainforth.pdf · 2018-03-11 · and (Burda et al., 2016) provide strong empirical evidence that setting

Tighter Variational Bounds are Not Necessarily Better

The ESS is a measure of how many unweighted sampleswould be equivalent to the weighted sample set. A low ESSindicates that the inference network is struggling to performeffective inference for the generative network. The results,given in Figure 7, show that the ESSs for CIWAE, MIWAE,and the VAE were all significantly larger than for IWAE andPIWAE, with IWAE giving a particularly poor ESS.Our final experiment looks at the SNR values for the infer-ence networks. Here we took a number of different neuralnetwork gradient weights at different layers of the networkand calculated empirical estimates for their SNRs at variouspoints during the training. We then averaged these estimatesover the different network weights, the results of which aregiven in Figure 8. This clearly shows the low SNR exhibitedby the IWAE inference network, suggesting that our resultsfrom the simple Gaussian experiments carry over to themore complex neural network domain.

7. ConclusionsWe have provided theoretical and empirical evidence thatalgorithmic approaches of increasing the tightness of theELBO independently to the expressiveness of the inferencenetwork can be detrimental to learning by reducing thesignal-to-noise ratio of the inference network gradients. Ex-periments on a simple latent variable model confirmed ourtheoretical findings. We then exploited these insights to in-troduce three estimators, the MIWAE, PIWAE, and the CIWAEand showed that each of these can deliver improvementsover the IWAE, even when the metric used for this assess-ment is the IWAE target itself. In particular, PIWAE deliveredsimultaneous improvements in both the inference networkand the generative network compared to IWAE.

9

Page 10: Tighter Variational Bounds are Not Necessarily Bettermobile/Papers/2018ARXIV_rainforth.pdf · 2018-03-11 · and (Burda et al., 2016) provide strong empirical evidence that setting

Tighter Variational Bounds are Not Necessarily Better

ReferencesR. Bamler, C. Zhang, M. Opper, and S. Mandt. Pertur-

bative black box variational inference. arXiv preprintarXiv:1709.07433, 2017.

Y. Burda, R. Grosse, and R. Salakhutdinov. Importanceweighted autoencoders. In ICLR, 2016.

C. Cremer, Q. Morris, and D. Duvenaud. Reinterpret-ing importance-weighted autoencoders. arXiv preprintarXiv:1704.02916, 2017.

G. Fort, E. Gobet, and E. Moulines. Mcmc design-basednon-parametric regression for rare event. application tonested risk computations. Monte Carlo Methods andApplications, 23(1):21–42, 2017.

T. C. Hesterberg. Advances in importance sampling. PhDthesis, Stanford University, 1988.

D. Kingma and J. Ba. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980, 2014.

D. P. Kingma and M. Welling. Auto-encoding variationalBayes. In ICLR, 2014.

D. P. Kingma, T. Salimans, and M. Welling. Improving vari-ational inference with inverse autoregressive flow. arXivpreprint arXiv:1606.04934, 2016.

T. A. Le, M. Igl, T. Jin, T. Rainforth, and F. Wood.Auto-encoding sequential Monte Carlo. arXiv preprintarXiv:1705.10306, 2017.

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceed-ings of the IEEE, 86(11):2278–2324, 1998.

Y. Li and R. E. Turner. Renyi divergence variational in-ference. In Advances in Neural Information ProcessingSystems, pages 1073–1081, 2016.

L. Maaløe, C. K. Sønderby, S. K. Sønderby, and O. Winther.Auxiliary deep generative models. arXiv preprintarXiv:1602.05473, 2016.

C. J. Maddison, D. Lawson, G. Tucker, N. Heess,M. Norouzi, A. Mnih, A. Doucet, and Y. W. Teh. Filteringvariational objectives. arXiv preprint arXiv:1705.09279,2017.

C. A. Naesseth, S. W. Linderman, R. Ranganath, and D. M.Blei. Variational sequential Monte Carlo. arXiv preprintarXiv:1705.11140, 2017.

A. B. Owen. Monte Carlo theory, methods and examples.2013.

T. Rainforth, R. Cornish, H. Yang, A. Warrington,and F. Wood. On the opportunities and pitfallsof nesting Monte Carlo estimators. arXiv preprintarXiv:1709.06181, 2017.

R. Ranganath, D. Tran, and D. Blei. Hierarchical varia-tional models. In International Conference on MachineLearning, pages 324–333, 2016.

D. Rezende and S. Mohamed. Variational inference with

normalizing flows. In International Conference on Ma-chine Learning, pages 1530–1538, 2015.

D. J. Rezende, S. Mohamed, and D. Wierstra. Stochas-tic backpropagation and approximate inference in deepgenerative models. In ICML, 2014.

J. W. Roberts and R. Tedrake. Signal-to-noise ratio analysisof policy gradient algorithms. In Advances in NeuralInformation Processing Systems, pages 1361–1368, 2009.

F. J. Ruiz, M. K. Titsias, and D. M. Blei. Overdis-persed black-box variational inference. arXiv preprintarXiv:1603.01140, 2016.

T. Salimans, D. Kingma, and M. Welling. Markov chainMonte Carlo and variational inference: Bridging the gap.In Proceedings of the 32nd International Conference onMachine Learning (ICML-15), pages 1218–1226, 2015.

D. Tran, R. Ranganath, and D. M. Blei. The variationalGaussian process. arXiv preprint arXiv:1511.06499,2015.

R. E. Turner and M. Sahani. Two problems with varia-tional expectation maximisation for time-series models.Bayesian Time series models, pages 115–138, 2011.

10

Page 11: Tighter Variational Bounds are Not Necessarily Bettermobile/Papers/2018ARXIV_rainforth.pdf · 2018-03-11 · and (Burda et al., 2016) provide strong empirical evidence that setting

Tighter Variational Bounds are Not Necessarily Better

A. Proof of SNR Convergence RatesTheorem 1. Assume that when M = K = 1, the expected gradients; the variances of the gradients; and the first fourmoments of w1,1,∇θw1,1, and∇φw1,1 are all finite and the variances are also non-zero. Then the signal-to-noise ratios ofthe gradient estimates converge at the following rates

SNRM,K(θ) = (9)

√M

∣∣∣∣∣∣∣∣√K ∇θZ − 1

2Z√K∇θ(

Var[w1,1]Z2

)+O

(1

K3/2

)√E[w2

1,1 (∇θ logw1,1 −∇θ logZ)2]

+O(

1K

)∣∣∣∣∣∣∣∣

SNRM,K(φ) =√M

∣∣∣∣∣∣ ∇φVar [w1,1] +O(

1K

)2Z√K σ [∇φw1,1] +O

(1√K

)∣∣∣∣∣∣ (10)

where Z := pθ(x) is the true marginal likelihood.

Proof. We start by considering the variance of the estimators. We will first exploit the fact that each Zm,K is independentand identically distributed and then apply Taylor’s theorem3 to log Zm,K about Z, using R1(·) to indicate the remainderterm, as follows.

M · Var [∆M,K ] = Var [∆1,K ] = Var

[∇θ,φ

(logZ +

Z1,K − ZZ

+R1

(Z1,K

))]

=Var

[∇θ,φ

(Z1,K − Z

Z+R1

(Z1,K

))]

=E

(∇θ,φ( Z1,K − ZZ

+R1

(Z1,K

)))2−(E[∇θ,φ( Z1,K − Z

Z+R1

(Z1,K

))])2

=E

( 1

K

K∑k=1

Z∇θ,φw1,k − w1,k∇θ,φZZ2

+∇θ,φR1

(Z1,K

))2

∇θ,φ���

����*0

E

[Z1,K − Z

Z

]+ E

[∇θ,φR1

(Z1,K

)]2

=1

KZ4E[(Z∇θ,φw1,1 − w1,1∇θ,φZ)

2]

+ Var[∇θ,φR1

(Z1,K

)]+ 2E

[(∇θ,φR1

(Z1,K

))( 1

K

K∑k=1

Z∇θ,φw1,k − w1,k∇θ,φZZ2

)]

Now we have by the mean-value form of the remainder that for some Z between Z and Z1,K

R1

(Z1,K

)= −

(Z1,K − Z

)22Z2

and therefore

∇θ,φR1

(Z1,K

)= −

Z(Z1,K − Z

)∇θ,φ

(Z1,K − Z

)−(Z1,K − Z

)2∇θ,φZ

Z3.

3This approach follows similar lines to the derivation of nested Monte Carlo convergence bounds in (Rainforth et al., 2017) and (Fortet al., 2017), and the derivation of the mean squared error for self-normalized importance sampling, see e.g. (Hesterberg, 1988).

11

Page 12: Tighter Variational Bounds are Not Necessarily Bettermobile/Papers/2018ARXIV_rainforth.pdf · 2018-03-11 · and (Burda et al., 2016) provide strong empirical evidence that setting

Tighter Variational Bounds are Not Necessarily Better

It follows that the∇θ,φR1

(Z1,K

)terms are dominated as each of

(Z1,K − Z

)∇θ,φ

(Z1,K − Z

)and

(Z1,K − Z

)2vary

with the square of the estimator error, whereas other comparable terms vary only with the unsquared difference. Theassumptions on moments of the weights and their derivatives further guarantee that these terms are finite. More precisely,we have Z = Z + α

(Z1,K − Z

)for some 0 < α < 1 where ∇θ,φα must be bounded with probability 1 as K → ∞ to

maintain our assumptions. It follows that∇θ,φR1

(Z1,K

)= O

((Z1,K − Z

)2)and thus that

Var [∆M,K ] =1

MKZ4E[(Z∇θ,φw1,1 − w1,1∇θ,φZ)

2]

+1

MO

(1

K2

)(24)

using the fact that the third and fourth order moments of a Monte Carlo estimator both decrease at a rate O(1/K2).Considering now the expected gradient estimate and again using a Taylor’s theorem, this time to a higher number of terms,we have

E [∆M,K ] = E [∆1,K ] = E [∆1,K −∇θ,φ logZ] +∇θ,φ logZ

= ∇θ,φE

logZ +Z1,K − Z

Z−

(Z1,K − Z

)22Z2

+R2

(Z1,K

)− logZ

+∇θ,φ logZ

= −1

2∇θ,φE

( Z1,K − ZZ

)2+∇θ,φE

[R2

(Z1,K

)]+∇θ,φ logZ

= −1

2∇θ,φ

Var[Z1,K

]Z2

+∇θ,φE[R2

(Z1,K

)]+∇θ,φ logZ

= − 1

2K∇θ,φ

(Var [w1,1]

Z2

)+∇θ,φE

[R2

(Z1,K

)]+∇θ,φ logZ. (25)

Using a similar process as in variance case, it is now straightforward to show that∇θ,φE[R2

(Z1,K

)]= O(1/K2), which

is thus similarly dominated (also giving us (11)).

Finally, by combing (24) and (25) and noting that√

AK + B

K2 = A√K

+ B2AK3/2 +O

(1

K(5/2)

)we have

SNRM,K(θ) =

∣∣∣∣∣∣∣∣∇θ logZ − 1

2K∇θ(

Var[w1,1]Z2

)+O

(1K2

)√1

MKZ4E[(Z∇θw1,1 − w1,1∇θZ)

2]

+ 1MO

(1K2

)∣∣∣∣∣∣∣∣ (26)

=√M

∣∣∣∣∣∣∣∣Z2√K(∇θ logZ − 1

2K∇θ(

Var[w1,1]Z2

))+O

(1

K3/2

)√E[(Z∇θw1,1 − w1,1∇θZ)

2]

+O(

1K

)∣∣∣∣∣∣∣∣ (27)

=√M

∣∣∣∣∣∣∣∣√K ∇θZ − 1

2Z√K∇θ(

Var[w1,1]Z2

)+O

(1

K3/2

)√E[w2

1,1 (∇θ logw1,1 −∇θ logZ)2]

+O(

1K

)∣∣∣∣∣∣∣∣ = O

(√MK

). (28)

For φ, then because∇φZ = 0, we instead have

SNRM,K(φ) =√M

∣∣∣∣∣∣ ∇φVar [w1,1] +O(

1K

)2Z√K σ [∇φw1,1] +O

(1√K

)∣∣∣∣∣∣ = O

(√M

K

)(29)

and we are done.

12

Page 13: Tighter Variational Bounds are Not Necessarily Bettermobile/Papers/2018ARXIV_rainforth.pdf · 2018-03-11 · and (Burda et al., 2016) provide strong empirical evidence that setting

Tighter Variational Bounds are Not Necessarily Better

-0.1 0 0.1 0.2 0.3 0.40

10

20

30

(a) VAE inference network gradient estimates

-0.05 0 0.05 0.1 0.150

10

20

30

40

50

(b) VAE generative network gradient estimates

Figure 9: Histograms of gradient estimates ∆M,K for the generative network and the inference network using the VAE(K = 1) objectives with different values of M .

B. Derivation of Optimal Parameters for Gaussian ExperimentTo derive the optimal parameters for the Gaussian experiment we first note that

J (θ, φ) =1

Nlog

N∏n=1

pθ(x(n))− 1

N

N∑n=1

KL(Qφ(z1:K |x(n))

∣∣∣∣∣∣Pθ(z1:K |x(n))) where

Pθ(z1:K |x(n)) =1

K

K∑k=1

qφ(z1|x(n)) . . . qφ(zk−1|x(n))pθ(zk|x(n))qφ(zk+1|x(n)) . . . qφ(zK |x(n)),

Qφ(z1:K |x(n)) is as per (4) and the form of the Kullback-Leibler (KL) is taken from (Le et al., 2017). Next, we note thatφ only controls the mean of the proposal so, while it is not possible to drive the KL to zero, it will be minimized forany particular θ when the means of qφ(z|x(n)) and pθ(z|x(n)) are the same. Furthermore, the corresponding minimumpossible value of the KL is independent of θ and so we can calculate the optimum pair (θ∗, φ∗) by first optimizing for θand then choosing the matching φ. The optimal θ maximizes log

∏Nn=1 pθ(x

(n)), giving θ∗ := µ∗ = 1N

∑Nn=1 x

(n). Aswe straightforwardly have pθ(z|x(n)) = N (z;

(x(n) + µ

)/2, I/2), the KL is then minimized when A = I/2 and b = µ/2,

giving φ∗ := (A∗, b∗), where A∗ = I/2 and b∗ = µ∗/2.

C. Additional Empirical Analysis of SNRC.1. Histograms for VAETo complete the picture for the effect of M and K on the distribution of the gradients, we generated histograms for theK = 1 (i.e. VAE) gradients as M is varied. As shown in Figure 9a, we see the expected effect from the law of large numbersthat the variance of the estimates decreases with M , but not the expected value.

C.2. Convergence of RMSE for Generative NetworkAs explained in the main paper, the SNR is not entirely appropriate metric for the generative network – a low SNR is stillhighly problematic, but a high SNR does not indicate good performance. It is thus perhaps better to measure the qualityof the gradient estimates for the generative network by looking at the root mean squared error (RMSE) to ∇µ logZ, i.e.√

E [‖∆M,K −∇µ logZ‖22]. The convergence of this RMSE is shown in Figure 10 where the solid lines are the RMSEestimates using 104 runs and the shaded regions show the interquartile range of the individual estimates. We see thatincreasing M in the VAE reduces the variance of the estimates but has negligible effect on the RMSE due to the fixed bias. Onthe other hand, we see that increasing K leads to a monotonic improvement, initially improving at a rate O(1/K) (becausethe bias is the dominating term in this region), before settling to the standard Monte Carlo convergence rate of O(1/

√K)

(shown by the dashed lines).

C.3. Experimental Results for High Variance RegimeWe now present empirical results for a case where our weights are higher variance. Instead of choosing a point close to theoptimum by offsetting parameters with a standard deviation of 0.01, we instead offset using a standard deviation of 0.5. We

13

Page 14: Tighter Variational Bounds are Not Necessarily Bettermobile/Papers/2018ARXIV_rainforth.pdf · 2018-03-11 · and (Burda et al., 2016) provide strong empirical evidence that setting

Tighter Variational Bounds are Not Necessarily Better

Figure 10: RMSE in µ gradient estimate to∇µ logZ

further increased the proposal covariance to I to make it more diffuse. This is now a scenario where the model is far from itsoptimum and the proposal is a very poor match for the model, giving very high variance weights.We see that the behavior is the same for variation in M , but somewhat distinct for variation in K. In particular, the SNR andDSNR only decrease slowly with K for the inference network, while increasing K no longer has much benefit for the SNR ofthe inference network. It is clear that, for this setup, the problem is very far from the asymptotic regime in K such that ourtheoretical results no longer directly apply. Nonetheless, the high-level effect observed is still that the SNR of the inferencenetwork deteriorates, albeit slowly, as K increases.

-1.5 -1 -0.5 00

5

10

15

(a) IWAE inference network gradient estimates

-1.5 -1 -0.5 00

5

10

15

(b) VAE inference network gradient estimates

0 0.1 0.2 0.3 0.4 0.50

10

20

30

(c) IWAE generative network gradient estimates

0 0.1 0.2 0.3 0.4 0.50

10

20

30

(d) VAE generative network gradient estimates

Figure 11: Histograms of gradient estimates as per Figure 1.

14

Page 15: Tighter Variational Bounds are Not Necessarily Bettermobile/Papers/2018ARXIV_rainforth.pdf · 2018-03-11 · and (Burda et al., 2016) provide strong empirical evidence that setting

Tighter Variational Bounds are Not Necessarily Better

100 101 102 103100

101

102

103

104

(a) Convergence of SNR for inference network

100 101 102 10310-1

100

101

102

103

104

(b) Convergence of SNR for generative network

Figure 12: Convergence of signal-to-noise ratios of gradient estimates as per Figure 2.

(a) Convergence of DSNR for inference network (b) Convergence of DSNR for generative network

Figure 13: Convergence of directional signal-to-noise ratio of gradients estimates as per Figure 3.

(a) Convergence of DSNR for inference network (b) Convergence of DSNR for generative network

Figure 14: Convergence of directional signal-to-noise ratio of gradient estimates where the true gradient is taken asE [∆1,1000] as per Figure 14.

15

Page 16: Tighter Variational Bounds are Not Necessarily Bettermobile/Papers/2018ARXIV_rainforth.pdf · 2018-03-11 · and (Burda et al., 2016) provide strong empirical evidence that setting

Tighter Variational Bounds are Not Necessarily Better

−37.0

−36.5

−36.0

−35.5

−35.0

EL

BO

10−3

10−2

10−1

100

101

‖µ−µ∗‖ 2

0 1000 2000 3000 4000

Epoch

10−3

10−2

10−1

100

101

‖A−A∗‖ 2

K = 1, M = 1 K = 1, M = 1000 K = 10, M = 100 K = 100, M = 10 K = 1000, M = 1

0 1000 2000 3000 4000

Epoch

10−3

10−2

10−1

100

101

‖b−b∗‖ 2

Figure 15: Convergence of optimization for different values of K and M . (Top, left) ELBOIS during training (note thisrepresents a different metric for different K). (Top, right) L2 distance of the generative network parameters from the truemaximizer. (Bottom) L2 distance of the inference network parameters from the true maximizer. Plots show means over 3repeats with ±1 standard deviation. Optimization is performed using the Adam algorithm with all parameters initialized bysampling from the uniform distribution on [1.5, 2.5].

D. Convergence of Toy Gaussian ProblemWe now asses the effect of the outlined changes in the quality of the gradient estimates on the final optimization for our toyGaussian problem. Figure 15 shows the convergence of running Adam (Kingma and Ba, 2014) to optimize µ, A, and b. Thissuggests that the effects observed predominantly transfer to the overall optimization problem. Interestingly, setting K = 1and M = 1000 gave the best performance on learning not only the inference network parameters, but also the generativenetwork parameters.

16