Lecture 5: Advanced Bayesian Inference Methods › files › 11641 › lecture_5.pdf · 0 0.25 0.5 0.75 1 0 0.5 1 1.5 2 fall in the same cell. The identity of the test point is predicted

5

University of Oxford visual identity guidelines

At the heart of our visual identity is the Oxford logo.It should appear on everything we produce, from letterheads to leaflets and from online banners to bookmarks.

The primary quadrangle logo consists of an Oxford blue (Pantone 282) square with the words UNIVERSITY OF OXFORD at the foot and the belted crest in the top right-hand corner reversed out in white.

The word OXFORD is a specially drawn typeface while all other text elements use the typeface Foundry Sterling.

The secondary version of the Oxford logo, the horizontal rectangle logo, is only to be used where height (vertical space) is restricted.

These standard versions of the Oxford logo are intended for use on white or light-coloured backgrounds, including light uncomplicated photographic backgrounds.

Examples of how these logos should be used for various applications appear in the following pages.

NOTEThe minimum size for the quadrangle logo and the rectangle logo is 24mm wide. Smaller versions with bolder elements are available for use down to 15mm wide. See page 7.

The Oxford logoQuadrangle Logo

Rectangle Logo

This is the square logo of first choice or primary Oxford logo.

The rectangular secondary Oxford logo is for use only where height is restricted.

Lecture 5: Advanced Bayesian Inference

Methods

Advanced Topics in Machine Learning

Dr. Tom Rainforth

January 31st, 2020

[email protected]

Recap: Bayesian Inference is Hard!

Image Credit: Metro

1

Recap: Monte Carlo

Sampling

• Sampling is a lot like repeated simulation

• Predicting the weather, basketball games, …

• Basic idea

• Draw N samples from a sampling distribution S

• Compute an approximate posterior probability

• Show this converges to the true probability P

• Why sample?

• Reinforcement Learning: Can approximate (q-)values even when you don’t know the transition function

• Inference: getting a sample is faster than computing the right answer (e.g. with variable elimination)

Image Credit: Pieter Abbeel

2

Recap: Rejection Sampling

3

Recap: Importance Sampling

4

Lecture 5 Outline

In this lecture we will we show how the foundational methods

introduced in the last section are not sufficient for inference in high

dimensions

Particular topics:

• The Curse of Dimensionality

• Markov Chain Monte Carlo (MCMC)

• Variational Inference

5

The Curse ofDimensionality

5

The Curse of Dimensionality

• The curse of dimensionality is a tendency of modeling and

numerical procedures to get substantially harder as the

dimensionality increases, often at an exponential rate.

• If not managed properly, it can cripple the performance of

inference methods

• It is the main reason the two procedures discussed so far,

rejection sampling and importance sampling, are in practice

only used for very low dimensional problems

• At its core, it stems from an increase of the size (in an

informal sense) of a problem as the dimensionality increases

6

The Curse of Dimensionality (2)

• Imagine we are calculating an expectation over a discrete

distribution of dimension D, where each dimension has K

possible values

• The cost of enumerating all the possible combinations scales

as KD and thus increases exponentially with D; even for

modest values for K and D this will be prohibitively large

• The same problem occurs in continuous spaces: think about

splitting the space into blocks, we have to reason about all

the blocks to reason about the problem

7

The Curse of Dimensionality (3)

1.4. The Curse of Dimensionality 35

Figure 1.20 Illustration of a simple approachto the solution of a classificationproblem in which the input spaceis divided into cells and any newtest point is assigned to the classthat has a majority number of rep-resentatives in the same cell asthe test point. As we shall seeshortly, this simplistic approachhas some severe shortcomings.

x6

x7

0 0.25 0.5 0.75 10

0.5

1

1.5

2

fall in the same cell. The identity of the test point is predicted as being the sameas the class having the largest number of training points in the same cell as the testpoint (with ties being broken at random).

There are numerous problems with this naive approach, but one of the most se-vere becomes apparent when we consider its extension to problems having largernumbers of input variables, corresponding to input spaces of higher dimensionality.The origin of the problem is illustrated in Figure 1.21, which shows that, if we dividea region of a space into regular cells, then the number of such cells grows exponen-tially with the dimensionality of the space. The problem with an exponentially largenumber of cells is that we would need an exponentially large quantity of training datain order to ensure that the cells are not empty. Clearly, we have no hope of applyingsuch a technique in a space of more than a few variables, and so we need to find amore sophisticated approach.

We can gain further insight into the problems of high-dimensional spaces byreturning to the example of polynomial curve fitting and considering how we wouldSection 1.1

Figure 1.21 Illustration of thecurse of dimensionality, showinghow the number of regions of aregular grid grows exponentiallywith the dimensionality D of thespace. For clarity, only a subset ofthe cubical regions are shown forD = 3.

x1

D = 1x1

x2

D = 2

x1

x2

x3

D = 3

Image Credit: Bishop, Section 1.4

8

Example: Rejection Sampling From a Sphere

Consider rejection sampling from a D-dimensional hypersphere

with radius r using the tightest possible enclosing box:

Image Credit: Adrian Baez-Ortegal9

Example: Rejection Sampling From a Sphere (2)

Here the acceptance rate is equal to the ratio of the two volumes.

For even values of D this is given by

P(Accept) =Vs

Vc=πD/2rD/(D/2)!

(2r)D=

(√π

2

)D1

(D/2)!

This now decreases super–exponentially in D (noting that

(D/2)! > (D/6)(D/2))

D = 2, 10, 20, and 100 respectively gives P(Accept) values of 0.79,

2.5× 10−3, 2.5× 10−8, and 1.9× 10−70

Sampling this way was perfect in one-dimension, but quickly

becomes completely infeasible in higher dimensions

10

Curse of Dimensionality for Importance/Rejection Sampling

• For both importance sampling and rejection sampling we use

a proposal q(θ)

• This proposal is an approximation of the target p(θ|D)

• As the dimension increases, it quickly becomes much harder

to find good approximations

• The performance of both methods typically diminishes

exponentially as the dimension increases

11

Typical Sets

Another consequence of the curse of dimensionality is that most of

the posterior mass becomes concentrated away from the mode.

Consider representing an isotropic Gaussian in polar coordinates.

The marginal density of the radius changes with dimension:

Image Credit: Bishop, Section 1.4 12

Typical Sets

In high dimensions, the posterior mass concentrates in an thin strip

away from the mode known as the typical set

(a) Gaussian Example (b) Illustration (c) Simulation

Figure 1: Typical Sets. Subfigure (a) shows the example of a Gaussian with its mean located at thehigh-dimensional all-gray image. Subfigure (b) shows how the typical set arises due to the natureof high-dimensional integration. The figure is inspired by Betancourt (2017)’s similar illustration.Subfigure (c) shows our proposed method (Equation 3, higher ✏ implies OOD) applied to a Gaussiansimulation. The values have been re-scaled for purposes of visualization.

entropy-based definition of typicality into a statistical hypothesis test. To ensure that the test is robusteven in the low-data regime, we employ a bootstrap procedure (Efron & Tibshirani, 1994) to set theOOD-decision threshold. In the experiments, we demonstrate that our detection procedure succeedsin many of the challenging cases presented by Nalisnick et al. (2019). In addition to these successes,we also discuss failure modes that reveal drastic variability in OOD detection for the same data setpairs under different generative models. We highlight these cases to inspire future work.

2 BACKGROUND: TYPICAL SETS

The typical set of a probability distribution is the set whose elements have an information contentsufficiently close to that of the expected information (Shannon, 1948). A formal definition follows.

Definition 2.1 (✏,N)-Typical Set (Cover & Thomas, 2012) For a distribution p(x) with supportx 2 X , the (✏, N)-typical set AN

✏ [p(x)] 2 X N is comprised of all N -length sequences that satisfy

H[p(x)]� ✏ 1

N� log p(x1, . . . , xN ) H[p(x)] + ✏

where H[p(x)] =R

X p(x)[� log p(x)]dx and ✏ 2 R+ is a small constant.

When the joint density in Definition 2.1 factorizes, we can write:

H[p(x)]� ✏ 1

N

NX

n=1

� log p(xn) H[p(x)] + ✏. (1)

This is the definition we will use from here forward as we assume both training data andsamples from our generative model are identically and independently distributed (i.i.d.). Inthis factorized form, the middle quantity can be interpreted as an N -sample empirical entropy:1/N

PNn=1 � log p(xn) = HN [p(x)]. The asymptotic equipartition property (AEP) (Cover &

Thomas, 2012) states that this estimate will converge to the true entropy as N !1.

To build intuition, let p(x) = N(0,�2I) and consider its (✏, 1)-typical set. Plugging in the relevantquantities to Equation 1 and simplifying, we have x 2 A1

✏ [N(0,�2I)] if 12 |d � ||x � µ||22/�2| ✏

where d denotes dimensionality. See Appendix A.1 for a complete derivation. The inequality willhold for any choice of ✏ if ||x� µ||2 = �

pd. In turn, we can geometrically interpret A1

✏ [N(0,�2I)]as an annulus centered at µ with radius �

pd and whose width is a function of ✏ (and �). This is

a well-known concentration of measure result often referred to as the Gaussian Annulus Theorem(Vershynin, 2018). Figure 1(a) illustrates a Gaussian centered on the all gray image (pixel value128). We show that samples from this model never resemble the all gray image, despite it having thehighest probability density, because they are drawn from the annulus. In Figure 1(b) we visualizethe interplay between density and volume that gives rise to the typical set. The connection betweentypicality and concentration of measure can be stated formally as:

2

This means that, not only is the mass concentrated to a small

proportion of the space in high dimensions, the geometry of this

space can be quite complicated

1E Nalisnick et al. “Detecting out-of-distribution inputs to deep generative models using a test for typicality”. In:

arXiv preprint arXiv:1906.02994 (2019). 13

How Can We Overcome The Curse of Dimensionality?

• As we showed with the typical sets, the area of significant

posterior is usually only a small proportion of the overall space

• To overcome the curse, we thus need to use methods which

exploit structure of the posterior to only search this small

subset of the overall space

• All successful inference algorithms make some implicitassumptions into the structure and then try to exploit this

• MCMC methods exploit local moves to try and stick within the

typical set (thereby also implicitly assuming there are not

multiple modes)

• Variational methods assume independences between different

dimensions that allow large problems to be broken into

multiple smaller problems

14

Markov Chain Monte Carlo

14

Markov Chains

http://setosa.io/ev/markov-chains/

15

http://setosa.io/ev/markov-chains/

The Markov Property

In a Markovian system each state is independent of all the previous

states given the last state, i.e.

p(θn|θ1, . . . , θn−1) = p(θn|θn−1)

The system transitions based only on its current state.

Here the series of random variables produced the system

(i.e. Θ1, . . . ,Θn, . . . ) is known as a Markov chain.

16

Defining a Markov Chain

• All the Markov chains we will deal with are homogeneous

• This means that each transition has the same distribution:

p(Θn+1 = θ′|Θn = θ) = p(Θn = θ′|Θn−1 = θ),

• In such situations, p(Θn+1 = θn+1|Θn = θn) is typically

known as a transition kernel T (θn+1 ← θn)

• The distribution of any homogeneous Markov chain is fully

defined by a combination of an initial distribution p(θ1) and

the transition kernel T (θn+1 ← θn), e.g.

p(θm) =

∫T (θm ← θm−1)T (θm−1 ← θm−2) . . .

T (θ2 ← θ1)p(θ1)dθ1:m−1

17

Random walks: Continuous Markov Chains

• Markov chains do not have to be in discrete spaces

• In continuous spaces we can informally think of them as

guided random walks through the space with finite sized steps

https://youtu.be/7A83lXbs6Ik?t=114

18

https://youtu.be/7A83lXbs6Ik?t=114

Markov Chain Monte Carlo (MCMC)

• Markov chain Monte Carlo (MCMC) methods are one of the

most ubiquitous approaches for Bayesian inference and

sampling from target distributions more generally

• The key idea is to construct a valid Markov chain that

produces sample from the target distribution

• They only require that the target distribution is only known

up to a normalization constant.

• They circumvent the curse of dimensionality by exploitinglocal moves

• They have a hill–climbing effect until they reach the typical set

• They then move around the typical set using local moves

• They tend to fail spectacularly in the presence of

multi–modality

19

Convergence of a Markov Chain

• To use a Markov chain for consistent inference, we need it to

be able to produce an infinite series of samples that converge

to our posterior:

limN→∞

1

N

N∑

n=M

δΘn(θ)d−→ p(θ|D) (1)

where M is a number of burn–in samples that we discard

from the start of the chain

• In most cases, a core condition for this to hold is that the

distribution of individual samples converge to the target for all

possible starting points:

limN→∞

p(ΘN = θ|Θ1 = θ1) = p(θ|D) ∀θ1 (2)

20

Convergence of a Markov Chain

Ensuring that the chain converges to the target distribution for all

possible initializations has two requirements

1. p(θ|D) must be the stationary distribution of the chain,

such that if p(Θn = θ) = p(θ|D) then

p(Θn+1 = θ) = p(θ|D). This is satisfied if:

∫T (θ′ ← θ)p(θ|D)dθ = p(θ′|D) (3)

where we see that the target is invariant to the application of

the transition kernel.

2. The Markov chain must be ergodic. This means that all

possible starting points converge to this distribution.

21

Ergodicity

Ergodicity itself has two requirements, the chain must be:

1. Irreducible, i.e. all points with non-zero probability can be

reached in a finite number of steps

2. Aperiodic, i.e. no states can only be reached at certain

periods of time

These requirements for these to be satisfied are very mild for

commonly used Markov chains, but are beyond the scope of the

course

22

Stationary Distribution

Optional homework: figure out how we can get the stationary

distribution from the transition kernel when θ is discrete

Hint: start by defining the transition kernel as a matrix

23

Detailed Balance

A sufficient condition used for constructing valid Markov chains is

to ensure that the chain satisfies detailed balance:

p(θ|D)T (θ′ ← θ) = p(θ′|D)T (θ ← θ′). (4)

Chains that satisfy detailed balance are known as reversible.

Balance conditionT (✓0 ✓)⇡(✓) = R(✓ ✓0)⇡(✓0)

Implies that ⇡(✓) is left invariant:

ZT (✓0 ✓)⇡(✓) d✓ = ⇡(✓0)

⇠⇠⇠⇠⇠⇠⇠⇠⇠⇠⇠⇠⇠⇠⇠⇠⇠⇠⇠⇠⇠⇠⇠⇠:1ZR(✓ ✓0) d✓

Image Credit: Iain Murray

24

Detailed Balance (2)

It is straightforward to see that Markov chains satisfying detailed

balance will admit p(θ|D) as a stationary distribution by noting

that∫

T (θ′ ← θ)p(θ|D)dθ =

∫T (θ ← θ′)p(θ′|D)dθ = p(θ′|D). (5)

We can thus construct valid MCMC samplers by using detailed

balance to construct a valid transition kernel for our target,

sampling a starting point, then repeatedly applying the transition

kernel

25

Metropolis Hastings

One of the simplest and most widely used MCMC methods is

Metropolis Hastings (MH).

Given an unnormalized target p(θ,D), a starting point θ1, and a

proposal q(θ′|θ), the MH algorithm repeatedly applies the

following steps ad infinitum

1. Propose a new point θ′ ∼ q(θ′|θ = θn)

2. Accept the new sample with probability

P(Accept) = min

(1,

p(θ′,D)q(θn|θ′)p(θn,D)q(θ′|θn)

)(6)

in which case we set θn+1 ← θ′

3. If the sample is rejected, set θn+1 ← θn

4. Go back to step 1

26

Metropolis Hastings (2)

• This produces an infinite sequence of samples

θ1, θ2, . . . , θn, . . . that converge to p(θ|D) and from which we

can construct a Monte Carlo estimator

p(θ|D) ≈ 1

N

N∑

n=M

δθn(θ) (7)

where we start with sample M to burn-in the chain

• Note that MH only requires the unnormalized target p(θ,D)

• Unlike rejection/importance sampling, the samples are

correlated and produce biased estimates for finite N

• The key though is that the proposal q(θ′|θ) depends on the

current position allowing us to make local moves

27

MCMC Methods Demo

https://chi-feng.github.io/mcmc-demo/app.html?algorithm=RandomWalkMH&target=banana

28



More Advanced MCMC Methods

There are loads of more advanced MCMC methods.

Two that are particularly prominent ones that you should be able

to quickly pick up given what you have already learned are:

• Gibbs sampling (see the notes)

• Hamiltonian Monte Carlo: https:

//arxiv.org/pdf/1206.1901.pdf?fname=cm&font=TypeI

29

https://arxiv.org/pdf/1206.1901.pdf?fname=cm&font=TypeI

https://arxiv.org/pdf/1206.1901.pdf?fname=cm&font=TypeI

Hamiltonian Monte Carlo Demo

https://chi-feng.github.io/mcmc-demo/app.html?algorithm=HamiltonianMC&target=donut

30



Pros and Cons of MCMC Methods

Pros

• Able to work in high dimensions due to making local moves

• No requirement to have normalized target

• Consistent in the limit of running the chain for an infinitely

long time

• Do not require as finely tuned proposals as importance

sampling or rejection sampling

• Surprisingly effective for a huge range of problems

31

Pros and Cons of MCMC Methods (2)

Cons

• Produce biased estimates for finite sample sizes due to

correlation between samples

• Diagnostics can be very difficult

• Typically struggle to deal with multiple modes

• Proposal still quite important: chain can mix very slowly if

the proposal is not good

• Can be difficult to parallelize

• Deriving theoretical results is more difficult than previous

approaches

• Produces no marginal likelihood estimate

• Typically far slower to converge than the variational methods

we introduce next 32

Background for VariationalInference: Divergences

32

Divergences

• How do we quantitatively assess how similar two distributions

p(x) and q(x) are to one another?

• Similarity between distributions is much more subjective than

you might expect, particularly for continuous variables

• Metrics for measuring the “distance” between two

distributions are known as a divergence and typically

expressed in the form D(p(x)||q(x))

• Note that most divergences are not symmetric

33

Divergences (2)

Which is the best fitting Gaussian to our target blue distribution?

Either can be the best depending how we define our divergence

Image Credit: Barber, Section 28.3.4 34

The Kullback–Leibler (KL) Divergence

The KL divergence is one of the most commonly used divergences

due to its simplicity, useful computational properties, and the fact

that it naturally arises in a number of scenarios

KL(q(x) ‖ p(x)) =

∫q(x) log

q(x)

p(x)dx = E q(x)

[log

q(x)

p(x)

](8)

Importance properties:

• KL(q(x) ‖ p(x)) ≥ 0 for all q(x) and p(x)

• KL(q(x) ‖ p(x)) = 0 if and only if p(x) = q(x)∀x• In general, KL(q(x) ‖ p(x)) 6= KL(p(x) ‖ q(x))

35

Asymmetry of KL Divergence

Blue: target p(x)

Green: Gaussian q(x) that minimizes KL(q(x) ‖ p(x))

Red: Gaussian q(x) that minimizes KL(p(x) ‖ q(x))

Image Credit: Barber, Section 28.3.4 36

Mode Covering KL

The “forward” KL, KL(p(x) ‖ q(x)), is mode covering: q(x) must

place mass anywhere p(x) does

Image Credit: Eric Jang

37

Mode Seeking KL

The “reverse” KL, KL(q(x) ‖ p(x)), is mode seeking: q(x) must

not place mass anywhere p(x) does not

Image Credit: Eric Jang

38

Tail Behavior

• We can get insights into this happens by considering the cases

q(x)→ 0 and p(x)→ 0, noting that limx→0 x log x = 0

• If q(x) = 0 when p(x) > 0, then q(x) log(q(x)/p(x)) = 0 andp(x) log(p(x)/q(x)) =∞• KL(p(x) ‖ q(x)) =∞ if q(x) = 0 anywhere p(x) > 0

• KL(q(x) ‖ p(x)) is still fine when this happens

• By symmetry, KL(q(x) ‖ p(x)) is problematic if q(x) > 0

anywhere p(x) = 0

39

Variational Inference

39


• Variational inference (VI) methods are another class of

ubiquitously used approaches for Bayesian inference wherein

we try to learn an approximation to p(θ|D)

• Key idea: reformulate the inference problem to an

optimization, by learning parameters of a posterior

approximation

• We do this through introducing a parameterized variational

family qφ(θ), φ ∈ ϕ• Then finding the φ∗ ∈ ϕ that gives the “best” approximation

• Here “best” is based on minimizing KL(q ‖ p):

φ∗ = arg minφ∈ϕ

KL(qφ(θ) ‖ p(θ|D)) (9)

40

The Variational Family


Explored the central role of statistical inference in Machine Learning and data science.

38

In Review …Problem

Machine Learning Core

Data Implement and TestInference Application/

ProductionModel

q(z)

KL(q(z|x)kp(z|x) Approximation class

True posteriorLooked at the variational approach as one powerful and compelling method for inference.

532 11. SAMPLING METHODS

Figure 11.8 Importance sampling addresses the prob-lem of evaluating the expectation of a func-tion f(z) with respect to a distribution p(z)from which it is difficult to draw samples di-rectly. Instead, samples {z(l)} are drawnfrom a simpler distribution q(z), and thecorresponding terms in the summation areweighted by the ratios p(z(l))/q(z(l)).

p(z) f(z)

z

q(z)

Furthermore, the exponential decrease of acceptance rate with dimensionality is ageneric feature of rejection sampling. Although rejection can be a useful techniquein one or two dimensions it is unsuited to problems of high dimensionality. It can,however, play a role as a subroutine in more sophisticated algorithms for samplingin high dimensional spaces.

11.1.4 Importance samplingOne of the principal reasons for wishing to sample from complicated probability

distributions is to be able to evaluate expectations of the form (11.1). The techniqueof importance sampling provides a framework for approximating expectations di-rectly but does not itself provide a mechanism for drawing samples from distributionp(z).

The finite sum approximation to the expectation, given by (11.2), depends onbeing able to draw samples from the distribution p(z). Suppose, however, that it isimpractical to sample directly from p(z) but that we can evaluate p(z) easily for anygiven value of z. One simplistic strategy for evaluating expectations would be todiscretize z-space into a uniform grid and to evaluate the integrand as a sum of theform

E[f ] !L!

l=1

p(z(l))f(z(l)). (11.18)

An obvious problem with this approach is that the number of terms in the summationgrows exponentially with the dimensionality of z. Furthermore, as we have alreadynoted, the kinds of probability distributions of interest will often have much of theirmass confined to relatively small regions of z space and so uniform sampling will bevery inefficient because in high-dimensional problems, only a very small proportionof the samples will make a significant contribution to the sum. We would really liketo choose the sample points to fall in regions where p(z) is large, or ideally wherethe product p(z)f(z) is large.

As in the case of rejection sampling, importance sampling is based on the useof a proposal distribution q(z) from which it is easy to draw samples, as illustratedin Figure 11.8. We can then express the expectation in the form of a finite sum over

f(x)

PenaltyReconstruction

F(y, q) = Eq(z)[log p(y|z)]�KL[q(z)kp(z)]

Approx. Posterior

Moved from importance sampling to variational inference by applying the variational principle giving us the variational lower bound.

Image Credit: Shakir Mohamed 41

Variational Inference (2)

�0

p(✓|D)

q�(✓)�⇤ KL(q�⇤(✓)||p(✓|D))

Images based on example from David Blei

42

Variational Inference (3)

• We cannot work directly with KL(qφ(θ) ‖ p(θ|D)) because we

don’t know the posterior density

• We can note that the marginal likelihood p(D) is independent

of our variational parameters φ to work with the joint instead

φ∗ = arg minφ∈ϕ

KL(qφ(θ) ‖ p(θ|D))

= arg minφ∈ϕ

Eqφ(θ)

[log

qφ(θ)

p(θ|D)

]

= arg minφ∈ϕ

Eqφ(θ)

[log

qφ(θ)

p(θ|D)

]− log p(D)

= arg minφ∈ϕ

Eqφ(θ)

[log

qφ(θ)

p(θ,D)

]

• This trick is why we work with KL(qφ(θ) ‖ p(θ|D)) rather

than KL(p(θ|D) ‖ qφ(θ)): the latter is doubly intractable43

The ELBO

We can equivalently think about the optimization problem in VI as

the maximization

φ∗ = arg maxφ∈ϕ

L(φ)

where L(φ) : = Eqφ(θ)

[log

p(θ,D)

qφ(θ)

]

= log p(D)− KL(qφ(θ) ‖ p(θ|D))

is known as the evidence lower bound or ELBO for short

L(φ) is also sometimes known as the variational free energy

44

The ELBO (2)

This name comes from the fact that the ELBO is a lower bound on

the log evidence by Jensen’s inequality using the concavity of log

L(φ) = Eqφ(θ)

[log

p(θ,D)

qφ(θ)

]≤ logEqφ(θ)

[p(θ,D)

qφ(θ)

]= log p(D)

log

(u1 + u2

2

)

≥ log u1 + log u2

2

for any u1, u2 > 0

Image Credit: Michael Gutmann

45

Example: Mixture of Gaussians

Image credit: Alp Kucukelbir46

Worked Example—Gaussian with Unknown Mean and Variance

As a simple worked example (taken from Bishop 10.1.3), consider

the following model where we are trying to infer to the mean µ and

precision τ of a Gaussian given a set of observations D = {xn}Nn=1.

Our full model is given by

p(τ) = Gamma(τ ;α, β)

p(µ|τ) = N (µ;µ0, (λ0τ)−1)

p(D|µ, τ) =N∏

n=1

N (xn;µ, τ−1)

47


We care about the posterior p(µ, τ |D) and we are going to try and

approximate this using variational inference

For our variational family we will take

qφ(τ, µ) = q(τ)q(µ)

qφ(τ) = Gamma(τ ;φa, φb)

qφ(µ) = N (µ;φc , φ−1d )

where we note that this factorization is an assumption: the

posterior itself does not factorize

48


To find the best variational parameters φ∗, we need to optimize

L(φ), for which we can use gradient methods, using

∇φL(φ) = ∇φ∫∫

qφ(τ)qφ(µ) log

(p(D|µ, τ)p(µ|τ)p(τ)

qφ(τ)qφ(µ)

)dτdµ

If we can calculate this gradient, this means we can optimize φ by

performing gradient ascent.

After initializing some φ0, we just repeatedly apply

φn+1 ← φn + εn∇φL(φn)

where εn are our step sizes

49

Gradient Updates of Variational Parameters

�0

p(✓|D)

q�(✓)�⇤ KL(q�⇤(✓)||p(✓|D))

50


�0

p(✓|D)

q�(✓)�⇤ KL(q�⇤(✓)||p(✓|D))

�1

�2

�3

✏0r�L(�0)

✏1r�L(�1)

✏2r�L(�2)

50


�0

p(✓|D)

q�(✓)�⇤ KL(q�⇤(✓)||p(✓|D))

�1

�2

�3

50

The Mean Field Assumption

• In this example we chose a factorized variational

approximation: qφ(µ, τ) = qφ(µ)qφ(τ)

• This factorization assumption is actually a common

assumption more generally called the mean field assumption

• Mathematically we can define this as

qφ(θ) =∏

j

qj ,φ(θj) (10)

• There are a number of scenarios where this can help make

maximizing the ELBO more tractable

• However, it is a less necessary assumption than it used to be

since the rise of AutoDiff and stochastic gradients methods

51

The Mean Field Assumption (2)

Perhaps the biggest upshot of the mean field assumption is that it

gives a closed form solution for the optimal distribution of each

qj ,φ(θj) given the θi for i 6= j (up to a normalization constant and

assuming an otherwise constrained variational family):

q∗j ,φ(θj) ∝ exp(E∏

i 6=j qi,φ(θi ) [log p(θ,D)])

(11)

In some cases, this can be calculated analytically giving a

gradientless coordinate ascent approach to optimizing the ELBO

However, it also tends to require restrictive conjugate distribution

assumptions and so it is used much less often in modern

approaches

52


Figure 10.4 from Bishop 53

Pros and Cons of Variational Methods

Pros

• Typically more efficient than MCMC approaches, particularly

in high dimensions once we exploit the stochastic variational

approaches introduced in the next lecture

• Can often provided effective inference for models where

MCMC methods have impractically slow convergence

• Though it is an approximation for the density, we can also

sample directly from our variational distribution to calculate

Monte Carlo estimates if needed

• Allows simultaneous optimization of model parameters as we

will show in the next lecture

54

Pros and Cons of Variational Methods (2)

Cons

• But it produces (potentially very) biased estimates andrequires strong structural assumptions to be made about theform of the posterior

• Unlike MCMC methods, this bias stays even in the limit of

large computation

• Often requires substantial tailoring to a particular problem

• Very difficult to estimate how much error their is in the

approximation: subsequent estimates can be unreliable,

particular in their uncertainty

• Tends to underestimate the variance of the posterior due to

mode–seeking nature of reverse KL, particularly if using a

mean field assumption

55

Further Reading

• The lecture notes give extra information on the curse of

dimensionality and MCMC methods

• Iain Murray on MCMC

https://www.youtube.com/watch?v=_v4Eb09qp7Q

• Chapters 21, 22, and 23 of K P Murphy. Machine learning: a

probabilistic perspective. 2012

• David M Blei, Alp Kucukelbir, and Jon D McAuliffe.

“Variational inference: A review for statisticians”. In:

Journal of the American statistical Association (2017)

• NeurIPS tutorial on variational inference that accompanies the

previous paper:

https://www.youtube.com/watch?v=ogdv_6dbvVQ

56

https://www.youtube.com/watch?v=_v4Eb09qp7Q

https://www.youtube.com/watch?v=ogdv_6dbvVQ

Documents

Lecture 5: Advanced Bayesian Inference Methods › files › 11641 › lecture_5.pdf · 0 0.25 0.5 0.75 1 0 0.5 1 1.5 2 fall in the same cell. The identity of the test point is predicted