Explaining the Basics of Mean Field Variational Approximation for Statisticians

Explaining “Explaining Variational Approximation”

Based on Paper

“Explaining Variational Approximation”

JT Ormerod, MP Wand (2010)

Presentation by Wayne Tai Lee

My Goal

● Convert the paper into a short presentation● Not covering the examples (really helpful!)● Intuition and motivation only

Why do we want to use variational approximations?

● In Statistics, Bayesian solutions always involve the posterior:

p(Θ|data) = p(data | Θ) p(Θ) / p(data)


● p(Θ|data) = p(data | Θ) p(Θ) / p(data)

p(Θ|data) : posterior, belief after updating with data




p(data | Θ): likelihood, data generation





p(Θ): prior, belief before updating with data





p(Θ): prior, belief before updating with data

p(data): “normalizing constant” to ensure posterior is a density function



p(data | Θ): likelihood, specified by you

p(Θ): prior, specified by you

p(data): a nasty integral that cannot be calculated explicitly in general



p(data | Θ): likelihood, specified by you

p(Θ): prior, specified by you

p(data): a nasty integral that cannot be calculated explicitly in general

● Consequence:– Posterior often has no analytical expression

Most popular alternative

● To obtain the posterior or any related statistic– Sample the posterior via MCMC methods

Most popular alternative

● To obtain the posterior or any related statistic– Sample the posterior via MCMC methods

● Pros– Can get arbitrarily close to the posterior with enough

samples (resource/time intensive)

● Con:– Lots of tuning necessary

– Time consuming to run

Variational Approximation

● Intuition:– Approximate the posterior with a class of

functions that are easier to deal with mathematically

– Find the function that minimizes the KL divergence between the posterior in this class

Variational Approximation

● Intuition:– Approximate the posterior with a class of

functions that are easier to deal with mathematically

– Find the function that minimizes the KL divergence between the posterior in this class

● Pros:– Suuuuper fast

● Cons:– No guarantees on closeness

Big Picture

Method to get Posterior

MCMC Variational Method

Strategy Sampling Optimization

Solution Asymptotically Exact Approximation with no bounds

Speed Often slow Fast

The “catch” Tuning and convergence assessment require experience

Need tractable mathematical setup

Explaining Variational Approximation

● Change notation: p(y) = p(data)

● Use q(Θ) to approximate p(Θ|y)● Will assume family of functions for q(Θ)

– q(Θ) = q1(Θ1)q2(Θ2)...qp(Θp)– Each qi(Θi) is a density

Max Lower Bound = Min KL-Divergence

Sanity Check: Optimal Solution is THE solution

● Optimal q(Θ) is p(Θ|y) for general form:

● Important: this is a very general solution for arbitrary dependence/distribution of Θ and y

● Product form of q(Θ) allows us to divide and conquer!

Focus on each Θ separately


+...


+...


Our assumptions so far

● Product form of q(Θ) allowed us to optimize each term separately

● qi(Θi) being densities allowto integrate out nicely

How to convert into an optimization problem that we can solve?

We've only learned one trick...

We've only learned one trick...

● Optimal q1(Θ1) is then

To get a densities, just normalize

Unfold our definitions

Focus on Θ1

Repeat for Θi

● General Mean Field Variational Approximation Solution: The density that is proportional to

Similarity to Full Conditional in Gibb Sampling

● Optimal qi(Θi) is proportional to

● Need to do algebra until this is “tractable”– i.e. something we recognize as a standard

distribution that is easily normalized

– This is where the “setup” comes in important

For example

● Ifresembles exp(Θi^2 *c) then we know this must be the Gaussian density!

Final Solution

● Product of all qi(θi) is then approximated to

p(θ|y)

● Naturally doesn't do well when there's strong dependence between the θi

● You Should try the examples in the paper!

First Example

● Data generated as

– Y | μ, σ^2 ~ N(μ,σ^2)● Priors

– μ ~ N(m,s^2)– σ^2 ~InvGamma(a, b)

Gibbs Sampling vs Variational Samples

● N=100

● N=20

Discussion

● Hard to know when the approximation is poor relative to the true posterior...

Education

Explaining the Basics of Mean Field Variational Approximation for Statisticians