Upload
wayne-lee
View
104
Download
0
Embed Size (px)
Citation preview
Explaining “Explaining Variational Approximation”
Based on Paper
“Explaining Variational Approximation”
JT Ormerod, MP Wand (2010)
Presentation by Wayne Tai Lee
My Goal
● Convert the paper into a short presentation● Not covering the examples (really helpful!)● Intuition and motivation only
Why do we want to use variational approximations?
● In Statistics, Bayesian solutions always involve the posterior:
p(Θ|data) = p(data | Θ) p(Θ) / p(data)
Why do we want to use variational approximations?
● p(Θ|data) = p(data | Θ) p(Θ) / p(data)
p(Θ|data) : posterior, belief after updating with data
Why do we want to use variational approximations?
● p(Θ|data) = p(data | Θ) p(Θ) / p(data)
p(Θ|data) : posterior, belief after updating with data
p(data | Θ): likelihood, data generation
Why do we want to use variational approximations?
● p(Θ|data) = p(data | Θ) p(Θ) / p(data)
p(Θ|data) : posterior, belief after updating with data
p(data | Θ): likelihood, data generation
p(Θ): prior, belief before updating with data
Why do we want to use variational approximations?
● p(Θ|data) = p(data | Θ) p(Θ) / p(data)
p(Θ|data) : posterior, belief after updating with data
p(data | Θ): likelihood, data generation
p(Θ): prior, belief before updating with data
p(data): “normalizing constant” to ensure posterior is a density function
Why do we want to use variational approximations?
● p(Θ|data) = p(data | Θ) p(Θ) / p(data)
p(data | Θ): likelihood, specified by you
p(Θ): prior, specified by you
p(data): a nasty integral that cannot be calculated explicitly in general
Why do we want to use variational approximations?
● p(Θ|data) = p(data | Θ) p(Θ) / p(data)
p(data | Θ): likelihood, specified by you
p(Θ): prior, specified by you
p(data): a nasty integral that cannot be calculated explicitly in general
● Consequence:– Posterior often has no analytical expression
Most popular alternative
● To obtain the posterior or any related statistic– Sample the posterior via MCMC methods
Most popular alternative
● To obtain the posterior or any related statistic– Sample the posterior via MCMC methods
● Pros– Can get arbitrarily close to the posterior with enough
samples (resource/time intensive)
● Con:– Lots of tuning necessary
– Time consuming to run
Variational Approximation
● Intuition:– Approximate the posterior with a class of
functions that are easier to deal with mathematically
– Find the function that minimizes the KL divergence between the posterior in this class
Variational Approximation
● Intuition:– Approximate the posterior with a class of
functions that are easier to deal with mathematically
– Find the function that minimizes the KL divergence between the posterior in this class
● Pros:– Suuuuper fast
● Cons:– No guarantees on closeness
Big Picture
Method to get Posterior
MCMC Variational Method
Strategy Sampling Optimization
Solution Asymptotically Exact Approximation with no bounds
Speed Often slow Fast
The “catch” Tuning and convergence assessment require experience
Need tractable mathematical setup
Explaining Variational Approximation
● Change notation: p(y) = p(data)
● Use q(Θ) to approximate p(Θ|y)● Will assume family of functions for q(Θ)
– q(Θ) = q1(Θ1)q2(Θ2)...qp(Θp)– Each qi(Θi) is a density
Max Lower Bound = Min KL-Divergence
Sanity Check: Optimal Solution is THE solution
● Optimal q(Θ) is p(Θ|y) for general form:
● Important: this is a very general solution for arbitrary dependence/distribution of Θ and y
● Product form of q(Θ) allows us to divide and conquer!
Focus on each Θ separately
Focus on each Θ separately
+...
Focus on each Θ separately
+...
Focus on each Θ separately
Our assumptions so far
● Product form of q(Θ) allowed us to optimize each term separately
● qi(Θi) being densities allowto integrate out nicely
How to convert into an optimization problem that we can solve?
We've only learned one trick...
We've only learned one trick...
● Optimal q1(Θ1) is then
To get a densities, just normalize
Unfold our definitions
Focus on Θ1
Repeat for Θi
● General Mean Field Variational Approximation Solution: The density that is proportional to
Similarity to Full Conditional in Gibb Sampling
● Optimal qi(Θi) is proportional to
● Need to do algebra until this is “tractable”– i.e. something we recognize as a standard
distribution that is easily normalized
– This is where the “setup” comes in important
For example
● Ifresembles exp(Θi^2 *c) then we know this must be the Gaussian density!
Final Solution
● Product of all qi(θi) is then approximated to
p(θ|y)
● Naturally doesn't do well when there's strong dependence between the θi
● You Should try the examples in the paper!
First Example
● Data generated as
– Y | μ, σ^2 ~ N(μ,σ^2)● Priors
– μ ~ N(m,s^2)– σ^2 ~InvGamma(a, b)
Gibbs Sampling vs Variational Samples
● N=100
● N=20
Discussion
● Hard to know when the approximation is poor relative to the true posterior...