Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Variational inferenceMonte Carlo methods
Bayesian Machine Learning - Lecture 8
Guido Sanguinetti
Institute for Adaptive and Neural ComputationSchool of Informatics
University of [email protected]
March 4, 2015
Guido Sanguinetti Bayesian Machine Learning - Lecture 8
Variational inferenceMonte Carlo methods
Today’s lecture
1 Variational inference
2 Monte Carlo methods
Guido Sanguinetti Bayesian Machine Learning - Lecture 8
Variational inferenceMonte Carlo methods
Free energy revisited
Recall that by postulating the inference model in terms ofapproximating probability distributions we obtained a generalvariational inference principle for Bayesian inference
The central task is minimising the free energy
G (q) = 〈E (x)〉q − H [q(x)]
BP expresses the free energy in terms of families of one nodeand two node beliefs with constraints
These do not necessary correspond to any probabilitydistribution → problems with convergence
Guido Sanguinetti Bayesian Machine Learning - Lecture 8
Variational inferenceMonte Carlo methods
Variational Bayes
As opposed to BP, start directly with free energy minimisationwithin a certain family of approximating distributions q
Common choices are factorized distributions (VariationalBayes EM, VBEM), or parametric families (e.g. Gaussian)
In the factorised case, the free energy then becomes
G [q(x)] =G [q1(x1), . . . , qN(xN)] =
=〈E (x)〉q +∑
i
∑xi
qi (xi ) log qi (xi )(1)
This leads to update equations
Guido Sanguinetti Bayesian Machine Learning - Lecture 8
Variational inferenceMonte Carlo methods
VBEM update equations
For finite state space, computing an expectation is the sameas forming the dot product pT f (xi )
The free energy in (1) then depends on qi as
G [qi ] = qTi 〈E (x)〉q\i
+ qTi log qi + const
where subscript \i indicates all indices except i-th
Taking derivatives and equating to zero, we get
qi (xi ) ∝ exp[−〈E (x)〉q\i
]In other words, the distribution over the i-th variable dependsonly on the statistics of the other variables
Guido Sanguinetti Bayesian Machine Learning - Lecture 8
Variational inferenceMonte Carlo methods
Variational GMMs
Suppose we wish to compute the joint posterior over the classvariables z AND model parameters Θ in a Gaussian mixturemodel
This is analytically intractable; approximate the posterior asq(z)q(Θ) and minimize KL
KL[q‖p(z ,Θ|x)] = log1
Z+ KL[q‖p(z ,Θ, x)]
= log1
Z+ Eq(z)q(Θ)[log p(x, z ,Θ)] + H[q(z)] + H[q(Θ)]
(2)
where H = −Eq[log(q)] is the entropy of the approximatingdistributions
Guido Sanguinetti Bayesian Machine Learning - Lecture 8
Variational inferenceMonte Carlo methods
Variational GMMs
The KL divergence in equation (1) can be optimisediteratively by zeroing the functional derivatives w.r.t. theapproximating distributions
The resulting update rules are
q(z) ∝ exp{−Eq(Θ) [log p(x, z ,Θ)]
}q(Θ) ∝ exp
{−Eq(z) [log p(x, z ,Θ)]
}These are iterated until the KL divergence stops decreasing(or the parameters stop changing)
Guido Sanguinetti Bayesian Machine Learning - Lecture 8
Variational inferenceMonte Carlo methods
Sample-based approximations
Variational methods are limited by being biased (choice ofapproximating family) and constrained (you still need to beable to compute expectations)
An alternative is to draw samples from the distribution ofinterest and compute sample based estimates
These will be asymptotically exact, with error decreasing withthe square root of the number of (effective) samples
We do not have the distribution of interest, but there aremethods to draw samples
Guido Sanguinetti Bayesian Machine Learning - Lecture 8
Variational inferenceMonte Carlo methods
Importance sampling
We always assume to have the distribution of interest p up toan unknown multiplicative scaling factor 1
Zpp̃ (the joint is
known)
We introduce a simple distribution q from which we can drawsamples
We then approximate any expectations as
Ep[f ] =
∫f (x)p̃(x)dx∫
p̃(x)dx
=
∫f (x)p̃(x)q(x)
q(x)dx∫p̃(x)q(x)
q(x)dx≈∑M
j=1 f (xj)wj∑Mj=1 wj
(3)
where xj is a set of M points sampled from q (easy to sample
from) and wj =p̃(xj )q(xj )
are the importance weights
Guido Sanguinetti Bayesian Machine Learning - Lecture 8
Variational inferenceMonte Carlo methods
Comments on importance sampling
Adjusting by the importance weights ensures that we areeffectively sampling from the correct distribution
Ideally, one would want the proposal distribution q to be asclose as possible to the target p
Discuss possible drawbacks of importance sampling
Guido Sanguinetti Bayesian Machine Learning - Lecture 8
Variational inferenceMonte Carlo methods
Markov Chain Monte Carlo
An alternative set of algorithms for Bayesian computationrelies on constructing an iterative sampling strategy thatexplores the latent variables according to the posteriordistribution
The trick is to construct a random walk (Markov Chain) andthen bias it in a way that (after some time) the variables areeffectively sampled from the posterior
This is the core idea of Markov Chain Monte Carlo methods
The biasing is usually obtained either through a rejectionprocedure, or by drawing the samples from distributions thatare already very close to what we want
Guido Sanguinetti Bayesian Machine Learning - Lecture 8
Variational inferenceMonte Carlo methods
MCMC conditions
MCMC works by building a random walk that converges tothe target distribution
A necessary condition is the invariance of the targetdistribution p under T (x ′, x), thetransition kernel of therandom walk
p(x) =
∫T (x , x ′)p(x ′)dx ′
The Markov chain must be ergodic (time averages = sampleaverages); usually one checks for detailed balance (reversibilityof transitions, not quite equivalent but sufficient)
The Markov chain must be irreducible and aperiodic
Guido Sanguinetti Bayesian Machine Learning - Lecture 8
Variational inferenceMonte Carlo methods
MCMC: the Metropolis-Hastings algorithm
The simplest such strategy is the Metropolis-Hastingsalgorithm
You need a proposal distribution q(z |z0) from which you candraw samples
Having drawn a sample z1, you accept it or reject it withprobability
r = min{1, p(z1)
p(z0)
q(z0|z1)
q(z1|z0)}
You then carry on using q(z |z1) as the updated proposal.Asymptotically, this converges to the posterior
Guido Sanguinetti Bayesian Machine Learning - Lecture 8
Variational inferenceMonte Carlo methods
Validity of MH sampler
Irreducibility and aperiodicity depend on the proposaldistribution (true unless pathological)
To verify ergodicity we check detailed balance (only neededfor when the samples are accepted)
T (x , y)p(y) =min{1, p(x)
p(y)
q(y |x)
q(x |y)}q(x |y)p(y) =
=min{q(x |y)p(y), p(x)q(y |x)} = T (y , x)p(x)
(4)
MH therefore gives a generic procedure for asymptoticallydrawing samples from any distribution
However, the time it may take to reach the asymptoticdetailed balance state may be very long
Guido Sanguinetti Bayesian Machine Learning - Lecture 8
Variational inferenceMonte Carlo methods
MCMC: Gibbs sampling
Often the latent variables are structured in groupsz = (z1, . . . , zk)
If the conditional posteriors p(zi |z−i , x) can be computed,then one can use Gibbs sampling
Practically, one samples iteratively from all conditionalposteriors one at the time
Easily shown to be a special case of MH when the acceptanceprobability is 1
Generally only slow
Guido Sanguinetti Bayesian Machine Learning - Lecture 8
Variational inferenceMonte Carlo methods
Gibbs sampling for a Bayesian GMM
Exrecise: work out the Gibbs sampler for the BayesianGaussian Mixture model with two spherical Gaussiancomponents with fixed variance σ2
Place a Beta(1,1) prior over the mixing components π1, π2
and a spherical Gaussian prior N (0, 10) over the componentmeans
Guido Sanguinetti Bayesian Machine Learning - Lecture 8
Variational inferenceMonte Carlo methods
Tomorrow’s lab class
Sample 50 2D points each from Gaussians with means (-1,-1),(1,3) and (3,0) all with variance 1
Implement EM for GMMs with k components of fixedvariance 1
Run EM on the data for k = 2, . . . , 8; compute and plot theBIC score for these 7 models
Guido Sanguinetti Bayesian Machine Learning - Lecture 8
Variational inferenceMonte Carlo methods
and thanks to the ERC!Guido Sanguinetti Bayesian Machine Learning - Lecture 8