Upload
tavita
View
54
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Bayesian Inference. The Reverend Thomas Bayes (1702-1761). Sudhir Shankar Raman Translational Neuromodeling Unit, UZH & ETH. With many thanks for materials to: Klaas Enno Stephan & Kay H. Brodersen. Why do I need to learn about Bayesian stats?. - PowerPoint PPT Presentation
Citation preview
Bayesian InferenceSudhir Shankar RamanTranslational Neuromodeling Unit,UZH & ETH
The Reverend Thomas Bayes
(1702-1761)
With many thanks for materials to: Klaas Enno Stephan & Kay H. Brodersen
Why do I need to learn about Bayesian stats?
Because SPM is getting more and more Bayesian:
• Segmentation & spatial normalisation• Posterior probability maps (PPMs)• Dynamic Causal Modelling (DCM)• Bayesian Model Selection (BMS)• EEG: source reconstruction
Bayesian Inference
Approximate Inference
Variational Bayes
MCMC Sampling
Model Selection
Bayesian Inference
Approximate Inference
Variational Bayes
MCMC Sampling
Model Selection
Classical and Bayesian statistics
Flexibility in modelling Incorporating prior information Posterior probability of effect Options for model comparison
Probability of observing the data y, given no effect ( = 0).
0
0
: 0( | )Hp y H
)|( yp
)(p
),( yp
p-value: probability of getting the observed data in the effect’s absence. If small, reject null hypothesis that there is no effect.
One can never accept the null hypothesis
Given enough data, one can always demonstrate a significant effect
Correction for multiple comparisons necessary
Bayesian Inference
Statistical analysis and the illusion of objectivity. James O. Berger, Donald A. Berry
Bayes‘ Theorem
Reverend Thomas Bayes1702 - 1761
“Bayes‘ Theorem describes, how an ideally rational person processes information."Wikipedia
Observe
d Data Pri
or Beliefs
PosteriorBeliefs
)()()|()|(
yppypyP
Given data y and parameters , the conditional probabilities are:
)(),()|(
pypyp
Eliminating p(y,) gives Bayes’ rule:
Likelihood
Prior
Evidence
Posterior
Bayes’ Theorem
)(),()|(
ypypyp
Bayesian statistics
)()|()|( pypyp posterior likelihood prior∙
)|( yp )(p
Bayes theorem allows one to formally incorporate prior knowledge into computing statistical probabilities.
Priors can be of different sorts:empirical, principled or shrinkage priors, uninformative.
The “posterior” probability of the parameters given the data is an optimal combination of prior knowledge and new data, weighted by their relative precision.
new data prior knowledge
Bayes in motion - an animation
Observation of data
Formulation of a generative model
Model Inversion - Update of beliefs based upon observations, given a prior state of knowledge
( | ) ( | ) ( )p y p y p
Principles of Bayesian inference
ModelLikelihood function p(y|)
prior distribution p()
Data y
Maximum a posteriori (MAP)
Maximum likelihood(ML)
Conjugate PriorsPrior and Posterior have the same form
)()( )|()|(
yppypyp
Same form !!
Analytical expression. Conjugate priors for all exponential family members. Example – Gaussian Likelihood , Gaussian prior over mean
Likelihood & prior
Posterior:
Prior
LikelihoodPosterior
Gaussian Model
p
),|()(
),|()|(1
1
pp
e
Np
yNyp
),| ()|( 1 Nyp
ppe
pe
y
Relative precision weighting
y
y
Bayesian regression: univariate case
Relative precision weighting
Normal densities
xy
ppe
yy
pey
yx
x
222||
22
2
2|
1
11
xy
Univariate linear model),|()( 2
ppNp
),|()|( 2exyNyp
),|()|( 2|| yyNyp
p
y|
Bayesian GLM: multivariate case
One step if Ce is known. Otherwise define conjugate prior or
perform iterative estimation with EM.
GeneralLinear Model
Normal densitieseXθy ),;()( ppNp Cηθθ
),;()|( eNp CXθyθy
),;()|( || yyNyp Cηθθ
ppeT
yy
peT
y
ηCyCXCη
CXCXC1
||
111|
2
1
An intuitive example
-10 -5 0 5 10
-10
-5
0
5
10
1
2
PriorLikelihoodPosterior
Bayesian Inference
Approximate Inference
Variational Bayes
MCMC Sampling
Model Selection
Bayesian model selection (BMS)Given competing hypotheses on structure & functional mechanisms of a system, which model is the best?
For which model m does p(y|m) become maximal?
Which model represents thebest balance between model fit and model complexity?
Pitt & Miyung (2002), TICS
dmpmypmyp )|(),|()|( Model evidence:
Kass and Raftery (1995), Penny et al. (2004) NeuroImage
Bayesian model selection (BMS)
Model comparison via Bayes factor:
)|()|(),|(
),|(myp
mpmypmyp
Bayes’ rule:
accounts for both accuracy and complexity of the model
allows for inference about structure (generalizability) of the model
Model averaging
𝑝 (𝜃∨𝑦 )=∑𝑚𝑝 (𝜃|𝑦 ,𝑚)𝑝 (𝑚∨𝑦 )
𝐵𝐹=𝑝(𝑦∨𝑚1)𝑝(𝑦∨𝑚2)
𝑝 (𝑚1|𝑦 ¿ ¿𝑝 (𝑚2∨𝑦 )
=𝑝(𝑦∨𝑚1)𝑝(𝑦∨𝑚2)
𝑝 (𝑚1)𝑝 (𝑚2)
BF10 Evidence against H0
1 to 3 Not worth more than a bare mention
3 to 20 Positive20 to 150 Strong> 150 Decisive
Bayesian model selection (BMS)Various Approximations:
• Akaike Information Criterion (AIC) – Akaike, 1974
• Bayesian Information Criterion (BIC) – Schwarz, 1978
• Negative free energy ( F )• A by-product of Variational Bayes
ln𝑝 (𝐷|𝜃𝑀𝐿¿¿−𝑀
ln𝑝 (𝐷)≅ ln𝑝 (𝐷|𝜃𝑀 𝐴𝑃 ¿¿− 12 𝑀 ln (𝑁 )
Bayesian Inference
Approximate Inference
Variational Bayes
MCMC Sampling
Model Selection
marginal likelihood (model evidence)
priorlikelihoodposterior
Bayesian inference formalizes model inversion, the process of passing from a prior to a posterior in light of data.
Approximate Bayesian inference
In practice, evaluating the posterior is usually difficult because we cannot easily evaluate , especially when:
• High dimensionality, complex form• analytical solutions are not available• numerical integration is too expensive
𝑝 (𝜃|𝑦 )= 𝑝 ( 𝑦|𝜃 )𝑝 (𝜃 )∫ 𝑝 ( 𝑦 ,𝜃 ) d 𝜃
Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
Approximate Bayesian inference
Stochasticapproximate inferencein particular sampling
design an algorithm that draws samples from
inspect sample statistics (e.g., histogram, sample quantiles, …)
Deterministicapproximate inferencein particular variational Bayes
find an analytical proxy that is maximally similar to
inspect distribution statistics of (e.g., mean, quantiles, intervals, …)
There are two approaches to approximate inference. They have complementary strengths and weaknesses.
asymptotically exact computationally expensive tricky engineering
concerns
often insightful and fast often hard work to derive converges to local minima
Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
Bayesian Inference
Approximate Inference
Variational Bayes
MCMC Sampling
Model Selection
The Laplace approximation provides a way of approximating a density whose normalization constant we cannot evaluate, by fitting a Normal distribution to its mode.
The Laplace approximation
normalization constant(unknown)
main part of the density (easy to evaluate)
Pierre-Simon Laplace
(1749 – 1827)French
mathematicianand astronomer
This is exactly the situation we face in Bayesian inference:
𝑝 (𝑧 )= 1𝑍 × 𝑓 (𝑧 )
𝑝 (𝜃|𝑦 )= 1𝑝 (𝑦 )
×𝑝 ( 𝑦 ,𝜃 )model evidence
(unknown)joint density
(easy to evaluate)Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
Given a model with parameters , the Laplace approximation reduces to a simple three-step procedure:
Applying the Laplace approximation
Find the mode of the log-joint:
Evaluate the curvature of the log-joint at the mode:
We obtain a Gaussian approximation: with
Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
-10 0 10 200
0.05
0.1
0.15
0.2
0.25
The Laplace approximation
-10 0 10 200
0.05
0.1
0.15
0.2
0.25
The Laplace approximation
log jointapproximation
-10 0 10 200
0.05
0.1
0.15
0.2
0.25
The Laplace approximation
-10 0 10 200
0.05
0.1
0.15
0.2
0.25
The Laplace approximation
log jointapproximation
Limitations of the Laplace approximation
ignores global properties of the posterior
becomes brittle when the posterior is multimodal
only directly applicable to real-valued parameters
The Laplace approximation is often too strong a simplification.
Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
best proxy
Variational Bayesian inference
trueposterio
r
hypothesis class
divergence
Variational Bayesian (VB) inference generalizes the idea behind the Laplace approximation. In VB, we wish to find an approximate density that is maximally similar to the true posterior.
Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
Variational Bayesian inference is based on variational calculus.
Variational calculus
Standard calculusNewton, Leibniz, and others
• functions
• derivatives
Example: maximize the likelihood expression w.r.t.
Variational calculusEuler, Lagrange, and others
• functionals
• derivatives
Example: maximize the entropy w.r.t. a probability distribution
Leonhard Euler(1707 – 1783)
Swiss mathematician, ‘Elementa Calculi
Variationum’
Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
Variational calculus lends itself nicely to approximate Bayesian inference.
Variational calculus and the free energy
divergence between
and
free energy
Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
Variational calculus and the free energy
Maximizing is equivalent to:
• minimizing
• tightening as a lowerbound to the log model evidence
In summary, the log model evidence can be expressed as:
divergence
(unknown)
free energy(easy to
evaluate for a given )
KL ¿
ln𝑝 (𝑦 )∗
𝐹 (𝑞 , 𝑦 )
KL ¿ln𝑝 (𝑦 )
𝐹 (𝑞 , 𝑦 )
initialization …
… convergenc
eSource: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
Computing the free energyWe can decompose the free energy as follows:
expected log-joint
Shannon entropy
Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
When inverting models with several parameters, a common way of restricting the class of approximate posteriors is to consider those posteriors that factorize into independent partitions,
where is the approximate posterior for the th subset of parameters.
The mean-field assumption
𝜃1
𝜃2
𝑞 (𝜃1 ) 𝑞 (𝜃2 )
Jean Daunizeau, www.fil.ion.ucl.ac.uk/ ~jdaunize/presentations/Bayes2.pdf
Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
In summary:
Suppose the densities are kept fixed. Then the approximate posterior that maximizes is given by:
Therefore:
Variational algorithm under the mean-field assumptionThis implies a straightforward algorithm for variational inference:
Initialize all approximate posteriors , e.g., by setting them to their priors.
Cycle over the parameters, revising each given the current estimates of the others.
Loop until convergence.
Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
Typical strategies in variational inference
no parametricassumptions
parametric assumptions
no mean-field assumption
(variational inference = exact
inference)
fixed-formoptimization of
momentsmean-field assumption
iterative free-form variational
optimization
iterative fixed-formvariational
optimization
Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
We are given a univariate dataset which we model by a simple univariate Gaussian distribution. We wish to infer on its mean and precision:
Although in this case a closed-form solution exists*, we shall pretend it does not. Instead, we consider approximations that satisfy the mean-field assumption:
Example: variational density estimation
𝜇
𝑦 𝑖
𝜏
𝑖=1…𝑛
mean
precision
data
10.1.3; Bishop (2006) PRMLSource: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
Univariate normal distribution
Multivariate normal distribution
Gamma distribution
Recurring expressions in Bayesian inference
Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
𝜆𝑛=(𝜆0+𝑛 ) ⟨𝜏 ⟩𝑞 (𝜏 )
Variational density estimation: mean ln𝑞∗ (𝜇 )
𝜇𝑛=𝑛𝑦 ⟨𝜏 ⟩𝑞 (𝜏 )+𝜆0𝜇0 ⟨𝜏 ⟩𝑞 (𝜏 )
𝜆𝑛=𝜆0𝜇0+𝑛 𝑦𝜆0+𝑛
reinstation by inspection
with
Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
𝑎𝑛=𝑎0+𝑛+1
2
𝑏𝑛=𝑏0+𝜆0
2 ⟨ (𝜇−𝜇0 )2 ⟩𝑞 (𝜇 )+12 ⟨∑ (𝑦 𝑖−𝜇)2 ⟩𝑞 (𝜇 )
with
Variational density estimation: precision ln𝑞∗ (𝜏 )
Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
Variational density estimation: illustration
Bishop (2006) PRML, p. 472
𝑞 (𝜃 )
𝑝 (𝜃|𝑦 )
𝑞∗ (𝜃 )
Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
Bayesian Inference
Approximate Inference
Variational Bayes
MCMC Sampling
Model Selection
Markov Chain Monte Carlo(MCMC) sampling
• A general framework for sampling from a large class of distributions
• Scales well with dimensionality of sample space• Asymptotically convergent
1 2 3 N…………………...
Posterior distributio
n𝑝 (𝜃|𝑦 )Proposal
distributionq()
0
Markov chain properties• Transition probabilities – homogeneous/invariant
• Invariance
• Ergodicity
𝑝 (𝜃 𝑡+1|𝜃 1 ,…,𝜃 𝑡 ¿=𝑝 (𝜃𝑡+1|𝜃𝑡 ¿=𝑇 𝑡 (𝜃𝑡+1 ,𝜃𝑡)
MCMC
Homogeneous
Ergodic Invariant
Reversible
𝑝∗ (𝜃 )=∑𝜃′𝑇 (𝜃′ ,𝜃 )𝑝∗ (𝜃 ′ )
Metropolis-Hastings Algorithm• Initialize at step 1 - for example, sample from prior• At step t, sample from the proposal distribution:
• Accept with probability:
• Metropolis – Symmetric proposal distribution
t)
𝐴(𝜃¿¿∗ ,𝜃 𝑡) 𝑚𝑖𝑛 (1 , 𝑝 (𝜃∗|𝑦 ¿𝑞(𝜃 t∨𝜃∗)¿¿𝑝 (𝜃𝑡|𝑦 ¿𝑞(𝜃∗∨𝜃 t)¿)¿
𝐴(𝜃¿¿∗ ,𝜃 𝑡) 𝑚𝑖𝑛 (1 , 𝑝 (𝜃∗|𝑦 ¿¿¿𝑝 (𝜃𝑡|𝑦 ¿¿ )¿ Bishop (2006) PRML, p. 539
Gibbs Sampling Algorithm
• Special case of Metropolis Hastings• At step t, sample from the conditional distribution:
• Acceptance probability = 1• Blocked Sampling
) : : :)
𝑝 (𝜽 )=𝑝(𝜃1 ,𝜃2 ,… ,𝜃𝑛)
𝜃1𝑡1 ,𝜃2 𝑡1 𝑝 (𝜃 1 ,𝜃2|𝜃3 𝑡 ,…… ,𝜃𝑛𝑡 )
Posterior analysis from MCMC
Obtain independent samples:
Generate samples based on MCMC sampling. Discard initial “burn-in” period samples to remove dependence
on initialization. Thinning- select every mth sample to reduce correlation . Inspect sample statistics (e.g., histogram, sample quantiles, …)
1 2 3 N…………………...0
SummaryBayesian Inference
Approximate Inference
Variational Bayes
MCMC Samplin
g
Model Selection
References• Pattern Recognition and Machine Learning (2006). Christopher M. Bishop• Bayesian reasoning and machine learning. David Barber• Information theory, pattern recognition and learning algorithms. David McKay• Conjugate Bayesian analysis of the Gaussian distribution. Kevin P. Murphy. • Videolectures.net – Bayesian inference, MCMC , Variational Bayes• Akaike, H (1974). A new look at statistical model identification. IEEE transactions on
Automatic Control 19, 716-723.• Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics 6, 461-
464.• Kass, R.E. and Raftery, A.E. (1995).Bayes factors Journal of the American Statistical
Association, 90, 773-795. • Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference, Second
Edition (Chapman & Hall/CRC Texts in Statistical Science) Dani Gamerman, Hedibert F. Lopes.
• Statistical Parametric Mapping: The Analysis of Functional Brain Images.
Thank You
http://www.translationalneuromodeling.org/tapas/