48
Bayesian Inference Sudhir Shankar Raman Translational Neuromodeling Unit, UZH & ETH The Reverend Thomas Bayes (1702-1761) With many thanks for materials to: Klaas Enno Stephan & Kay H. Brodersen

Bayesian Inference

  • Upload
    tavita

  • View
    54

  • Download
    0

Embed Size (px)

DESCRIPTION

Bayesian Inference. The Reverend Thomas Bayes (1702-1761). Sudhir Shankar Raman Translational Neuromodeling Unit, UZH & ETH. With many thanks for materials to: Klaas Enno Stephan & Kay H. Brodersen. Why do I need to learn about Bayesian stats?. - PowerPoint PPT Presentation

Citation preview

Page 1: Bayesian Inference

Bayesian InferenceSudhir Shankar RamanTranslational Neuromodeling Unit,UZH & ETH

The Reverend Thomas Bayes

(1702-1761)

With many thanks for materials to: Klaas Enno Stephan & Kay H. Brodersen

Page 2: Bayesian Inference

Why do I need to learn about Bayesian stats?

Because SPM is getting more and more Bayesian:

• Segmentation & spatial normalisation• Posterior probability maps (PPMs)• Dynamic Causal Modelling (DCM)• Bayesian Model Selection (BMS)• EEG: source reconstruction

Page 3: Bayesian Inference

Bayesian Inference

Approximate Inference

Variational Bayes

MCMC Sampling

Model Selection

Page 4: Bayesian Inference

Bayesian Inference

Approximate Inference

Variational Bayes

MCMC Sampling

Model Selection

Page 5: Bayesian Inference

Classical and Bayesian statistics

Flexibility in modelling Incorporating prior information Posterior probability of effect Options for model comparison

Probability of observing the data y, given no effect ( = 0).

0

0

: 0( | )Hp y H

)|( yp

)(p

),( yp

p-value: probability of getting the observed data in the effect’s absence. If small, reject null hypothesis that there is no effect.

One can never accept the null hypothesis

Given enough data, one can always demonstrate a significant effect

Correction for multiple comparisons necessary

Bayesian Inference

Statistical analysis and the illusion of objectivity. James O. Berger, Donald A. Berry

Page 6: Bayesian Inference

Bayes‘ Theorem

Reverend Thomas Bayes1702 - 1761

“Bayes‘ Theorem describes, how an ideally rational person processes information."Wikipedia

Observe

d Data Pri

or Beliefs

PosteriorBeliefs

Page 7: Bayesian Inference

)()()|()|(

yppypyP

Given data y and parameters , the conditional probabilities are:

)(),()|(

pypyp

Eliminating p(y,) gives Bayes’ rule:

Likelihood

Prior

Evidence

Posterior

Bayes’ Theorem

)(),()|(

ypypyp

Page 8: Bayesian Inference

Bayesian statistics

)()|()|( pypyp posterior likelihood prior∙

)|( yp )(p

Bayes theorem allows one to formally incorporate prior knowledge into computing statistical probabilities.

Priors can be of different sorts:empirical, principled or shrinkage priors, uninformative.

The “posterior” probability of the parameters given the data is an optimal combination of prior knowledge and new data, weighted by their relative precision.

new data prior knowledge

Page 9: Bayesian Inference

Bayes in motion - an animation

Page 10: Bayesian Inference

Observation of data

Formulation of a generative model

Model Inversion - Update of beliefs based upon observations, given a prior state of knowledge

( | ) ( | ) ( )p y p y p

Principles of Bayesian inference

ModelLikelihood function p(y|)

prior distribution p()

Data y

Maximum a posteriori (MAP)

Maximum likelihood(ML)

Page 11: Bayesian Inference

Conjugate PriorsPrior and Posterior have the same form

)()( )|()|(

yppypyp

Same form !!

Analytical expression. Conjugate priors for all exponential family members. Example – Gaussian Likelihood , Gaussian prior over mean

Page 12: Bayesian Inference

Likelihood & prior

Posterior:

Prior

LikelihoodPosterior

Gaussian Model

p

),|()(

),|()|(1

1

pp

e

Np

yNyp

),| ()|( 1 Nyp

ppe

pe

y

Relative precision weighting

y

y

Page 13: Bayesian Inference

Bayesian regression: univariate case

Relative precision weighting

Normal densities

xy

ppe

yy

pey

yx

x

222||

22

2

2|

1

11

xy

Univariate linear model),|()( 2

ppNp

),|()|( 2exyNyp

),|()|( 2|| yyNyp

p

y|

Page 14: Bayesian Inference

Bayesian GLM: multivariate case

One step if Ce is known. Otherwise define conjugate prior or

perform iterative estimation with EM.

GeneralLinear Model

Normal densitieseXθy ),;()( ppNp Cηθθ

),;()|( eNp CXθyθy

),;()|( || yyNyp Cηθθ

ppeT

yy

peT

y

ηCyCXCη

CXCXC1

||

111|

2

1

Page 15: Bayesian Inference

An intuitive example

-10 -5 0 5 10

-10

-5

0

5

10

1

2

PriorLikelihoodPosterior

Page 16: Bayesian Inference

Bayesian Inference

Approximate Inference

Variational Bayes

MCMC Sampling

Model Selection

Page 17: Bayesian Inference

Bayesian model selection (BMS)Given competing hypotheses on structure & functional mechanisms of a system, which model is the best?

For which model m does p(y|m) become maximal?

Which model represents thebest balance between model fit and model complexity?

Pitt & Miyung (2002), TICS

Page 18: Bayesian Inference

dmpmypmyp )|(),|()|( Model evidence:

Kass and Raftery (1995), Penny et al. (2004) NeuroImage

Bayesian model selection (BMS)

Model comparison via Bayes factor:

)|()|(),|(

),|(myp

mpmypmyp

Bayes’ rule:

accounts for both accuracy and complexity of the model

allows for inference about structure (generalizability) of the model

Model averaging

𝑝 (𝜃∨𝑦 )=∑𝑚𝑝 (𝜃|𝑦 ,𝑚)𝑝 (𝑚∨𝑦 )

𝐵𝐹=𝑝(𝑦∨𝑚1)𝑝(𝑦∨𝑚2)

𝑝 (𝑚1|𝑦 ¿ ¿𝑝 (𝑚2∨𝑦 )

=𝑝(𝑦∨𝑚1)𝑝(𝑦∨𝑚2)

𝑝 (𝑚1)𝑝 (𝑚2)

BF10 Evidence against H0

1 to 3 Not worth more than a bare mention

3 to 20 Positive20 to 150 Strong> 150 Decisive

Page 19: Bayesian Inference

Bayesian model selection (BMS)Various Approximations:

• Akaike Information Criterion (AIC) – Akaike, 1974

• Bayesian Information Criterion (BIC) – Schwarz, 1978

• Negative free energy ( F )• A by-product of Variational Bayes

ln𝑝 (𝐷|𝜃𝑀𝐿¿¿−𝑀

ln𝑝 (𝐷)≅ ln𝑝 (𝐷|𝜃𝑀 𝐴𝑃 ¿¿− 12 𝑀 ln (𝑁 )

Page 20: Bayesian Inference

Bayesian Inference

Approximate Inference

Variational Bayes

MCMC Sampling

Model Selection

Page 21: Bayesian Inference

marginal likelihood (model evidence)

priorlikelihoodposterior

Bayesian inference formalizes model inversion, the process of passing from a prior to a posterior in light of data.

Approximate Bayesian inference

In practice, evaluating the posterior is usually difficult because we cannot easily evaluate , especially when:

• High dimensionality, complex form• analytical solutions are not available• numerical integration is too expensive

𝑝 (𝜃|𝑦 )= 𝑝 ( 𝑦|𝜃 )𝑝 (𝜃 )∫ 𝑝 ( 𝑦 ,𝜃 ) d 𝜃

Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

Page 22: Bayesian Inference

Approximate Bayesian inference

Stochasticapproximate inferencein particular sampling

design an algorithm that draws samples from

inspect sample statistics (e.g., histogram, sample quantiles, …)

Deterministicapproximate inferencein particular variational Bayes

find an analytical proxy that is maximally similar to

inspect distribution statistics of (e.g., mean, quantiles, intervals, …)

There are two approaches to approximate inference. They have complementary strengths and weaknesses.

asymptotically exact computationally expensive tricky engineering

concerns

often insightful and fast often hard work to derive converges to local minima

Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

Page 23: Bayesian Inference

Bayesian Inference

Approximate Inference

Variational Bayes

MCMC Sampling

Model Selection

Page 24: Bayesian Inference

The Laplace approximation provides a way of approximating a density whose normalization constant we cannot evaluate, by fitting a Normal distribution to its mode.

The Laplace approximation

normalization constant(unknown)

main part of the density (easy to evaluate)

Pierre-Simon Laplace

(1749 – 1827)French

mathematicianand astronomer

This is exactly the situation we face in Bayesian inference:

𝑝 (𝑧 )= 1𝑍 × 𝑓 (𝑧 )

𝑝 (𝜃|𝑦 )= 1𝑝 (𝑦 )

×𝑝 ( 𝑦 ,𝜃 )model evidence

(unknown)joint density

(easy to evaluate)Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

Page 25: Bayesian Inference

Given a model with parameters , the Laplace approximation reduces to a simple three-step procedure:

Applying the Laplace approximation

Find the mode of the log-joint:

Evaluate the curvature of the log-joint at the mode:

We obtain a Gaussian approximation: with

Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

Page 26: Bayesian Inference

-10 0 10 200

0.05

0.1

0.15

0.2

0.25

The Laplace approximation

-10 0 10 200

0.05

0.1

0.15

0.2

0.25

The Laplace approximation

log jointapproximation

-10 0 10 200

0.05

0.1

0.15

0.2

0.25

The Laplace approximation

-10 0 10 200

0.05

0.1

0.15

0.2

0.25

The Laplace approximation

log jointapproximation

Limitations of the Laplace approximation

ignores global properties of the posterior

becomes brittle when the posterior is multimodal

only directly applicable to real-valued parameters

The Laplace approximation is often too strong a simplification.

Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

Page 27: Bayesian Inference

best proxy

Variational Bayesian inference

trueposterio

r

hypothesis class

divergence

Variational Bayesian (VB) inference generalizes the idea behind the Laplace approximation. In VB, we wish to find an approximate density that is maximally similar to the true posterior.

Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

Page 28: Bayesian Inference

Variational Bayesian inference is based on variational calculus.

Variational calculus

Standard calculusNewton, Leibniz, and others

• functions

• derivatives

Example: maximize the likelihood expression w.r.t.

Variational calculusEuler, Lagrange, and others

• functionals

• derivatives

Example: maximize the entropy w.r.t. a probability distribution

Leonhard Euler(1707 – 1783)

Swiss mathematician, ‘Elementa Calculi

Variationum’

Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

Page 29: Bayesian Inference

Variational calculus lends itself nicely to approximate Bayesian inference.

Variational calculus and the free energy

divergence between

and

free energy

Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

Page 30: Bayesian Inference

Variational calculus and the free energy

Maximizing is equivalent to:

• minimizing

• tightening as a lowerbound to the log model evidence

In summary, the log model evidence can be expressed as:

divergence

(unknown)

free energy(easy to

evaluate for a given )

KL ¿

ln𝑝 (𝑦 )∗

𝐹 (𝑞 , 𝑦 )

KL ¿ln𝑝 (𝑦 )

𝐹 (𝑞 , 𝑦 )

initialization …

… convergenc

eSource: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

Page 31: Bayesian Inference

Computing the free energyWe can decompose the free energy as follows:

expected log-joint

Shannon entropy

Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

Page 32: Bayesian Inference

When inverting models with several parameters, a common way of restricting the class of approximate posteriors is to consider those posteriors that factorize into independent partitions,

where is the approximate posterior for the th subset of parameters.

The mean-field assumption

𝜃1

𝜃2

𝑞 (𝜃1 ) 𝑞 (𝜃2 )

Jean Daunizeau, www.fil.ion.ucl.ac.uk/ ~jdaunize/presentations/Bayes2.pdf

Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

Page 33: Bayesian Inference

In summary:

Suppose the densities are kept fixed. Then the approximate posterior that maximizes is given by:

Therefore:

Variational algorithm under the mean-field assumptionThis implies a straightforward algorithm for variational inference:

Initialize all approximate posteriors , e.g., by setting them to their priors.

Cycle over the parameters, revising each given the current estimates of the others.

Loop until convergence.

Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

Page 34: Bayesian Inference

Typical strategies in variational inference

no parametricassumptions

parametric assumptions

no mean-field assumption

(variational inference = exact

inference)

fixed-formoptimization of

momentsmean-field assumption

iterative free-form variational

optimization

iterative fixed-formvariational

optimization

Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

Page 35: Bayesian Inference

We are given a univariate dataset which we model by a simple univariate Gaussian distribution. We wish to infer on its mean and precision:

Although in this case a closed-form solution exists*, we shall pretend it does not. Instead, we consider approximations that satisfy the mean-field assumption:

Example: variational density estimation

𝜇

𝑦 𝑖

𝜏

𝑖=1…𝑛

mean

precision

data

10.1.3; Bishop (2006) PRMLSource: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

Page 36: Bayesian Inference

Univariate normal distribution

Multivariate normal distribution

Gamma distribution

Recurring expressions in Bayesian inference

Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

Page 37: Bayesian Inference

𝜆𝑛=(𝜆0+𝑛 ) ⟨𝜏 ⟩𝑞 (𝜏 )

Variational density estimation: mean ln𝑞∗ (𝜇 )

𝜇𝑛=𝑛𝑦 ⟨𝜏 ⟩𝑞 (𝜏 )+𝜆0𝜇0 ⟨𝜏 ⟩𝑞 (𝜏 )

𝜆𝑛=𝜆0𝜇0+𝑛 𝑦𝜆0+𝑛

reinstation by inspection

with

Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

Page 38: Bayesian Inference

𝑎𝑛=𝑎0+𝑛+1

2

𝑏𝑛=𝑏0+𝜆0

2 ⟨ (𝜇−𝜇0 )2 ⟩𝑞 (𝜇 )+12 ⟨∑ (𝑦 𝑖−𝜇)2 ⟩𝑞 (𝜇 )

with

Variational density estimation: precision ln𝑞∗ (𝜏 )

Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

Page 39: Bayesian Inference

Variational density estimation: illustration

Bishop (2006) PRML, p. 472

𝑞 (𝜃 )

𝑝 (𝜃|𝑦 )

𝑞∗ (𝜃 )

Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

Page 40: Bayesian Inference

Bayesian Inference

Approximate Inference

Variational Bayes

MCMC Sampling

Model Selection

Page 41: Bayesian Inference

Markov Chain Monte Carlo(MCMC) sampling

• A general framework for sampling from a large class of distributions

• Scales well with dimensionality of sample space• Asymptotically convergent

1 2 3 N…………………...

Posterior distributio

n𝑝 (𝜃|𝑦 )Proposal

distributionq()

0

Page 42: Bayesian Inference

Markov chain properties• Transition probabilities – homogeneous/invariant

• Invariance

• Ergodicity

𝑝 (𝜃 𝑡+1|𝜃 1 ,…,𝜃 𝑡 ¿=𝑝 (𝜃𝑡+1|𝜃𝑡 ¿=𝑇 𝑡 (𝜃𝑡+1 ,𝜃𝑡)

MCMC

Homogeneous

Ergodic Invariant

Reversible

𝑝∗ (𝜃 )=∑𝜃′𝑇 (𝜃′ ,𝜃 )𝑝∗ (𝜃 ′ )

Page 43: Bayesian Inference

Metropolis-Hastings Algorithm• Initialize at step 1 - for example, sample from prior• At step t, sample from the proposal distribution:

• Accept with probability:

• Metropolis – Symmetric proposal distribution

t)

𝐴(𝜃¿¿∗ ,𝜃 𝑡) 𝑚𝑖𝑛 (1 , 𝑝 (𝜃∗|𝑦 ¿𝑞(𝜃 t∨𝜃∗)¿¿𝑝 (𝜃𝑡|𝑦 ¿𝑞(𝜃∗∨𝜃 t)¿)¿

𝐴(𝜃¿¿∗ ,𝜃 𝑡) 𝑚𝑖𝑛 (1 , 𝑝 (𝜃∗|𝑦 ¿¿¿𝑝 (𝜃𝑡|𝑦 ¿¿ )¿ Bishop (2006) PRML, p. 539

Page 44: Bayesian Inference

Gibbs Sampling Algorithm

• Special case of Metropolis Hastings• At step t, sample from the conditional distribution:

• Acceptance probability = 1• Blocked Sampling

) : : :)

𝑝 (𝜽 )=𝑝(𝜃1 ,𝜃2 ,… ,𝜃𝑛)

𝜃1𝑡1 ,𝜃2 𝑡1 𝑝 (𝜃 1 ,𝜃2|𝜃3 𝑡 ,…… ,𝜃𝑛𝑡 )

Page 45: Bayesian Inference

Posterior analysis from MCMC

Obtain independent samples:

Generate samples based on MCMC sampling. Discard initial “burn-in” period samples to remove dependence

on initialization. Thinning- select every mth sample to reduce correlation . Inspect sample statistics (e.g., histogram, sample quantiles, …)

1 2 3 N…………………...0

Page 46: Bayesian Inference

SummaryBayesian Inference

Approximate Inference

Variational Bayes

MCMC Samplin

g

Model Selection

Page 47: Bayesian Inference

References• Pattern Recognition and Machine Learning (2006). Christopher M. Bishop• Bayesian reasoning and machine learning. David Barber• Information theory, pattern recognition and learning algorithms. David McKay• Conjugate Bayesian analysis of the Gaussian distribution. Kevin P. Murphy. • Videolectures.net – Bayesian inference, MCMC , Variational Bayes• Akaike, H (1974). A new look at statistical model identification. IEEE transactions on

Automatic Control 19, 716-723.• Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics 6, 461-

464.• Kass, R.E. and Raftery, A.E. (1995).Bayes factors Journal of the American Statistical

Association, 90, 773-795. • Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference, Second

Edition (Chapman & Hall/CRC Texts in Statistical Science) Dani Gamerman, Hedibert F. Lopes.

• Statistical Parametric Mapping: The Analysis of Functional Brain Images.

Page 48: Bayesian Inference

Thank You

http://www.translationalneuromodeling.org/tapas/