Advanced Lectures on Bayesian Analysisbayes/16/lp/Heavens_Lecture_2.pdf · Advanced Lectures on Bayesian Analysis Alan Heavens Imperial Centre for Inference and Cosmology (ICIC) Imperial

Advanced Lectures on Bayesian Analysis

Alan Heavens

Imperial Centre for Inference and Cosmology (ICIC)Imperial College, London

[email protected]

November 21, 2016

Alan Heavens (ICIC, Imperial College) Advanced Topics November 21, 2016 1 / 19

Overview

1 Model Comparison

2 Bayesian Evidence, or Marginal Likelihood

Bayesian Information Criterion (BIC)

Nested Models: Savage-Dickey Density Ratio

Gaussian Example


Model Comparison

A higher-level question than parameter inference, in which one wantsto know which theoretical framework (‘model’) is preferred, given thedata (regardless of the parameter values)

The models may be completely different (e.g. compare Big Bang withSteady State, to use an old example),

or variants of the same idea. E.g. comparing a simple cosmologicalmodel where the Universe is assumed to be flat, with a more generalmodel where curvature is allowed to vary

The sort of question asked here is essentially ‘Do the data favour amore complex model?’

Clearly in the latter type of comparison the likelihood itself will be ofno use - it will always increase if we allow more freedom.


Model Comparison

Assuming uninformative priors for the models (i.e. the same a prioriprobability), the probability of the models given the data is simplyproportional to the Bayesian Evidence.

The Bayesian Evidence, or Marginal Likelihood, is thedenominator in Bayes’ theorem

p(θ|x) =p(x |θ)π(θ)

p(x)

It is much more obvious if we include the model dependence as acondition:

p(θ|x ,M) =p(x |θ,M)π(θ|M)

p(x |M)

The Bayesian Evidence normalises the posterior, so is

p(x |M) =

∫dθ p(x |θ,M)π(θ|M)


Model Comparison

Figure: The Planck power spectrum, with the theoretical model with best fittingcosmological parameters.


Bayesian Evidence or Marginal Likelihood

We denote two competing models by M and M ′.

We denote by x the data vector, and by θ and θ′ the parametervectors (of length n and n′).

Rule 1: Write down what you want to know.

Here it is p(M|x) - the probability of the model, given the data.

Use Bayes’ theorem:

p(M|x) =p(x |M)π(M)

p(x)


Bayesian evidence

The Bayesian Evidence is

p(x|M) =

∫dθ p(x|θ,M)π(θ|M),

If a model has no parameters, then the integral is simply replaced byp(x|M)

The relative probabilities of two models is

p(M ′|x)

p(M|x)=π(M ′)

π(M)

∫dθ′ p(x|θ′,M ′)π(θ′|M ′)∫dθ p(x|θ,M)π(θ|M)

.

With uninformative priors on the models, p(M ′) = p(M), this ratiosimplifies to the ratio of evidences, called the Bayes Factor,

B ≡∫dθ′ p(x|θ′,M ′)π(θ′|M ′)∫dθ p(x|θ,M)π(θ|M)

.


Nested models

We assume that M ′ is a simpler model, which has fewer parameters init (n′ < n)

We further assume that it is nested in Model M ′, i.e. the n′

parameters of model M ′ are common to M, which has p ≡ n − n′

extra parameters in it. These parameters are fixed to fiducial values inM ′.

Note that the a complicated model M will (if M ′ is nested) inevitablylead to a higher likelihood (or at least as high), but the evidence mayfavour the simpler model if the fit is nearly as good, through thesmaller prior volume.


Nested models

We assume uniform (and hence separable) priors in each parameter,over ranges ∆θ (or ∆θ′). Hence p(θ|M) = (∆θ1 . . .∆θn)−1

B =

∫dθ′ p(x|θ′,M ′)∫dθ p(x|θ,M)

∆θ1 . . .∆θn

∆θ′1 . . .∆θ′n′.

Note that if the prior ranges are not large enough to containessentially all the likelihood, then the position of the boundarieswould influence the Bayes factor. In what follows, we will assume theprior range is large enough to encompass all the likelihood.

In the nested case, the ratio of prior hypervolumes simplifies to

∆θ1 . . .∆θn

∆θ′1 . . .∆θ′n′= ∆θn′+1 . . .∆θn′+p,

where p ≡ n − n′ is the number of extra parameters in the morecomplicated model.


Bayesian evidence

Challenges: The evidence requires a multidimensional integration over thelikelihood and prior, and this may be very expensive to compute.

Fisher-like approach: and assume the likelihood is a multivariategaussian (Laplace approximation)

Nested sampling (multinest, polychord), where one tries to samplethe likelihood in an efficient way.

Approximations: e.g., AIC and BIC may be unreliable as they arebased on the best-fit χ2, and from a Bayesian perspective we want toknow how much parameter space would give the data with highprobability. Also don’t include the prior. Not Bayesian.



With very constraining data, the likelihood L will be approximatelygaussian near a narrow peak at θmax

L ' Lmax exp

[−

(θ − θmax)iFij(θ − θmax)j2

]The evidence

p(x) =

∫dθ p(x|θ)π(θ)

p(x) ' Lmaxπmax

∫dθ exp

[−

(θ − θmax)iFij(θ − θmax)j2

]



In k parameter dimensions, this is

p(x) ' (2π)k/2|F |−1/2Lmaxπmax

ln p(x) ' lnLmax + lnπmax +k

2ln 2π − 1

2ln |F |

ln |F | ∼ k lnN (N= number of data points) to leading order, so

ln p(x) ' lnLmax − k2 lnN = -BIC/2.

So maximising the evidence is roughly equivalent to minimising theBIC, except that it is only true asymptotically, and assumes differentmodels have the same πmax

These assumptions often do not hold in practice


Bayesian EvidenceNested Models: Savage-Dickey Density Ratio

Let M0 and M1 be nested models, such that M0 is a subset of M1,e.g., where one of M1’s parameters is fixed to a particular value.

Some notation. Let the parameters for M0 be ψ, and those of M1 beψ, φ.

M0 has φ = φ0 (fixed).

Assume that all probabilities are continuous, so

limφ→φ0

π1(ψ|φ) = aπ0(ψ). i .e. π1(ψ|φ = φ0) = aπ0(ψ).

The factor a is needed to ensure that the priors are normalised, whichwe absolutely have to have. e.g. If M1 has a 2D flat prior with ranges∆θ1,∆θ2,

π1 =1

∆θ1∆θ2; π0 =

1

∆θ1

so a = 1∆θ2

.


SDDR

The Bayes factor is

BF01 ≡p(x |M0)

p(x |M1)=

∫p0(x |ψ)π0(ψ) dψ∫

p1(x |ψ, φ)π1(ψ, φ) dψdφ

With the continuity, we have then

BF01 =

∫p1(x |ψ, φ = φ0)π1(ψ, φ = φ0) dψ∫

p1(x |ψ, φ)π1(ψ, φ) dψdφ=

a p1(x |φ = φ0)

p1(x).

Now, using Bayes’ theorem,

p1(x |φ = φ0) =p1(φ = φ0|x) p1(x)

π1(φ = φ0)


SDDR

Hence

BF01 =a p1(φ = φ0|x)

π1(φ = φ0).

This is the Savage-Dickey Density Ratio (SDDR). It looks verysimple, but we need to think how to use it, since the numerator is aposterior, not a likelihood.

However, if we have sampled it, we can estimate the SDDR.

The denominator is easy if the prior has a simple functional form.

The numerator may be estimated from samples of the posterior inmodel 1, e.g. if f (∆φ, φ0) is the fraction of samples within (a smallrange) ∆φ of φ0, then p1(φ0|x) ' f (∆φ, φ0)/(2∆φ).


Gaussian Example

In this gaussian example, we can evaluate the integrals analytically.

Let M0 be x ∼ N (0, σ2), and M1 be x ∼ N (µ, σ2), where the prior on µis gaussian with variance Σ2. Let the measurement be x = λσ.

p1(x |µ) =1√2πσ

e−(x−µ)2/(2σ2)

and

p1(µ|x) =p1(x |µ)π1(µ)

p1(x)=

p1(x |µ)π1(µ)∫p1(x |µ)π1(µ)dµ


Gaussian Example

Hence

BF01 =p1(x |µ = 0)π1(µ = 0)

p1(x)

i.e.,

BF01 =

1√2πσ

e−x2/(2σ2) · 1√

2πΣ1√2πσ

1√2πΣ

∫∞−∞ e−(x−µ)2/(2σ2)e−µ2/(2Σ2)dµ

so

BF01 =

√1 +

Σ2

σ2exp

[− λ2

2(1 + σ2

Σ2 )

]


Gaussian Example

BF01 =

√1 +

Σ2

σ2exp

− λ2

2(1 + σ2

Σ2 )

If λ� 1, then B01 � 1 and M1 is favoured. If λ ' 1 and σ � Σ, thenM0 is favoured (Occam’s razor). If likelihood is much broader than prior,σ � Σ then BF01 ' 1 and nothing has been learned.

1

150

1

12

1

3

3 12 150

-1 0 1 2 3 4

0

1

2

3

4

5

Figure: The Bayes Factor for a gaussian likelihood (variance σ2), and a gaussianprior (variance Σ2). The x axis =log10(Σ/σ); the y axis is datum/σ.


Summary

Bayesian formalism can easily be generalised to model comparison

Resulting integrals over parameter space may be challenging tocompute (see Fred’s lectures)

Approximations such as BIC may not be accurate

Evidence ratios have sensitivity to the prior, even asymptotically

SDDR may be useful for nested models


Documents

Advanced Lectures on Bayesian Analysisbayes/16/lp/Heavens_Lecture_2.pdf · Advanced Lectures on Bayesian Analysis Alan Heavens Imperial Centre for Inference and Cosmology (ICIC) Imperial