17
Bayesian Reasoning: Tempering & Sampling A/Prof Geraint F. Lewis Rm 560: [email protected]

Bayesian Reasoning: Tempering & Sampling A/Prof Geraint F. Lewis Rm 560: [email protected]

Embed Size (px)

Citation preview

Page 1: Bayesian Reasoning: Tempering & Sampling A/Prof Geraint F. Lewis Rm 560: gfl@physics.usyd.edu.au

Bayesian Reasoning:Tempering & Sampling

A/Prof Geraint F. LewisRm 560: [email protected]

Page 2: Bayesian Reasoning: Tempering & Sampling A/Prof Geraint F. Lewis Rm 560: gfl@physics.usyd.edu.au

http://www.physics.usyd.edu.au/~gfl/LectureLecture 7

Why does it work?With the Metropolis-Hastings algorithm, the desired posterior distribution (the stationary distribution of the Markov Chain) is recovered for a wide range of proposal distributions. For this, the chain must have three properties;

Irreducibility: Given any starting point, the chain must be able to (eventually) jump to all states in the posterior distribution.

Aperiodic: The chain must oscillate between two different states with a regular periodic motion (i.e. it gets stuck in the oscillation for ever).

Page 3: Bayesian Reasoning: Tempering & Sampling A/Prof Geraint F. Lewis Rm 560: gfl@physics.usyd.edu.au

http://www.physics.usyd.edu.au/~gfl/LectureLecture 7

Why does it work?Positive Recurrent: This basically means that the posterior distribution exists, such that if an initial value X0 samples (X) then all subsequent iterations will also sample (X) .

These can be shown for Metropolis-Hastings e.g.;

This is the Detailed Balance equation.

Page 4: Bayesian Reasoning: Tempering & Sampling A/Prof Geraint F. Lewis Rm 560: gfl@physics.usyd.edu.au

http://www.physics.usyd.edu.au/~gfl/LectureLecture 7

Where can we go wrong?Our posterior distribution may be multi-modal, with several significant peaks.

Given enough time, our MCMC walk through the probability space will eventually cover the entire volume. However, the walk may stay on one peak for a significant period before moving to the next.

If we have only a certain amount of time (i.e. a three year PhD), how can we ensure that we have appropriately sampled the space and that the MCMC chain truly reflects the underlying posterior distribution?

If it does not, properties you draw from the sample will be biased.

Page 5: Bayesian Reasoning: Tempering & Sampling A/Prof Geraint F. Lewis Rm 560: gfl@physics.usyd.edu.au

http://www.physics.usyd.edu.au/~gfl/LectureLecture 7

Simulated TemperingThe problem is similar to ensuring you find a global minimum in optimization problems; one approach, simulated annealing, allows a solution to “cool” into the global minimum.

We can take a similar approach with out MCMC, heating up the posterior distribution (to make it flatter) and then cooling it down. When hotter, the MCMC can hop out of local regions of significant probability and explore more of the volume, then cooling down again into regions of interest.

We start with Bayes’ theorem such that

Page 6: Bayesian Reasoning: Tempering & Sampling A/Prof Geraint F. Lewis Rm 560: gfl@physics.usyd.edu.au

http://www.physics.usyd.edu.au/~gfl/LectureLecture 7

Simulated TemperingWe can construct a flatter distribution through

Typically, a discrete set of tempering parameters, , are used, with =1 (the “cold sampler”) being the target distribution.

We can “random walk” through the temperature, and consider only those steps taken when =1 to represent our target distribution.

However, parallel tempering provides a similar, but more efficient, approach to exploring the posterior distribution.

Page 7: Bayesian Reasoning: Tempering & Sampling A/Prof Geraint F. Lewis Rm 560: gfl@physics.usyd.edu.au

http://www.physics.usyd.edu.au/~gfl/LectureLecture 7

Parallel TemperingParallel tempering uses a series of MCMC explorations of the posterior distribution, each at a different tempering parameter, i; those at high temperature will hop all over the space, while those at colder temperature will take a more sedate walk. Typically, the temperatures are distributed over a ladder i = {1=1, 2, …, n}.

The goal of parallel tempering is to take parallel chains and consider swapping them. Suppose we choose a swap to take place once every ns steps in the chain, the proposal to make a swap can be undertaken by choosing a uniform random number and considering a swap if U11/ns .

If we choose to swap, two chains are chosen, one at i and in state Xt,i, and the other at i+1 and in state Xt,i+1.

Page 8: Bayesian Reasoning: Tempering & Sampling A/Prof Geraint F. Lewis Rm 560: gfl@physics.usyd.edu.au

http://www.physics.usyd.edu.au/~gfl/LectureLecture 7

Parallel TemperingWe can then choose to swap with a probability

by again selecting a uniform random number between 0 & 1 and choosing to swap if U(0,1) ≤ r.

The swaps move information between the parallel chains at different temperatures.

As ever, the choice of i depends on experimentation and experience.

Page 9: Bayesian Reasoning: Tempering & Sampling A/Prof Geraint F. Lewis Rm 560: gfl@physics.usyd.edu.au

http://www.physics.usyd.edu.au/~gfl/LectureLecture 7

An exampleEarlier, we examined the comparison between two models for some spectral data. Here, we look at the results of a Metropolis-Hastings and parallel tempering analysis of this problem.

To match the earlier analysis;

A Jeffreys prior was used for T between 0.1mK and 100mK.

A uniform prior was used for between channel 1 and 44.

The proposal for both parameters was Gaussian with =1.

Page 10: Bayesian Reasoning: Tempering & Sampling A/Prof Geraint F. Lewis Rm 560: gfl@physics.usyd.edu.au

http://www.physics.usyd.edu.au/~gfl/LectureLecture 7

An example

After a distinct burn-in, the chain wanders through the parameter space, but it clearly prefers T~1 and ~38, although significant departures are apparent.

Page 11: Bayesian Reasoning: Tempering & Sampling A/Prof Geraint F. Lewis Rm 560: gfl@physics.usyd.edu.au

http://www.physics.usyd.edu.au/~gfl/LectureLecture 7

An example

However, it is interesting to examine the marginalized distributions compared to the numerical integration results obtained earlier.

While the M-H approach has nicely recovered the distribution in T, and has captured the strong peak in , the chain has clearly failed to characterize the structure in the posterior at low channel numbers, not spending enough time in regions with <30.

Page 12: Bayesian Reasoning: Tempering & Sampling A/Prof Geraint F. Lewis Rm 560: gfl@physics.usyd.edu.au

http://www.physics.usyd.edu.au/~gfl/LectureLecture 7

An example

Here is the =1 chain for the parallel tempering run (with five evenly-spaced between 0.01 and 1, and swaps considered every 50 steps (on average)).

Page 13: Bayesian Reasoning: Tempering & Sampling A/Prof Geraint F. Lewis Rm 560: gfl@physics.usyd.edu.au

http://www.physics.usyd.edu.au/~gfl/LectureLecture 7

An example

The difference is quite apparent in the marginalized distributions.

Again, T and the strong peak in are well characterized, but the application of parallel tempering has also well sampled channel numbers with <30, better recovering the underlying distribution.

Page 14: Bayesian Reasoning: Tempering & Sampling A/Prof Geraint F. Lewis Rm 560: gfl@physics.usyd.edu.au

http://www.physics.usyd.edu.au/~gfl/LectureLecture 7

Model ComparisonRemember, to compare models and to deduce which is more probable, we calculate the odds ratio;

Where the final term, B12, is the Bayes factor.

Suppose we have the same two competing models for the spectral line data, one with no parameters (so the Bayes factor can be calculated analytically), and the other which we have analyzed with parallel tempering. How do we calculate the Bayes factor for the latter?

Page 15: Bayesian Reasoning: Tempering & Sampling A/Prof Geraint F. Lewis Rm 560: gfl@physics.usyd.edu.au

http://www.physics.usyd.edu.au/~gfl/LectureLecture 7

Model ComparisonWhat we want to calculate is;

We can combine the information in the parallel tempering chains through the relation (read Chap 12.7);

where

Page 16: Bayesian Reasoning: Tempering & Sampling A/Prof Geraint F. Lewis Rm 560: gfl@physics.usyd.edu.au

http://www.physics.usyd.edu.au/~gfl/LectureLecture 7

Model ComparisonHere are the results for the analysis of the spectral line model. There are only five points in and so we need to interpolate between the points (this is a Matlab spline).

Of course, we would prefer more samples in .

The result of the integral yield ln[ p(D|M1,I)] = -87.3369, with a resultant Bayes factor of B12=1.04 (similar to the result obtained earlier from the analytic calculation).

Page 17: Bayesian Reasoning: Tempering & Sampling A/Prof Geraint F. Lewis Rm 560: gfl@physics.usyd.edu.au

http://www.physics.usyd.edu.au/~gfl/LectureLecture 7

Nested Sampling

Figures form www.inference.phy.cam.ac.uk/bayesys/box/nested.ps

There are other ways to analyze the posterior and the likelihood space (with more efficient and faster approaches). One, of these, nested sampling, iteratively re-samples the space and slice it into regions of likelihood; Brendon will discuss this in more detail in his final lecture.