brendon J. Brewer - Department Of Statisticsbrewer/infonest-talk.pdf · 2 Entropy of the...

Preview:

Citation preview

Computing Entropies with Nested Sampling

Brendon J. Brewer

Department of StatisticsThe University of Auckland

https://www.stat.auckland.ac.nz/˜brewer/

Brendon J. Brewer Computing Entropies with Nested Sampling

What is entropy?

Firstly, I’m talking about information theory, notthermodynamics (though the two are connected).

Brendon J. Brewer Computing Entropies with Nested Sampling

Information Theory

The fundamental theorem of information theory

Brendon J. Brewer Computing Entropies with Nested Sampling

Information Theory

The fundamental theorem of information theory

TheoremIf you take the log of a probability, it seems like you understandprofound truths.

Brendon J. Brewer Computing Entropies with Nested Sampling

Shannon entropy

Consider a discrete probability distribution with probabilitiesp = {pi}. The Shannon entropy is

H(p) = −∑

i

pi log pi (1)

It is a real-valued property of the distribution.

Brendon J. Brewer Computing Entropies with Nested Sampling

Relative entropy

Consider two discrete probability distributions with probabilitiesp = {pi} and q = {qi}. The relative entropy is

H(p;q) = −∑

i

pi log

(pi

qi

)(2)

Without the minus sign, it’s the ‘Kullback-Leibler divergence’,and is more fundamental than the Shannon entropy. Withuniform q, it reduces to the Shannon entropy (up to an additiveconstant).

Brendon J. Brewer Computing Entropies with Nested Sampling

Entropy quantifies uncertainty

If there are just N equally likely possibilities, i.e., pi = 1/N, thenH = logN.

1 2 3x

0.0

0.1

0.2

0.3

0.4P

rob

abili

ty

H = 1.0986 nats

Brendon J. Brewer Computing Entropies with Nested Sampling

Entropy quantifies uncertainty

If there are just N equally likely possibilities, i.e., pi = 1/N, thenH = logN.

5 10 15 20 25x

0.00

0.01

0.02

0.03

0.04

0.05P

rob

abili

ty

H = 3.0445 nats

Brendon J. Brewer Computing Entropies with Nested Sampling

Entropy quantifies uncertainty

If there are just N equally likely possibilities, i.e., pi = 1/N, thenH = logN.

10 20 30 40x

0.00

0.01

0.02

0.03

0.04

0.05P

rob

abili

ty

H = 3.0445 nats

Heuristic: standard deviation quantifies uncertainty‘horizontally’, entropy does it ‘vertically’.

Brendon J. Brewer Computing Entropies with Nested Sampling

What about densities?

We get ‘differential entropy’

H = −∫

all xf (x) log f (x)dx (3)

This generalises log-volume, as defined with respect to dx .

ImportantDifferential entropy is coordinate-system dependent.

Brendon J. Brewer Computing Entropies with Nested Sampling

Some entropies in Bayesian statistics

Written in terms of parameters θ and data d , for Bayesianpurposes.

1 Entropy of the prior for the parameters H(θ)

2 Entropy of the conditional prior for the data H(d |θ)3 Entropy of the posterior H(θ|d)4 Entropy of the prior for the data H(d)

Brendon J. Brewer Computing Entropies with Nested Sampling

Some entropies in Bayesian statistics

Written in terms of parameters θ and data d , for Bayesianpurposes.

1 Entropy of the prior for the parameters H(θ)

2 Entropy of the conditional prior for the data H(d |θ)3 Entropy of the posterior H(θ|d)4 Entropy of the prior for the data H(d)

Brendon J. Brewer Computing Entropies with Nested Sampling

Some entropies in Bayesian statistics

Written in terms of parameters θ and data d , for Bayesianpurposes.

1 Entropy of the prior for the parameters H(θ)

2 Entropy of the conditional prior for the data H(d |θ)

3 Entropy of the posterior H(θ|d)4 Entropy of the prior for the data H(d)

Brendon J. Brewer Computing Entropies with Nested Sampling

Some entropies in Bayesian statistics

Written in terms of parameters θ and data d , for Bayesianpurposes.

1 Entropy of the prior for the parameters H(θ)

2 Entropy of the conditional prior for the data H(d |θ)3 Entropy of the posterior H(θ|d)

4 Entropy of the prior for the data H(d)

Brendon J. Brewer Computing Entropies with Nested Sampling

Some entropies in Bayesian statistics

Written in terms of parameters θ and data d , for Bayesianpurposes.

1 Entropy of the prior for the parameters H(θ)

2 Entropy of the conditional prior for the data H(d |θ)3 Entropy of the posterior H(θ|d)4 Entropy of the prior for the data H(d)

Brendon J. Brewer Computing Entropies with Nested Sampling

Some entropies in Bayesian statistics

RemarkConditional entropies such as (2) and (3) are defined using anexpectation over the second argument (the thing conditionedon).

Brendon J. Brewer Computing Entropies with Nested Sampling

Interpretation of conditional entropies

How uncertain would the question “what’s the value of θ,precisely?” be if the question “what’s the value of d , precisely”were to be resolved?

Brendon J. Brewer Computing Entropies with Nested Sampling

Connections

Entropy of the joint prior:

H(θ,d) = H(θ) + H(d |θ) (4)= H(d) + H(θ|d). (5)

Mutual information:

I(θ;d) = H(θ)− H(θ|d). (6)

This quantifies dependence — or more fundamentally,relevance, or the potential for learning.There are many other ways of expressing I.

Brendon J. Brewer Computing Entropies with Nested Sampling

Pre-data considerations

We might want to know how relevant the data is to theparameters, before learning the data. We might want tooptimise that quantity for experimental design. But it’s nasty,especially if there are nuisance parameters.

Brendon J. Brewer Computing Entropies with Nested Sampling

Hard integrals

E.g.

H(θ|d) = −∫

p(d)∫

p(θ|d) log p(θ|d)dθ dd (7)

= −∫

p(d)∫

p(θ|d) log[

p(θ)p(d |θ)p(d)

]dθ dd (8)

But p(d), sitting there inside a logarithm, is already supposedto be a hard integral (the marginal likelihood / evidence)...

p(d) =∫

p(θ)p(d |θ)dθ (9)

Brendon J. Brewer Computing Entropies with Nested Sampling

Hard integrals

Hard integrals with nuisance parameters η, interestingparameter(s) φ

Brendon J. Brewer Computing Entropies with Nested Sampling

Marginal Likelihood Integral

Nested Sampling was invented in order to do this hard integral

p(d) =∫

p(θ)p(d |θ)dθ (10)

or

Z =

∫π(θ)L(θ)dθ (11)

where π = prior, L = likelihood. It’s just an expectation. Why isit hard?

Brendon J. Brewer Computing Entropies with Nested Sampling

Simple Monte Carlo fails

Z =

∫π(θ)L(θ)dθ (12)

≈ 1N

N∑i=1

L(θi) (13)

with θi ∼ π will probably miss the tiny regions where L is high.Equivalently, π implies a very heavy-tailed distribution ofL-values, and simple Monte Carlo fails.

Brendon J. Brewer Computing Entropies with Nested Sampling

Nested Sampling

Nested Sampling takes the original problem and constructs a1D problem from it.

Z =

∫ 1

0L(X )dX (14)

where

X (`) =

∫L(θ)>`

π(θ)dθ (15)

The meaning of X

X (`) is the amount of prior mass whose likelihood exceeds `.As ` increases, X decreases.

Brendon J. Brewer Computing Entropies with Nested Sampling

Nested Sampling

Figure from Skilling (2006).

Since X (`) is the CDF of L-values implied by π, points θ ∼ πhave a uniform distribution over X .

Brendon J. Brewer Computing Entropies with Nested Sampling

Nested Sampling

The idea is to generate a sequence of points with increasinglikelihoods, such that we can estimate their X values. Since weknow their L-values, we can then do the integral numerically.

Z =

∫ 1

0L(X )dX (16)

Brendon J. Brewer Computing Entropies with Nested Sampling

Nested Sampling algorithm

1 Generate N points from π

2 Find the worst one (lowest likelihood L∗), save it3 Estimate its X using Beta(N,1) distribution1

4 Record worst particle and its X -value, then discard it.5 Replace that point with a new one from π but with

likelihood above L∗.6 Repeat steps 2–5 indefinitely.

1From order statistics.Brendon J. Brewer Computing Entropies with Nested Sampling

Nested Sampling algorithm

1 Generate N points from π

2 Find the worst one (lowest likelihood L∗), save it3 Estimate its X using Beta(N,1) distribution1

4 Record worst particle and its X -value, then discard it.5 Replace that point with a new one from π but with

likelihood above L∗.6 Repeat steps 2–5 indefinitely.

1From order statistics.Brendon J. Brewer Computing Entropies with Nested Sampling

Nested Sampling algorithm

1 Generate N points from π

2 Find the worst one (lowest likelihood L∗), save it

3 Estimate its X using Beta(N,1) distribution1

4 Record worst particle and its X -value, then discard it.5 Replace that point with a new one from π but with

likelihood above L∗.6 Repeat steps 2–5 indefinitely.

1From order statistics.Brendon J. Brewer Computing Entropies with Nested Sampling

Nested Sampling algorithm

1 Generate N points from π

2 Find the worst one (lowest likelihood L∗), save it3 Estimate its X using Beta(N,1) distribution1

4 Record worst particle and its X -value, then discard it.5 Replace that point with a new one from π but with

likelihood above L∗.6 Repeat steps 2–5 indefinitely.

1From order statistics.Brendon J. Brewer Computing Entropies with Nested Sampling

Nested Sampling algorithm

1 Generate N points from π

2 Find the worst one (lowest likelihood L∗), save it3 Estimate its X using Beta(N,1) distribution1

4 Record worst particle and its X -value, then discard it.

5 Replace that point with a new one from π but withlikelihood above L∗.

6 Repeat steps 2–5 indefinitely.

1From order statistics.Brendon J. Brewer Computing Entropies with Nested Sampling

Nested Sampling algorithm

1 Generate N points from π

2 Find the worst one (lowest likelihood L∗), save it3 Estimate its X using Beta(N,1) distribution1

4 Record worst particle and its X -value, then discard it.5 Replace that point with a new one from π but with

likelihood above L∗.

6 Repeat steps 2–5 indefinitely.

1From order statistics.Brendon J. Brewer Computing Entropies with Nested Sampling

Nested Sampling algorithm

1 Generate N points from π

2 Find the worst one (lowest likelihood L∗), save it3 Estimate its X using Beta(N,1) distribution1

4 Record worst particle and its X -value, then discard it.5 Replace that point with a new one from π but with

likelihood above L∗.6 Repeat steps 2–5 indefinitely.

1From order statistics.Brendon J. Brewer Computing Entropies with Nested Sampling

The sequence of X -values

The sequence of X values, if you transform them, have aPoisson process distribution with rate N.

− ln(X1) ∼ Exponential(N) (17)− ln(X2) ∼ − ln(X1) + Exponential(N) (18)− ln(X3) ∼ − ln(X2) + Exponential(N) (19)

(forgive the notational abuse)

Brendon J. Brewer Computing Entropies with Nested Sampling

Poisson process view of NS

The number of NS iterations taken to enter a small region(defined by a likelihood threshold) is an unbiased estimator ofthe log-probability of that region!

Also, π(θ) can be any distribution (needn’t be a prior) and L(θ)any scalar function. Opportunities here...

Brendon J. Brewer Computing Entropies with Nested Sampling

My algorithm

To compute H(θ) = −∫

f (θ) log f (θ)dθ when f can be sampledbut not evaluated:

1 Generate a ‘reference point’ θref from f2 Do a Nested Sampling run with f as “prior” and minus the

distance to θref as “likelihood”.3 Measure how many NS iterations were needed to make

the distance to θref really small, and divide by N. That givesan unbiased estimate of the log-prob near θref.

4 Repeat steps 1–3 many times.5 Average the estimated log-probs, then apply corrections to

convert to density.

Brendon J. Brewer Computing Entropies with Nested Sampling

My algorithm

To compute H(θ) = −∫

f (θ) log f (θ)dθ when f can be sampledbut not evaluated:

1 Generate a ‘reference point’ θref from f

2 Do a Nested Sampling run with f as “prior” and minus thedistance to θref as “likelihood”.

3 Measure how many NS iterations were needed to makethe distance to θref really small, and divide by N. That givesan unbiased estimate of the log-prob near θref.

4 Repeat steps 1–3 many times.5 Average the estimated log-probs, then apply corrections to

convert to density.

Brendon J. Brewer Computing Entropies with Nested Sampling

My algorithm

To compute H(θ) = −∫

f (θ) log f (θ)dθ when f can be sampledbut not evaluated:

1 Generate a ‘reference point’ θref from f2 Do a Nested Sampling run with f as “prior” and minus the

distance to θref as “likelihood”.

3 Measure how many NS iterations were needed to makethe distance to θref really small, and divide by N. That givesan unbiased estimate of the log-prob near θref.

4 Repeat steps 1–3 many times.5 Average the estimated log-probs, then apply corrections to

convert to density.

Brendon J. Brewer Computing Entropies with Nested Sampling

My algorithm

To compute H(θ) = −∫

f (θ) log f (θ)dθ when f can be sampledbut not evaluated:

1 Generate a ‘reference point’ θref from f2 Do a Nested Sampling run with f as “prior” and minus the

distance to θref as “likelihood”.3 Measure how many NS iterations were needed to make

the distance to θref really small, and divide by N. That givesan unbiased estimate of the log-prob near θref.

4 Repeat steps 1–3 many times.5 Average the estimated log-probs, then apply corrections to

convert to density.

Brendon J. Brewer Computing Entropies with Nested Sampling

My algorithm

To compute H(θ) = −∫

f (θ) log f (θ)dθ when f can be sampledbut not evaluated:

1 Generate a ‘reference point’ θref from f2 Do a Nested Sampling run with f as “prior” and minus the

distance to θref as “likelihood”.3 Measure how many NS iterations were needed to make

the distance to θref really small, and divide by N. That givesan unbiased estimate of the log-prob near θref.

4 Repeat steps 1–3 many times.

5 Average the estimated log-probs, then apply corrections toconvert to density.

Brendon J. Brewer Computing Entropies with Nested Sampling

My algorithm

To compute H(θ) = −∫

f (θ) log f (θ)dθ when f can be sampledbut not evaluated:

1 Generate a ‘reference point’ θref from f2 Do a Nested Sampling run with f as “prior” and minus the

distance to θref as “likelihood”.3 Measure how many NS iterations were needed to make

the distance to θref really small, and divide by N. That givesan unbiased estimate of the log-prob near θref.

4 Repeat steps 1–3 many times.5 Average the estimated log-probs, then apply corrections to

convert to density.

Brendon J. Brewer Computing Entropies with Nested Sampling

My algorithm

Brendon J. Brewer Computing Entropies with Nested Sampling

My algorithm

Brendon J. Brewer Computing Entropies with Nested Sampling

My algorithm

Brendon J. Brewer Computing Entropies with Nested Sampling

My algorithm

Brendon J. Brewer Computing Entropies with Nested Sampling

Movie

Play movie.mkv(It’s also on YouTube)

Brendon J. Brewer Computing Entropies with Nested Sampling

Toy experimental design example

Two observing strategies — even and uneven. Which is betterfor measuring a period?

0.0 0.2 0.4 0.6 0.8 1.0

t

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

y

True signal

Even data

Uneven data

Brendon J. Brewer Computing Entropies with Nested Sampling

Specifics

Let τ = log10(period).

I knew H(τ) because I chose the prior. I used the algorithm toestimate H(τ |d), so marginal posteriors were the distributionswhose entropies I estimated2.

I then computed the mutual information

I(τ ;d) = H(τ)− H(τ |d) (20)

2If you only care about one parameter, define your distance function interms of that parameter only!

Brendon J. Brewer Computing Entropies with Nested Sampling

Result

Ieven = 5.441± 0.038 nats (21)Iuneven = 5.398± 0.038 nats (22)

i.e., any difference is trivial.

Brendon J. Brewer Computing Entropies with Nested Sampling

In practice...

When a period is short relative to observations, you can get amultimodal posterior pdf3, and ‘learn a lot’ by ruling out most ofthe space, but still having many peaks.

I did not investigate this aspect of the problem — I assumedlong periods.

3e.g., see Larry Bretthorst’s work connecting the posterior pdf to theperiodogram.

Brendon J. Brewer Computing Entropies with Nested Sampling

Paper/software/endorsement

http://www.mdpi.com/1099-4300/19/8/422https://github.com/eggplantbren/InfoNest

Brendon J. Brewer Computing Entropies with Nested Sampling

Thanks

Ruth Angus (Flatiron Institute), Ewan Cameron (Oxford), JamesCurran (Auckland), Tom Elliott (Auckland), David Hogg (NYU),Kevin Knuth (SUNY Albany), Thomas Lumley (Auckland), IainMurray (Edinburgh), John Skilling (Maximum Entropy DataConsultants), Jared Tobin (jtobin.io).

Brendon J. Brewer Computing Entropies with Nested Sampling