brendon J. Brewer - Department Of Statisticsbrewer/infonest-talk.pdf · 2 Entropy of the conditional prior for the data H ... 5 Average the estimated log-probs, ... convert to density

Computing Entropies with Nested Sampling

Brendon J. Brewer

Department of StatisticsThe University of Auckland

https://www.stat.auckland.ac.nz/˜brewer/

Brendon J. Brewer Computing Entropies with Nested Sampling

https://www.stat.auckland.ac.nz/~brewer/

What is entropy?

Firstly, I’m talking about information theory, notthermodynamics (though the two are connected).


Information Theory

The fundamental theorem of information theory


Information Theory

The fundamental theorem of information theory

TheoremIf you take the log of a probability, it seems like you understandprofound truths.


Shannon entropy

Consider a discrete probability distribution with probabilitiesp = {pi}. The Shannon entropy is

H(p) = −∑

i

pi log pi (1)

It is a real-valued property of the distribution.


Relative entropy

Consider two discrete probability distributions with probabilitiesp = {pi} and q = {qi}. The relative entropy is

H(p;q) = −∑

i

pi log

(pi

qi

)(2)

Without the minus sign, it’s the ‘Kullback-Leibler divergence’,and is more fundamental than the Shannon entropy. Withuniform q, it reduces to the Shannon entropy (up to an additiveconstant).


Entropy quantifies uncertainty

If there are just N equally likely possibilities, i.e., pi = 1/N, thenH = logN.

1 2 3x

0.0

0.1

0.2

0.3

0.4P

rob

abili

ty

H = 1.0986 nats




5 10 15 20 25x

0.00

0.01

0.02

0.03

0.04

0.05P

rob

abili

ty

H = 3.0445 nats




10 20 30 40x

0.00

0.01

0.02

0.03

0.04

0.05P

rob

abili

ty

H = 3.0445 nats

Heuristic: standard deviation quantifies uncertainty‘horizontally’, entropy does it ‘vertically’.


What about densities?

We get ‘differential entropy’

H = −∫

all xf (x) log f (x)dx (3)

This generalises log-volume, as defined with respect to dx .

ImportantDifferential entropy is coordinate-system dependent.


Some entropies in Bayesian statistics

Written in terms of parameters θ and data d , for Bayesianpurposes.

1 Entropy of the prior for the parameters H(θ)

2 Entropy of the conditional prior for the data H(d |θ)3 Entropy of the posterior H(θ|d)4 Entropy of the prior for the data H(d)










2 Entropy of the conditional prior for the data H(d |θ)

3 Entropy of the posterior H(θ|d)4 Entropy of the prior for the data H(d)





2 Entropy of the conditional prior for the data H(d |θ)3 Entropy of the posterior H(θ|d)

4 Entropy of the prior for the data H(d)








RemarkConditional entropies such as (2) and (3) are defined using anexpectation over the second argument (the thing conditionedon).


Interpretation of conditional entropies

How uncertain would the question “what’s the value of θ,precisely?” be if the question “what’s the value of d , precisely”were to be resolved?


Connections

Entropy of the joint prior:

H(θ,d) = H(θ) + H(d |θ) (4)= H(d) + H(θ|d). (5)

Mutual information:

I(θ;d) = H(θ)− H(θ|d). (6)

This quantifies dependence — or more fundamentally,relevance, or the potential for learning.There are many other ways of expressing I.


Pre-data considerations

We might want to know how relevant the data is to theparameters, before learning the data. We might want tooptimise that quantity for experimental design. But it’s nasty,especially if there are nuisance parameters.


Hard integrals

E.g.

H(θ|d) = −∫

p(d)∫

p(θ|d) log p(θ|d)dθ dd (7)

= −∫

p(d)∫

p(θ|d) log[

p(θ)p(d |θ)p(d)

]dθ dd (8)

But p(d), sitting there inside a logarithm, is already supposedto be a hard integral (the marginal likelihood / evidence)...

p(d) =∫

p(θ)p(d |θ)dθ (9)


Hard integrals

Hard integrals with nuisance parameters η, interestingparameter(s) φ


Marginal Likelihood Integral

Nested Sampling was invented in order to do this hard integral

p(d) =∫

p(θ)p(d |θ)dθ (10)

or

Z =

∫π(θ)L(θ)dθ (11)

where π = prior, L = likelihood. It’s just an expectation. Why isit hard?


Simple Monte Carlo fails

Z =

∫π(θ)L(θ)dθ (12)

≈ 1N

N∑i=1

L(θi) (13)

with θi ∼ π will probably miss the tiny regions where L is high.Equivalently, π implies a very heavy-tailed distribution ofL-values, and simple Monte Carlo fails.


Nested Sampling

Nested Sampling takes the original problem and constructs a1D problem from it.

Z =

∫ 1

0L(X )dX (14)

where

X (`) =

∫L(θ)>`

π(θ)dθ (15)

The meaning of X

X (`) is the amount of prior mass whose likelihood exceeds `.As ` increases, X decreases.


Nested Sampling

Figure from Skilling (2006).

Since X (`) is the CDF of L-values implied by π, points θ ∼ πhave a uniform distribution over X .


Nested Sampling

The idea is to generate a sequence of points with increasinglikelihoods, such that we can estimate their X values. Since weknow their L-values, we can then do the integral numerically.

Z =

∫ 1

0L(X )dX (16)


Nested Sampling algorithm

1 Generate N points from π

2 Find the worst one (lowest likelihood L∗), save it3 Estimate its X using Beta(N,1) distribution1

4 Record worst particle and its X -value, then discard it.5 Replace that point with a new one from π but with

likelihood above L∗.6 Repeat steps 2–5 indefinitely.

1From order statistics.Brendon J. Brewer Computing Entropies with Nested Sampling









2 Find the worst one (lowest likelihood L∗), save it

3 Estimate its X using Beta(N,1) distribution1













4 Record worst particle and its X -value, then discard it.

5 Replace that point with a new one from π but withlikelihood above L∗.

6 Repeat steps 2–5 indefinitely.






likelihood above L∗.

6 Repeat steps 2–5 indefinitely.








The sequence of X -values

The sequence of X values, if you transform them, have aPoisson process distribution with rate N.

− ln(X1) ∼ Exponential(N) (17)− ln(X2) ∼ − ln(X1) + Exponential(N) (18)− ln(X3) ∼ − ln(X2) + Exponential(N) (19)

(forgive the notational abuse)


Poisson process view of NS

The number of NS iterations taken to enter a small region(defined by a likelihood threshold) is an unbiased estimator ofthe log-probability of that region!

Also, π(θ) can be any distribution (needn’t be a prior) and L(θ)any scalar function. Opportunities here...


My algorithm

To compute H(θ) = −∫

f (θ) log f (θ)dθ when f can be sampledbut not evaluated:

1 Generate a ‘reference point’ θref from f2 Do a Nested Sampling run with f as “prior” and minus the

distance to θref as “likelihood”.3 Measure how many NS iterations were needed to make

the distance to θref really small, and divide by N. That givesan unbiased estimate of the log-prob near θref.

4 Repeat steps 1–3 many times.5 Average the estimated log-probs, then apply corrections to

convert to density.


My algorithm



1 Generate a ‘reference point’ θref from f

2 Do a Nested Sampling run with f as “prior” and minus thedistance to θref as “likelihood”.

3 Measure how many NS iterations were needed to makethe distance to θref really small, and divide by N. That givesan unbiased estimate of the log-prob near θref.


convert to density.


My algorithm




distance to θref as “likelihood”.

3 Measure how many NS iterations were needed to makethe distance to θref really small, and divide by N. That givesan unbiased estimate of the log-prob near θref.


convert to density.


My algorithm







convert to density.


My algorithm






4 Repeat steps 1–3 many times.

5 Average the estimated log-probs, then apply corrections toconvert to density.


My algorithm







convert to density.


My algorithm


My algorithm


My algorithm


My algorithm


Movie

Play movie.mkv(It’s also on YouTube)


Toy experimental design example

Two observing strategies — even and uneven. Which is betterfor measuring a period?

0.0 0.2 0.4 0.6 0.8 1.0

t

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

y

True signal

Even data

Uneven data


Specifics

Let τ = log10(period).

I knew H(τ) because I chose the prior. I used the algorithm toestimate H(τ |d), so marginal posteriors were the distributionswhose entropies I estimated2.

I then computed the mutual information

I(τ ;d) = H(τ)− H(τ |d) (20)

2If you only care about one parameter, define your distance function interms of that parameter only!


Result

Ieven = 5.441± 0.038 nats (21)Iuneven = 5.398± 0.038 nats (22)

i.e., any difference is trivial.


In practice...

When a period is short relative to observations, you can get amultimodal posterior pdf3, and ‘learn a lot’ by ruling out most ofthe space, but still having many peaks.

I did not investigate this aspect of the problem — I assumedlong periods.

3e.g., see Larry Bretthorst’s work connecting the posterior pdf to theperiodogram.


Paper/software/endorsement

http://www.mdpi.com/1099-4300/19/8/422https://github.com/eggplantbren/InfoNest


http://www.mdpi.com/1099-4300/19/8/422

https://github.com/eggplantbren/InfoNest

Thanks

Ruth Angus (Flatiron Institute), Ewan Cameron (Oxford), JamesCurran (Auckland), Tom Elliott (Auckland), David Hogg (NYU),Kevin Knuth (SUNY Albany), Thomas Lumley (Auckland), IainMurray (Edinburgh), John Skilling (Maximum Entropy DataConsultants), Jared Tobin (jtobin.io).


Documents

brendon J. Brewer - Department Of Statisticsbrewer/infonest-talk.pdf · 2 Entropy of the conditional prior for the data H ... 5 Average the estimated log-probs, ... convert to density