Upload
vukhuong
View
213
Download
0
Embed Size (px)
Citation preview
Computing Entropies with Nested Sampling
Brendon J. Brewer
Department of StatisticsThe University of Auckland
https://www.stat.auckland.ac.nz/˜brewer/
Brendon J. Brewer Computing Entropies with Nested Sampling
What is entropy?
Firstly, I’m talking about information theory, notthermodynamics (though the two are connected).
Brendon J. Brewer Computing Entropies with Nested Sampling
Information Theory
The fundamental theorem of information theory
Brendon J. Brewer Computing Entropies with Nested Sampling
Information Theory
The fundamental theorem of information theory
TheoremIf you take the log of a probability, it seems like you understandprofound truths.
Brendon J. Brewer Computing Entropies with Nested Sampling
Shannon entropy
Consider a discrete probability distribution with probabilitiesp = {pi}. The Shannon entropy is
H(p) = −∑
i
pi log pi (1)
It is a real-valued property of the distribution.
Brendon J. Brewer Computing Entropies with Nested Sampling
Relative entropy
Consider two discrete probability distributions with probabilitiesp = {pi} and q = {qi}. The relative entropy is
H(p;q) = −∑
i
pi log
(pi
qi
)(2)
Without the minus sign, it’s the ‘Kullback-Leibler divergence’,and is more fundamental than the Shannon entropy. Withuniform q, it reduces to the Shannon entropy (up to an additiveconstant).
Brendon J. Brewer Computing Entropies with Nested Sampling
Entropy quantifies uncertainty
If there are just N equally likely possibilities, i.e., pi = 1/N, thenH = logN.
1 2 3x
0.0
0.1
0.2
0.3
0.4P
rob
abili
ty
H = 1.0986 nats
Brendon J. Brewer Computing Entropies with Nested Sampling
Entropy quantifies uncertainty
If there are just N equally likely possibilities, i.e., pi = 1/N, thenH = logN.
5 10 15 20 25x
0.00
0.01
0.02
0.03
0.04
0.05P
rob
abili
ty
H = 3.0445 nats
Brendon J. Brewer Computing Entropies with Nested Sampling
Entropy quantifies uncertainty
If there are just N equally likely possibilities, i.e., pi = 1/N, thenH = logN.
10 20 30 40x
0.00
0.01
0.02
0.03
0.04
0.05P
rob
abili
ty
H = 3.0445 nats
Heuristic: standard deviation quantifies uncertainty‘horizontally’, entropy does it ‘vertically’.
Brendon J. Brewer Computing Entropies with Nested Sampling
What about densities?
We get ‘differential entropy’
H = −∫
all xf (x) log f (x)dx (3)
This generalises log-volume, as defined with respect to dx .
ImportantDifferential entropy is coordinate-system dependent.
Brendon J. Brewer Computing Entropies with Nested Sampling
Some entropies in Bayesian statistics
Written in terms of parameters θ and data d , for Bayesianpurposes.
1 Entropy of the prior for the parameters H(θ)
2 Entropy of the conditional prior for the data H(d |θ)3 Entropy of the posterior H(θ|d)4 Entropy of the prior for the data H(d)
Brendon J. Brewer Computing Entropies with Nested Sampling
Some entropies in Bayesian statistics
Written in terms of parameters θ and data d , for Bayesianpurposes.
1 Entropy of the prior for the parameters H(θ)
2 Entropy of the conditional prior for the data H(d |θ)3 Entropy of the posterior H(θ|d)4 Entropy of the prior for the data H(d)
Brendon J. Brewer Computing Entropies with Nested Sampling
Some entropies in Bayesian statistics
Written in terms of parameters θ and data d , for Bayesianpurposes.
1 Entropy of the prior for the parameters H(θ)
2 Entropy of the conditional prior for the data H(d |θ)
3 Entropy of the posterior H(θ|d)4 Entropy of the prior for the data H(d)
Brendon J. Brewer Computing Entropies with Nested Sampling
Some entropies in Bayesian statistics
Written in terms of parameters θ and data d , for Bayesianpurposes.
1 Entropy of the prior for the parameters H(θ)
2 Entropy of the conditional prior for the data H(d |θ)3 Entropy of the posterior H(θ|d)
4 Entropy of the prior for the data H(d)
Brendon J. Brewer Computing Entropies with Nested Sampling
Some entropies in Bayesian statistics
Written in terms of parameters θ and data d , for Bayesianpurposes.
1 Entropy of the prior for the parameters H(θ)
2 Entropy of the conditional prior for the data H(d |θ)3 Entropy of the posterior H(θ|d)4 Entropy of the prior for the data H(d)
Brendon J. Brewer Computing Entropies with Nested Sampling
Some entropies in Bayesian statistics
RemarkConditional entropies such as (2) and (3) are defined using anexpectation over the second argument (the thing conditionedon).
Brendon J. Brewer Computing Entropies with Nested Sampling
Interpretation of conditional entropies
How uncertain would the question “what’s the value of θ,precisely?” be if the question “what’s the value of d , precisely”were to be resolved?
Brendon J. Brewer Computing Entropies with Nested Sampling
Connections
Entropy of the joint prior:
H(θ,d) = H(θ) + H(d |θ) (4)= H(d) + H(θ|d). (5)
Mutual information:
I(θ;d) = H(θ)− H(θ|d). (6)
This quantifies dependence — or more fundamentally,relevance, or the potential for learning.There are many other ways of expressing I.
Brendon J. Brewer Computing Entropies with Nested Sampling
Pre-data considerations
We might want to know how relevant the data is to theparameters, before learning the data. We might want tooptimise that quantity for experimental design. But it’s nasty,especially if there are nuisance parameters.
Brendon J. Brewer Computing Entropies with Nested Sampling
Hard integrals
E.g.
H(θ|d) = −∫
p(d)∫
p(θ|d) log p(θ|d)dθ dd (7)
= −∫
p(d)∫
p(θ|d) log[
p(θ)p(d |θ)p(d)
]dθ dd (8)
But p(d), sitting there inside a logarithm, is already supposedto be a hard integral (the marginal likelihood / evidence)...
p(d) =∫
p(θ)p(d |θ)dθ (9)
Brendon J. Brewer Computing Entropies with Nested Sampling
Hard integrals
Hard integrals with nuisance parameters η, interestingparameter(s) φ
Brendon J. Brewer Computing Entropies with Nested Sampling
Marginal Likelihood Integral
Nested Sampling was invented in order to do this hard integral
p(d) =∫
p(θ)p(d |θ)dθ (10)
or
Z =
∫π(θ)L(θ)dθ (11)
where π = prior, L = likelihood. It’s just an expectation. Why isit hard?
Brendon J. Brewer Computing Entropies with Nested Sampling
Simple Monte Carlo fails
Z =
∫π(θ)L(θ)dθ (12)
≈ 1N
N∑i=1
L(θi) (13)
with θi ∼ π will probably miss the tiny regions where L is high.Equivalently, π implies a very heavy-tailed distribution ofL-values, and simple Monte Carlo fails.
Brendon J. Brewer Computing Entropies with Nested Sampling
Nested Sampling
Nested Sampling takes the original problem and constructs a1D problem from it.
Z =
∫ 1
0L(X )dX (14)
where
X (`) =
∫L(θ)>`
π(θ)dθ (15)
The meaning of X
X (`) is the amount of prior mass whose likelihood exceeds `.As ` increases, X decreases.
Brendon J. Brewer Computing Entropies with Nested Sampling
Nested Sampling
Figure from Skilling (2006).
Since X (`) is the CDF of L-values implied by π, points θ ∼ πhave a uniform distribution over X .
Brendon J. Brewer Computing Entropies with Nested Sampling
Nested Sampling
The idea is to generate a sequence of points with increasinglikelihoods, such that we can estimate their X values. Since weknow their L-values, we can then do the integral numerically.
Z =
∫ 1
0L(X )dX (16)
Brendon J. Brewer Computing Entropies with Nested Sampling
Nested Sampling algorithm
1 Generate N points from π
2 Find the worst one (lowest likelihood L∗), save it3 Estimate its X using Beta(N,1) distribution1
4 Record worst particle and its X -value, then discard it.5 Replace that point with a new one from π but with
likelihood above L∗.6 Repeat steps 2–5 indefinitely.
1From order statistics.Brendon J. Brewer Computing Entropies with Nested Sampling
Nested Sampling algorithm
1 Generate N points from π
2 Find the worst one (lowest likelihood L∗), save it3 Estimate its X using Beta(N,1) distribution1
4 Record worst particle and its X -value, then discard it.5 Replace that point with a new one from π but with
likelihood above L∗.6 Repeat steps 2–5 indefinitely.
1From order statistics.Brendon J. Brewer Computing Entropies with Nested Sampling
Nested Sampling algorithm
1 Generate N points from π
2 Find the worst one (lowest likelihood L∗), save it
3 Estimate its X using Beta(N,1) distribution1
4 Record worst particle and its X -value, then discard it.5 Replace that point with a new one from π but with
likelihood above L∗.6 Repeat steps 2–5 indefinitely.
1From order statistics.Brendon J. Brewer Computing Entropies with Nested Sampling
Nested Sampling algorithm
1 Generate N points from π
2 Find the worst one (lowest likelihood L∗), save it3 Estimate its X using Beta(N,1) distribution1
4 Record worst particle and its X -value, then discard it.5 Replace that point with a new one from π but with
likelihood above L∗.6 Repeat steps 2–5 indefinitely.
1From order statistics.Brendon J. Brewer Computing Entropies with Nested Sampling
Nested Sampling algorithm
1 Generate N points from π
2 Find the worst one (lowest likelihood L∗), save it3 Estimate its X using Beta(N,1) distribution1
4 Record worst particle and its X -value, then discard it.
5 Replace that point with a new one from π but withlikelihood above L∗.
6 Repeat steps 2–5 indefinitely.
1From order statistics.Brendon J. Brewer Computing Entropies with Nested Sampling
Nested Sampling algorithm
1 Generate N points from π
2 Find the worst one (lowest likelihood L∗), save it3 Estimate its X using Beta(N,1) distribution1
4 Record worst particle and its X -value, then discard it.5 Replace that point with a new one from π but with
likelihood above L∗.
6 Repeat steps 2–5 indefinitely.
1From order statistics.Brendon J. Brewer Computing Entropies with Nested Sampling
Nested Sampling algorithm
1 Generate N points from π
2 Find the worst one (lowest likelihood L∗), save it3 Estimate its X using Beta(N,1) distribution1
4 Record worst particle and its X -value, then discard it.5 Replace that point with a new one from π but with
likelihood above L∗.6 Repeat steps 2–5 indefinitely.
1From order statistics.Brendon J. Brewer Computing Entropies with Nested Sampling
The sequence of X -values
The sequence of X values, if you transform them, have aPoisson process distribution with rate N.
− ln(X1) ∼ Exponential(N) (17)− ln(X2) ∼ − ln(X1) + Exponential(N) (18)− ln(X3) ∼ − ln(X2) + Exponential(N) (19)
(forgive the notational abuse)
Brendon J. Brewer Computing Entropies with Nested Sampling
Poisson process view of NS
The number of NS iterations taken to enter a small region(defined by a likelihood threshold) is an unbiased estimator ofthe log-probability of that region!
Also, π(θ) can be any distribution (needn’t be a prior) and L(θ)any scalar function. Opportunities here...
Brendon J. Brewer Computing Entropies with Nested Sampling
My algorithm
To compute H(θ) = −∫
f (θ) log f (θ)dθ when f can be sampledbut not evaluated:
1 Generate a ‘reference point’ θref from f2 Do a Nested Sampling run with f as “prior” and minus the
distance to θref as “likelihood”.3 Measure how many NS iterations were needed to make
the distance to θref really small, and divide by N. That givesan unbiased estimate of the log-prob near θref.
4 Repeat steps 1–3 many times.5 Average the estimated log-probs, then apply corrections to
convert to density.
Brendon J. Brewer Computing Entropies with Nested Sampling
My algorithm
To compute H(θ) = −∫
f (θ) log f (θ)dθ when f can be sampledbut not evaluated:
1 Generate a ‘reference point’ θref from f
2 Do a Nested Sampling run with f as “prior” and minus thedistance to θref as “likelihood”.
3 Measure how many NS iterations were needed to makethe distance to θref really small, and divide by N. That givesan unbiased estimate of the log-prob near θref.
4 Repeat steps 1–3 many times.5 Average the estimated log-probs, then apply corrections to
convert to density.
Brendon J. Brewer Computing Entropies with Nested Sampling
My algorithm
To compute H(θ) = −∫
f (θ) log f (θ)dθ when f can be sampledbut not evaluated:
1 Generate a ‘reference point’ θref from f2 Do a Nested Sampling run with f as “prior” and minus the
distance to θref as “likelihood”.
3 Measure how many NS iterations were needed to makethe distance to θref really small, and divide by N. That givesan unbiased estimate of the log-prob near θref.
4 Repeat steps 1–3 many times.5 Average the estimated log-probs, then apply corrections to
convert to density.
Brendon J. Brewer Computing Entropies with Nested Sampling
My algorithm
To compute H(θ) = −∫
f (θ) log f (θ)dθ when f can be sampledbut not evaluated:
1 Generate a ‘reference point’ θref from f2 Do a Nested Sampling run with f as “prior” and minus the
distance to θref as “likelihood”.3 Measure how many NS iterations were needed to make
the distance to θref really small, and divide by N. That givesan unbiased estimate of the log-prob near θref.
4 Repeat steps 1–3 many times.5 Average the estimated log-probs, then apply corrections to
convert to density.
Brendon J. Brewer Computing Entropies with Nested Sampling
My algorithm
To compute H(θ) = −∫
f (θ) log f (θ)dθ when f can be sampledbut not evaluated:
1 Generate a ‘reference point’ θref from f2 Do a Nested Sampling run with f as “prior” and minus the
distance to θref as “likelihood”.3 Measure how many NS iterations were needed to make
the distance to θref really small, and divide by N. That givesan unbiased estimate of the log-prob near θref.
4 Repeat steps 1–3 many times.
5 Average the estimated log-probs, then apply corrections toconvert to density.
Brendon J. Brewer Computing Entropies with Nested Sampling
My algorithm
To compute H(θ) = −∫
f (θ) log f (θ)dθ when f can be sampledbut not evaluated:
1 Generate a ‘reference point’ θref from f2 Do a Nested Sampling run with f as “prior” and minus the
distance to θref as “likelihood”.3 Measure how many NS iterations were needed to make
the distance to θref really small, and divide by N. That givesan unbiased estimate of the log-prob near θref.
4 Repeat steps 1–3 many times.5 Average the estimated log-probs, then apply corrections to
convert to density.
Brendon J. Brewer Computing Entropies with Nested Sampling
My algorithm
Brendon J. Brewer Computing Entropies with Nested Sampling
My algorithm
Brendon J. Brewer Computing Entropies with Nested Sampling
My algorithm
Brendon J. Brewer Computing Entropies with Nested Sampling
My algorithm
Brendon J. Brewer Computing Entropies with Nested Sampling
Movie
Play movie.mkv(It’s also on YouTube)
Brendon J. Brewer Computing Entropies with Nested Sampling
Toy experimental design example
Two observing strategies — even and uneven. Which is betterfor measuring a period?
0.0 0.2 0.4 0.6 0.8 1.0
t
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
y
True signal
Even data
Uneven data
Brendon J. Brewer Computing Entropies with Nested Sampling
Specifics
Let τ = log10(period).
I knew H(τ) because I chose the prior. I used the algorithm toestimate H(τ |d), so marginal posteriors were the distributionswhose entropies I estimated2.
I then computed the mutual information
I(τ ;d) = H(τ)− H(τ |d) (20)
2If you only care about one parameter, define your distance function interms of that parameter only!
Brendon J. Brewer Computing Entropies with Nested Sampling
Result
Ieven = 5.441± 0.038 nats (21)Iuneven = 5.398± 0.038 nats (22)
i.e., any difference is trivial.
Brendon J. Brewer Computing Entropies with Nested Sampling
In practice...
When a period is short relative to observations, you can get amultimodal posterior pdf3, and ‘learn a lot’ by ruling out most ofthe space, but still having many peaks.
I did not investigate this aspect of the problem — I assumedlong periods.
3e.g., see Larry Bretthorst’s work connecting the posterior pdf to theperiodogram.
Brendon J. Brewer Computing Entropies with Nested Sampling
Paper/software/endorsement
http://www.mdpi.com/1099-4300/19/8/422https://github.com/eggplantbren/InfoNest
Brendon J. Brewer Computing Entropies with Nested Sampling
Thanks
Ruth Angus (Flatiron Institute), Ewan Cameron (Oxford), JamesCurran (Auckland), Tom Elliott (Auckland), David Hogg (NYU),Kevin Knuth (SUNY Albany), Thomas Lumley (Auckland), IainMurray (Edinburgh), John Skilling (Maximum Entropy DataConsultants), Jared Tobin (jtobin.io).
Brendon J. Brewer Computing Entropies with Nested Sampling