Nested sampling

OutlineNested Sampling

Posterior SimulationNested Sampling Termination and Size of N

Numerical ExamplesConclusion

Nested Sampling for General BayesianComputation

Represented by WU Changye

12 février 2015

Represented by WU Changye Nested Sampling for General Bayesian Computation




Nested Sampling

Posterior Simulation

Nested Sampling Termination and Size of N

Numerical Examples

Conclusion





Introduction

In the Bayesian paradigm, the parameter θ follows the priordistribution π, the observations y follow the distribution L(y |θ)given θ, then the posterior distribution f (θ|y) which indicates thedistribution of θ given the observations y has the following form :

f (θ|y) =L(y |θ)π(θ)∫

Θ L(y |θ)π(θ)dθ

The objective of nested sampling is to compute the ’evidence’ :

Z =

∫Θ

L(y |θ)π(θ)dθ





θ is a random variable, then

Z = Eπ(L(θ))

For simplicity, let L(θ) denote the likelihood L(y |θ). The cumulativedistribution function of L(θ) is

F (λ) =

∫L(θ)<λ

π(θ)dθ

Define the induced measure µ on R by the likelihood function andthe prior as follwing

µ(A) = Pπ(L(θ) ∈ A)





Lemma 1 : Eπ(L(θ)) = Eµ(X ).Proof : ∀g is a indication function of a measurable set A in R.Then

Eπ(g(L(θ))) = Eπ(IA(L(θ))) =

∫L(θ)∈A

π(θ)dθ

However, µ(dx) =∫

Θ δ{L(θ)}(dx)π(θ)dθ.

Eµ(g(X )) =

∫RIA(x)µ(dx) =

∫Θ

(∫RIA(x)δ{L(θ)}(dx)

)π(θ)dθ

Therefore,

Eµ(g(X )) = Eπ(IA(L(θ))) = Eπ(g(L(θ)))





In the general case, let {gn} be an increasing sequence of stepfunctions converging to identity function Id ; then {gn ◦ L} is anincreasing sequence of step functions converging to L and thedesired conclusion follows by taking limits.





Lemma 2 : If X is a positive-valued random variable, has p.d.f. fand c.d.f. F, then :∫ ∞

0(1− F (x))dx =

∫ ∞0

xf (x)dx = E(X ).

Proof : ∫ ∞0

(1− F (x))dx =

∫ ∞0

(1− P(X < x))dx

=

∫ ∞0

P(X ≥ x)dx

=

∫ ∞0

∫ ∞x

f (y) · dy · dx

=

∫ ∞0

f (y)

∫ y

0dx · dy

=

∫ ∞0

f (y) · ydy = E(X )





According to Lemma 1 and 2,

Z = Eµ(X ) =

∫ ∞0

xdF (x) =

∫ ∞0

(1− F (x))dx

Let ϕ−1(x) = 1− F (x) = P{θ : L(θ) > x}

Z =

∫ ∞0

ϕ−1(x)dx =

∫ 1

0ϕ(x)dx

Therefore, we have the evidence represented by an one-dimensionalintegration.





In order to compute the following integration :

J =

∫ 1

0ϕ(x)dx

there are three methods based on sampling.





1) Importance Sampling :i = 1, · · · , n,Ui ∼ U[0,1],

J1 = 1n∑n

i=1 ϕ(Ui )

2) Riemann approximation :i = 1, · · · , n,Ui ∼ U[0,1],U(i) is the order statistics of(U1, · · · ,Un), U(1) ≤ · · · ≤ U(n),

J2 =∑n−1

i=1 ϕ(U(i))(U(i+1) − U(i))

3) A complicated method :x0 = 1step1 : i = 1, · · · ,N,U1

i ∼ U[0,1], x1 = max{U11 , · · · ,U1

N}step2 : i = 1, · · · ,N,U2

i ∼ U[0,x1], x2 = max{U21 , · · · ,U2

N}· · · · · ·setp n : i = 1, · · · ,N,Un

i ∼ U[0,xn−1], xn = max{Un1 , · · · ,Un

N}J3 =

∑ni=1 ϕ(xi )(xi−1 − xi )





Nested sampling takes the third method and the reason is that ϕ isa decreasing function and in many cases it decreases rapidly.

Figure: Graph of ϕ(x) and the trace of (xi , ϕ(xi ))





First, we consider the distributions of x1, · · · , xn :for u ∈ [0, 1],

P(x1 < u) = P(U11 < u, · · · ,U1

N < u)

=N∏

i=1

P(U1i < u)

= uN

As a result, the density function of x1 is

f (x1) = NxN−11

By the same method, we have :

f (xk |xk−1) =N

xk−1

(xk

xk−1

)N−1





Note tk = xkxk−1

,

P(tk ≤ t) =

∫P(xk ≤ tx |xk−1 = x)fxk−1(x)dx

=

∫xk−1

∫ tx

0fxk |xk−1(y |x)fxk−1(x)dxdy

=

∫xk−1

∫ tx

0

Nx

(yx

)N−1fxk−1(x)dxdy

=

∫xk−1

tN fxk−1(x)dx = tN

Besides,

P(tk ≤ t|xk−1 = x) = P(xk ≤ tx |xk−1 = x) = tN

As a result, we have tk ⊥ xk−1.Represented by WU Changye Nested Sampling for General Bayesian Computation




Moreover, a point estimate for xk can be written entirely in termsof point estimates for the tk ,

xk =xk

xk−1× xk−1

xk−2×· · ·× x1

x0×x0 = tk ·tk−1 · · · t1 ·x0 =

(k∏

i=1

ti

)·x0

More appropriate to the large range common to many problems,log xk becomes

log xk = log

(k∏

i=1

ti

)· x0 =

k∑i=1

log ti + log x0

where the logarithmic shrinkage is distributed as

f (log t) = Ne(N−1) log t

with the mean and the variance :

E(log t) = − 1N

V(log t) =1

N2Represented by WU Changye Nested Sampling for General Bayesian Computation




Taking the mean as the point estimate for each log ti finally gives

logxk

x0= − k

N±√

kN

Parameterizing xk in terms of the shrinkage proves immediatelyadvantageous – because the log ti are independent, the errors in thepoint estimates tend to cancel and the estimates for the xk growincreasingly more accurate with k .

xk = exp(− kN

)





Next, we consider the distribution of ϕ(X ), where X ∼ U [0, 1]Considering the random variable X = ϕ−1(L(θ)), where θ ∼ π.Notice that :

ϕ−1 : [0, Lmax]→ [0, 1],

λ 7→ P(L(θ) > λ)

for u ∈ [0, 1],

P(X < u) = P(ϕ−1(L(θ)) < u)

= P(L(θ) > ϕ(u))

= ϕ−1(ϕ(u))

= u

This means that ϕ−1(L(θ)) follows the U [0, 1] and ϕ(X ) ∼ L(θ).





Considering the situation on the truncated distribution :

π(θ) ∝{π(θ) L(θ) > L00 otherwise

Let X0 = ϕ−1(L0) and X = ϕ−1(L(θ)), where θ ∼ π.For u ∈ [0,X0],

P(X < u) = P(ϕ−1(L(θ)) < u|L(θ) > L0)

=P(L(θ) > ϕ(u))

P(L(θ) > L0)

=ϕ−1(ϕ(u))

X0=

uX0

X ∼ U [0,X0],

As a result, ϕ(X ) ∼ L(θ), where X ∼ U [0,X0] and θ ∼ π.Represented by WU Changye Nested Sampling for General Bayesian Computation




AlgorithmThe algorithm based on the method discussed in the previoussection is described in below :– Iteration 1 : sample independently N points θ1,i from the priorπ(θ), determine θ1 = argmin1≤i≤N L(θ1,i ) and set ϕ1 = L(θ1)

– Iteration 2 : obtain the N current values θ2,i , by reproducing theθ1,i ’s except for θ1 that is replaced by a draw from the priordistribution π conditional upon L(θ) ≥ ϕ1 ; then select θ2 asθ2 = argmin1≤i≤N L(θ2,i ), and set ϕ2 = L(θ2)

– Iterate the above step until a given stopping rule is satisfied, forinstance when observing very small changes in the approximationZ or when reaching the maximal value of L(θ) when it is known.





Z =J∑

i=1

ϕi (xi−1 − xi )





By-product of Nested Sampling

Skilling indicates that nested sampling provides simulations fromthe posterior distribution at no extra cost : "the existing sequenceof points θ1, θ2, θ3, . . . already gives a set of posteriorrepresentatives, provided the i ’th is assigned the appropriateimportance ωiLi"

Eπ(f (θ)) =

∫Θ π(θ)L(θ)f (θ)dθ∫

Θ π(θ)L(θ)dθWe can use a single run of nested sampling to obtain estimators ofboth the numerator and the denominator, the latter being theevidence Z . The estimator of the numerator is

j∑i=1

(xi−1 − xi )ϕi f (θi ) (1)





Lemma 3(N.Chopin & C.P Robert) :Let f (l) = Eπ{f (θ)|L(θ) = l} for l > 0, then, if f is absolutelycontinuous, ∫ 1

0ϕ(x)f (ϕ(x)) dx =

∫π(θ)L(θ)f (θ)dθ

Proof : Let ψ : x → x f (x),∫π(θ)L(θ)f (θ)dθ = Eπ[ψ{L(θ}]

=

∫ +∞

0Pπ(ψ{L(θ} > l)dl

=

∫ +∞

0ϕ−1(ψ−1(l))dl =

∫ 1

0ψ(ϕ(x))dx





Termination

The author suggests that

max(L1, · · · , LN)Xj < fZj =⇒ termination

where f is some fraction.





N ?

The larger N is, the smaller the variability of the approximation is.





How to sample N points from the constraint parametricspace

Using a MCMC method which constructs a Markov Chain that hasthe invariant distribution of the truncated distribution.





A decentred gaussian example

The prior is

π(θ) =d∏

i=1

1√2π

exp(−12

(θ(k))2)

and the likelihood is

L(y |θ) =d∏

i=1

1√2π

exp(−12

(yk − θ(k))2)

In this example, we can calculate the evidence analytically

Z =

∫Rd

L(θ)π(θ)dθ =exp(−

∑dk=1 y2

k4 )

2dπd/2





Figure: Graph of ϕ(x) and the trace of (xi , ϕ(xi )) with d = 1 and y = 10.





Figure: The prior distribution and the likelihood with d = 1 and y = 10.





Figure: the box-plot of log Z − log Z with d = 1 and y = 10 for Nestedsampling and Monte Carlo.





Figure: the box-plot of log Z − log Z with d = 5 and y = (3, 3, 3, 3, 3).





A Probit Model

We consider the arsenic dataset and a probit model studied inChapter 5 of Gelman & Hill (2006). The observations areindependent Bernoulli variables yi such thatP(yi = 1|xi ) = Φ(xT

i θ), where xi is a vector of d covariates, θ is avector parameter of size d , and Φ denotes the standard normaldistribution function. In this particular example, d = 7.





The prior isθ ∼ N (0, 102Id )

L(θ) =n∏

i=1

(Φ(xT

i θ))yi(1− Φ(xT

i θ))1−yi





Figure: the box-plot of log Z with N = 20 for HMC and random walkMCMC. The blue line remarks the true value of log Z (Chib’s method).





Posterior Samples

We use the Gaussian example to illustrate this result. Letf (θ) = exp(−3θ + 9d

2 ).

Figure: The box-plot of the log-relative error of log Z − log Z andlog ˆE(f )− logE(f )





Conclusion

– Nested sampling reverses the accepted approach to Bayesiancomputation by putting the evidence first.

– Nested sampling samples more sparsely from the prior in regionswhere the likelihood is low and more densely where the likelihoodis high, resulting in greater efficiency than a sampler that drawsdirectly from the prior.

– The procedure runs with an evolving collection of N points,where N can be chosen small for speed or large for accuracy.

– Nested sampling always reduces a multidimensional integral tothe integral of a one-dimensional monotonic function, no matterhow many dimensions θ occupies, and no matter how strange theshape of the likelihood function L(θ) is.





Problems

– How to generate N independent points in the constraintparametric space is an important problem. Techniques to do soeffectively and efficiently may vary from problem to problem.

– Termination is also another problem in practice.





Thank you !


Science

Nested sampling