I Inference techniques used so far have been based on ... · MAS472/6004: Computational Inference Chapter III Simulating random variables 1/70 3.1 Generating Random Variables I Inference

MAS472/6004: Computational Inference

Chapter IIISimulating random variables

1 / 70

3.1 Generating Random Variables

I Inference techniques used so far have been based onsimulation

I We now consider how to simulate X from fX(x).

I In semester 1 used MCMC - but simpler methods areneeded in order to do MCMC.

I Starting point: generate U from U [0, 1] distribution

I Then consider transformation g(U) to obtain a randomdraw from fX(x).

How could we generate U [0, 1] r.v.s with coin tosses?

2 / 70

Sampling from U(0, 1)

Need to simulate independent random variables uniformlydistributed on [0, 1].

Definition: A sequence of pseudo-random numbers {ui} is adeterministic sequence of numbers in [0, 1] having the samestatistical properties as a similar sequence of random numbers.Ripley 1987.

The sequence {ui} is reproducible provided u1 is known.

A good sequence would be “unpredictable to the uninitiated”.

3 / 70

Congruential generators (D.H. Lehmer, 1949)

The general form of a congruential generator is

Ni = (aNi−1 + c) mod M,

Ui = Ni/M, where integers a, c ∈ [0,M − 1]

If c = 0, it is called a multiplicative congruential generator(otherwise, mixed).These numbers are restricted to the M possible values

0,1

M,

2

M, . . . ,

M − 1

M.

Clearly, they are rational numbers, but if M is large they willpractically cover the reals in [0, 1].

N1: the seed. Can be re-set so you can reproduce same set ofuniform random numbers. In R, use set.seed(i), where i aninteger.

4 / 70

As soon as some Ni repeats, say, Ni = Ni+T , then the wholesubsequence repeats, i.e. Ni+t = Ni+T+t, t = 1, 2, . . ..The least such T is called the period.A good generator will have a long period.

The period cannot be longer than M and also depends on a andc.Several useful Theorems exist concerning periods of congruentialgenerators. For example, for c > 0, T = M if and only if

1. c and M have no common factors (except 1),

2. 1 = a (mod p) for every prime number that divides M ,

3. 1 = a (mod 4) if 4 divides M .

5 / 70

Usually M is chosen to make the modulus operation efficient,and then a and c are chosen to make the period as long aspossible. Ripley suggests c = 0 or c = 1 is usually a good choice.

The NAG Fortran Library G05CAF

M = 259 a = 1313 c = 0

Another recommended one is

M = 232 a = 69069 c = 1.

so thatNi = (69069Ni−1 + 1)mod 232

andUi = 2−32Ni

6 / 70

Lattice structure

Notice that for a congruential generator

Ni − aNi−1 = c− bM,

where b > 0 is an integer. Therefore,

Ui − aUi−1 =c

M− b.

The LHS lies in (−a, 1) since Ui ∈ [0, 1).Therefore, b can take at most a+ 1 distinct values.

If we plot points (Ui−1, Ui), all the points will lie on at mosta+ 1 parallel lines.

7 / 70

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

0 10 20 30 40 50

010

2030

4050

M=2^11, a=51, c=1

U_i

U_{

i+1}

All linear congruential generators exhibit this kind of latticestructure, not just for pairs (Ui−1, Ui), but also for triples(Ui−2, Ui−1, Ui), and in higher dimensions.

A good generator is expected to have fine lattice structure, thatis, points (Ui−k+1, . . . , Ui−1, Ui) ∈ [0, 1)k must lie on manyhyperplanes in Rk for all small k (k �M).

8 / 70

RANDU - lattice structure

M = 231, a = 216 + 3 = 65539, and c = 0.

Once very popular, RANDU has eventually been found out tobe a rather poor generator.

9 / 70

RANDU - lattice structure II

Let Ui = Ni/m then for this generator

Ui+2 − 6Ui+1 + 9Ui = k an integer.

Since 0 ≤ Ui < 1

−6 < Ui+2 − 6Ui+1 + 9Ui < 10.

Therefore k = −5,−4, . . . ,−1, 0,+1, . . . , 9.Hence k can take on 15 integer values only, and subsequently(Ui−2, Ui−1, Ui) must lie on at most 15 parallel planes.

This is an example of coarse lattice structure, unsatisfactorycoverage of [0, 1)3.

10 / 70

Generation from non-U(0, 1)

We have a sequence U1, U2, U3, . . . of independent uniformrandom numbers in [0, 1].

We want X1, X2, . . . distributed independently and identicallyfrom some specified distribution.The answer is to transform the U1, U2, . . . sequence intoX1, X2, . . . sequence.The idea is to find a function g(U1, U2, U3, . . .) that has therequired distribution.There are always many ways of doing this. A good algorithmshould be quick because millions of random numbers may berequired.

11 / 70

3.2 The inversion methodLet X be any continuous random variable and defineY = FX(X), where FX is the distribution function of X:FX(x) = P (X ≤ x).Claim: Y ∼ U [0, 1].Proof Y ∈ [0, 1] and the distribution function of Y is

FY (y) = P (Y ≤ y) = P(FX(X) ≤ y

)

= P(X ≤ F−1X (y)

)= FX

(F−1X (y)

)= y

which is the distribution function of a uniform random variableon [0, 1].

So whatever the distribution of X, Y = FX(X) is uniformlydistributed on [0, 1]. The inversion method turns thisbackwards. Let U = FX(X), then X = F−1X (U).

I So to generate X ∼ FX take a single uniform variable U ,and set X = F−1X (U).

12 / 70

Example: exponential distribution

Let X ∼ Exp(1/λ) (mean λ), i.e.

f(x) = λ−1e−x/λ (x ≥ 0)

F (x) =

∫ x

0λ−1e−z/λ dz = [−e−z/λ]x0 = 1− e−x/λ.

Set U = 1− e−X/λ and solve for X

X = −λ ln(1− U).

Note that 1− U is uniformly distributed on [0, 1], so we mightas well use

X = −λ lnU.

Question: What are the limitations of the inversion method?

13 / 70

Discrete distributionsThe inversion method works for discrete random variables inthe following sense.Let X be discretely distributed with possible values xi havingprobabilities pi. So

P (X = xi) = pi,

k∑

i=1

pi = 1.

Then FX(x) =∑xi≤x

pi is a step function.

Inversion gives X = xi if∑

xj<xi

pj < U ≤ ∑xj≤xi

pj which clearly

gives the right probability values.

I Think of this as splitting [0, 1] into intervals of length pi.The interval in which U falls is the value of X.

Question: What problems might we face using this method?Eg Consider a Poisson(100) distribution.

14 / 70

Discrete distributions - exampleLet X ∼ Bin(4, 0.3). The probabilities are

P (X = 0) = .2401, P (X = 1) = .4116, P (X = 2) = .2646

P (X = 3) = .0756, P (X = 4) = .0081.

The algorithm says X = 0 if 0 ≤ U ≤ .2401,

X = 1 if .2401 < U ≤ .6517,

X = 2 if .6517 < U ≤ .9163,

X = 3 if .9163 < U ≤ .9919,

X = 4 if .9919 < U ≤ 1.

Carrying out the binomial algorithm means the following. LetU ∼ U(0, 1).

1. Test U ≤ .2401. If true, return X = 0.2. If false, test U ≤ .6517. If true, return X = 1.3. If false, test U ≤ .9163. If true, return X = 2.4. If false, test U ≤ .9919. If true, return X = 3.5. If false, return X = 4.

15 / 70

Discrete distributions - example

Consider the speed of this. The expected number of steps(which roughly equates to speed) is

1× .2401 + 2× .4116 + 3× .2646 + 4× .0756 + 4× .0081

= 1 + E(X)− 0.0081 = 2.1919

To speed things up we can rearrange the order so that the latersteps are less likely.

1. Test U ≤ .4116. If true return X = 1.

2. If false, test U ≤ .6762. If true return X = 2.

3. If false, test U ≤ .9163. If true return X = 0.

4. and 5. as before.

Expected number of steps:1× .4116 + 2× .2646 + 3× .2401 + 4× (0.0956 + 0.0081) = 1.9959.Approximate 10% speed increase.

16 / 70

3.3 Other Transformations

(a) If U ∼ U(0, 1) set V = (b− a)U + a then V ∼ U(a, b) wherea < b.

(b) If Yi are iid exponential with parameter λ then

X =

n∑

i=1

Yi = − 1

λ

n∑

i=1

logUi = − 1

λlog

(n∏

i=1

Ui

)

has a Ga(n, λ) distribution.

(c) If X1 ∼ Ga(p, 1), X2 ∼ Ga(q, 1), X1 and X2 independentthen Y = X1/(X1 +X2) ∼ Be(p, q).

(d) Composition: if

f =r∑

i=1

pifi

where∑pi = 1 and each fi is a density, then we can

sample from f by first sampling I from the discretedistribution p = {p1, . . . , pr} and then taking a samplefrom fI .

17 / 70

The Box-Muller algorithm for the normal distribution

We cannot generate a normal random variable by inversion,because FX is not known in closed form (nor its inverse).The Box–Muller method (1958). Let U1, U2 ∼ U [0, 1].Calculate

X1 =√−2 lnU1 cos(2πU2),

X2 =√−2 lnU1 sin(2πU2).

Then X1 and X2 are independent N(0, 1) variables.The method is not particularly fast, but is easy to program andquite memorable.

18 / 70

3.4 Rejection Algorithm

Fundamental Theorem of Simulation:Simulating

X ∼ f(x)

is equivalent to simulating

(X,U) ∼ U{(x, u) : 0 < u < f(x)}.

Note that f(x, u) = I0<u<f(x) so that

∫f(x, u)du =

∫ f(x)

0du = f(x)

as required.

Hence, f is the marginal density of the joint distribution(X,U) ∼ U{(x, u) : 0 < u < f(x)}.

19 / 70

Rejection Algorithm ExplainedThe problem with this result is that simulating uniformly fromthe set

{(x, u) : 0 < u < f(x)}may not be possible. A solution is to simulate the pair (X,U)in a bigger set, where simulation is easier, and then take thepair if the constraint is satisfied.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

x

Den

sity

20 / 70

Rejection: Uniform bounding boxSuppose that f(x) is zero outside the interval [a, b] (so that∫ ba f(x)dx = 1) and that f is bounded above by m.

I Simulate the pair (Y,U) ∼ U [a, b]× [0,m](Y ∼ U [a, b], U ∼ U [0,m] independently).

I Accept the pair if the constraint 0 < U < f(Y ) is satisfied.

This results in the correct distribution for the accepted Y value,call it X.

P(X ≤ x) = P(Y ≤ x|U < f(Y ))

=

∫ xa

∫ f(y)0 dudy

∫ ba

∫ f(y)0 dudy

=

∫ x

af(y)dy.

Note: we can use the rejection algorithm even if we only know fupto a normalising constant (as is often the case in Bayesianstatistics - see chapter 4).

21 / 70

Example: Sampling from a beta distributionConsider sampling from X ∼ Beta(α, β) for α, β > 1 which haspdf

f(x) =Γ(α+ β)

Γ(α)Γ(β)xα−1(1− x)β−1 0 < x < 1.

We note

f(x) ∝ f1(x) = xα−1(1− x)β−1 0 < x < 1

and that M = sup0<x<1

xα−1(1− x)β−1 occurs at x =α− 1

α+ β − 2(mode) and hence

M =(α− 1)α−1(β − 1)β−1

(α+ β − 2)α+β−2.

The rejection algorithm is

1. Generate Y ∼ U(0, 1) and U ∼ U(0,M).

2. If U ≤ f1(Y ) = Y α−1(1− Y )β−1 then let X = Y (accept)else go to 1 (reject).

22 / 70

Generalising the Rejection IdeaIf the support of f is not finite, then bounding it within arectangle will not work. Instead of using a box to bound thedensity f(x) (ie requiring f(x) < m for some constant m) wecan use a function m(x) such that f(x) ≤ m(x) for all x.

Suppose the larger bounding set is

L = {(y, u) : 0 < u < m(y)}

then all we require is that simulation of a uniform from L isfeasible. Note

I The closer m is to f the more efficient our algorithm.

I Because m(x) ≥ f(x), m cannot be a probability density.We write

m(x) = Mg(x) where

∫m(x)dx =

∫Mg(x)dx = M

for some density g.23 / 70

Generalising the Rejection Idea IIThis suggests a more general implementation of thefundamental theorem:

Corollary: Let X ∼ f(x) and let g(x) be a density functionthat satisfies f(x) ≤Mg(x) for some constantM ≥ 1. Then,to simulate X ∼ f , it is sufficient to generate

Y ∼ g and U |Y = y ∼ U(0,Mg(y))

and set X = Y if U ≤ f(Y ).

Proof:

P(X ∈ A) = P(Y ∈ A|U ≤ f(Y ))

=

∫A

∫ f(y)0

duMg(y)g(y)dy

∫ ∫ f(y)0

duMg(y)g(y)dy

=

∫

Af(y)dy

24 / 70

The Rejection Algorithm

The rejection algorithm is usually stated in a slightly modifiedform:

Rejection Algorithm

If g is such that f/g is bounded, so there exists Msuch that Mg(x) ≥ f(x) for all x then

1. Generate Y from density g, and U from U(0, 1).

2. If U ≤ f(Y )/Mg(Y ) set X = Y . Otherwise, return tostep 1.

produces simulations from f

We keep sampling new Y and U until the condition is satisfied.

Exercise: Convince yourself that these two descriptions of therejection algorithm are the same.

25 / 70

Example: Sampling from a beta distribution revisited

Use rejection to sample from X ∼ Beta(α, β). Letg(y) = αyα−1, 0 < y < 1, then

f1(x)

g(x)=

(1− x)β−1

αis bounded if and only if β ≥ 1

Then M = supx

{f1(x)

g(x)

}=

1

αoccurs at x = 0.

1. Simulate Y with pdf g(y) = αyα−1, 0 < y < 1 andU ∼ U(0, 1).

2. If U ≤ f1(Y )

Mg(Y )=

(1− Y )β−1(1α

)α

= (1− Y )β−1 then set

X = Y else go to 1.

26 / 70

How to simulate Y with pdf g(y) = αyα−1?I We note that the cdf of Y is G(y) = yα, 0 < y < 1.I Therefore we can use inversion. Let Z ∼ U(0, 1) then solve

Z = G(Y ) = Y α and so Y = Z1α .

Full algorithm is:

1. Generate U ∼ U(0, 1) and Z ∼ U(0, 1). Let Y = Z1α .

2. If U ≤ (1− Y )β−1 then set X = Y else go to 1.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x

Den

sity

27 / 70

Efficiency of the rejection method

Each time we generate a (Y,U) pair,

Prob(Reject) = P(U ≥ f(Y )/Mg(Y )

)= 1− 1

M, Prob(Accept) =

1

M.

The number of tries until we accept Y is a geometric random variablewith expectation M .

Note that M here must be calculated with the normalised density f ,

i.e., M = sup f(x)g(x) .

If we used an unnormalised density f1(x), where∫f1(x)dx = c, so

that f(x) = 1cf1(x), then if we used

M = supf1(x)

g(x)

the acceptance rate is

P(Accept) =c

M

28 / 70

For maximum efficiency, we want M as small as possible, i.e.sup f(x)/g(x) as small as possible. This means finding a g that

(a) we can sample from efficiently, and

(b) mimics f as closely as possible.

There are many good generators based on rejection from awell-chosen g.

29 / 70

Rejection Example IIILet θ have von Mises distribution with pdf

f(θ) =exp(k cos θ)

2πI(k)0 < θ < 2π (k ≥ 0)

where I(k) is the normalising constant.

Let f1(θ) =1

2πexp(k cos θ), 0 < θ < 2π.

Y ∼ U(0, 2π) so that g(y) =1

2π, 0 < y < 2π.

Then

M = supθ

{f1(θ)

g(θ)

}= sup

θ{exp(k cos θ)} = exp k.

Let U ∼ U(0, 1).If

U ≤ f1(Y )

Mg(Y )=

exp(k cosY )

2π · 1

2π· exp k

= exp(k(cosY − 1)

)

we accept θ = Y otherwise reject.30 / 70

Truncated distributions

Suppose we wish to sample X from the following distribution:

fX(x) ∝{gX(x) for x ∈ A

0 otherwise

where gX(x) is a known density that we can sample from, e.g.gX(x) is the N(0, 1) density, and A = [0,∞).

fX(x) =

{k gX(x) for x ∈ A

0 otherwise

where k is a normalising constant, given by

k−1 =

∫

AgX(x)dx

31 / 70

fX(x) ∝{gX(x) for x ∈ A

0 otherwise

Consider using rejection method to sample X from fX(x). Wesample Y from the full (non-truncated) density gX(x).

fX(x)

gX(x)=

{k if x ∈ A0 otherwise

32 / 70

So M = supxfX(x)gX(x) = k.

Rejection algorithm: sample u from U [0, 1] and y from gY (y),

and accept X = y if u ≤ fX(y)M gY (y) .

But since

fX(x)

M gX(x)=

{fX(x)k gX(x) = 1 if x ∈ A

0 otherwise

we will always have u ≤ fX(y)M gY (y) if y ∈ A , and u ≥ fX(y)

M gY (y) if

y /∈ A.So we don’t need to sample u. Can just do

1. generate y from gY (y)

2. if y ∈ A, accept X = y

3. otherwise, return to step 1.

As usual, acceptance probability will be high if M is small, i.e∫A gY (y) dy is near 1. So if the truncated region is large,

rejection sampling will be inefficient.

33 / 70

3.5 Multivariate generators

Now suppose we want to generate a random vectorX = (X1, . . . , Xp) from density f(x). We can note the followingsimple points.

1. If the elements of X are to be independent, i.e.

f(x) = f1(x1)f2(x2) . . . fp(xp),

then we can separately generate X1 from f1, X2 fromf2, . . . , Xp from fp using different uniforms.

2. Inversion no longer works as the theorem can’t begeneralised.

3. Rejection does work. If we can generate from g(x) (and gmay be a product of independent components) and find

M ≥ supx

f(x)

g(x)and otherwise reject.

34 / 70

Sequential methods

We can obviously write

f(x) = f1(x1)f2(x2|x1)f3(x3|x1, x2) . . . .

So we can first generate X1 from f1. Then for that given valueof X1, generate X2 from f2, and so on.

35 / 70

Example

Suppose we wish to sample {x1, x2} from the density function

f(θ, φ) ∝ x−12

2 x−(α+1)2 e

− 2β+λ(x1−µ)22x2

Firstly, consider the marginal distribution of x1

f(x1|x2) ∝ e−λ(x1−µ)

2

2x2

as we can ignore factors not depending on x1.

Thus we can recognise that

f(x1|x2) ∼ N(µ,x2λ

)

36 / 70

Next consider the marginal of x2

f(x2) ∝∫f(x1, x2)dx1

∝ x−12

2 x−(α+1)2 e

− βx2

(x2λ

) 12

∝ x−(α+1)2 e

− βx2

where the term on the right in rd is the missing constant fromthe N(µ, x2λ ) distribution.

We can recognise this as an inverse gamma distributionx2 ∼ Γ−1(α, β).

So to simulate random variables from f we can first simulate x2from an inverse-Gamma distribution (e.g. by rejectionsampling) and then simulate x1 ∼ N(µ, x2λ ) using, e.g.,Box-Muller.

37 / 70

Multivariate normal distributions

How can we generate X from a N(m, V ) distribution, for somenon-diagonal matrix V ?

We know how to generate iid N(0, 1) rvs from the Box-Mulleralgorithm, so perhaps we can take a sequence of independentstandard normal random variables Z1, Z2, . . . and transformthese in some way?

One technique involves the use of the Cholesky square rootof the matrix V . For any (symmetric, square) positive definitematrix V , we can find a square root U (called the Choleskydecomposition), such that UTU = V.

To find the Cholesky square root of a matrix V in R, typechol(V).

38 / 70

Multivariate normal distributions II

Set Z =

Z1...Zn

where Zi ∼ N(0, 1) and n = dim X.

ConsiderY = m + UTZ.

Then Y must have a multivariate normally distribution (why?),and

E(m + UTZ) = m,

Var(m + UTZ) = UT InU = V,

(with In the n× n identity matrix = VarZ).Hence to generate X, we generate independent standard normalrandom variables Z, and then transform them by m + UTZ toobtain X.

39 / 70

3.6 Importance sampling

In order to estimate an integral of the form∫h(x)f(x)dx we

find that it is sometimes better to generate values not from thedistribution f(x), but instead from some other distribution g(x)and to then account for this by using a weighting. This is theidea behind importance sampling.

To introduce the idea we consider a simple example.

40 / 70

Example of Monte Carlo/Importance SamplingLet X be Cauchy f(x) =

1

π(1 + x2), −∞ < x <∞.

Let θ = P (X > 2) = I =

∫ ∞

2

1

π(1 + x2)dx (= 0.1476).

Use Monte Carlo Methods to estimate θ.

(i) Generate n Cauchy variates, X1, . . . , Xn.Let Y1 be the number that are greater than 2,Y1 =

∑IXi>2. Then Y1 ∼ B(n, θ) so that

E(Y1) = nθ, V (Y1) = nθ(1− θ)

θ1 =Y1n

E(θ1) =E(Y1)

n=nθ

n= θ

and

V (θ1) =V (Y1)

n2=nθ(1− θ)

n2=θ(1− θ)

n=

0.126

n.

41 / 70

Example of Monte Carlo/Importance Sampling - II

(ii) Note that θ = 12P (|X| > 2) - we want to use this to reduce

the variance of our estimator θ.Generate n Cauchy variates.Let Y2 be the number that are greater than 2 in modulusthen Y2 ∼ B(n, 2θ)

and θ2 = 12

Y2n

=⇒ E(θ2) = 12

E(Y2)

n= 1

2 ·n2θ

n= θ

and

V (θ2) =V (Y2)

22n2=n2θ(1− 2θ)

22n2=θ(1− 2θ)

2n=

0.052

n.

42 / 70

Example of Monte Carlo/Importance Sampling - III

(iii) The relative inefficiency of these methods is due togeneration of values outside the domain of interest [2,∞).Alternatively note we can write

θ =1

2−∫ 2

0

1

π(1 + x2)dx.

This integral can be considered the expectation ofh(X) = 2

π(1+x2)where X ∼ U [0, 2] as the density of U [0, 2]

is g(x) = 1/2.An alternative method of evaluation of θ is therefore

θ3 =1

2− 1

n

n∑

i=1

h(Ui)

where Ui ∼ U [0, 2].

43 / 70

Example of Monte Carlo/Importance Sampling - IVWe can see that

E(θ3) =1

2− 1

n

n∑

i=1

∫ 2

0

2

π(1 + x2)dx =

1

2− P(0 < X < 2)

where X ∼ Cauchy, so that it too is an unbiased estimator.

The variance of θ3 is Var(h(U))/n and we can see that

Eh(U) =

∫ 2

0h(x)

1

2dx = 0.5− 0.1475 = 0.3525

Eh(U)2 =

∫ 2

0h(x)2

1

2dx =

∫ 2

0

2

π2(1 + x2)2dx

=1

π2

[x

x2 + 1+ tan−1(x)

]2

0

= 0.1527

Hence Var(h(x)) = 0.1527− 0.35252 = 0.02851 and thus

Var(θ3) =0.02851

n

44 / 70

Example of Monte Carlo/Importance Sampling - V

(iv) Finally, note that another possibility is to note that if

y =1

x

θ =

∫ ∞

+2

1

π(1 + x2)dx =

∫ 12

0

y−2 dy

π(1 + y−2)=

∫ 12

0h(y) dy.

This can be seen as the expectation of h(X) = X−2

2π(1+X−2)

where X ∼ U [0, 12 ]. We can estimate this as

θ4 =1

n

n∑

i=1

h(Ui)

where U1, . . . , Un ∼ U [0, 1/2].Again, we have Eθ4 = θ and now

Eh(U)2 =

∫ 1/2

0h(x)2·2dx =

1

4π2

[x

x2 + 1+ tan−1(x)

]1/2

0

= 0.02188

Hence Var(θ4) =0.02188− 0.14762

n=

0.0000955

n45 / 70

Summary of Example

We found 4 unbiased estimators of θ, each with a differentvariance.

Var(θ1) =0.126

nVar(θ2) =

0.052

n

Var(θ3) =0.02851

nVar(θ4) =

0.0000955

n

The best estimator is the one with the smallest variance,namely θ4.Compared with θ1, the evaluation of θ4 requires√

(0.126/0.0000955) ≈ 36 times fewer simulations to achieve thesame precision.By carefully considering our simulation method we can hope toget more accurate estimates.Estimate θ2 and θ4 are both types of importance sampling.

46 / 70

Importance Sampling

Consider calculating the integral

I = Efh(X) =

∫h(x)f(x) dx.

Importance samplingLet X1,X2, . . . ,Xn be independently and identically distributedrandom variables with common density g(x).Define w(x) = f(x)/g(x), so that

Eg{h(Xi)w(Xi)} =

∫h(x)w(x)g(x) dx =

∫h(x)f(x) dx = I.

Therefore

I =1

n

n∑

i=1

w(Xi)h(Xi) (1)

is an unbiased estimator of I.

47 / 70

Some comments:

I g(x) is called the importance function, and w(Xi) arecalled the importance weights.

I The sum (1) will converge for the same reasons the MonteCarlo sum does.

I Notice that this sum is valid for any choice of thedistribution g, as long as supp(f) ⊆supp(g).

I This is a very general representation that expresses the factthat a given integral is not intrinsically associated with agiven distribution.

I Because very little restriction is put on the choice g, we canchoose a distribution which is easy to sample from, and onewhich gives nice properties for the sum.

48 / 70

Cauchy example revisitedWe can now understand the estimator θ4 in the Cauchyexample. Recall that we want to estimate

EIX>2 =

∫h(x)f(x)dx

where h(x) = Ix>2 and f(x) = 1π(1+x2)

.

Noticing that for large x, f(x) is similar to the density

g(x) = 2/x2 for x > 2.

suggests g() might be a good importance density. We cansample from g by letting Xi = 1/Ui where Ui ∼ U [0, 12 ](inversion method). Thus our estimator is

θ =1

n

n∑

i=1

h(xi)f(xi)

g(xi)=

1

n

n∑

i=1

x2i2π(1 + x2i )

=1

n

n∑

i=1

u−2i2π(1 + u−2i )

= θ4

49 / 70

The variance of the estimator

Since the Xis are iid, Var(I) =σ2

n, where

σ2 = Varg{h(X)w(X)} = E{h(X)2w(X)2} − E{h(X)w(X)}2

=

∫h(x)2w(x)2g(x) dx− I2

=

∫h(x)2f(x)2

g(x)dx− I2 since g(x) =

f(x)

w(x).

We do not of course know σ2 in practice, but we can see that Iwill be a better estimator if we can make w(X) less variable.Our objective, therefore, is to find a distribution g(x) that weknow how to obtain independent samples from, and whichmimics h(x)f(x) as closely as possible.

50 / 70

Optimal choice of g

Theorem The choice of g = g∗ = |h(x)|f(x)∫|h(z)|f(z)dz minimises the

variance of the estimator (1).Proof We’ve seen that it is sufficient to minimise

∫h2(x)f2(x)

g(x)dx = Eg

(h2(X)f2(X)

g2(X)

)

and using Jensen’s inequality we can see that

Eg(h2(X)f2(X)

g2(X)

)≥(Eg[ |h(X)|f(X)

g(X)

])2

=

(∫|h(x)|f(x)dx

)2

and that this lower bound is achieved by choosing g = g∗.NB: We won’t be able to calculate g∗! But the theorem suggeststhat choosing g to look like hf will be a good choice.

51 / 70

Unnormalised densitesSuppose we only know f upto a normalising constant, i.e., weknow

f(x) =f1(x)

cwhere c =

∫f1(x)dx

We can still use importance sampling

Importance sampling with unnormalised densitesLet X1,X2, . . . ,Xn be independently and identically dis-tributed random variables with common density g(x).Define w(x) = f1(x)/g(x). Estimate I by

I =

∑ni=1 w(Xi)h(Xi)∑n

i=1 w(Xi)

Alternatively, we can write this as

I =

n∑

i=1

wih(Xi) where wi =w(Xi)∑w(Xi)

52 / 70

1n

∑w(Xi) is an unbiased estimator of c as

Egw(X) =

∫f1(x)

g(x)g(x)dx =

∫f1(x)dx = c.

When we use unnormalised densities, I is a biased estimator ofI, however it is possible to prove that we still have I → Ialmost surely as n→∞.

This will be important when we use importance sampling toestimate Bayesian quantities.

53 / 70

Effective sample sizeHow variable the weights are tells us how efficient our choice of g is.

In the best case, where g = f , then w(X) = 1 so that wi = 1n , which

is the case in plain Monte Carlo. In this case Var(w(X)) = 0.

If f and g are very different, then the weights will be very variable,and we can find that one or two particles (Xi) dominate the sum.

We often calculate the effective sample size

ESS =1∑w2

i

I In the best case, wi = 1n and ESS= n - so we have an effective

sample size equal to the true sample size.

I The worst case is when one of the wi = 1 and all the others areequal to zero. Then ESS= 1, i.e., we effectively have only asingle sample.

We want to choose g so that the ESS is large.

54 / 70

3.7 Variance reduction techniquesAntithetic variables

The method of antithetic variables uses two correlatedestimators and combines them to get an estimator with a lowervariance (i.e. a better estimator).Suppose we have two different estimators θ1 and θ2 of θ,

I with the same mean and varianceI but which are negatively correlated

Define θ3 = 12(θ1 + θ2). Then

Var(θ3) =1

4(Var(θ1) + Var(θ2) + 2Cov(θ1, θ2))

=1

2(Var(θ1) + Cov(θ1, θ2))

<1

2Var(θ1)

This is twice the cost of computing θ1 but the variance is morethan halved!

55 / 70

Antithetic variables - II

We need to find two estimators which are negatively correlated.This can be done as follows:

I If U ∼ U [0, 1] then 1− U ∼ U [0, 1] also.

I If F is the distribution function of X then X1 = F−1(U)and X2 = F−1(1− U) are both distributed according to F

I and Cov(X1, X2) < 0.

Proof (non-examinable):Let h(u) = F−1(u). Then h(u) is a non-decreasing function.We need to show

Eh(U)h(1− U) ≤ (Eh(U))2

Let Q = Eh(U). The since h is non-decreasing on [0, 1]

h(0) ≤ Q ≤ h(1)

56 / 70

Let f(y) =∫ y

0h(1− x)dx−Qy on [0, 1]

Then f(0) = f(1) = 0 and

f ′(y) = h(1− y)−Q

is also a non-increasing function.Since f ′(0) = h(1)−Q ≥ 0 and f ′(1) = h(0)−Q ≤ 0 we must have

f(u) ≥ 0 on [0, 1]

Therefore

0 ≤∫ 1

0

f(y)h′(y)dy = [fh]10 −∫f ′h(y)dy

= −∫ 1

0

f ′(y)h(y)dy

Therefore∫ 1

0

f ′(y)h(y) =

∫ 1

0

h(y)(h(1−y)−Q)dy =

∫ 1

0

h(y)h(1−y)dy−Q2 ≤ 0

Hence∫ 1

0h(y)h(1− y)dy ≤ Q2 as required.

57 / 70

Cauchy Example RevisitedAbove we used

θ3 =1

2− 2

n

n∑

i=1

[1

π(1 + u2i )

]

as an estimator of P(X > 2) where X ∼ Cauchy.An estimator with a smaller variance can be found usingantithetic variables

1

2

(1

2− 2

n

n∑

i=1

[1

π(1 + u2i )

]+

1

2− 2

n

n∑

i=1

[1

π(1 + (2− ui)2)

])

which gives

θantithetic =1

2− 1

n

n∑

i=1

[1

π(1 + u2i )+

1

π(1 + (2− ui)2)

]

The for n = 10 we find the variance of θ3 is 2.7× 10−4 whereasthe variance of θantithetic is 5.5× 10−6 - a substantialimprovement.

58 / 70

3.8 Bayesian inferenceUnnormalised densities frequently occur when we are doingBayesian inference.

Suppose we are interested in some posterior expectation, forexample, the posterior mean:

I = E(θ|x) =

∫θf(θ|x)dθ

where

f(θ|x) =f(θ)f(x|θ)

f(x)by Bayes theorem.

The denominator f(x) =∫f(θ)f(x|θ)dx is often intractable

and unknown, and so we instead work with the unnormaliseddensity

f1(θ|x) = f(θ)f(x|θ) = prior × likelihood

59 / 70

Rejection sampling for Bayesian inference

You may have seen in MAS364 (or Autumn of MAS6004) howto sample from a posterior distribution using MCMC. We canalso use rejection sampling, or estimate posterior expectationsusing importance sampling.

So to sample posterior samples of θ from f(θ|x), using proposaldensity g (assuming f1(θ|x)/g(θ) ≤M for all θ), we can do

1. Simulate θ ∼ g(·)2. Accept θ with probability

f(θ)f(x|θ)Mg(θ)

otherwise reject θ.

If we use g(θ) = f(θ), ie, use the prior as the proposal, then this

reduces to accept θ with probability f(x|θ)M , but this is usually

inefficient (ie, M is large, so the acceptance rate 1/M is small).

60 / 70

Importance sampling for Bayesian inferenceSuppose we wish to estimate the posterior expectation

E(r(θ)|x) =

∫r(θ)f(θ|x)dθ

We could use importance sampling, using the prior distributionas the importance distribution, ie, g = f .If we do not know f(x) then we can use the followingimportance sampling approach:

I Simulate θθθ1, . . . , θθθn from the prior f(θθθ)I Set wi = f(x|θθθ)I Set wi = wi/

∑wi and estimate E(r(θθθ)|x) by

n∑

i=1

wir(θθθi)

This is inefficient if the prior is very different to the posterior aswe will spend too much time sampling θθθi where the likelihood isvery small, and so the weights w(θθθi) will also be very small.

If this is the case, then the effective sample size will be small.and our estimates of E(θθθ|x) will be dominated by just a few ofthe θθθ samples.

61 / 70

Choice of g and the normal approximation

A more efficient alternative to using the prior distribution for g,is to build a normal approximation to the posterior and use thisas g

Let h(θθθ) = log f(θθθ|x). Now define m to be posterior mode of θθθ,so m maximises both f(θθθ|x) and h(θθθ).

We may need to use numerical optimisation to find m, e.g.using the optim command in R.We can then use a Taylor expansion of h(θθθ) around m

h(θθθ) = h(m) + (θθθ −m)Th′(m) +1

2(θθθ −m)TM(θθθ −m) + . . .

to build a Gaussian approximation to the posterior (known asthe Laplace approximation).Here, h′(m) the vector of first derivatives of h(θθθ), and M thematrix of second derivatives of h(θθθ), both evaluated at θθθ = m.

62 / 70

Since m maxmises h(m) we have h′(m) = 0. Hence

f(θθθ|x) = exp{h(θθθ)} ' exp{h(m)} exp

{−1

2(θθθ −m)TV −1(θθθ −m)

},

(2)where −V −1 = M .

Thus, our approximation of f(θθθ|x) is a multivariate normaldistribution, mean vector m, variance matrix −M−1. This willbe a good approximation if posterior mass is concentratedaround m.

NB: We do not need f(x) to obtain M , since

h(θθθ) = log f(θθθ|x) = log f(θθθ) + log f(x|θθθ)− log f(x),

so log f(x) will disappear when we differentiate h(θθθ).

63 / 70

Assessing convergence

Suppose we wish to estimate E{r(θθθ)|x} for some r(θθθ). If f(x)known, then

E{r(θθθ)|x} =1

n

n∑

i=1

r(θθθi)w(θθθi),

and can use central limit theorem to obtain a confidenceinterval for E{r(θθθ)|x}, as in MC integration.We can check our estimate by1) Increasing the sample size n to check the stability of anyestimate.2) Increasing the standard deviation in the g(θθθ) density, tocheck stability to the choice of g, e.g., if we’re using a normalapproximation, we could multiply V by 4 etc.

64 / 70

Example: leukaemia dataPatients suffering from leukaemia are given a drug,6-mercaptopurine (6-MP), and the number of days xi untilfreedom from symptoms is recorded of patient i:

6∗, 6, 6, 6, 7, 9∗, 10∗, 10, 11∗, 13, 16, 17∗,

19∗, 20∗, 22, 23, 25∗, 32∗, 32∗, 34∗, 35∗.

A * denotes censored observation.Will suppose that time x to the event of interest follows aWeibull distribution:

f(x|α, β) = αβ(βx)α−1 exp{−(βx)α}

for x > 0.For censored observations, we have

P (x > t|α, β) = exp{−(βt)α}.

65 / 70

Example: leukaemia dataLikelihood

Define

I d: number of uncensored observations,

I∑

u log xi: sum of logs of all uncensored observations.

Writing θθθ = (α, β)T , the log likelihood is then given by

log f(x|θθθ) = d logα+ αd log β + (α− 1)∑

u

log xi − βαn∑

i=1

xαi .

Suppose our prior distributions for α and β are bothexponential with

f(α) = 0.001 exp(−0.001α),

f(β) = 0.001 exp(−0.001β).

66 / 70

Example: leukaemia dataBuilding an approximation to the posterior

1) Obtain the posterior mode of θθθ. Maximise log posteior,i.e.

h(θθθ) = d logα+αd log β+(α−1)∑

u

log xi−βαn∑

i=1

xαi −0.001α−0.001β+K,

for some constant K.

In R, we can find the mode to be m = (1.354, 0.030) using theoptim command.

67 / 70

2) Derive the matrix of second derivatives of h(θθθ).

M =

(∂2

∂α2h(θθθ) ∂2

∂α∂βh(θθθ)∂2

∂α∂βh(θθθ) ∂2

∂β2h(θθθ)

),

evaluated at θθθ = m.

∂2

∂α2h(θθθ) = − d

α2−∑

(βxi)α(log(βxi))

2

∂2

∂β2h(θθθ) =

1

β2

{βαα(1− α)

n∑

i=1

xαi − dα},

∂2

∂α∂βh(θθθ) =

1

β

[d− βα

{α log β

n∑

i=1

xαi +

n∑

i=1

xαi + α

n∑

i=1

xαi log xi

}],

M =

(−31.618 175.442175.442 −18806.085

).

68 / 70

3) Obtain the normal approximation to use as g(θθθ).g(θθθ): bivariate normal, mean m, variance matrix V = −M−1:

θθθ ∼ N{(

1.3540.030

),

(0.0334 0.00030.0003 0.00006

)}

4) Sample θθθ1, . . . , θθθn from g(θθθ) and compute theimportance weights w(θθθ1), . . . , w(θθθn). The weights are givenby

w(θθθi) =w(θθθi)∑ni=1 w(θθθi)

, with w(θθθi) =f(θθθi)f(x|θθθi)

g(θθθi)

NB the Gaussian approximation may give us negative samples.Since α > 0 and β > 0, we shoould simply discard negative θθθvalues, i.e., use a truncated normal density for g(θ).

Note that when we compute w(θθθi), it is not necessary to rescaleg(θ) so that it integrates to 1, as any normalising constant ing(θ) will cancel.

69 / 70

5) Estimate the posterior mean of θθθWe compute the estimate

E(θθθ|x) =n∑

i=1

θθθiw(θθθi).

In R, with n = 100000, this gives E(θθθ|x) = (1.346, 0.031)T .6) Check for convergenceWe repeat steps 4 and 5 with more dispersion in g(θθθ):

g(θθθ) E(θθθ|x)

N(m, V ) (1.346, 0.031)T

N(m, 4V ) (1.384, 0.031)T

N(m, 16V ) (1.380, 0.031)T

Finally, double the sample size (no effect observed).For percentiles, we can do resampling in R.

See computer class 5 for more details and code to implementthis approach.

70 / 70

Documents

I Inference techniques used so far have been based on ... · MAS472/6004: Computational Inference Chapter III Simulating random variables 1/70 3.1 Generating Random Variables I Inference