Simulation - Unicampdias/chap4.pdfSimulation In this chapter we examine how to simulate random numbers from a range of statistical distributions. Many computers have built-in routines

Chapter 4

Simulation

In this chapter we examine how to simulate random numbers from a range of statistical distributions.Many computers have built-in routines for generating independent random numbers from the uniformdistribution U [0, 1], so we shall focus on how these may be manipulated in order to obtain randomnumbers from other distributions.

Computer generated random numbers are not random at all, they are deterministic. However,providing the computer manufacturer has done their job properly, computer generated randomnumbers have identical properties to truly random numbers. To make clear the distinction betweencomputer generated random numbers and truly random numbers, the computer generated ones areoften referred to as pseudo-random. Various schemes are available for generating pseudo-randomnumbers, eg. congruential generators, but these are not of direct interest here. Further details maybe found in Ripley (1987).

4.1 Set up, notation, etc.

Suppose we wish to generate independent random variables X1, X2, . . . each with distribution func-tion F (x) = Pr(Xi ≤ x) and density function f(x) = dF (x)/dx. We assume that we have access toan infinite supply of independent U [0, 1] random variables which we denote by U1, U2, . . . where, bydefinition, Pr(Ui ≤ u) = u for all u ∈ [0, 1] and each i.

We start by looking at some illustrative examples.

1. What is the distribution of X =∑12

i=1 Ui − 6?

It is straightforward to show that X has mean zero and variance 1. Thus, by the central limittheorem, X should be approximately N(0, 1) distributed provided that the number of Ui’s thatare added together in constructing X (i.e. 12) is large enough. In fact, the approximation toN(0, 1) is pretty good, as may be shown using the R/S-Plus code below.

sim.fn1 <- function(nreps = 1000){

# nreps is the number of repetitions (default = 1000).# each repetition yields 1 value of XX <- NULLfor (i in 1:nreps){

1

X[i] <- sum(runif(12)) - 6}answer <- Xanswer

}

Save the above in a file and then read it into R/Splus using the source function, (see the webpage on writing your own functions).

http://www.stat.nus.edu.sg/~stapkm/R/ownfunctions.html

Make sure that you understand how the above code works, and also that you understand theoutput from R/S-Plus. Note that R/S-Plus has an extensive built-in help system that isstarted with the command help.start() in an R/S-Plus session.

2. The Box-Muller algorithm. This is an exact method of transforming independent U [0, 1]random variables into N(0, 1) random variables. Here, we examine how the method works.

(a) Generate U1. Set Θ = 2πU1.

(b) Generate U2. Set E = − logU2 and R =√

2E. (Note: log = loge)

(c) Then X = R cos Θ and Y = R sinΘ are independent N(0, 1).

Proof: From above,X = (−2 logU2)1/2 cos(2πU1)

andY = (−2 logU2)1/2 sin(2πU1)

and the joint density of (U1, U2) is 1 on [0, 1]× [0, 1]. Note that U2 = exp{−(X2 + Y 2)/2}.The joint density of (X,Y ) is therefore

1×∣∣∣∣det

(∂(u1, u2)∂(x, y)

)∣∣∣∣ =∣∣∣∣det

(∂(x, y)∂(u1, u2)

)∣∣∣∣−1

This equals∣∣∣∣det(−2π(−2 log u2)1/2 sin(2πu1) −u−1

2 (−2 log u2)−1/2 cos(2πu1)2π(−2 log u2)1/2 cos(2πu1) −u−1

2 (−2 log u2)−1/2 sin(2πu1)

)∣∣∣∣−1

or(2π)−1u2 = (2π)−1 exp{−(x2 + y2)/2}.

Thus X and Y are independent N(0, 1) random variables. �

4.2 General methods

Most algorithms for generating random variables from chosen distributions are based on a few generalprinciples. We examine these here.

2

4.2.1 The inversion method

The idea here is simple but extremely powerful. Let F (x) = Pr(X ≤ x) denote the distributionfunction of the random variable X, and let F−1(·) denote the function inverse of F (·), i.e. if F (x) = uthen x = F−1(u).

Method: Take U ∼ U [0, 1], and let X = F−1(U). Then X has distribution function F (·).

Proof: Clearly, Pr(X ≤ x) = Pr(F−1(U) ≤ x) = Pr(U ≤ F (x)) = F (x) since Pr(U ≤ u) = u. �

Note: the inverse function F−1(·) is well defined for continuous random variables, but a more carefuldefinition is required for discrete random variables. In the discrete case, the above mechanism worksprovided we define F−1(·) by F−1(u) = min{x : F (x) ≥ u}.

Exercises: Derive simulation schemes for the following distributions.

1. The exponential distribution: F (x) = 1− exp(−λx) for x ∈ [0,∞).

Answer: Solving F (x) = u gives x = F−1(u) = −λ−1 log(1 − u). Thus X = −λ−1 log(1 − U)is exponentially distributed. Note that if U ∼ U [0, 1] then 1− U ∼ U [0, 1] also, which impliesX = −λ−1 logU is exponentially distributed too.

2. The Weibull distribution: F (x) = 1− exp(−xβ) for x ∈ [0,∞).

3. The Cauchy distribution: F (x) = π−1 arctanx+ 1/2 for x ∈ (−∞,∞).

4. The generalised extreme value distribution: F (x) = exp[−{1 + ξ(x− µ)/σ}−1/ξ

+

]where

µ, σ > 0 and ξ are location, scale and shape parameters respectively and s+ = max(s, 0).

5. Poisson distribution: F (x) =∑x

k=0 e−λλk/k!.

Answer: Here, F−1(u) is the smallest integer x for which F (x) ≥ u, etc.

6. The discrete distribution with probability mass function Pr(X = x) = cx for x ∈ {1, 2, 3, 4, 5}.

In principle, inversion can always be used provided that the distribution function F (·), and henceits inverse function F−1(·), are known. In some cases though, it is difficult to evaluate F−1(·), andalternative techniques are more efficient.

4.2.2 The rejection method

Suppose we wish to simulate random variables X1, X2, . . . from the density f , and have a methodavailable for generating Y1, Y2, . . . from the density g. The rejection method is based on rejecting oraccepting each Yi value according to some probability, which we denote by h(Y ). If Y is acceptedthen we set X = Y ; if Y is rejected then the next Y value is considered and so on. The form of thealgorithm is as follows:

1. Generate Y from g.

2. With probability h(Y ), set X = Y ; else return to 1.

3

Clearly, if h(Y ) is always small, then many of the Y values will be rejected and the algorithm will beinefficient. The clever part is choosing the acceptance probability h(Y ) so that the accepted valuesare from the density f and the algorithm is also efficient.

Our analysis of the rejection method starts by noting that

Pr(Y ≤ x and Y is accepted) =∫ x

−∞g(y)h(y) dy

soPr(Y is accepted) =

∫ ∞

−∞g(y)h(y) dy.

Combining these as a conditional probability, we therefore have

Pr(Y ≤ x|Y is accepted) =

∫ x

−∞ g(y)h(y) dy∫∞−∞ g(y)h(y) dy

.

This shows that the accepted values have density g(x)h(x)/{∫∞−∞ g(y)h(y) dy}.

Now, if f(x)/g(x) ≤ M < ∞ for each x,g(x) > 0 and some fixed M > 0, we may take h(x) =f(x)/{g(x)M}. Under this choice of h(x), the accepted values have density

g(x)f(x)/{g(x)M}∫∞−∞ g(y)f(y){g(y)M}−1 dy

=f(x)M−1∫∞

−∞ f(y)M−1 dy= f(x),

which is the target density. Furthermore, we have that

Pr(Y is accepted) =∫ ∞

−∞g(y)f(y){g(y)M}−1 dy = M−1.

Hence the number of proposals until a Y is accepted is geometrically distributed with mean M .

General rejection sampling algorithm: To sample from f(x) where f(x) ≤ Mg(x) for all x.The following version of the algorithm is suitable for computation and also provides a geometricmotivation for how rejection sampling works:

1. Generate Y from the density g(y), and then X from U [0,Mg(Y )].

2. Accept Y if X ≤ f(Y ).

3. Repeat.

Note that the event X ≤ f(Y ) occurs with probability f(Y ){g(Y )M}−1 which is the acceptanceprobability given above.

We now examine the geometric aspects of rejection sampling. Suppose we wish to simulate froma Beta(2, 3) distribution. This has density function 12x(1 − x)2, which has maximum value 16/9.Thus we may bound the Beta(2, 3) density with a rectangle of this height (or any height greaterthan 16/9). We generate points uniformly over this rectangle: those above the Beta(2, 3) density arerejected. The x-components of the points that remain are a sample from the Beta(2, 3) distribution.See Figure 4.1.

The efficiency of the method is governed by how many points are rejected – a feature that isdetermined by how similar f(x) and the bounding function are. The general form of the algorithmallows the bounding function, which is usually called the envelope, to be of the form Mg(x) insteadof flat, but the idea is the same in essence.

4

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

.

.

.

.

.

.

.

.

.

.

.

.

.

. ..

.

.

.

.

.

.

.

.

..

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

.

..

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

.

.

.

.

.

.

..

.

.

.

.

.

.

.

..

.

..

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.. .

.

.

.

.

.

.

..

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

.

.

.

.

.

.

.

.

.

.

.

..

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

.

.

..

.

.

.

..

.

.

.

.

. .

.

.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

Rejection sampling with Uniform envelope

Figure 4.1: Rejection sampling for the Beta(2, 3) distribution using a uniform envelope.

An important feature of the rejection method is that we need to know f only up to a normalising con-stant. For example, suppose that f(x) = kf1(x) where k is unknown. Then provided f1(x)/g(x) ≤M <∞ for all x, we may ignore the unknown value k and define h1(x) = f1(x)/{g(x)M}. It is easyto check that setting X = Y with probability h1(Y ) yields an observation from f .

Example: The Beta( a, b) distribution

This has density f(x) = kabf1(x) where kab is a constant and f1(x) = xa−1(1− x)b−1 on x ∈ [0, 1].

• The Beta(a, b) density is bounded if and only if a ≥ 1 and b ≥ 1. When this is the case, we maybound f1(x) with a uniform envelope. It is straightforward to show that the maximum value off1(x) is M1 = (a − 1)a−1(b − 1)b−1/(a + b − 2)a+b−2, so the algorithm is as follows: (1) GenerateY ∼ U [0, 1]. (2) Generate X ∼ U [0,M1]. (3) Accept Y if X ≤ f1(Y ).

• If a ∈ (0, 1) and b ≥ 1, for example, then we cannot get away with a flat envelope. Instead, wetake Y = U1/a where U ∼ U [0, 1] so that g(x) = axa−1. Then f1(x)/g(x) is bounded by a−1 for allx, and the algorithm is therefore: (1) Generate Y = U1/a. (2) Generate X ∼ U [0, Y a−1]. (3) AcceptY if X ≤ f1(Y ). R/S-Plus code that implements this algorithm is given below, and the output isshown in Figure 4.2.

sim.fn2 <- function(a = 0.5, b = 2, nvals = 1000) { # Rejection sampling forBeta(0.5,2) distribution.

X.accepted <- NULLi <- 0repeat {

Y <- runif(1)^(1/a)

5

X <- runif(1, 0, Y^(a - 1))if(X <= Y^(a - 1) * (1 - Y)^(b - 1)) {

i <- i+1X.accepted[i] <- Y

}if(i >= nvals) break

}answer <- X.acceptedanswer

}

simvals <- sim.fn2(0.5,2,5000) # Sample of size 5000par(mfcol=c(1,2)) # Sets plot window to be 1 row and 2 columnshist(simvals,xlim=c(0,1),prob=T,breaks=25)xvals <- seq(0.01,1,0.01)lines(xvals,dbeta(xvals,0.5,2))qqplot(simvals,qbeta(ppoints(5000),0.5,2),ylab=’True’, main="QQplot of simvals")

Histogram of simvals

simvals

Rel

ativ

e F

requ

ency

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

6

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

QQplot of simvals

simvals

Tru

e

Figure 4.2: Simulated Beta(0.5, 2) values (true Beta(0.5, 2) density superimposed) and QQ-Plot.

The skill in using the rejection method comes in finding a ‘good’ envelope function from which it issimple to simulate, where ‘good’ means that f and g match each other well. Rejection may be usedto simulate from discrete distributions also, but the envelopes required are very unusual.

Example: Two independent N(0, 1) variables (The Polar Algorithm)

We have already encountered a way of generating two independent N(0, 1) variables with the Box-Muller algorithm. That method requires evaluating two trigonometric functions, which is ‘expensive’

6

in terms of computer time. The following algorithm avoids this by using a rejection step, and isusually faster than the Box-Muller algorithm.

The Polar Algorithm (for two independent N(0, 1) variables)

1. REPEAT: Generate independent V1 ∼ U [−1, 1] and V2 ∼ U [−1, 1] until W = V 21 + V 2

2 ≤ 1.

2. Let C =√−2W−1 logW .

3. Set X = CV1 and Y = CV2, then X and Y are independent N(0, 1).

Proof: Step 1 is a rejection step that leaves (V1, V2) uniformly distributed over the unit discv21 + v2

2 ≤ 1. Let (R,Θ) denote (V1, V2) in polar coordinates. Writing W = R2, we have that(W,Θ) is uniform over [0, 1]× [0, 2π]. This follows because

r2 = x2 + y2 θ = arctan(y

x)

∂θ

∂y=

x

x2 + y2

∂θ

∂x= − y

x2 + y2

∂W

∂y= 2y

∂W

∂y= 2x

Hence the Jacobean in the change of variable formula will be a constant. So that W and Θ areindependent uniform variables.

Set E = − logW . We know, by the Box-Muller method, that X =√

2E cos Θ and Y =√

2E sinΘare independent N(0, 1). Substituting, we obtain X =

√2E cos Θ =

√−2 logW (V1/

√W ) = CV1,

and similarly for Y =√

2E sinΘ = CV2. �.

4.2.3 Ratio of Uniforms

The idea here is that independent uniform random variables are simulated, U and V say, and thosethat fall outside some set are discarded. The ratio V/U is then calculated for those points inside theset. The ratio values obtained are used as observations from the required distribution. The cleverpart here is choosing the form of the set so that the ratios have the required distribution.

Example: The Cauchy distribution.

This has density function proportional to (1 + x2)−1 on x ∈ (−∞,∞).

Let (U, V ) be uniformly distributed on the unit disc. Then V/U has the same distribution as the ratioof two independent N(0, 1) variables (see the Polar algorithm, above) and this is the distribution oftanΘ where Θ is uniform on [0, 2π]. Thus V/U is Cauchy distributed.

Thus a simple algorithm for generating Cauchy variables is

1. REPEAT: Generate independent U ∼ U [−1, 1] and V ∼ U [−1, 1] until U2 + V 2 ≤ 1.

2. Set X = V/U .

General ratio of uniforms

In principle, the ratio of uniforms method can be used to generate from an arbitrary density, andalso when the density is known only up to a constant of proportionality. This is shown by thefollowing theorem:

7

Theorem 1 Let h(x) be a non-negative function with∫h(x) dx <∞, and define the set

Ch ={

(u, v) : 0 ≤ u ≤√h(v/u)

}.

Then Ch has finite area, and if (U, V ) is uniformly distributed over Ch then

X = V/U has density h(x)/{∫

h(x) dx}.

Proof: Let the area of the set Ch be ∆h, so that ∆h =∫ ∫

Chdudv. Changing variable according

to (u, v) → (u, x = v/u) gives

∆h =∫ ∫

Ch

dudv =∫ ∫ √

h(x)

0

u dudx =∫h(x)

2dx <∞.

Now, since (U, V ) has joint density 1/∆h, the joint density of (U,X) is u/∆h, so X has marginaldensity

1∆h

∫ √h(x)

0

u du =h(x)2∆h

=h(x)∫h(x) dx

, as required.

�

Note: This result is most useful when Ch is contained in some rectangle [0, a]× [b−, b+], as rejectionsampling may then be used to sample (U, V ) pairs uniformly from Ch. The algorithm is then

1. REPEAT: Generate independent U ∼ U [0, a] and V ∼ U [b−, b+] until (U, V ) ∈ Ch.

2. Let X = V/U .

The following result may be useful for constructing such a rectangle.

Theorem 2 Suppose h(x) and x2h(x) are bounded. Then Ch ⊂ [0, a]× [b−, b+] where

a =√

suph, b+ =√

sup{x2h(x) : x ≥ 0}, b− = −√

sup{x2h(x) : x ≤ 0}.

Proof: It is obvious that 0 ≤ u ≤√h(v/u) ≤

√suph. For v ≥ 0 to be possible, there must exist

some u > 0 with 0 < u2 ≤ h(v/u). Writing t = v/u, this implies there exists t > 0 with v2 ≤ t2h(t).Thus (u, v) ∈ Ch implies v2 ≤ b2+, i.e. v ≤ b+. The case v < 0 follows similarly. �

Examples

1. The exponential distribution with density h(x) = e−x on (0,∞). Then a = 1, b− = 0 andb+ = 2/e, and (u, v) ∈ Ch is equivalent to u2 ≤ e−v/u, or equivalently, v ≤ −2u log u.

2. The normal distribution with density proportional to exp(−x2/2). Here, a = 1, b2+ = b2− = 2/e,and (u, v) ∈ Ch is equivalent to v2 ≤ −4u2 log u.

4.2.4 Composition

This method is very useful for simulating from mixture distributions. A mixture distribution is onethat has density f(x) =

∑ri=1 pifi(x) where, for each i, fi(x) is a density function and {p1, . . . , pr}

are a set of weights that satisfy∑r

i=1 pi = 1. The value of pi may be interpreted as the probabilitythat an arbitrary observation comes from the distribution with density fi(x). Mixture distributionsare used in a variety of situations, e.g. experiments with contamination, inhomogeneous populations.

8

To sample from f(x) we choose an index i∗ from {1, . . . , r} according to the probabilities {p1, . . . , pr},and then simulate a value from fi∗(x).

Example: Here we simulate from a mixture of two normals and displays the results.

sim.fn3 <- function(p1 = 0.3, m1 = 0, m2 = 1, v1 = 1, v2 = 1, nvals = 100) {#p1 is the first mixing weight. p2 = 1 - p1.# The two distributions are N(m1,v1) and N(m2, v2).

X.vals <- NULLi <- 0

repeat {i <- i+1

u <- runif(1)if(u <= p1) {

dum <- rnorm(1, m1, sqrt(v1))}else {

dum <- rnorm(1, m2, sqrt(v2))}X.vals[i] <- dumif(i >= nvals) break

}answer <- X.valsanswer

}

hist(sim.fn3(0.3,0,5,0.5,1,10000),xlab=’Simulated values’,prob=T, breaks=25,ylim=c(0,0.28), main="")title(main=’Non-symmetric mixture of two normals’)x <- seq(-2,8,0.01)lines(x, 0.3 * dnorm(x,0,sqrt(0.5)) + 0.7 * dnorm(x,5,1))

4.3 Multivariate distributions

Generating from general multivariate distributions is much more complicated than generating fromunivariate distributions. The reason for this is that there are two separate aspects to consider:(1) each (univariate) marginal distribution, and (2) their dependence structure. Simulating fromthe joint distribution would be straightforward if the marginal distributions were independent, aseach marginal distribution could be handled separately using the univariate techniques we haveencountered already. However, when there is dependence, the problem becomes more complicated.We will focus on the multivariate normal distribution, and then give a brief but more generaltreatment.

4.3.1 The multivariate normal distribution

This is the most commonly used multivariate distribution and is one of the easiest from which tosimulate. There are two general methods in common use: one relies on the special structure of themultivariate normal distribution, the other is based on a more general approach.

9

Simulated values

Rel

ativ

e F

requ

ency

−2 0 2 4 6 8

0.00

0.05

0.10

0.15

0.20

0.25

Non−symmetric mixture of two normals

Figure 4.3: Mixture of normals: p1 = 0.3, p2 = 1− p1 = 0.7, µ1 = 0, µ2 = 5, σ21 = 0.5, σ2

2 = 1.

Let X be a p-dimensional multivariate normal random variable. Then the density of X is given by

|2π det Σ|−1/2 exp{−1

2( x− µ)′Σ−1( x− µ)

}where µ ∈ Rp is the mean and Σ is the p × p positive-definite variance-covariance matrix of X.The usual notation is to write X ∼ N( µ,Σ). Note: x′ means the transpose of x.

Method 1: This method is based on factorising the variance-covariance matrix as Σ = SS′

forsome p × p matrix S. Providing we are able to do this factorisation, then we are able to simulateX as follows:

• Take Z1, . . . , Zp independent univariate N(0, 1), and let Z ′ = (Z1, . . . , Zp).

• Set X = µ+ S Z. Then X ∼ N( µ,Σ).

Proof: The distribution of X = µ + S Z is multivariate normal (with some mean and somevariance-covariance matrix) because X is a linear function of Z which is multivariate normal. The

10

mean of X is given by

E( X) = E( µ+ S Z) = µ+ SE( Z) = µ since E( Z) = 0,

and the variance-covariance matrix of X is given by

E{( X − µ)( X − µ)′} = E{(S Z)(S Z)′} = SE( Z Z ′)S′ = SIpS′ = Σ,

where Ip denotes the order-p identity matrix. Thus X ∼ N( µ,Σ). �

It is always possible to express the variance-covariance matrix Σ as Σ = SS′. For example, theCholesky decomposition of Σ may be used to obtain a unique lower-triangular matrix L with LL′ =Σ. For more details, look at the function chol() in the RS-Plus help system.

Method 2: This method generates the p-dimensional variable via a sequence of p separate univariatesimulations. We write X = µ+ Y where Y = (Y1, . . . , Yp)′ ∼ N(0,Σ), and generate the value y1from the univariate distribution of Y1, then we generate y2 from the distribution of Y2|Y1 = y1, theny3 from the distribution of Y3|{Y1 = y1, Y2 = y2}, etc. This conditioning approach may be used forgeneral multivariate distributions. Note A|B means A conditional on B.

For the multivariate normal distribution, it can be shown that all of the conditional distributionsare univariate normal. More precisely, letting Ak denote the upper k × k sub-matrix of Σ, anda′ = (Σ1k, . . . ,Σk−1,k), the conditional distribution of Yk given W k = (Y1, . . . Yk−1)′ is univariate

normal with mean a′A−1k−1 W k and variance Σkk − a′A−1

k−1 a.

General multivariate distributions

Various special algorithms exist for simulating from specific multivariate distributions — see Ripley’sbook. The only general method we will look at is that used in method 2 for the multivariate normaldistribution.

To generate Y ∈ Rp with multivariate distribution F ( y) = Pr(Y1 ≤ y1, . . . , Yp ≤ yp).

• Generate y1 from the marginal distribution of Y1, i.e. from the distribution F (y,∞, . . . ,∞).

• Generate y2 from the conditional distribution of Y2|Y1 = y1.

• Generate y3 from the conditional distribution of Y3|{Y1 = y1, Y2 = y2}, etc.

The reason this works is that the density of Y may be factorised into a sequence of conditionaldensities as follows:

f(y1, . . . , yp) = f(y1)f(y1, . . . , yp)

f(y1)= f(y1)f(y2, . . . , yp|y1)

= f(y1)f(y1, y2)f(y1)

f(y1, . . . , yp)f(y1, y2)

= f(y1)f(y2|y1)f(y3 . . . , yp|y1, y2)

...= f(y1)f(y2|y1)f(y3|y1, y2) · · · f(yp|y1, . . . , yp−1).

The above scheme corresponds exactly to simulating sequentially from these conditional densities.There are subtle similarities between the method here and the composition method we used formixtures.

11

4.4 Monte-Carlo integration

Why do we want all these random numbers? Several uses of them will be examined later in thecourse, but here we will look at using them to evaluate integrals. This method of evaluating integralsby simulation of random numbers is called Monte-Carlo integration.

In many statistical contexts, a quantity of interest may be expressed as an integral. For example, themean and variance of a random variable X are given by µ = E(X) =

∫xf(x) dx and E{(X−µ)2} =∫

(x− µ)2f(x) dx respectively where f(x) denotes the density of X. It may be difficult to evaluatesuch integrals theoretically, but if we have a method for simulating values from f(x) we may estimatethem by replacing the theoretical expectation by a sample based mean. Thus a Monte-Carlo basedestimator of µ is µ̂ = n−1

∑ni=1Xi, where X1, . . . , Xn is a sample from f(x), and a Monte-Carlo

based estimator of the variance is (n− 1)−1∑n

i=1(Xi − µ̂)2.

The idea is to express the quantity of interest as an integral with respect to a density f(x) and thenestimate this integral using a random sample from f(x). Evaluating integrals using this Monte-Carloapproach is extremely simple but is surprisingly versatile.

In general, suppose we are interested in a quantity θ that may be written as the expected value ofsome function φ(X) where the expectation is taken with respect to the distribution of X. Thenθ =

∫φ(x)f(x) dx, so if X1, . . . , Xn is a sample from f(x), the Monte-Carlo estimator of θ is

θ̂ = n−1n∑

i=1

φ(Xi).

Now E(θ̂) = n−1∑n

i=1E{φ(Xi)} = E{φ(Xi)} = θ, so θ̂ is unbiased for θ. The variance of θ̂ is

n−1

∫{φ(x)− θ}2f(x) dx,

so the precision of θ̂ is proportional to n−1/2. This result is very different to the precision obtainedusing numerical integration, which can use n points to achieve a precision of n−4 or better. AlthoughMonte-Carlo integration is very simple to implement, this is at the expense of θ̂ possibly having highvariability. Thus large samples may be needed in some cases in order to obtain adequate results.We cannot alter the fact that var(θ̂) = cn−1, but we can strive to make the constant c small.

Example: Suppose we want to work out θ = Pr(X2 + 2Y 3 > 3) where (X,Y ) is bivariate standardnormally distributed with correlation ρ = 0.25. Then

θ =∫ ∫

I[x2+2y3>3](x, y)f(x, y) dxdy (4.1)

where

IA(x, y) ={

1 if (x, y) ∈ A0 if (x, y) 6∈ A

}and

f(x, y) =1

2π√

1− ρ2exp

{−

(x2 − 2ρxy + y2

2(1− ρ2)

)}for ρ = 0.25.

Evaluating the integral in equation (4.1) using analytic or numerical methods is not straightforward.However, estimating the integral using a Monte-Carlo approach is very easy.

Given below is a routine that generates from the bivariate standard normal distribution, a functionthat performs a single Monte-Carlo integration, and a function that repeatedly performs theseintegrations that may be used to assess the variability of the procedure. The R/S-Plus code atthe end generates the histogram in Figure 4.4. The bivariate normal generation routine is based onthe conditioning argument given previously.

12

sim.bvn <- function(nvals = 1000, rho = 0.25){# Routine that generates from a bivariate standard normal# with correlation rho

X <- rnorm(nvals, 0, 1) # These are from N(0,1)Y <- rnorm(nvals, rho * X, sqrt((1 - rho^(2))))# These are N(rho*x, 1-rho^2)answer <- cbind(X, Y)answer

}

monte.carlo.fn1 <- function(nvals = 1000, rho = 0.25){

simvals <- sim.bvn(nvals,rho)X <- simvals[,1] # X gets column 1 of simvalsY <- simvals[,2] # Y gets column 2 of simvalsdummy <- X^2 + 2 * Y^3answer <- sum(dummy > 3) / length(dummy)answer

}

monte.carlo.fn2 <- function(nreps = 100, nvals = 1000, rho = 0.25) {# nreps is how many repetitions# nvals is the size of each simulation# rho is the correlation

est.probs <- NULLfor(i in 1:nreps) {

est.probs[i] <- monte.carlo.fn1(nvals, rho)}answer <- est.probsanswer

}

simvals <- monte.carlo.fn2(100,25000,0.25)hist(simvals)

Clearly, there is some variation in the estimates of θ, but if we require only a rough estimate ofθ then this Monte-Carlo method may be adequate. Increasing the size of each simulation (i.e.increasing nvals) will reduce the variability, of course, but this may prove expensive in computertime. Remember: the variance is O(n−1) so the standard deviation, which measures precision onthe θ scale, is O(n−1/2). Loosely then, doubling the sample size may reduce the standard deviationby a factor of about

√2.

4.5 Variance reduction techniques

As seen above, it is natural to seek to reduce (and minimise, if possible) the variability in the Monte-Carlo estimator of θ, as then adequate answers may be obtained with a smaller computational effort.Several such variance reduction techniques exist – our study of them starts with a simple examplefrom Ripley’s book, and then goes on to look at some general methods.

13

Histogram of simvals

simvals

Fre

quen

cy

0.196 0.198 0.200 0.202 0.204 0.206 0.208

05

1015

2025

3035

Figure 4.4: Histogram of 100 independent Monte-Carlo estimates of θ = Pr(X2 + 2Y 3 > 3). Eachestimate used a sample of 25,000 points.

4.5.1 An extended example

Suppose X is Cauchy distributed, so that the density and distribution of X are f(x) = {π(1+x2)}−1

and F (x) = π−1 arctanx+ 1/2 respectively, and that we wish to estimate θ = Pr(X > 2).

Method 1: Generate X1, . . . , Xn Cauchy random variables, set θ̂ = n−1∑n

i=1 I(Xi > 2). This isthe basic Monte-Carlo method, and will be used as our benchmark.

The variability of θ̂ may be obtained by noting that nθ̂ ∼ Binomial(n, θ) so that var(θ̂) = θ(1−θ)/n.Now

θ = 1− F (2) =12− π−1 arctan 2 ≈ 0.1476,

so var(θ̂) ≈ 0.126/n. Our intention is to improve on this, i.e. develop Monte-Carlo estimators of θthat have greater precision.

Method 2: Clearly, θ = 12 Pr(|X| > 2). Hence, to estimate θ using this formula, we generate

X1, . . . , Xn from the Cauchy distribution, and set θ̂ = (2n)−1∑n

i=1 I[|X| > 2].

Here, 2nθ̂ ∼ Binomial(n, 2θ), so var(θ̂) ≈ 0.052/n, which is a variance reduction of approximately2.4 times compared to the variability of the Method 1 estimator.

Method 3: It is clear that

1− 2θ =∫ 2

−2

f(x) dx = 2∫ 2

0

f(x) dx

14

where f(x) is the Cauchy density function, so θ = 1/2−∫ 2

0{2f(x)} 1

2dx. The reason why we writethe integral in this way is that we may then use a U [0, 2] simulation as the basis of a Monte-Carlomethod.

• Generate X1, . . . , Xn from U [0, 2].

• Set θ̂ = 12 − n−1

∑ni=1 2f(Xi).

Exercise: Verify the following results, where the expectation and variance are with respect to theU [0, 2] distribution: E{2f(x)} = π−1 arctan 2, var{2f(x)} = (2 + 5 arctan 2 − 5 arctan2 2)/(5π2),and hence var(θ̂) ≈ 0.029/n. This method has a variance that is a factor of about 4.4 times smallerthan that of method 1.

Method 4: Note that

θ =∫ ∞

2

dxπ(1 + x2)

=∫ 1/2

0

y−2 dyπ(1 + y−2)

=∫ 1/2

0

dyπ(1 + y2)

=∫ 1/2

0

{f(y)/2} 2dy.

Thus, to estimate θ via this approach, we

• Generate X1, . . . , Xn from U [0, 1/2].

• Evaluate θ̂ = n−1∑n

i=1 f(Xi)/2.

Exercise: Show that the variance of f(Xi)/2 taken with respect to a U [0, 1/2] distribution is{2 + 5 arctan(1/2)− 20 arctan2(1/2)}/(20π2), and hence that var(θ̂) ≈ 9.55× 10−5/n. This methodhas a variance that is a factor of about 1300 times smaller than that of method 1.

4.5.2 Importance sampling

This is a general variance reduction method for Monte-Carlo integrals. Suppose we want to estimateθ = E{φ(X)}. Then some values of X may be more important than others for determining θ. Thusa good (i.e. less variable) estimate of θ may be obtained if these values are sampled more frequentlythan those which have lesser importance. A simple example is to take θ as the probability of somerare event — the only way to estimate θ to an acceptable accuracy may be to produce rare eventsmore frequently.

We wish to calculate

θ =∫φ(x)f(x) dx =

∫ {φ(x)f(x)g(x)

}g(x) dx =

∫ψ(x)g(x) dx

where ψ(x) = φ(x)f(x)/g(x). Hence if X1, . . . , Xn is a sample from g(x), then

θ̂g = n−1n∑

i=1

ψ(Xi)

is an unbiased estimator of θ. Think of this as a weighted sum of the φ(Xi) values, where the weightsare proportional to f(Xi)/g(Xi). The variance of θ̂g is given by

var(θ̂g) = n−1

∫{ψ(x)− θ}2g(x) dx

= n−1

∫ {φ(x)f(x)g(x)

− θ

}2

g(x) dx,

15

which may be small provided g(x) is chosen to make ψ(x) = φ(x)f(x)/g(x) nearly constant. [It isstraightforward to show that the variance of θ̂g is minimised when g(x) ∝ |φ(x)f(x)|, but this isoften not practical for applications.]

Example: We consider the previous Cauchy example, where θ = Pr(X > 2) and φ(x) = I(x > 2).

We aim to choose a density g(x) that has the same support as |φ(x)f(x)|, which is the region x > 2.Now, on x > 2, f(x) = {π(1 + x2)}−1 is approximately proportional to g(x) = 2/x2, which isstraightforward to sample from using X = 2/U where U ∼ U [0, 1]. Thus, on x > 2,

ψ(x) = f(x)/g(x) =1

π(1 + x2)· x

2

2=

12π(1 + x−2)

and X−1 ∼ U [0, 1/2]. Thus, when applied to this example, importance sampling is equivalent tothe approach of Method 4 given previously.

4.5.3 Control variables and antithetic variables

Both control and antithetic variable methods are based on using quantities that vary with thequantity of interest in order to reduce the variability of an estimator.

Suppose we wish to estimate θ = E(Z) where Z = φ(X).

Control variables: A control variable is another observation W = ψ(X) that has a known meanand varies with Z = φ(X). The approach is to estimate θ by averaging values of Z −{W −E(W )}.Hence, using the sample X1, . . . , Xn, an unbiased estimator of θ is given by

θ̂ = E{ψ(X)}+ n−1n∑

i=1

{φ(Xi)− ψ(Xi)} = E(W ) + n−1n∑

i=1

(Zi −Wi).

The variance of θ̂ is given by var(θ̂) = n−1{var(Z) − 2cov(W,Z) + var(W )}, which can be low ifcov(W,Z) is large, i.e. if W is chosen so that it matches Z closely.

Example: The Cauchy distribution, with θ = 1/2−∫ 2

0{2f(x)} 1

2dx as before.

We choose (arbitrarily) to use the four variables X, X2, X3 and X4 as our control variables. Ofcourse, we could have used fewer or more than these four, and chosen different variables. Thefirst thing to do is evaluate the mean of each control variable with respect to the U [0, 2] density.Elementary calculations yield E(X) = 1, E(X2) = 4/3, E(X3) = 2, and E(X4) = 16/5. In theory,any function of the form

W = α1(X − 1) + α2(X2 − 4/3) + α3(X3 − 2) + α4(X4 − 16/5) for α1, . . . , α4 ∈ R

may be used as a control variable. However, to reduce the variance we would like to choose the αi

values so that this expression agrees closely with Z, or equivalently, with 2f(X).

To choose appropriate αi values we generated 100 independent Xi values from U [0, 2], and usedmultiple regression to regress 2f(Xi) against the explanatory variables (Xi−1), (X2

i −4/3), (X3i −2)

and (X4i − 16/5). R/S-Plus code that performs this is given below, with the resulting output.

X <- runif(100,0,2)

vals <- 2 / (pi * (1 + X^{2}))

X1 <- X - 1

X2 <- X^2 - 4/3

X3 <- X^3 - 2

X4 <- X^4 - 16/5

16

lm(vals ~ X1 + X2 + X3 + X4)

Coefficients:

(Intercept) X1 X2 X3 X4

0.3522507 -0.03618723 -0.6609859 0.4760885 -0.1006712

Note: these coefficients depend on the random Xi sample — if you repeat the above commands youwill get different values. Combining the above, for Xi ∼ U [0, 2], we obtain the estimator

θ̂ = 1/2− n−1n∑

i=1

{2f(Xi) + 0.0362(Xi − 1)

+0.661(X2i − 4/3)− 0.476(X3

i − 2) + 0.101(X4i − 16/5)

}.

I tried this scheme with n = 1, 000 and obtained θ̂ = 0.1476.

Antithetic variables: These come in pairs. Suppose that Z∗ has the same distribution as Z, andthat corr(Z,Z∗) < 0, i.e. Z and Z∗ are negatively correlated. Then an unbiased estimator of θ isθ̂ = (Z + Z∗)/2. This has variance

var(θ̂) = {2var(Z) + 2cov(Z,Z∗)}/4 =var(Z)

2{1 + corr(Z,Z∗)}.

Thus to reduce the variance of θ̂, we aim to make corr(Z,Z∗) negative and close to −1 if possible.To quantify this further, we obtain a more precise estimator of θ from the n pairs (Zi, Z

∗i ) than we

do from 2n observations of Zi provided corr(Z,Z∗) < 0.

A standard (and simple) method of generating random variables that have the same distribution andare negatively correlated is the following: set Z = F−1(U) and Z∗ = F−1(1−U) where U ∼ U [0, 1].

Applying this to the Cauchy example with θ = 12 −

∫ 2

0{2f(x)} 1

2dx, we obtain the estimator

θ̂ =12− n−1

n∑i=1

{1

π(1 +X2i )

+1

π{1 + (2−Xi)2}

}where Xi, and hence 2−Xi, are uniformly distributed random variables on [0, 2].

Exercises

1. Write an R/S-Plus function that performs the Box-Muller algorithm. Extend your functionso that it generates independent N(µ1, σ

21) and N(µ2, σ

22) variables.

Generate 5,000 pairs (X,Y ) where X ∼ N(2, 1) and Y ∼ N(10, 2) and X and Y are inde-pendent. Calculate the correlation between X and Y using your sample, and comment on thevalue. Plot separate histograms of the simulated X and Y values and superimpose the truedensity functions.

2. Write R/S-Plus functions that simulate from the exponential, Weibull, and generalised ex-treme value distributions using inversion. Assess the performance of your functions using asuitable graphical method.

3. Write an R/S-Plus function that simulates values from the discrete distribution with pmfPr(X = x) = cx for x ∈ {1, 3, 5, 7, 9}. Assess the performance of your function using a suitablegraphical method.

17

4. Write an R/S-Plus function that simulates from the Beta(3, 4) distribution. Assess the per-formance of your function using a suitable graphical method.

5. Write an R/S-Plus function that simulates from the Beta(2, 0.5) distribution. Assess theperformance of your function using a suitable graphical method.

6. Write an R/S-Plus function that implements the Polar algorithm.

7. Write an R/S-Plus function that uses the ratio of uniforms method to simulate from (1) theexponential distribution and (2) the normal distribution.

8. Let (X,Y ) denote a bivariate standard normal random variable with correlation ρ. Show thatthe conditional distribution of Y given X = x is N(ρx, 1− ρ2).

Write an S-Plus function that generates from a general bivariate normal random variable (i.e.means and variances that are not necessarily 0 and 1 respectively).

9. Let (X,Y ) be bivariate normally distributed, X ∼ N(2, 3), Y ∼ N(1, 1), and corr(X,Y ) = 0.2.Use Monte-Carlo integration to estimate Pr(X2 + Y 4 > 6).

10. Write R/S-Plus functions that perform each of the four Monte-Carlo integration approachesin Section 4.5.1. Confirm that the formulae given for the variances of the estimators areconsistent with results from your functions.

11. Use control variables X2 and X4 for the Cauchy example. What is the variance of yourestimator of θ̂?

12. For the Cauchy example again, implement the antithetic variable estimator of θ. How does itsperformance compare to the control variable estimator in the previous exercise?

13. Let U1, . . . , Un be independent U [0, 1] random variables, and set Z =√

12n

∑ni=1(Ui− 1

2 ). Showthat Z is approximately N(0, 1) distributed for large n, and investigate how small n must befor Z to behave appreciably differently to an N(0, 1) variable.

14. Show that if X ∼ Γ(n, λ), where n ≥ 1 is an integer, then X has the same distribution asT1 + · · ·+Tn where the Ti are independent exponential(λ) random variables. Hence show thatX may be generated via X = − 1

λ

∑ni=1 logUi where the Ui are independent U [0, 1] random

variables. Estimate Pr(3 ≤ X ≤ 4) where X ∼ Γ(5, 2).

15. Show that if T1, T2, . . . are independent Exponential(λ) and N is defined to be the greatestinteger for which

∑Ni=1 Ti ≤ t, then N ∼ Poisson(λt). Write an S-Plus function that imple-

ments this, i.e. generate U1, U2, . . . and count how many before −∑N

i=1 logUi exceeds λt. Testthe performance of your function.

16. Apply the ratio-of-uniforms method to the t-distribution with density proportional to (1 +x2/ν)−(ν+1)/2.

17. Simulate from the discrete distribution on {2, 3, . . . , 12} that represents the sum of the out-comes of two dice throws.

R/S-Plus

The designers of the R/S-Plus package have included built-in functions that generate randomvariables from most of the distributions you are ever likely to want. For example, look at the helppages for the functions rnorm, rt, rbinom, rgamma, rpois, etc. In practice then, it is usually notnecessary to manipulate a sample of U [0, 1] variables into a sample from the required distribution,as R/S-Plus does this for you automatically. However, now you should understand how it doesthis.

18

Documents

Simulation - Unicampdias/chap4.pdfSimulation In this chapter we examine how to simulate random numbers from a range of statistical distributions. Many computers have built-in routines